Scientific reasoning driven by influential data: resuscitate dfstat!

submited by

Style Pass

2024-11-02 07:00:08

In biomedical literature, one of the most widely employed statistical procedures to analyze and visualize the association between two variables is linear regression. Data points that exert influence on the fit and its parameters are routinely, but not as often as required, identified by established influence measures and their corresponding cut-off values. In this work, we are specifically concerned with the presence of influential data points that directly impact hypothesis testing of linear regressions, which none of the established measures describe. Interestingly, the highly overlooked influence measure dfstat and its derived leave-one-out p-value exists exactly for this purpose, unmentioned in the majority of statistical text books as well as absent from all available statistical software packages. Its application for identifying these data points seems pivotal, as scientific reasoning in publications is almost exclusively based on the p-value of the fit, commonly adhering to the alpha = 0.05 threshold to state significance or not. With this metric, we found for 29 of 100 digitizable papers published in Science, Nature and PNAS in 2016, a time when the "reproducibility crisis" was a growing concern, that stated significances (or their absence) are based on the presence of a single influential data point.