Frequency of Type I Errors

Frequency of Type I Errors in Professional Journals

BS (Bad Science)

Gasparikova-Krasnec and Ging (1987) expressed the opinion that use of the .05 criterion of statistical significance results in 1 out of every 20 published studies representing a Type I error. I believe that they have grossly overestimated the frequency of published Type I errors and that uncritical acceptance of their opinion may produce more skepticism about the veracity of published studies than is warranted.

The frequency of Type I errors in the literature is critically dependent upon the frequency of psychologists' testing true null hypotheses. To keep my argument very simple, I shall assume that all significant studies are published and that no nonsignificant studies are published.

If 50% of the null hypotheses tested by psychologists were true and 50% were false, for every 1000 null hypotheses tested, 500 would be true and 500(.05) = 25 would produce significant results (using the .05 criterion for p) and thus be published. Let us falsely assume that every researcher used methods and sample sizes adequate to hold the Type II error rate (for effect sizes that are nontrivial) at a level equal to that that we consider acceptable for Type I errors, 5%. For every 1000 null hypotheses tested, 500 would be false, and 500(.95) = 475 would produce significant results and thus be published. For the total of 475 + 25 = 500 studies published, 25/500 = 5% would indeed represent Type I errors.

Let us now assume that only 10% of the null hypotheses tested by psychologists are true. For every 1000 null hypotheses tested, 100 would be true and would lead to .05(100) = 5 published Type I errors; 900 would be false and would lead to .95(900) = 855 published correct rejections of the null hypothesis. Only 5/860 = 0.6% of the published studies would represent Type I errors.

What percentage of the null hypotheses tested by psychologists are likely to be true? I believe that psychologists rarely ever test an absolutely true null hypothesis. Almost all of the null hypotheses tested by psychologists can be reduced to the hypothesis that no correlation exists between (among) two (or more) variables. The probability of a psychologist picking two variables that are absolutely uncorrelated in the population to which the results are to be generalized is extremely small. For example, I would wager a considerable sum on the hypothesis that in the population of all humans, mean IQ is associated with the number of letters in the person's last name. That is, were we to have data for the entire population, I seriously doubt that the mean IQ of persons with one letter last names is exactly equal to that of persons with two letter last names and so forth.

Of course, psychologists may indeed often study null hypotheses that while not absolutely true, are nearly true. That is, they may study effects that are trivial, such as the effect upon IQ of the number of letters in one's last name. I believe that psychologists should pay much more attention to consideration of the effects of power upon the probability of a Type II error and upon the probability that a practically trivial effect will be found statistically significant. Most psychologists realize that unacceptably small sample sizes and large error variance make it all too likely that even a relatively large effect will not be found statistically significant, but many often forget that large sample sizes and artificially low error variance can allow one to declare a practically trivial effect as statistically significant.

The manuscript above was submitted to the American Psychologist in 1987 as a comment on the Gasparikova-Krasnec and Ging article. In 1988 I got a reply from the editor, Leonard D. Goodstein, rejecting the submission. He had received one unfavorable review, the second reviewer never replied, so he gave up waiting for the second reviewer and just rejected it. The one reviewer commented:

the points in the paper are fairly common knowledge
it is fairly common knowledge that the proportion of significant results that are Type I errors is not the definition of alpha
It is neither important or new

Interestingly, while I was waiting for Goodstein's decision, there appeared in the Quantitative Methods section of the Psychological Bulletin an excellent article on the same topic: Pollard, P., & Richardson, J. T. E.(1987). On the probability of Making Type I Errors. Psychological Bulletin, 102, 159-163. I really enjoyed reading this article, but it is written at a level which will result, IMHO, in it not being read by many, certainly not by those who need most to read it. Below are a few of the points made by Pollard and Richardson in this excellent article. These are direct quotes - I could not state these points any more eloquently than Pollard and Richardson did. If you find them interesting, please do obtain their article and read the full text of it.

"Our informal inquiries within a wide and varied cross section of our professional colleagues indicated a widespread assumption that the probability of having made a Type I error in rejecting the null hypothesis is the same as the alpha level"
"One possible reason for the common assumption among psychologists and their students that the alpha level represents the probability of having made a Type I error is that standard statistical texts promote this fallacy."
"Of course, the alpha level does indeed give the probability of making a Type I error when the null hypothesis is true, but these quotations <from such texts> involve an unfortunate shorthand in which the conditional nature of this definition is left unstated."
"These problems worsen when the authors in question discuss the frequency of Type I errors. For instance, Christenson (1980) reported 'If the .05 significance level is set, you run the risk of being wrong and committing Type I error five times in 100.'"
"The alpha level cannot be used to estimate the proportion of Type I errors in the psychological research literature."
"to the extent that most psychologists frame good alternative hypotheses (that is, ones more likely to be true than false), P(H₀) will likely to be low."
"there are reasons for believing that the overall number of Type I error in the literature is small."

I wonder if the reviewer of my comment or the editor of the American Psychologist ever got around to reading the article by Pollard and Richardson. More recently, Raymond Nickerson (2000) has discussed "Misconceptions Associated with NHST," including the Belief That Alpha Is the Probability That if One Has Rejected the Null Hypothesis One Has Made a Type I Error, the Belief That the Value at Which Alpha Is Set for a Given Experiment Is the Probability That a Type I Error Will Be Made in Interpreting the Results of That Experiment, and the Belief That the Value at Which Alpha Is Set Is the Probability of Type I Error Across a Large Set of Experiments in Which Alpha Is Set at That Value. Nickerson's article is a dandy review of various controversies about NHST. Regretfully, because it appears in Psychological Methods, it may not be read by those who most need to read it - I assume that the readership of Psychological Methods includes relatively few of the many who suffer from the misconceptions which Nickerson reviews.

Even more recently, the American Psychologist carried an article by Erceg-Hurn and Mirosevich (2008) in which the same error was again made. "Usually, the Type I error rate (also known as the alpha rate, or α, is set at .05. This means that if a result is deemed statistically significant, there should be less than a 5% risk that a Type I error has been made." This is BS (Bad Statistics). The 5% risk is the P(significant | null is true), not P( null is true | test is significant).

In 2011, Douglas Medlin, president of the Association for Psychological Science, wrote "These and related factors should tend to inflate “false positives” (aka Type I errors) which leads inexorably to the pessimistic conclusion that some unknown (but considerably higher than five percent) proportion of our field’s published effects are not true effects. Sigh.

References

Gasparikova-Krasnec, M., & Ging, N.(1987). Experimental replication and professional cooperation. American Psychologist, 42, 266-267.
Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63, 591-601.
Medin, D. L. (2011, November). A science we can believe in. Observer, 24(10). Retrieved from http://www.psychologicalscience.org/index.php/publications/observer/2011/december-11/a-science-we-can-believe-in.html .
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy, Psychological Methods, 5, 241-301.