in Classical Hypothesis Testing

Wuensch, K. L. (1994). Evaluating the relative seriousness of type I versus type II errors in classical hypothesis testing. In B. Brown (Ed.), *Disseminations of the International Statistical Applications Institute: Vol 1* (3^{rd} ed., pp. 76-79). Wichita, KS: ACG Press.

In many disciplines (including mine, Psychology) classical hypothesis testing is the usual method of analyzing research data. Typically we have a relatively small sample of data and we employ a .05 (alpha) criterion of significance, a combination which makes a Type II error much more probable than a Type I error. This is probably quite reasonable for much of the research that is done in my discipline (where the null hypothesis is usually that there is no relationship between two variables or sets of variables), but it might not always be reasonable. Rarely do we consider why the .05 criterion is used, and often we don't consider the effect of varying sample size. To stimulate thought on this matter, I suggest you imagine that you are testing an experimental drug that is supposed to reduce blood pressure, but is suspected of inducing cancer. You administer the drug to a sample of rodents. Since you know that the tumor rate in this strain is 10% among untreated animals, your null hypothesis (the one which includes an equals sign) is that the tumor rate in treated animals is less than or equal to 10%, that is, the drug is safe, it does not cause tumors. The alternative hypothesis is that the tumor rate in treated animals is more than 10%, that is, the drug is not safe. A Type I error is defined as rejecting a true null hypothesis (not being a believer in the utility of testing point null hypotheses, what I really mean here is rejecting a null hypothesis that is so close to true that for practical purposes it is true). In this example that amounts to concluding that the drug is not safe when in fact it is. A Type II error is defined as failing to reject a false null hypothesis -- here, concluding that the drug is safe when in fact it is not.

If you were a potential consumer of this new drug, which of these types of errors would you consider more serious? Your initial response might be that it is more serious to make the Type II error, to declare an unsafe drug as being safe. Having decided that the Type II error is more serious, one should consider techniques to decrease the probability of making such an error, beta. One way to decrease beta is to increase alpha. That is, one might be willing to trade an increased risk of a Type I error for a decreased risk of a Type II error. It is, however, possible to decrease beta without increasing alpha. Additional power (ability to detect the falsity of the null hypothesis, (1 - beta) may be obtained by using larger sample sizes, more efficient statistics, and/or by reducing "error variance" (any variance in the dependent variable not caused by the independent variable).

There are, however, several difficult to quantify factors that we have not considered so far in our evaluation of the relative seriousness of Type I and Type II errors. What if you are one of those persons for whom currently available drugs are not effective? Might that make you reconsider the relative seriousness of the two types of errors? Concluding that that drug is not safe when in fact it is (Type I error) may now seem the more serious error, since it denies you the opportunity to obtain a new drug which might save your life. Furthermore, even it the drug does "significantly" raise tumor rates, you might be willing to accept an increased risk of developing cancer in return for achieving effective control of your blood pressure. If we use methods that maximize power we run the risk of declaring as "significant" an increase in tumor rate which is quite small, too small to outweigh the potential benefits of the new drug (but large enough to attract the attention of attorneys who specialize in medical/pharmaceutical malpractice).

Now imagine that you are not a potential consumer of this drug but rather a stockholder in the pharmaceutical company whose primary concern is with the profits to be made in the short term. Concluding that the drug is unsafe, when it really is safe (Type I) now becomes an extremely serious error, one which could not only deny patients of a potentially useful medication but deny you your well deserved capital gains. You might also be less than enthusiastic about increasing power by gathering more data, since it costs money to gather more data and the increased power would make it more likely that you would detect an increase in tumor rate should one exist.

It might be useful to consider an economic analysis of the problem. You could attempt to quantify the likely costs associated with making the one or the other type of error, the costs of collecting additional data, and note how these costs change as you vary sample size and alpha, choosing the sample size and alpha which minimize the costs. But you and I might differ with respect to our quantification of the costs of Type I versus Type II errors, right?

Now imagine that we have decided that the drug is safe. To get approval to market the drug we must also show that it is effective. We test its effect on blood pressure. Our dependent variable is pre- treatment blood pressure minus post-treatment blood pressure. Positive scores indicate that the drug lowered blood pressure. The null hypothesis, with the equals sign, is that the mean decrease in blood pressure is less than or equal to zero, that is, the drug is not effective. The alternative hypothesis is that the mean decrease is greater than zero, the drug is effective. A Type I error is concluding that the drug is effective when in fact it is not. A Type II error is concluding that the drug is not effective when in fact it is. Most of my students initially opine that the Type I error is more serious in this example. What do you think? In evaluating this question consider the same sorts of issues we addressed in the previous example.

In closing, I would like to express my thanks to the many persons who have discussed this issue on the EDSTAT-L list (edstat-l@jse.stat.ncsu.edu) on the Internet. I have attempted to include several of their thoughts in this brief paper. Those interested in the full discussion are referred to the archives for the first three weeks of September, 1994.

Karl's initial post in response to a query on the list:

Pierre Duchesne has asked us about the relative seriousness of Type I versus Type II errors in hypothesis testing. I address this issue with my first semester stats students, using a contrived (and possibly not very realistic) example, something like this.

You are testing to see if a new drug, which is intended to lower blood pressure, has as a side effect induction of cancer. Your subjects are rats, and you know the base rate of cancer in this population of untreated rats. The null hypothesis is that the new drug does not increase cancer rate, that is, in treated rats the rate is less than or equal to the base rate, that is, the drug is safe. The alternative hypothesis is that the drug is unsafe, does increase cancer rate.

Now you test the effectiveness of the drug. Your null hypothesis is that treatment produces zero or less reduction in blood pressure, it is not effective. The alternative is that is does reduce blood pressure, it is effective.

For each of these scenarios I ask my students to consider which is the more serious error -- "Type I" or "Type II." Most agree that a Type II error (drug is actually unsafe, we conclude it is safe) is more serious than a Type I error (drug is safe, we conclude it is not), in the first scenario, and most agree that in the second a Type I error (drug is not effective, but we conclude it is effective) is more serious than a Type II error (drug is effective, we conclude it is not).

As noted in an earlier post, the null hypothesis is the one which specifies a value of the tested parameter.

From this point I try to convince my students that one should set the "alpha-criterion" (for rejecting the null) by considering the relative seriousness of Type I and Type II errors for the particular circumstances in which the test is being used. This leads into discussion of Beta, Power, choosing sample sizes sufficiently large so that meaningful effects, if they exist, are nearly certain to be detected (and if they are not detected, one may be able to conclude they likely do not exist). One can also discuss how different persons might have different perspectives on the relative seriousness of Type I and Type II errors in a given situation -- a stockholder of the drug company might differ from a potential consumer for the scenarios above. Some students will ask very relevant questions, such as "Are there other drugs that are effective for this condition?" or "Might the benefit of effective treatment outweigh some elevated risk of developing cancer?"

Date: Mon, 12 Sep 94 00:59:11 EDT

Pierre Duchesne

Like Karl Wuensch, I take up these issues with my introductory stats class (mainly psychology students), and I use (probably totally unrealistic) scenarios like this one:V

Suppose the Australian government imposes mandatory HIV blood-tests on all citizens. Assume the tests have a .01 false positive rate and a .01 false negative rate. Unknown to the testers, 50,000 out of 17,000,000 Australians are HIV-positive. The result is that we should expect 500 false negatives and 169,500 false positives out of 17,000,000 tests.

Which is the more serious error? Is it 500 undetected HIV carriers or 169,500 people who are falsely believed to be HIV-positive? Students catch onto the point that the rarity of a disorder or disease can not only make the diagnosticity of a test problematic (Prob(HIV|Positive test) = 49,500/219,000) but can also alter our perceptions of which error is the worse one. They also start to see some of the difficulties that arise from using imperfect diagnostic tests on nonclinical populations. I have a small interactive tutorial on the Mac that allows them to try out different false positive and negative rates, and different numbers of HIV-infected people.

While in this case I tell them that Ho is "the person is uninfected" and H1 is "the person has HIV", I also caution them that under different circumstances one error may be perceived as more serious than the other, and so they need to worry about both.

Dr. Michael Smithson, email: Michael.Smithson@jcu.edu.au, Behavioural Sciences, James Cook University, Queensland Australia 4811

Date: Mon, 12 Sep 94 15:02:30 EDT

In a recent note, Wuensch implied that the experimenter could decide the level of alpha. In experimental psychology, it seems to me that alpha is set at .05 by the enterprise of psychology, and experimenters have little choice in the matter.

This seems appropriate, since the decision is always the same -- whether or not to let the experimenter make a claim. For more important claims, the cost of a Type I error rises with the cost of a Type II error.

Bob Frick, RFRICK@psych1.psy.sunysb.edu

Date: Wed, 14 Sep 94 11:44:05 EDT

Concerning Elaine Allen' R.Frick', A.Taylor, H.Rubin' et al's thread re. setting alpha, I believe from experience in the semiconductor industry, that what we are talking about is the fact that the applied stat's fields and the applied economics (and other fields, such as reliability!) fields are in fact inexorably intwined, and the point of juncture is the study of the EXPECTED COSTS of the various possible outcomes of the experiment being designed. It is fascinating to try to do this for a particular experiment:

- Cost of the sample size

- Alpha Error Cost (Type I)

- Beta Error Cost (Type II)

- Cost of the resulting real-life decision (e.g., do we buy the new $2M machine, or do we keep running the old one?)
- Costs to "society"
- manufacturer
- consumer
- third parties that may be either helped or damaged by the product

These various costs are then matched up with the probabilities of the various possible outcomes, then integrated to find the EXPECTED COST. This results in a map of alpha error setting versus EXPECTED COST versus sample size.

Only after the affected parties do this can you responsibly set the alpha level, IMHO. The trouble is, I do not know that any of us are teaching the students that this is necessary, and how to do it. I would be game to working up a "realistic" example with one or more of you, that could be used in teaching. This isn't an assigned project for me, please understand, but I think it is important enough, especially if you concur. The semiconductor data is very complex, so I wouldn't necessarily suggest an example from my experience. The drug-study type example might be more interesting to students, with obvious types of expected costs.

Kevin Hankins, Reliability Engineer, Delco Electronics MS R117, KOKOMO IN 46902

A1_KOESS_hankins_kt%email@delcoelect.com

Date: Wed, 14 Sep 94 18:45:41 EDT

>>What about the case in which people's life span is reduced in the absence of the drug, and the *unsafe* condition is a heightened rate of birth defects? Saying the drug is unsafe when it is indeed safe, means that many people die sooner than they would have otherwise. Saying that it is safe when it is in fact unsafe means an increased rate of birth defects. In this case determining which is more *serious* becomes a moral judgement.

>This is ALWAYS the case. Not only which is more serious, but quantitatively how much more serious.

This poses an interesting question. How do we quantify moral judgements? In some societies, life is not considered all that valuable while in others it is sacrosanct. Which is correct and by how much?

brad.brown@acginc.com (Brad Brown)

Date: Wed, 14 Sep 94 18:48:42 EDT

>>I agree with your approach to getting students to consider type I and II errors, however, taking no action is not always the least *serious* option. Using medical examples in particular, in many cases people will die without the treatment whereas they may only suffer loss of limb or diminished quality of life as adverse outcomes. Is it appropriate to deny a person continued life just because they encounter the risk of losing a limb?

>The attitude above is also wrong. How much more serious something is depends on the individual; some may even prefer to die rather than to have a diminished quality of life. The risk needs to be evaluated probabilistically; utility analysis tells us to take the expected utility, the utitlity being highly personal.

We don't disagree at all. However, we are not talking about the same thing.

The patient has virtually no choice regarding the therapies which are available to treat their condition. Those choices are made by the FDA, Medicare, Hospital Administration and Medical Staff. The patient only gets to choose from among the therapies which are available. This is where the issues you raise come in. The issue that I was referring to is involved in determining whether or not the therapy would be available for the patient to choose.

For example, say I am a medicare reimbursement specialist who has to make a decision about whether to reimburse on a national basis for a particular mode of therapy or not. If the therapy does no harm but also does no good, I am wasting money if I reimburse for it and will be embarrassed if it later is evident that the therapy is worthless. If the therapy provides great benefit and also could cause great harm, I now am perched upon a peak with a possible precipice on either side, compounded by the fact that I am effectively wearing a blind-fold since I know that I can't KNOW the TRUTH with certainty. If the therapy MIGHT produce benefit and there is high confidence that it does not cause harm, but costs me some money, this is an easy decision. Wait until the null hypothesis (the therapy does not provide benefit,) is rejected with an alpha of .001 (or until my boss or one of her relatives contracts a disease which might benefit from that therapy. Oh, wait... Government employees aren't under Medicare, are they?)

In this case, I do not care about YOUR utility. I am interested in MINE. (Oh, surely, this sort of thing never happens in real life...) In this particular hypothetical situation, I make a decision based on my utility that affects your receipt of therapy regardless of your utility for that therapy. You are free to make your decision regarding your utility for that therapy by paying for it yourself if I don't (at least for now, that may not be an option in the future. There are proposals that would prohibit you from paying for therapy other than what the government system provides/allows. Also, remember it is frequently cheaper to let a person die than to try to extend their life or improve their quality of life.) After all, in a perfect free market environment, PRICE will optimize overall UTILITY; however, no one has ever argued that medicine in the US at least, is a free market.

brad.brown@acginc.com (Brad Brown)

Date: Thu, 15 Sep 94 18:40:34 EDT
From:
To: Multiple recipients of list

hrubin@snap.stat.purdue.edu (Herman Rubin) wrote:

RFRICK@psych1.psy.sunysb.edu writes:

In a recent note, Wuensch implied that the experimenter could
decide the level of alpha. In experimental psychology, it seems to
me that alpha is set at .05 by the enterprise of psychology, and
experimenters have little choice in the matter.

This seems appropriate, since the decision is always the same --
whether or not to let the experimenter make a claim. For more
important claims, the cost of a Type I error rises with the cost of a
Type II error.

That setting alpha to anything is wrong can be seen by comparing the results of testing with 100 observations and 1000000 observations. The costs of the errors stay put, but the type II error probability as a function of the state of nature decreases. Some of the reduced cost should be used to reduce the type I error probability.

Given the data, I would agree. However, a statistical investigation starts before the data is collected. Part of the statisticians task is to decide how much data to collect. I would suggest that some of the cost of collecting 1000000 observations would usually be better spent by investigating other problems.

Remember that precision is proportional to the square root of the sample size, so one can do four studies for the cost of doubling the precision in one study.

If one chooses the smallest sample necessary to gain a reasonable degree of precision, many of Herman's objections to classical methods disappears. (That does not mean that a Bayesian decision analysis may not be better, but with reasonable intuition about costs, one can choose alpha to give a fair approximation to a more thorough analysis).

Terry Moore, Statistics Department, Massey University, New Zealand.

T.Moore@massey.ac.nz

Date: Fri, 16 Sep 94 21:11:12 EDT

I appreciate Terry Moore's comments on choosing small, but sufficient, sample sizes. I would like to amplify this theme and suggest that a study's design and size is more important than the alpha level.

First, any sorce of bias in design and data collection, such as a biased sampling frame, non-response, can overwhelm a large study. I believe Cochran, in his sampling book, demonstraited how bias may excede precision in such a manner as to make a nominal 95% confidence interval have hardly a chance to cover the true parameter. In such a situation we are actually estimating the wrong thing with high precision. A smaller sample size may not decrease bias, but at least we won't mislead be the apperance of high precision.

Second, overprecision may lead to irrelevant significance. There is no utility in obtaining "statistical significance" beyond practical importance. It is foolish to measure timber with a micrometer. So it is wise to choose a sample size only as large as is needed to obtain a practical degree of precision. (Note that this approach avoids the asyptotic foolishness of the so-called point null hypothesis by suggesting that optimal in the sense of practical decision making is somewhat short of asyptotically large.)

Third, strategic allocation of resources can lead to improvements in both bias reduction and efficiency. As Moore points out, we can execute four studies for the price of one with twice the precision. But we can actually do better than that. It may be possible to design a battery of studies so as to check sources of bias and to improve the efficiency of each study in succession. I think most of would agree that if we had the resources to conduct a 1,000,000 simple random sample study, then we would do better with a pilot study leading to an optimally allocated 250,000 stratified sample with two or three follow-ups for catching nonrespondents and checking response accuracy.

For researchers who must publish in journals which impose the 5% significance rule, as neanderthal as that may be, the optimal strategy is to allocate optimally the minimum of resources needed to have a reasonable chance of "significance." This strategy will maximize the researcher's rate of publication, hastening the researcher's advancement in a publish-or-perish environment. Perhaps in light of the three points above, this may be a reasonably efficient way for science to advance. However, to be unbiased, small, well-crafted studies should be published on the quality of design and importance of subject matter, and not on the specific results of such a study.

I have said nothing new here. I wish only to emphasize the importance of good planning over concern for choosing the right alpha. I find arguments for the asymptotic foolishness of hypothesis testing irrelevant inspite of their validity. We live in a finite world. Who would ever commission a $1,000,000 study to answer a $5 question, U.S. government notwithstanding?

James Hilden-Minton, jhilden@stat.ucla.edu

Date: Sat, 17 Sep 94 17:16:58 EDT

Subject: Re: who sets alpha?

May I commend to readers of this debate the excellent chapter in Leamer's Specification Searches book. It is, IMHO, the most lucid treatment of this very important subject.

Neal Beck, Dept of Political Science, UCSD, beck@ucsd.edu

Date: Sat, 17 Sep 94 16:25:33 EDT

Re the messages below and many others:

MMs second point -- and several of the things the original poster said -- are based on what the null hypothesis usually is in some particular discipline. However, what ends up being the null hypothesis depends on how you quantify the problem. It could be that the new drug has no effect, or it could be that the new drug has no side effects. It could be that the patient is healthy (T=98.6 F) or that the patient is ill (T=100.0 F) or dead (T=68 F). Even if you make a (probably tacit and unconscious) assumption that the only thing we ever test is a difference of means, you can't be sure what the interpretation of Ho will be until you see just how the research question was quantified.

I teach that alpha cannot be set just by a statistician, because it depends on the consequences of the decision being made

So far I agree, as have many other respondents.

more important (expensive, life-affecting) decisions need more evidence in support of them than minor ones that may be retrieved if further evidence suggests that one's conclusion was not well-founded. And more evidence translates to smaller alphas.

This seems a common attitude, but I strongly disagree. If the decision is important then, yes, it should be made carefully. But this does not mean leaning towards the null hypothesis, regardless of all else.

Imagine that an inexpensive, totally safe new treatment for some currently
untreatable fatal disease is being tested, but the test must be small
(perhaps the disease is rare, so available patients are few). The
preceding argument would say that because the test is so important, we
must have improvement significant at some tiny alpha, before recommending
use of the treatment. I would say quite the opposite: almost any evidence
of improvement at all should lead to adoption of the treatment. Why?
Because in this case there is little if any cost to a Type I error, but
considerable cost to a Type II error (assuming H0 is no effect). There
might be indirect costs of adopting an ineffective, or barely effective
treatment (e.g. getting patient's hopes up, or reducing effort at finding
other treatments) but these could be managed in ways other than avoiding
use of the tested treatment, i.e. by emphasizing the uncertainty about the
effectiveness of the treatment.

- Andy Taylor, Department of Zoology, University of Hawaii at Manoa, ataylor@lala.zoo.hawaii.edu

Robert W. Hayden, Department of Mathematics, Plymouth State College, Plymouth, New Hampshire 03264, hayden@oz.plymouth.edu

Date: Thu, 22 Sep 94 10:31:42 EDT

From: "Karl L. Wuensch" PSWUENSC@ECUVM1

Subject: alpha: how do we set it?

Another interesting chapter on this topic is "The Inference Revolution"
in Gigerenzer & Murray's *Cognition as Intuitive Statistics* (Lawrence Erlbaum, 1987). A few quotes (inserted parenthetical material is mine):

"The choice of the decision criterion (the critical value, determined by the alpha one is willing to accept) allows a balance between these two errors (Type I and Type II), depending on such considerations as the costs of the respective errors, which are heavily content-dependent and lie outside of the statistical theory."

"The decision criterion has to be chosen by weighing the possible consequences of each error. This balance of utilities must be based on informed personal judgment: the formal statistical theory does not stipulate how this balance should be achieved. Here is the dividing line between the statistical and subjective, or behavioral, parts of the theory (Neyman- Pearson). Once we have agreed on a decision criterion, then the statistical theory tells us exactly the probability of Type I and Type II errors and their relationship to the size n of the sample we choose."

The authors provide an example involving industrial quality control, give some history on the origins of the .05 level of significance as a common standard, and other interesting things.

Raymond Nickerson (2000, Null hypothesis significance testing: A review of an old and continuing controversy, *Psychological Methods*, *5*, 241-301) addresses the controversy about how the criterion of statistical significance should be set - see

Click here to return to Dr. Wuensch's Statistical Help Page.

Contact Information for the Webmaster,

Dr. Karl L. Wuensch

This page most recently revised on 23. July 2001.