Tests of Equivalence and Confidence Intervals for Standardized Effect Sizes


    Point or sharp null hypotheses specify that a parameter has a particular value -- for example,  (m1 - m2) = 0, or r = 0.  Such null hypotheses are highly unlikely ever to be true.  They may, however, be close to true, and it may be more useful to test range or loose null hypotheses that state that the value of the parameter of interest is close to a hypothetical value.  For example, one might test the null hypothesis that the difference between the effect of drug G and that of drug A is so small that the drugs are essentially equivalent.  Biostatisticians do exactly this, and they call it bioequivalence testing.

   Steiger (2004) presents a simple example of bioequivalence testing.  Suppose that we wish to determine whether or not generic drug G is bioequivalent to brand name drug B.  Suppose that the FDA defines bioequivalence as bioavailability within 20% of that of the brand name drug.  Let q1 represent the lower limit (bioavailability 20% less than that of the brand name drug), q2 the upper limit (bioavailability 20% greater than that of the brand name drug), and qG the bioavailability of the generic drug.  A test of bioequivalence amounts to pitting the following two hypotheses against one another:

HNEqG < q1 or   qG > q2 -- the drugs are not equivalent
H
E: q1 £ qG £ q2 -- the drugs are equivalent -- note that this a range hypothesis

     In practice, this amounts to testing two pairs of directional hypotheses:
H0: qG £ q1 versus H1: qG > q1 and
H0qG ³ q2 versus H1: qG < q2.

    If both of these null hypotheses rejected, then we conclude that the drugs are equivalent.  Alternatively, we can simply construct a confidence interval for qG -- if the confidence interval falls entirely within q1  to q2 , then bioequivalence is established.

    Steiger (2004) opines that tests of equivalence (also described as tests of close fit) have a place in psychology too, especially when we are interested in demonstrating that an effect is trivial in magnitude.  Steiger recommends the use of confidence intervals, dispensing with the traditional NHST procedures (computation of test statistic, p value, decision).

    Suppose, for example, that we are interested in determining whether or not two different therapies for anorexia are equivalent.  Our criterion variable will be the average amount of weight gained during a two month period of therapy.  By how much would the groups need differ before we would say they differ by a nontrivial amount?  Suppose we decide that a difference of less than three pounds is trivial.  The hypothesis that the difference (D) is trivial in magnitude can be evaluated with two simultaneous one-sided tests:

H0: D £ -3 versus H1: D > 3 and
H0:  D ³ 3 versus H1: D < 3

    After obtaining our data, we simply construct a confidence interval for the difference in the two means.  If that confidence interval is entirely enclosed within the "range of triviality," -3 to +3, then we retain the loose null hypothesis that the two therapies are equivalent.  What if the entire confidence interval is outside the range of triviality?  I assume we would then conclude that there is a nontrivial difference between the therapies.  If part of the confidence interval is within the range of triviality and part outside the range, then we suspend judgment and wish we had obtained more data and/or less error variance.  Of course, if the confidence interval extended into the range of triviality but not all the way to the point of no difference then we would probably want to conclude that there is a difference but confess that it might be trivial.

    Psychologists often use instruments which produce measurements in units that are not as meaningful as pounds and inches.  For example, suppose that we are interested in studying the relationship between political affiliation and misanthropy.  We treat political affiliation as dichotomous (Democrat or Republican) and obtain a measure of misanthropy on a 100 point scale.  The point null is that mean misanthropy in Democrats is exactly the same as that in Republicans.  While this hypothesis is highly unlikely to be true, it could be very close to true.  Can we construct a loose null hypothesis, like we did for the anorexia therapies?  What is the smallest difference between means on the misanthropy scale that we would consider to be nontrivial?  Is a 5 point difference small, medium, or large?  Faced with questions like this, we often resort to using standardized measures of effect sizes.  In this case, we could use Cohen's d, the standardized difference between means.  Suppose that we decide that the smallest difference that would be nontrivial is d = .1.  All we need to do is get our data and then construct a confidence interval for d.  If that interval is totally enclosed within the range -.1 to .1, then we conclude that affiliates of the two parties are equivalent in misanthropy, and if the entire confidence interval is outside the range, then we conclude that there is a nontrivial difference between the parties.

    So, how do we get a confidence interval for d?  Regretfully, it is not as simple as finding the confidence interval in the raw unit of measure and then dividing the upper and lower limits by the pooled standard deviation.  Because we are estimating both means and standard deviations, we will be dealing with noncentral distributions (see Cumming & Finch, 2001; Fidler & Thompson, 2001; Smithson, 2001).  Iterative computations that cannot reasonably be done by hand will be required.  There are, out there on the Internet, statistical programs designed to construct confidence intervals for standardized effect size estimates, but I think it unlikely that such confidence intervals will be commonly used unless and until they are incorporated in major statistical packages such as SAS, SPSS, BMDP, Minitab, and so on.  I have, on my SAS Program Page and my SPSS Program Page, programs for constructing confidence intervals for Cohen's d.

    Steiger (2004) argues that when testing for close fit, the appropriate confidence interval for testing range hypotheses is a 100(1-2α) confidence interval.  For example, with the traditional .05 criterion, use a 90% confidence interval, not a 95% confidence interval.  His argument is that the estimated effect cannot be small in both directions, so the confidence coefficient is relaxed to provide the same amount of power that would be obtained with a one-sided test.  I am not entirely comfortable with this argument, especially after reading the Monte Carlo work by Serlin & Zumbo (2001).

References

Back to the Stat Help Page

Visit Karl's Index Page


Contact Information for the Webmaster,
Dr. Karl L. Wuensch



This page most recently revised on 14. May 2005.