Pairwise Comparisons

East Carolina University
Department of Psychology

Almost all of the techniques designed to control familywise error rate when making pairwise or other multiple comparisons do not require that one first conduct an ANOVA, and if one does conduct such an ANOVA, it need not be significant to use such multiple comparison procedures. It seems that almost nobody knows this, and the few who do take a lot of undeserved crap from the ignorant experts that review their manuscripts.

Here are a couple of quotes from not-ignorant experts:

"In fact, if one of these procedures is used, there seems to be little reason for applying the F test at all! In light of the results just cited, it might seem that the F test could be abandoned completely." From page 36, Wilcox, R. R. (1987). New designs in analysis of variance. Annual Review of Psychology, 38, 29-60.

"In the past, the general advice was that without a significant <ANOVA> group effect, individual comparisons were inappropriate. ... However, this is a case where the general advice is wrong. The logic behind most of our multiple comparison procedures does not require overall significance before making specific comparisons. ... requiring overall significance will actually change the FW <familywise error rate>, making the multiple comparisons tests <unreasonably> conservative. These tests were designed, and their significance levels established, without regard to the overall F. From pages 372 and 373, Howell, D. C. (2010). Statistical methods for psychology (8^th ed.). Belmont, CA: Cengage Wadsworth. ISBN-13: 978-1-111-83548-4 .

ANOVA Significant, REGWQ Not. Yes, it is possible for the ANOVA to be significant but the REGWQ to show no significant pairwise differences. The null in the ANOVA is that all of the means are identical. Rejecting that null indicates that the data do not fit with that hypothesis well – but what are the means that differ enough in our samples for us to be confident that they also differ in the populations? Pairwise comparisons attempt to answer that question, but may be more conservative than the omnibus ANOVA. Also, there may be a linear contrast involving the means that is significant but is not a pairwise contrast. For example, when completing this four-group one-way ANOVA assignment, one of my students obtained a significant ANOVA but REGWQ indicated no significant pairwise differences. While this seems to be a paradox, it really is not. If he had tested the combined male groups versus the combined female groups, he would have found that the mean pulse rate increase was significantly greater for the men than for the women. Later he will employ a factorial ANOVA to make exactly that contrast, and will find that the main effect of sex/gender is significant (at the .005 level). So, what to report in this circumstance? It probably best to simply report that the ANOVA was significant but there were no significant pairwise contrasts, but it is unlikely that reviewers of the manuscript would find that satisfying. It might be worth testing non-pairwise contrasts that make sense, like the two male groups combined versus the two female groups combined, or the two sexual infidelity groups combined versus the two emotional infidelity groups combined. Check out Steve Simon’s thoughts on this.

Here are some comments taken from the statistical consultants list, STAT-L.

Date: Fri, 15 Sep 95 15:27:28 EDT
From: "Karl L. Wuensch" <PSWUENSC@ECUVM1>
Subject: Re: ANOVA vs Tukey HSD test
To: Multiple recipients of list STAT-L <STAT-L%MCGILL1.BITNET
@VTBIT.CC.VT.EDU>,

    In Psychology it is common to run an omnibus ANOVA and then if and
only if that ANOVA is significant do pairwise comparisons with a
procedure like Tukey, Newman-Keuls, etc. This is really rather foolish,
since these post-hoc procedures were designed to control familywise error
rates IN THE ABSENCE OF ANY SIGNIFICANT PRIOR OMNIBUS ANALYSIS. I tell
my students that protecting oneself from the dreaded Type I error by
requiring both significance of an omnibus test and a post-hoc test like
the Tukey HSD is as paranoid as protecting oneself from sexually
transmitted diseases by both wearing a condom and abstaining from sex.

    I refer my students to an old article by Ryan (Comments on Orthogonal
Components, Psych. Bull., 1959, 56: 394-396). I quote: "The newer methods
for multiple comparisons do NOT require an initial F test of the over-all
variation between groups. All of the methods mentioned in the present note
can be applied immediately, without such a preliminary test.

    Members of the STAT-L were recently asked:

>I am running a one way ANOVA, and testing significance between groups using
>the tukey HSD test. The ANOVA shows a statistically significant between group
>difference. However the tukey HSD shows no pair of groups that are different
>from each other. <how can this be?>

Hans-Peter Piepho <piepho@WIZ.UNI-KASSEL.DE> also responded:

>If you apply two different tests to the same problem, there is no guarantee
>you get the same result. You should decide a priori which test to use.

>Some might argue as follows: I do an F-test, and if this is significant, I
>do multiple comparisons (MCs). Well, if you do MCs by Fisher's LSD, this is
>Fishers protected LSD, which controls the familywise error (FWE) rate of
>making at least one false inference ONLY IN CASE THE GLOBAL NULL IS TRUE,
>i.e. no treatment differences at all, but not otherwise.

>TUKEY's TEST NEEDS NO PROTECTION BY PRELIMINARY F-TEST! Tukey's test
>controls the FWE without a preliminary F-test, also when there are real
>treatment differences. Doing the Tukey test only when the F-test is
>significant, is an "over-protection".

    Why do so many researchers continue with the practice of requiring
a significant preliminary test followed by a very conservative post-hoc test??

Karl L. Wuensch, Dept. of Psychology, East Carolina Univ.
Greenville, NC 27858-4353, phone 919-328-4102, fax 919-328-6283
Bitnet Address: PSWUENSC@ECUVM1
Internet Address: PSWUENSC@ECUVM.CIS.ECU.EDU

Date: Mon, 18 Sep 95 11:47:34 -0700
From: Patrick Onghena <Patrick.Onghena@psy.kuleuven.ac.be>
Organization: Katholieke Universiteit Leuven

"Karl L. Wuensch" <PSWUENSC@ECUVM.CIS.ECU.EDU> wrote:
..
> Why do so many researchers continue with the practice of requiring
>a significant preliminary test followed by a very conservative post-hoc test??

Because textbooks describe them (e.g. Tukey's HSD) as "a posteriori" (e.g.,
Kirk)? "A posteriori" should mean "without any a priori hypotheses", but it can
easily be misunderstood as "after the omnibus F-test".

Date: Mon, 18 Sep 1995 09:48:01 +0100
Sender: Stat-l Discussion List <STAT-L%MCGILL1.BITNET@VTBIT.CC.VT.EDU>
From: "Mr. N.W.A. Marsh" <jw34@LIVERPOOL.AC.UK>

I mail simply to state that I experience the same problem as a statisitician
in a psychology department here in the UK, and thus share Karl Wuensch's
frustration. I would say that within the UK this is widespread rather than
being special to my own institution.

I suppose that the explanation must be that on both sides of the Atlantic
an ageing generation of psychologists has been taught statistics in a
manner which has proved extremely convincing. From what I have seen the
statistical curriculum for psychologists has been generally sensible and
well-thought out, but in this context we have what I would agree is a
(pervasive) error.

Norman Marsh
University of Liverpool

Sender: Stat-l Discussion List <STAT-L%MCGILL1.BITNET@VTBIT.CC.VT.EDU>
From: "T. A. Ryan" <TAR1@PSUVM.PSU.EDU>
Subject: "Post hoc" tests

    The "post hoc" discusssion, some comments:

    1. It is surprising that the Newman-Keuls test is still being seriously
recommended. Fortunately Hans-Peter Piepho pointed out that it does not
control the Type I error experimentwise. I am surprised that it is still
around since Tukey pointed out its deficiencies in his well-known but
unpublished monograph in 1953, and the problem has been pointed out
many times since then.
    The method controls the Type I error rate only if all of the
population means are equal. In cases where there are several groups of
equal populations, the error rate can climb sharply. For example, if
there are 5 pairs of equal populations (10 samples being compared at the
.05 level) the experimentwise error rate is .23. Tukey's methods were
developed to correct this error. They set an upper limit on the error
rate for all possible partial null hypotheses.
    To use the Newman-Keuls method to gain power is the equivalent
of gaining power by raising the alpha level of the test. This can be
done with any test procedure, but then you have to publicly raise the
alpha level, rather than doing it under the table.

    Note that Tukey's monograph has now been published along with several
other papers on multiple testing:

COLLECTED WORKS OF JOHN W. TUKEY, Volume VIII, Edited by Henry I.
Braun, Publisher: Chapman & Hall, 1994.

    The monograph is still an important source for understanding the
problems of multiple testing.

    2. The term "post hoc" is misleading and should be replaced. It
implies falsely that the methods of multiple comparison or multiple
testing are needed only if the tests were not planned in advance.
Multiple tests are multiple tests whether planned in advance, or not.
The more tests, the higher the probability of error and planning does
not affect this fact. The only effect of planning is that it might
reduce the number of tests performed.

    We should speak only of "methods of multiple comparison" or
"methods of multiple testing."

Thomas A. Ryan
Professor of Psychology, Emeritus
Cornell University

Date: Sat, 2 Dec 1995 11:38:28 -0500
From: Gregory Robert Hancock <ghancock@WAM.UMD.EDU>

    If I may add to some comments made by Dr. Ryan (Professor of Psychology,
Emeritus, Cornell University):

> 1. It is surprising that the Newman-Keuls test is still being seriously
> recommended....
> The method controls the Type I error rate only if all of the
> population means are equal....

    I agree, with one exception. In the case of k =3 treatment groups,
Newman-Keuls (NK) controls experimentwise error in the strong sense (i.e.,
over all possible configurations of partial null hypotheses). There have
been numerous modifications of the NK procedure in an attempt to "close"
it (i.e., to achieve strong experimentwise error control), including
those started by Dr. Ryan in highly regarded 1959 and 1960 Psychological
Bulletin pieces. Further modifications by Einot & Gabriel (1975, JASA),
Welsch (1977, JASA), and Peritz (1970, unpub) have sought to offer more
power than Tukey's method while facilitating the desired closure. For
the case of k = 3, these modifications all become equivalent to the NK
method. For k > 4, Tukey's method, the aforementioned modifications of NK,
or a method by Hayter (1986, JASA) are all viable alternatives to the weak
NK method.

    It was my (possibly mistaken) understanding that the term "post hoc"
meant "following an omnibus F test over all between-group variability."
Such an F test, often believed to offer "protection" from the dangers of
Type I error, only offers that protection weakly. That is, if the
complete null is true (i.e., if all population means are equal), an
F-test does provide experimentwise Type I error control. However, so
does Tukey's test or other methods all by themselves making the F-test
redundant and in fact potentially inhibiting to experimental power. When
partial nulls exist (i.e., when only some population menas are equal),
the omnibus F is easily rejected because of the false nulls in the mean
set, and experimentwise Type I error control is only as good as the
so-called "post hoc" method you're using. Because the omnibus F test
doesn't really do much statistically (and in fact, does little
substantively as well), its utility is in question. Thus, because many
multiple comparison procedures do not require an omnibus F, the term
"post hoc" is often a misnomer. I concur with Dr. Ryan that more precise
language needs to be used.
=============================================================
Footnote from Karl Wuensch

    The k = 3 case is really a special case. With only three groups, even the
Fisher's LSD test adequately controls familywise error, as long as you
employ the omnibus ANOVA as a protection -- that is, do the omnibus ANOVA
first, if and only if that is significant, do the multiple comparisons --
they really are just multiple t-tests, usually with a pooled error term, and, of course,
if the the variances were heterogeneous, one should use separate variances t's
instead. There was an article in Psychological Bulletin not long ago
explaining how Fisher's LSD really does control familywise alpha adequately
when k = 3 (and extending the procedure to some other 2 df situations) -- and
it certainly does afford one with more power (Psychol. Bull. 115: 153-159).

Date: Fri, 26 Apr 1996 15:22:00 EDT
From: "T. A. Ryan" <TAR1@PSUVM.PSU.EDU>
Subject: post-hoc, etc.

On April 25th, Steven J. Pierce wrote:

>As I understand it, a priori (planned) comparisons don't require a
>significant overall F-test. If you've got a reason to expect one group to
>be different it seems entirely logical to go ahead with planned
>comparisons without a significant F-test.

This statement involves several misconceptions. Contrary to the implications
of the statement the following facts should be recognized:

    1. Tukey, Bonferroni, and most other multiple comparison procedures DO NOT require
a preliminary F test, whether the tests are considered "post-hoc" or a priori.
The habit of doing an F test first is a holdover from the days when the
Fisher "LSD" method was the only method available. LSD is the only commonly
used procedure in which the F test is an integral part of the method, and
it is inadequate for other reasons.

    2. A priori or planned tests require correction for multiple tests just as
much as "post-hoc" tests. If multiple tests are made, the probability of
Type I errors familywise needs to be taken into account whether the tests
are preplanned or not. The only way in which planning can affect the
procedure is in the rare case where the planning reduces the number of tests.
In that case the Bonferroni method can be changed by reducing the denominator
of the correction.

    3. There is no justification whatever for the notion that planning allows
us to use uncorrected t tests. This notion is perpetrated in a number of
textbooks but never given any logical justification. It is simply stated
that it is "self evident." It is a dangerous notion, since those
who want significance at all costs can always claim they planned their
tests in advance. Whether they did or not is actually irrelevant.

    4. We should stop labeling the Tukey and other tests for multiple comparions
as "post-hoc" tests. The term is misleading because corrections for multiple
comparison are needed whether we plan the tests in advance, or not.

T. A. Ryan, Sr.
Professor of Psychology, Emeritus
Cornell University

    PS. It is ironic that Fisher's description of the "LSD" method specified
that the t's following the F test should be planned in advance. Yet the
method is now usually considered a "post hoc" procedure.

Date: Fri, 26 Apr 1996 16:17:02 CDT
From: Ed Cook <SBSF081@UABDPO.DPO.UAB.EDU>
Subject: post-hoc, etc.

If one takes the position that one should always correct
for multiple comparisons, would that not apply to main effects and
interactions in the factorial design as well? In the 2 x 2, for example,
the two main effects and the interaction are equivalent to 3 t-tests
on the set of 4 cell means. I've never seen any recommendation
that *those* 3 tests should be corrected for multiple comparisons.
If not those, then why 3 orthogonal comparisons among 4 cell means
in a one-way design?

Perhaps it's critical that one be able to defend the set of comparisons
that one is doing. As an author and frequent journal reviewer, my
impression is that one cannot "always claim [one] planned [one's]
tests in advance." At least in my experience authors, editors, and
reviewers just aren't that naive. (But then, perhaps it is *I* who
is being naive.)

Ed Cook, Assoc Prof of Psychology, Univ of Alabama at Birmingham, USA

Date: Sun, 28 Apr 1996 16:59:00 EDT
From: "T. A. Ryan" <TAR1@PSUVM.PSU.EDU>
Subject: post-hoc etc.

    I have argued for this recommendation a number of times over the past
30 or more years, but I have not succeeded in convincing those in
positions of influence, including Tukey. One argument that is offered
against this conclusion is that the three tests in a two-way ANOVA
can be considered as three independent experiments. Supporting this
position becomes complicated if there is interaction, and I do not find
it convincing. R. B. Darlington has a thoughtful discussion of the problem
of separating comparisons into separate families ("Regression and Linear
Models", 1990). There are no easy answers to this question and there is
no good basis for automatically treating the main effects and interactions
as separate families.

    A stronger reason for my failure is the general resistance among researchers
against making any corrections for multiple testing at all. Naturally,
they want all the power they can get, and it is expensive to do it by
getting more data. So they use any excuse they can find for doing
uncorrected tests. Unfortunately, editors often let them get away with it.

    I believe that the cost of Type I errors is badly underestimated. To the
researcher, Type II errors have great personal cost -- he can't get his paper
published or he misses his promotion. Our treatment of data, however,
ought to be based upon the cost to science -- is it really important if we
miss a small effect? Isn't it more important to find the big ones?
The cost of Type I error includes a lot of time wasted by researchers
trying to explain a non-existent effect. The falsely "significant" finding
can result in a furor of activity which gradually peters out because there
wasn't really any effect to work on. In practical research a Type I error
can mean the use of a treatment which really does no good. This is surely
an important cost.

***************************

    In a private communication, Bob Wheeler reminded me that I had failed to
mention the Scheffe method as one which requires an initial F test.
I apologize for the omission; my only excuse is that I was thinking in
the context of simple comparisons, the problem which started this series.
Scheffe's method is not commonly used for simple comparisons since is is
less powerful that Tukey's.

    The fact that Scheffe's method is good for complex contrasts reminds me
of another custom which I consider erroneous. This is the notion that, if
we test only orthogonal contrasts we are justified in testing each on
a per-test basis. I have never found any logical justification for this
idea. If we make several orthogonal tests, we are still making multiple
tests and the type I error rate per family or per experiment is increased
just as it is for non-orthogonal tests. It is true that restricting our
attention to orthogonal tests reduces the total number of tests so the
inflation of Type I error is less that that for unrestricted uncorrected
tests. On the other hand it is often difficult to find orthogonal contrasts
that are meaningful for the research problem.

Date: Mon, 29 Apr 96 11:37:06 EDT
From: "Karl L. Wuensch" <PSWUENSC@ECUVM1>
Subject: Re: post-hoc etc.

T. A. Ryan, in his delightful post, mentions a problem that has
bothered me for years: how does one define the "family" in "familywise
error rate." I guess I'll have to go find Darlington's book and read his
answer. Is "family" the set of comparisons I am doing in this "experiment?"
Why not the set of comparisons I am doing in all experiments reported in this
manuscript? Why not all that I expect to do this year? Why not all that
faculty in my department expect to do in this decade?

Date: Mon, 29 Apr 1996 12:10:00 -0600
From: Michael Lacy <mglacy@LAMAR.COLOSTATE.EDU>
Organization: Colorado State University, Fort Collins, CO 80523

One logical extension is to consider all hypothesis tests ever performed
by a particular investigator, the "careerwise error rate." If we apply
appropriate corrections to p-values based on careerwise comparisons, a very
laudatory democratizing of the practice of statistical testing would
occur: Young investigators, with fewer past comparisons, would have
less stringents per-test alpha levels, thus enabling them to be
more likely to get "significant" and therefore publishable results.
This is good, since their need for publications is greatest.
Older investigators, at some point in their careers, would cease
hypothesis testing at all, since they would have to use such
stringent per-test alpha levels as to drastically cut the power
of their tests. All in all, this seems like a good thing to
me. <Grin>
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy, Sociology Dept., Colo. State Univ. FT COLLINS CO 80523

From: Meredith_Warshaw@frankston.com
To: sci.stat.consult%news@iecc.com
Subject: Re: post-hoc etc.
Date: Mon, 29 Apr 1996 15:31 -0400

Michael Lacy wrote:
<<One logical extension is to consider all hypothesis tests ever performed
by a particular investigator, the "careerwise error rate.">> Actually,
a few years ago Murray Jorgenson posted the following suggestion for
dealing with careerwise error rates:

Date: 08/06/92 08:01:38 PM
Subject: A Market in \alpha s.

Dear STAT-L,

    It would be a disgrace for the statistical profession
if any true null hypothesis were ever wrongly rejected
by an approved statistician, so I have a modest proposal
to make to make this state of affairs unlikely and so
bring our profession to new heights of public respect.

    The idea is to Bonferronize all inferences published
from the date of the inception of the scheme ad infinitum.

    We will base the allocation on the geometric series:

$$ \Sigma_{n=1}^{infty 2}^ⁿ = 1 $$

    Each statistician to be registered must purchase a
personal alpha from the authority to be created. The
alphas would be available in sizes 2ⁿ starting at
n = 6 so that the total of the sold personal alphas
would be less than 1/32.

    The authority would distribute the alphas on a
'more market' basis by auctions, tenders, etc.

    Each statistician, when approving a significance claim
would use as level their own personal alpha divided by
2, 4, 8, etc. Each significance test would be logged with
the central authority.

    There would be a market in 'second-hand' alphas like that
in personalized 'vanity' car plates. A thoughtful parent
may well secure their child's future as a statistician by
purchasing and laying aside an \alpha for their child.

    As a special concession it will be possible to use
Milton Friedman's nonparametric two-way ANOVA without
dividing one's alpha by 2.

    Some may feel that this will lead to rather conservative
inferential behaviour, but I feel that conservatism is
nothing to be ashamed of and that protection against
false claims is something worth making a stand about.
--
Murray A. Jorgensen [ maj@waikato.ac.nz ] University of Waikato
Department of Mathematics and Statistics Hamilton, New Zealand
__________________________________________________________________
'Tis the song of the Jubjub! the proof is complete,
if only I've stated it thrice.'

    Meanwhile, Karl Wuensch asked:
<<T. A. Ryan, in his delightful post, mentions a problem that has
bothered me for years: how does one define the "family" in "familywise
error rate." >>

    According to Barney (the purple philosopher of tripe) "a family is people and
a family is love.." :-)

Scheffe's approach doesn't require a preliminary F-test in order to
maintain it's Type I protection; in fact, it's redundant. If you find
any contrast significant using a Scheffe approach, you will by
definition also have a significant F-test.
-----------------------------------------------------------------------------
David Nichols Senior Support Statistician SPSS, Inc.
Phone: (312) 329-3684 Internet: nichols@spss.com Fax: (312) 329-3668

Date: Fri, 21 Feb 1997 07:47:08 -0500
From: Greg Hancock <ghancock@WAM.UMD.EDU>

John Reece wrote...

>I would like to garner some opinion on the relationship between post-hoc
>means comparison procedures and the omnibus F test. The traditional view in
>teaching psychology students (and I suspect students from many other
>disciplines) is that one should not carry out exploratory pairwise means
>comparisons unless an omnibus F test indicates significance at some
>arbitrary value, usually .05. However, several sources (Howell & Wilcox to
>name two) indicate that, far from being a requirement, exploratory means
>comparisons should be carried out regardless of the significance of the
>overall F, or even in lieu of an overall F. This makes some sense to me,
>mainly because it is my understanding that exploratory means comparison
>procedures were developed independent of the notion of an overall omnibus
>test.

    I have many thoughts on this subject, few if any of which come down in
favor of an omnibus F. This does not mean, however, that I don't
think it should be taught. It is still a useful frame of reference
for understanding main effects in factorial designs (although these
could also be couched in a different "complex contrast" context). Anyway,
the first question I would ask is what do you mean by "exploratory?"

    If by "exploratory" you mean that this is pilot work, meant to be a
precursor to more rigorous statistical analyses on other samples, I guess
I'd say spare the F and go nuts with your t-tests. But don't make any
grandiose proclamations based on your findings. Save those to follow your
more formal cross-validation work.

    If by "exploratory" you mean that you're looking at some data
(probably sample means, specifically), and those data suggest some
particular comparisons that were not planned a priori, then I'm all in
favor of exacting some kind of familywise Type I error control.
However, the F-test is not the beast for doing this. It only works as a
control mechanism if the complete null hypothesis (that all population
means are equal) is true. When one population mean differs, and you have
sufficient power, you can get past the omnibus F pretty easily and then
make Type I errors on other null comparisons if you don't have another
control mechanism. Scheffe provides such a mechanism. The only value of
the omnibus F here is that it tells you if you should even both with
Scheffe; if the omnibus F is not significant, then no Scheffe contrast or
comparison will be either. If the omnibus F is significant, then at least
one exploratory contrast of comparison will be as well (although it may
not be one you're interested in or one that makes sense).

    By the way, Scheffe's test is actually unnecessarily conservative. A
paper presented last year at the American Educational Research Association
(Klockars & Hancock) dealt with this issue, and has been addressed in
independent work by Ottaway & Harris (personal communication).

    An excerpt from a recent paper which addresses the omnibus F, among many
other topics, is presented below:

    "There are a number of problems associated with the requirement of
an omnibus test rejection prior to conducting multiple comparisons; we
will present four. First, and most simply, few research questions are
directly addressed by an omnibus test. In a well planned study, the
researcher's questions involve specific contrasts of group means; the
omnibus test, addresses each question only tangentially. Some might argue
that the omnibus test is not present to answer questions; rather, it is
there to facilitate control over the rate of Type I error. This issue of
control, however, brings us to our second point the belief that an
omnibus test offers protection is not completely accurate. When the
complete null hypothesis is true, weak familywise Type I error control is
facilitated by the omnibus test; but, when the complete null is false and
partial nulls exist, the F-test does not maintain strong control over the
familywise error rate.

"A third point, which Games (1971) so elegantly demonstrated in
his figures, is that the F-test may not be completely consistent with the
results of a pairwise comparison approach. Consider, for example, a
researcher who is instructed to conduct Tukey's test only if an
alpha-level F-test rejects the complete null. It is possible for the
complete null to be rejected but for the widest ranging means not to
differ significantly. This is an example of what has been referred to as
incoherence (Gabriel, 1969) or incompatibility (Lehmann, 1957). On the
other hand, the complete null may be retained while the null associated
with the widest ranging means would have been rejected had the decision
structure allowed it to be tested. This has been referred to by Gabriel
(1969) as nonconsonance. One wonders if, in fact, a practitioner in this
situation would simply conduct the MCP contrary to the omnibus test's
recommendation. Strangely enough, such a seeming breach of multiple
comparison ethics would have largely positive statistical ramifications as
we discuss in our next and final point.

"The fourth argument against the traditional implementation of an
initial omnibus F-test stems from the fact that its well-intentioned but
unnecessary protection contributes to a decrease in power. The first test
in a pairwise MCP, such as that of the most disparate means in Tukey's
test, is a form of omnibus test all by itself, controlling the familywise
error rate at the alpha-level in the weak sense. Requiring a preliminary
omnibus F-test amounts to forcing a researcher to negotiate two hurdles to
proclaim the most disparate means significantly different, a task that the
range test accomplished at an acceptable alpha-level all by itself. If
these two tests were perfectly redundant, the results of both would be
identical and the omnibus test would represent neither friend nor foe;
probabilistically speaking, the joint probability of rejecting both would
be alpha when the complete null hypothesis was true. However, the two
tests are not completely redundant; as a result the joint probability of
their rejection is less than alpha. The F-protection therefore imposes
unnecessary conservatism (see Bernhardson, 1975, for a simulation of this
conservatism). For this reason, and those listed before, we agree with
Games' (1971) statement regarding the traditional implementation of a
preliminary omnibus F-test:

'There seems to be little point in applying the overall F test
prior to running c contrasts by procedures that set the
familywise error rate <= alpha.... If the c contrasts express
the experimental interest directly, they are justified whether the
overall F is significant or not and [familywise error rate] is
still controlled. (Games, 1971, p. 560)'"

Whole passage from:

Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha:
Developments in multiple comparison procedures in the quarter century
since Games (1971). Review of Educational Research, 66(3), 269-306.

I'd be interested to hear others chime in on this topic.

Good luck,
Greg Hancock

Gregory R. Hancock Cogito, ergo S.E.M.
Department of Educational Measurement,
Statistics, and Evaluation
1230 Benjamin Building phone: (301) 405-3621
University of Maryland fax: (301) 314-9245
College Park, MD 20742-1115 e-mail: ghancock@wam.umd.edu

Check out our graduate program in measurement, stats, and evaluation:
http://www.inform.umd.edu:8080/EdRes/Colleges/EDUC/.WWW/Depts/EDMS/

Date: Fri, 21 Feb 1997 09:52:41 +0100
From: Hans-Peter Piepho <piepho@WIZ.UNI-KASSEL.DE>

The suggestion to pursue multiple comparisons only when an overall F-test
rejects, is connected with Fisher's protected LSD test. This guarantees the
experiment-wise Type I error to be controlled only in the weak sense, i.e.
only if the global null is true, but not otherwise (there is no protection
when the global null is false).

To control the experiment-wise error rate in the strong sense, i.e. also
when the global null is false (which I think is the most common situation),
a host of other procedures have been suggested, the most prominent of them
being Tukey's test, which uses studentized ranges. These tests do NOT require
a preliminary F-test.
_______________________________________________________________________
Hans-Peter Piepho
Institut f. Nutzpflanzenkunde WWW: http://www.wiz.uni-kassel.de/fts/
Universitaet Kassel Mail: piepho@wiz.uni-kassel.de
Steinstrasse 19 Fax: +49 5542 98 1230
37213 Witzenhausen, Germany Phone: +49 5542 98 1248

Ten years later, and the "expert reviewers" of Psychology journals are still in the dark. Here is part of a review that a colleague of mine received in July of 2007 from the Journal of Applied Social Psychology:

I am unfamiliar with the REGWQ approach employed by the author(s) in analyzing group means. Although the author(s) state that ANOVA does not necessarily have to accompany such analysis, REGWQ is often accompanied by ANOVA. Given the unfamiliarity of the data analytic technique I would recommend that the authors follow the usual convention and accompany their REGWQ analysis with an ANOVA.

Back to the Stat Help Page

Visit Karl's Index Page

Contact Information for the Webmaster,
Dr. Karl L. Wuensch

This page most recently revised on the 16^th of February, 2017.