East Carolina University

Department of Psychology

Pairwise Comparisons

Almost all of the techniques designed to
control familywise error rate when making pairwise or other multiple comparisons
do **not** require that one first conduct an ANOVA, and if one does conduct such an
ANOVA, it need not be significant to use such multiple comparison procedures.
It seems that almost nobody knows this, and the few who do take a lot of
undeserved crap from the ignorant experts that review their manuscripts.

Here are a couple of quotes from not-ignorant experts:

"In fact, if one of these procedures is
used, there seems to be little reason for applying the *F* test at all!
In light of the results just cited, it might seem that the *F* test could
be abandoned completely." From page 36, Wilcox, R. R. (1987). New
designs in analysis of variance. *Annual Review of Psychology, 38*,
29-60.

"In the past, the general advice was that
without a significant <ANOVA> group effect, individual comparisons were
inappropriate. ... However, this is a case where the general advice is wrong.
The logic behind most of our multiple comparison procedures does not require
overall significance before making specific comparisons. ... requiring overall
significance will actually change the FW <familywise error rate>, making the
multiple comparisons tests <unreasonably> conservative. These tests were
designed, and their significance levels established, without regard to the
overall *F*. From pages 372 and 373, Howell, D. C. (2010). **
Statistical
methods for psychology** (8

**ANOVA Significant, REGWQ Not**. Yes, it is
possible for the ANOVA to be significant but the REGWQ to show no significant
pairwise differences. The null in the ANOVA is that all of the means are
identical. Rejecting that null indicates that the data do not fit with that
hypothesis well – but what are the means that differ enough in our samples for
us to be confident that they also differ in the populations? Pairwise
comparisons attempt to answer that question, but may be more conservative than
the omnibus ANOVA. Also, there may be a linear contrast involving the means that
is significant but is not a pairwise contrast. For example, when completing
this four-group
one-way ANOVA assignment, one of my students obtained a significant ANOVA
but REGWQ indicated no significant pairwise differences. While this seems to be
a paradox, it really is not. If he had tested the combined male groups versus
the combined female groups, he would have found that the mean pulse rate
increase was significantly greater for the men than for the women. Later he will
employ a factorial ANOVA to make exactly that contrast, and will find that the
main effect of sex/gender is significant (at the .005 level). So, what to report
in this circumstance? It probably best to simply report that the ANOVA was
significant but there were no significant pairwise contrasts, but it is unlikely
that reviewers of the manuscript would find that satisfying. It might be worth
testing non-pairwise contrasts that make sense, like the two male groups
combined versus the two female groups combined, or the two sexual infidelity
groups combined versus the two emotional infidelity groups combined. Check out
Steve Simon’s thoughts on this.

Here are some comments taken from the statistical consultants list, STAT-L.

Date: Fri, 15 Sep 95 15:27:28 EDT

From: "Karl L. Wuensch" <PSWUENSC@ECUVM1>

Subject: Re: ANOVA vs Tukey HSD test

To: Multiple recipients of list STAT-L <STAT-L%MCGILL1.BITNET

@VTBIT.CC.VT.EDU>,

In Psychology it is common to run an omnibus ANOVA and then if and

only if that ANOVA is significant do pairwise comparisons with a

procedure like Tukey, Newman-Keuls, etc. This is really rather foolish,

since **these post-hoc procedures were designed to control familywise error
rates IN THE ABSENCE OF ANY SIGNIFICANT PRIOR OMNIBUS ANALYSIS**. I tell

my students that protecting oneself from the dreaded Type I error by

requiring both significance of an omnibus test and a post-hoc test like

the Tukey HSD is as paranoid as protecting oneself from sexually

transmitted diseases by both wearing a condom and abstaining from sex.

I refer my students to an old article by Ryan (Comments on Orthogonal

Components,

for multiple comparisons do NOT require an initial

variation between groups. All of the methods mentioned in the present note

can be applied immediately, without such a preliminary test.

Members of the STAT-L were recently asked:

>I am running a one way ANOVA, and testing significance between groups using

>the tukey HSD test. The ANOVA shows a statistically significant between group

>difference. However the tukey HSD shows no pair of groups that are different

>from each other. <how can this be?>

Hans-Peter Piepho <piepho@WIZ.UNI-KASSEL.DE> also responded:

>If you apply two different tests to the same problem, there is no guarantee

>you get the same result. You should decide a priori which test to use.

>Some might argue as follows: I do an

>do multiple comparisons (MCs). Well, if you do MCs by Fisher's LSD, this is

>Fishers protected LSD, which controls the familywise error (FWE) rate of

>making at least one false inference ONLY IN CASE THE GLOBAL NULL IS TRUE,

>i.e. no treatment differences at all, but not otherwise.

>

>controls the FWE without a preliminary

>treatment differences. Doing the Tukey test only when the

>significant, is an "over-protection".

Why do so many researchers continue with the practice of requiring

a significant preliminary test followed by a very conservative post-hoc test??

Karl L. Wuensch, Dept. of Psychology, East Carolina Univ.

Greenville, NC 27858-4353, phone 919-328-4102, fax 919-328-6283

Bitnet Address: PSWUENSC@ECUVM1

Internet Address: PSWUENSC@ECUVM.CIS.ECU.EDU

Date: Mon, 18 Sep 95 11:47:34 -0700

From: Patrick Onghena <Patrick.Onghena@psy.kuleuven.ac.be>

Organization: Katholieke Universiteit Leuven

"Karl L. Wuensch" <PSWUENSC@ECUVM.CIS.ECU.EDU> wrote:

..

> Why do so many researchers continue with the practice of requiring

>a significant preliminary test followed by a very conservative post-hoc test??

Because textbooks describe them (e.g. Tukey's HSD) as "a posteriori" (e.g.,

Kirk)? "A posteriori" should mean "without any a priori hypotheses", but it can

easily be misunderstood as "after the omnibus *F*-test".

Date: Mon, 18 Sep 1995 09:48:01 +0100

Sender: Stat-l Discussion List <STAT-L%MCGILL1.BITNET@VTBIT.CC.VT.EDU>

From: "Mr. N.W.A. Marsh" <jw34@LIVERPOOL.AC.UK>

I mail simply to state that I experience the same problem as a statisitician

in a psychology department here in the UK, and thus share Karl Wuensch's

frustration. I would say that within the UK this is widespread rather than

being special to my own institution.

I suppose that the explanation must be that on both sides of the Atlantic

an ageing generation of psychologists has been taught statistics in a

manner which has proved extremely convincing. From what I have seen the

statistical curriculum for psychologists has been generally sensible and

well-thought out, but in this context we have what I would agree is a

(pervasive) error.

Norman Marsh

University of Liverpool

Sender: Stat-l Discussion List <STAT-L%MCGILL1.BITNET@VTBIT.CC.VT.EDU>

From: "**T. A. Ryan**" <TAR1@PSUVM.PSU.EDU>

Subject: "Post hoc" tests

The "post hoc" discusssion, some comments:

1. It is surprising that the Newman-Keuls test is still being seriously

recommended. Fortunately Hans-Peter Piepho pointed out that it does not

control the Type I error experimentwise. I am surprised that it is still

around since Tukey pointed out its deficiencies in his well-known but

unpublished monograph in 1953, and the problem has been pointed out

many times since then.

The method controls the Type I error rate only if all of the

population means are equal. In cases where there are several groups of

equal populations, the error rate can climb sharply. For example, if

there are 5 pairs of equal populations (10 samples being compared at the

.05 level) the experimentwise error rate is .23. Tukey's methods were

developed to correct this error. They set an upper limit on the error

rate for all possible partial null hypotheses.

To use the Newman-Keuls method to gain power is the equivalent

of gaining power by raising the alpha level of the test. This can be

done with any test procedure, but then you have to publicly raise the

alpha level, rather than doing it under the table.

Note that Tukey's monograph has now been published along with several

other papers on multiple testing:

COLLECTED WORKS OF JOHN W. TUKEY, Volume VIII, Edited by Henry I.

Braun, Publisher: Chapman & Hall, 1994.

The monograph is still an important source for understanding the

problems of multiple testing.

2. The term "post hoc" is misleading and should be replaced. It

implies falsely that the methods of multiple comparison or multiple

testing are needed only if the tests were not planned in advance.

Multiple tests are multiple tests whether planned in advance, or not.

The more tests, the higher the probability of error and planning does

not affect this fact. The only effect of planning is that it might

reduce the number of tests performed.

We should speak only of "methods of multiple comparison" or

"methods of multiple testing."

Thomas A. Ryan

Professor of Psychology, Emeritus

Cornell University

Date: Sat, 2 Dec 1995 11:38:28 -0500

From: Gregory Robert Hancock <ghancock@WAM.UMD.EDU>

If I may add to some comments made by Dr. Ryan (Professor of Psychology,

Emeritus, Cornell University):

> 1. It is surprising that the Newman-Keuls test is still being seriously

> recommended....

> The method controls the Type I error rate only if all of the

> population means are equal....

I agree, with one exception. In the case of *k *=3 treatment groups,

Newman-Keuls (NK) controls experimentwise error in the strong sense (i.e.,

over all possible configurations of partial null hypotheses). There have

been numerous modifications of the NK procedure in an attempt to "close"

it (i.e., to achieve strong experimentwise error control), including

those started by Dr. Ryan in highly regarded 1959 and 1960 Psychological

Bulletin pieces. Further modifications by Einot & Gabriel (1975, JASA),

Welsch (1977, JASA), and Peritz (1970, unpub) have sought to offer more

power than Tukey's method while facilitating the desired closure. For

the case of *k* = 3, these modifications all become equivalent to the NK

method. For *k* > 4, Tukey's method, the aforementioned modifications of NK,

or a method by Hayter (1986, JASA) are all viable alternatives to the weak

NK method.

It was my (possibly mistaken) understanding that the term "post hoc"

meant "following an omnibus *F* test over all between-group variability."

Such an *F* test, often believed to offer "protection" from the dangers of

Type I error, only offers that protection weakly. That is, if the

complete null is true (i.e., if all population means are equal), an

*F*-test does provide experimentwise Type I error control. However, so

does Tukey's test or other methods all by themselves making the *F*-test

redundant and in fact potentially inhibiting to experimental power. When

partial nulls exist (i.e., when only some population menas are equal),

the omnibus *F* is easily rejected because of the false nulls in the mean

set, and experimentwise Type I error control is only as good as the

so-called "post hoc" method you're using. Because the omnibus *F* test

doesn't really do much statistically (and in fact, does little

substantively as well), its utility is in question. Thus, because many

multiple comparison procedures do not require an omnibus *F*, the term

"post hoc" is often a misnomer. I concur with Dr. Ryan that more precise

language needs to be used.

=============================================================

Footnote from Karl Wuensch

The *k* = 3 case is really a special case. With only three groups, even the

Fisher's LSD test adequately controls familywise error, as long as you

employ the omnibus ANOVA as a protection -- that is, do the omnibus ANOVA

first, if and only if that is significant, do the multiple comparisons --

they really are just multiple *t*-tests, usually with a pooled error term, and, of course,

if the the variances were heterogeneous, one should use separate variances *t*'s

instead. There was an article in Psychological Bulletin not long ago

explaining how Fisher's LSD really does control familywise alpha adequately

when *k* = 3 (and extending the procedure to some other 2 *df *situations) -- and

it certainly does afford one with more power (*Psychol. Bull*. *115*:
153-159).

Date: Fri, 26 Apr 1996 15:22:00 EDT

From: "T. A. Ryan" <TAR1@PSUVM.PSU.EDU>

Subject: post-hoc, etc.

On April 25th, Steven J. Pierce wrote:

>As I understand it, a priori (planned) comparisons don't require a

>significant overall *F*-test. If you've got a reason to expect one group to

>be different it seems entirely logical to go ahead with planned

>comparisons without a significant *F*-test.

This statement involves several misconceptions. Contrary to the implications

of the statement the following facts should be recognized:

1. **Tukey, Bonferroni, and most other multiple comparison procedures DO NOT require
a preliminary F test, whether the tests are considered "post-hoc" or a priori.**

The habit of doing an

Fisher "LSD" method was the only method available. LSD is the only commonly

used procedure in which the

it is inadequate for other reasons.

2. A priori or planned tests require correction for multiple tests just as

much as "post-hoc" tests. If multiple tests are made, the probability of

Type I errors familywise needs to be taken into account whether the tests

are preplanned or not. The only way in which planning can affect the

procedure is in the rare case where the planning reduces the number of tests.

In that case the Bonferroni method can be changed by reducing the denominator

of the correction.

3. There is no justification whatever for the notion that planning allows

us to use uncorrected

textbooks but never given any logical justification. It is simply stated

that it is "self evident." It is a dangerous notion, since those

who want significance at all costs can always claim they planned their

tests in advance. Whether they did or not is actually irrelevant.

4. We should stop labeling the Tukey and other tests for multiple comparions

as "post-hoc" tests. The term is misleading because corrections for multiple

comparison are needed whether we plan the tests in advance, or not.

T. A. Ryan, Sr.

Professor of Psychology, Emeritus

Cornell University

PS. It is ironic that Fisher's description of the "LSD" method specified

that the

method is now usually considered a "post hoc" procedure.

Date: Fri, 26 Apr 1996 16:17:02 CDT

From: Ed Cook <SBSF081@UABDPO.DPO.UAB.EDU>

Subject: post-hoc, etc.

If one takes the position that one should always correct

for multiple comparisons, would that not apply to main effects and

interactions in the factorial design as well? In the 2 x 2, for example,

the two main effects and the interaction are equivalent to 3 *t*-tests

on the set of 4 cell means. I've never seen any recommendation

that *those* 3 tests should be corrected for multiple comparisons.

If not those, then why 3 orthogonal comparisons among 4 cell means

in a one-way design?

Perhaps it's critical that one be able to defend the set of comparisons

that one is doing. As an author and frequent journal reviewer, my

impression is that one cannot "always claim [one] planned [one's]

tests in advance." At least in my experience authors, editors, and

reviewers just aren't that naive. (But then, perhaps it is *I* who

is being naive.)

Ed Cook, Assoc Prof of Psychology, Univ of Alabama at Birmingham, USA

Date: Sun, 28 Apr 1996 16:59:00 EDT

From: "T. A. Ryan" <TAR1@PSUVM.PSU.EDU>

Subject: post-hoc etc.

I have argued for this recommendation a number of times over the past

30 or more years, but I have not succeeded in convincing those in

positions of influence, including Tukey. One argument that is offered

against this conclusion is that the three tests in a two-way ANOVA

can be considered as three independent experiments. Supporting this

position becomes complicated if there is interaction, and I do not find

it convincing. R. B. Darlington has a thoughtful discussion of the problem

of separating comparisons into separate families ("Regression and Linear

Models", 1990). There are no easy answers to this question and there is

no good basis for automatically treating the main effects and interactions

as separate families.

A stronger reason for my failure is the general resistance among researchers

against making any corrections for multiple testing at all. Naturally,

they want all the power they can get, and it is expensive to do it by

getting more data. So they use any excuse they can find for doing

uncorrected tests. Unfortunately, editors often let them get away with it.

I believe that the cost of Type I errors is badly underestimated. To the

researcher, Type II errors have great personal cost -- he can't get his paper

published or he misses his promotion. Our treatment of data, however,

ought to be based upon the cost to science -- is it really important if we

miss a small effect? Isn't it more important to find the big ones?

The cost of Type I error includes a lot of time wasted by researchers

trying to explain a non-existent effect. The falsely "significant" finding

can result in a furor of activity which gradually peters out because there

wasn't really any effect to work on. In practical research a Type I error

can mean the use of a treatment which really does no good. This is surely

an important cost.

***************************

In a private communication, Bob Wheeler reminded me that I had failed to

mention the Scheffe method as one which requires an initial *F* test.

I apologize for the omission; my only excuse is that I was thinking in

the context of simple comparisons, the problem which started this series.

Scheffe's method is not commonly used for simple comparisons since is is

less powerful that Tukey's.

The fact that Scheffe's method is good for complex contrasts reminds me

of another custom which I consider erroneous. This is the notion that, if

we test only orthogonal contrasts we are justified in testing each on

a per-test basis. I have never found any logical justification for this

idea. If we make several orthogonal tests, we are still making multiple

tests and the type I error rate per family or per experiment is increased

just as it is for non-orthogonal tests. It is true that restricting our

attention to orthogonal tests reduces the total number of tests so the

inflation of Type I error is less that that for unrestricted uncorrected

tests. On the other hand it is often difficult to find orthogonal contrasts

that are meaningful for the research problem.

Date: Mon, 29 Apr 96 11:37:06 EDT

From: "Karl L. Wuensch" <PSWUENSC@ECUVM1>

Subject: Re: post-hoc etc.

T. A. Ryan, in his delightful post, mentions a problem that has

bothered me for years: how does one define the "family" in "familywise

error rate." I guess I'll have to go find Darlington's book and read his

answer. Is "family" the set of comparisons I am doing in this "experiment?"

Why not the set of comparisons I am doing in all experiments reported in this

manuscript? Why not all that I expect to do this year? Why not all that

faculty in my department expect to do in this decade?

Date: Mon, 29 Apr 1996 12:10:00 -0600

From: Michael Lacy <mglacy@LAMAR.COLOSTATE.EDU>

Organization: Colorado State University, Fort Collins, CO 80523

One logical extension is to consider all hypothesis tests ever performed

by a particular investigator, the "careerwise error rate." If we apply

appropriate corrections to p-values based on careerwise comparisons, a very

laudatory democratizing of the practice of statistical testing would

occur: Young investigators, with fewer past comparisons, would have

less stringents per-test alpha levels, thus enabling them to be

more likely to get "significant" and therefore publishable results.

This is good, since their need for publications is greatest.

Older investigators, at some point in their careers, would cease

hypothesis testing at all, since they would have to use such

stringent per-test alpha levels as to drastically cut the power

of their tests. All in all, this seems like a good thing to

me. <Grin>

=-=-=-=-=-=-=-=-=-==-=-=-=

Mike Lacy, Sociology Dept., Colo. State Univ. FT COLLINS CO 80523

From: Meredith_Warshaw@frankston.com

To: sci.stat.consult%news@iecc.com

Subject: Re: post-hoc etc.

Date: Mon, 29 Apr 1996 15:31 -0400

Michael Lacy wrote:

<<One logical extension is to consider all hypothesis tests ever performed

by a particular investigator, the "careerwise error rate.">> Actually,

a few years ago Murray Jorgenson posted the following suggestion for

dealing with careerwise error rates:

Date: 08/06/92 08:01:38 PM

Subject: A Market in \alpha s.

Dear STAT-L,

It would be a disgrace for the statistical profession

if any true null hypothesis were ever wrongly rejected

by an approved statistician, so I have a modest proposal

to make to make this state of affairs unlikely and so

bring our profession to new heights of public respect.

The idea is to Bonferronize all inferences published

from the date of the inception of the scheme ad infinitum.

We will base the allocation on the geometric series:

$$ \Sigma_{*n*=1}^{infty 2}^^{n} = 1 $$

Each statistician to be registered must purchase a

personal alpha from the authority to be created. The

alphas would be available in sizes 2^{n} starting at

*n* = 6 so that the total of the sold personal alphas

would be less than 1/32.

The authority would distribute the alphas on a

'more market' basis by auctions, tenders, etc.

Each statistician, when approving a significance claim

would use as level their own personal alpha divided by

2, 4, 8, etc. Each significance test would be logged with

the central authority.

There would be a market in 'second-hand' alphas like that

in personalized 'vanity' car plates. A thoughtful parent

may well secure their child's future as a statistician by

purchasing and laying aside an \alpha for their child.

As a special concession it will be possible to use

Milton Friedman's nonparametric two-way ANOVA without

dividing one's alpha by 2.

Some may feel that this will lead to rather conservative

inferential behaviour, but I feel that conservatism is

nothing to be ashamed of and that protection against

false claims is something worth making a stand about.

--

Murray A. Jorgensen [ maj@waikato.ac.nz ] University of Waikato

Department of Mathematics and Statistics Hamilton, New Zealand

__________________________________________________________________

'Tis the song of the Jubjub! the proof is complete,

if only I've stated it thrice.'

Meanwhile, Karl Wuensch asked:

<<T. A. Ryan, in his delightful post, mentions a problem that has

bothered me for years: how does one define the "family" in "familywise

error rate." >>

According to Barney (the purple philosopher of tripe) "a family is people and

a family is love.." :-)

Scheffe's approach doesn't require a preliminary
*F*-test in
order to

maintain it's Type I protection; in fact, it's redundant. If you find

any contrast significant using a Scheffe approach, you will by

definition also have a significant *F*-test.

-----------------------------------------------------------------------------

David Nichols Senior Support Statistician SPSS, Inc.

Phone: (312) 329-3684 Internet: nichols@spss.com Fax: (312) 329-3668

Date: Fri, 21 Feb 1997 07:47:08 -0500

From: Greg Hancock <ghancock@WAM.UMD.EDU>

John Reece wrote...

>I would like to garner some opinion on the relationship between post-hoc

>means comparison procedures and the omnibus* F* test. The traditional view
in

>teaching psychology students (and I suspect students from many other

>disciplines) is that one should not carry out exploratory pairwise means

>comparisons unless an omnibus F test indicates significance at some

>arbitrary value, usually .05. However, several sources (Howell & Wilcox
to

>name two) indicate that, far from being a requirement, exploratory means

>comparisons should be carried out regardless of the significance of the

>overall *F*, or even in lieu of an overall *F*. This makes some sense to me,

>mainly because it is my understanding that exploratory means comparison

>procedures were developed independent of the notion of an overall omnibus

>test.

I have many thoughts on this subject, few if any of which come down in

favor of an omnibus *F*. This does not mean, however, that I don't

think it should be taught. It is still a useful frame of reference

for understanding main effects in factorial designs (although these

could also be couched in a different "complex contrast" context). Anyway,

the first question I would ask is what do you mean by "exploratory?"

If by "exploratory" you mean that this is pilot work, meant to be a

precursor to more rigorous statistical analyses on other samples, I guess

I'd say spare the F and go nuts with your t-tests. But don't make any

grandiose proclamations based on your findings. Save those to follow your

more formal cross-validation work.

If by "exploratory" you mean that you're looking at some data

(probably sample means, specifically), and those data suggest some

particular comparisons that were not planned a priori, then I'm all in

favor of exacting some kind of familywise Type I error control.

However, the *F*-test is not the beast for doing this. It only works as a

control mechanism if the complete null hypothesis (that all population

means are equal) is true. When one population mean differs, and you have

sufficient power, you can get past the omnibus *F* pretty easily and then

make Type I errors on other null comparisons if you don't have another

control mechanism. Scheffe provides such a mechanism. The only value of

the omnibus *F* here is that it tells you if you should even both with

Scheffe; if the omnibus *F* is not significant, then no Scheffe contrast or

comparison will be either. If the omnibus *F* is significant, then at least

one exploratory contrast of comparison will be as well (although it may

not be one you're interested in or one that makes sense).

By the way, Scheffe's test is actually unnecessarily conservative. A

paper presented last year at the American Educational Research Association

(Klockars & Hancock) dealt with this issue, and has been addressed in

independent work by Ottaway & Harris (personal communication).

An excerpt from a recent paper which addresses the omnibus *F*, among many

other topics, is presented below:

"There are a number of problems associated with the requirement of

an omnibus test rejection prior to conducting multiple comparisons; we

will present four. First, and most simply, few research questions are

directly addressed by an omnibus test. In a well planned study, the

researcher's questions involve specific contrasts of group means; the

omnibus test, addresses each question only tangentially. Some might argue

that the omnibus test is not present to answer questions; rather, it is

there to facilitate control over the rate of Type I error. This issue of

control, however, brings us to our second point the belief that an

omnibus test offers protection is not completely accurate. When the

complete null hypothesis is true, weak familywise Type I error control is

facilitated by the omnibus test; but, when the complete null is false and

partial nulls exist, the *F*-test does not maintain strong control over the

familywise error rate.

"A third point, which Games (1971) so elegantly demonstrated in

his figures, is that the *F*-test may not be completely consistent with the

results of a pairwise comparison approach. Consider, for example, a

researcher who is instructed to conduct Tukey's test only if an

alpha-level *F*-test rejects the complete null. It is possible for the

complete null to be rejected but for the widest ranging means not to

differ significantly. This is an example of what has been referred to as

incoherence (Gabriel, 1969) or incompatibility (Lehmann, 1957). On the

other hand, the complete null may be retained while the null associated

with the widest ranging means would have been rejected had the decision

structure allowed it to be tested. This has been referred to by Gabriel

(1969) as nonconsonance. One wonders if, in fact, a practitioner in this

situation would simply conduct the MCP contrary to the omnibus test's

recommendation. Strangely enough, such a seeming breach of multiple

comparison ethics would have largely positive statistical ramifications as

we discuss in our next and final point.

"The fourth argument against the traditional implementation of an

initial omnibus *F*-test stems from the fact that its well-intentioned but

unnecessary protection contributes to a decrease in power. The first test

in a pairwise MCP, such as that of the most disparate means in Tukey's

test, is a form of omnibus test all by itself, controlling the familywise

error rate at the alpha-level in the weak sense. Requiring a preliminary

omnibus F-test amounts to forcing a researcher to negotiate two hurdles to

proclaim the most disparate means significantly different, a task that the

range test accomplished at an acceptable alpha-level all by itself. If

these two tests were perfectly redundant, the results of both would be

identical and the omnibus test would represent neither friend nor foe;

probabilistically speaking, the joint probability of rejecting both would

be alpha when the complete null hypothesis was true. However, the two

tests are not completely redundant; as a result the joint probability of

their rejection is less than alpha. The *F*-protection therefore imposes

unnecessary conservatism (see Bernhardson, 1975, for a simulation of this

conservatism). For this reason, and those listed before, we agree with

Games' (1971) statement regarding the traditional implementation of a

preliminary omnibus *F*-test:

'**There seems to be little point in applying the overall F test
prior to running c contrasts by procedures that set the
familywise error rate <= alpha.... If the c contrasts express
the experimental interest directly, they are justified whether the
overall F is significant or not and [familywise error rate] is
still controlled.** (Games, 1971, p. 560)'"

Whole passage from:

Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha:

Developments in multiple comparison procedures in the quarter century

since Games (1971).

I'd be interested to hear others chime in on this topic.

Good luck,

Greg Hancock

Gregory R. Hancock Cogito, ergo S.E.M.

Department of Educational Measurement,

Statistics, and Evaluation

1230 Benjamin Building phone: (301) 405-3621

University of Maryland fax: (301) 314-9245

College Park, MD 20742-1115 e-mail: ghancock@wam.umd.edu

Check out our graduate program in measurement, stats, and evaluation:

http://www.inform.umd.edu:8080/EdRes/Colleges/EDUC/.WWW/Depts/EDMS/

Date: Fri, 21 Feb 1997 09:52:41 +0100

From: Hans-Peter Piepho <piepho@WIZ.UNI-KASSEL.DE>

The suggestion to pursue multiple comparisons only when an overall *F*-test

rejects, is connected with Fisher's protected LSD test. This guarantees the

experiment-wise Type I error to be controlled only in the weak sense, i.e.

only if the global null is true, but not otherwise (there is no protection

when the global null is false).

To control the experiment-wise error rate in the strong sense, i.e. also

when the global null is false (which I think is the most common situation),

a host of other procedures have been suggested, the most prominent of them

being Tukey's test, which uses studentized ranges. **These tests do NOT require
a preliminary F-test**.

_______________________________________________________________________

Hans-Peter Piepho

Institut f. Nutzpflanzenkunde WWW: http://www.wiz.uni-kassel.de/fts/

Universitaet Kassel Mail: piepho@wiz.uni-kassel.de

Steinstrasse 19 Fax: +49 5542 98 1230

37213 Witzenhausen, Germany Phone: +49 5542 98 1248

Ten years later, and the "expert reviewers"
of Psychology journals are still in the dark. Here is part of a review
that a colleague of mine received in July of 2007 from the *Journal of Applied
Social Psychology*:

I am unfamiliar with
the REGWQ approach employed by the author(s) in analyzing group means. Although
the author(s) state that ANOVA does not necessarily have to accompany such
analysis, REGWQ is often accompanied by ANOVA. Given the unfamiliarity of the
data analytic technique I would recommend that the authors follow the usual
convention and accompany their REGWQ analysis with an ANOVA.

Contact Information for the
Webmaster,

Dr. Karl L. Wuensch

This page most recently revised on the 16^{th} of February, 2017.