East Carolina University
Department of Psychology
What Effect Size Should I Use in Power Analysis &
How Much Power Should I Want?
Correspondent Tony Napoli asks some great questions here. My responses are in purple
Dear Dr. Wuensch,
I’m Tony Napoli, a social scientist and perennial student of
statistics, who has wandered into some murky waters regarding estimating effects
sizes.
First off, let me express my appreciation for your very
informative statistics WebPages. They are indeed an oasis in the vast dessert. I
and my students have benefitted greatly from them.
My dilemma concerns estimating effect sizes for power
analysis (sample size estimation) in the absence of any information on what
effect size to expect.
As a graduate student in the 1990s, the prevailing view
within my department was that without preliminary information
(aka pilot data) or published results to “hang your hat on” the researcher
should “shoot” for the minimum sample size that would produce a statistically
significant result under the worst case scenario – that would be a small
effect size.
I tell my students that the ideal situation is to have lots of power (95%) to detect the smallest effect that you would consider not to be trivial. What that smallest nontrivial effect size is depends on situational factors and is rather subjective. Of course, one could always fall back on Cohen's benchmarks, such as .2 for a standardized difference between means or .1 for a Pearson r.
If one has 95% power to detect the
smallest effect considered to be nontrivial, then one can make a strong
statement regardless of whether the null is rejected. If the null is
rejected, great, and the large power will enable you to estimate the size of the
effect with good precision. If the null is not rejected, then you can
argue that the effect is so small that it might as well be zero. That
argument would be strengthened by providing a 95% confidence interval for the
effect. For example, -0.05 < rho < +.02 is pretty convincing evidence that
rho is nearly zero. That said, confidence intervals like 0.03 < rho < 0.06
are also convincing that the effect is nearly zero, even though significant.
In my review of the literature, aspiring investigators are
sometimes advised to select an effect size based on a desired (e.g., clinical)
effect (for example
http://www.power-analysis.com/power_analysis.htm). This seems to be
somewhat self serving: Who wouldn’t want a large effect, obtained from a small
sample.
Not I, unless I were a
pharmaceutical firm trying to establish the bioequivalence of my product.
<grin>
Folks at the
University of Michigan's MEERA website opine that because effect size can
only be calculated after you collect data from program participants, you will
have to use an estimate for the power analysis. Common practice is to use a
value of 0.5 as it indicates a moderate to large difference. This suggestion, to
use estimate a moderate/large effects size (ES; presumably Cohen’s d in
this case), seems unwarranted and likely lead to a Type-II error. Also, I cannot
find an authoritative source for this recommendation/convention.
So unlike other conventions, for example, set alpha = .05
(The Earth is Round) and power should be set to .80 (Trochim,
2006; there doesn’t appear to be an acceptable convention for estimating ES,
in the absence of any other information.
If you use G*Power, you will see that the
default value for amount of desired power is 95%. So, why do the Germans
use 95% and those in the US use 80% ? My guess is that because we are so
shocked when we how many cases we need to get that much power. Perhaps the
Germans are better able to get enough data because their government still
supports research. IMHO, setting both alpha and beta should follow a
consideration of the
relative seriousness of Type I versus Type II errors.
If these two sorts of errors are considered equally
serious, and .05 a reasonable value for alpha, then the reasonable value for
beta should also be .05 -- that is, we should have 95% power. Does the
convention of alpha = .05 and beta = .20 indicate that the researchers consider
Type I errors more serious than Type II errors? I doubt it. I doubt
that the typical researcher ever even ponders the relative seriousness of Type I
versus Type II errors.
In the Spring of 2020 I reviewed a manuscript out of China. They reported having conducted an a priori analysis to determine how many cases would be needed to have 90% power to detect an effect that is of typical size for research in social psychology. That typical size was r = .21 (Richard, F. D., Bond, C. F., Jr., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7(4), 331-363.
Contact Information for the Webmaster,
Dr. Karl L. Wuensch
This page most recently revised on 27-April-2020.