Retrospective (Observed) Power Analysis

Query posted to EDSTAT-L in April of 2003:

A colleague of mine submitted a manuscript to a behavioral journal. He reported the results of an ANOVA which showed there to be a "significant" relationship between a categorical variable and a putatively normally distributed criterion variable. The editor of the journal requested a revision of the manuscript which included a power analysis.

My query to this group is this: Of what value is it to report a power analysis for an effect which has been found to be "significant?" That is, what is the value of finding the probability of obtaining significant results (assuming some nonzero effect size) when significance has already been found? Furthermore, when computing such a power analysis, what effect size should one assume? I have been told that it is common practice to use the observed effect size from the obtained sample, but that strikes me as foolish.

After foaming at the mouth about how editors seem to have no understanding about the logic of Statistical Hypothesis Inference Testing (Cohen's phrase), I cooled down and gave this advice: Pretend that you conducted the power analysis a priori (which you should have) and you determined what power would be for a small but not trivial effect and what it would be for a medium sized effect (given the sample sizes you anticipated being able to obtain) and report these statistics. My fear is that if the power statistic is low, then the editor will then think that somehow invalidates the "significant" result which was reported. Sigh.

From: "Simon, Steve, PhD" ssimon@cmh.edu
Subject: RE: A Posteriori Power Analysis
Date: Monday, April 07, 2003 6:18 PM

This type of post hoc power is clearly bogus. It has a one-to-one relationship with the p-value and is always large when the p-value is small and vice versa. If you wanted a publication to back you up, you could cite one or more of the following references.

•  Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. A. H. Smith and M. N. Bates. Epidemiology 1992: 3(5); 449-52.    Abstract: "Frequently, after an epidemiologic study is completed, statistical power to detect a relative risk of interest is recalculated using data obtained during the course of the study. A negative study may then be dismissed on the grounds that its power was too low. However, post hoc power calculations ignore the actual relative estimate and its variance, which are by then known. We present evidence that post-study power calculations have little value and should be replaced by a more informative method using the upper (1 - alpha)% confidence limit of the point estimate that touches the value of the relative risk of interest."

•  The Overemphasis On Power Analysis. Thomas Knapp. Nursing Research 1996: 45(6); 379.

•  Some Practical Guidelines for Effective Sample Size Determination. R. V. Lenth. The American Statistician 2001: 55(3); 187-193.

•  The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. John M. Hoenig and Dennis M. Heisey. The American Statistician 2001: 55(1); 19-24.

The best thing to present in the paper is an a priori sample size calculation. If this was not done, rely on the width of the confidence intervals to demonstrate whether the sample size was adequate. A post hoc power computed at a biologically relevant effect size is a poor third choice, and a post hoc power at the observed effect size is pathetic.

If you want to fight, then fight. You have plenty of material to cite. But I must admit that acquiescing sure is tempting. I've gone along with some bad referee comments in the past just because I wanted the whole publication process to end.

Steve Simon, ssimon@cmh.edu

From: "Bob Wheeler"
Date: Monday, April 07, 2003 7:26 PM

Here are two paragraphs from the SSize write up that may be of interest:

There are two large problems with a posteriori power procedures. First of all, it is an improper calculation of a probability. Power is of course, the probability that the alternate hypothesis will be correctly judged to occur when it is true. The value of this probability changes upon collection of the data, and the probabilities before and after the data are not necessarily equal. Zumbo and Bruno (1998) illustrate this for a simple case in which the probabilities before and after the data are 0.483 and 0.935 respectively. If, as most a posteriori users do, the usual power formulas had been used after taking the data and, upon finding a significant result, judging the alternate to be true, the user would mistakenly have assessed the probability of the alternate hypothesis to be 0.483, rather than the correct value of 0.935.

Secondly, it can not be used to add something to the interpretation of the results, such as an assessment of the likelihood of the null. This is in fact what is attempted by those who calculate observed power, where the observed values are fed into a reversed power calculation, and the size of the resulting power is used as evidence of the adequacy of the study. In point of fact, the observed power is completely determined by the observed p value:''

Hoenig and Heisey (2001). It follows that those who perform such observed power calculations will inevitably find that the power is low.

Zumbo, B.D. and Hubley, A.M. (1998). A note on misconceptions concerning prospective and retrospective power. The Statistician. 47-2. 385-388. A PDF copy of this paper may be obtained by e-mail from bruno.zumbo@ubc.ca.

Hoenig, J.M. and Heisey, D.M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician. 55-1. 19-24.

PS [KLW]: Statisticians finally convinced the persons who construct the Publication Manual of the American Psychological Association that researchers should conduct power analysis as part of their research projects. However, editors and others who rely on the APA Publication manual appear not to understand that the power analysis should be conducted a priori, as part of the planning of the research, when determining how many cases need be sampled to have a reasonable chance of detecting an effect of a specified size. Accordingly, when a researcher sends in a manuscript that does not include a power analysis, some editors will ask that researcher include a power analysis in a revised manuscript. If the reported effect is statistically significant, reporting a power analysis is just plain foolish.  However, if the reported effect is not significant, a power analysis may be helpful -- if the researcher can show that e had great power for detecting even a small effect, then the nonsignificant result can be used to argue that the effect is, for all practical purposes, zero. This is essentially what Steve Simon labeled as "a poor third choice" earlier in this document.  A better choice would be to present a confidence interval for the parameter of interest.  If it is narrow and includes the null value, then one can make a strong statement about the effect being trivial in magnitude.  Such an approach is referred to as equivalence testing.