LSMEANS-SAS.txt Here is a posting from the SAS-L that may tell you more about LSMEANS than you ever wanted to know. LSMEANS are commonly employed in analysis of covariance to produce adjusted cell and marginal means, adjusted in the sense that the effect of any covariates is statistically removed from the scores prior to computing the means. Another use of LSMEANS is to obtain "unweighted" (equally weighted) means from a nonorthogonal (unequal cell sizes) ANOVA. These means are those that we would expect if the independent (class) variables were actually NOT correlated in the population of interest. ------------------------------------------------------------------------------- Date: Tue, 28 Aug 90 17:56:11 EDT Reply-To: Tim Dorcey Sender: "SAS(r) Discussion" From: Tim Dorcey Subject: What are LSMEANS? The question has recently been raised about the construction and interpretation of LSMEANS, with discussion about various sorts of "adjustments." I look at LSMEANS rather differently, and thought it would be worth presenting this point of view. I apologize to those who are not interested for the length of this posting. To make things concrete, suppose that we have a 2 x 2 design with a single covariate. Denote the two class variables by A and B, each with levels 1 & 2. For convenience, suppose that the covariate X can take on only integer values 1,2,.....,p. For the expected value of the response variable Y(i,j,x), assume the following linear model: E[Y(i,j,x)] = m + a(i) + b(j) + ab(i,j) + c*x + ac(i)*x + bc(i)*x + abc(i,j)*x for i=1,2 ; j=1,2; x = 1,2,....,p Then, estimate by least-squares the 18 model parameters: m,a(1),a(2),b(1),b(2),ab(1,1),ab(1,2),ab(2,1),ab(2,2),c,ac(1),ac(2), bc(1),bc(2),abc(1,1),abc(1,2),abc(2,1),abc(2,2) If we now are given values of i,j and x and wish to predict the response variable Y(i,j,x), the "least-squares" estimate, which minimizes the expected squared error, is simply the expected value E[Y(i,j,x)] given by the above formula. This could be termed the "LSMEAN" for A=i,B=j,X=x Suppose, instead, that we are not given the values i,j & x. What if we simply draw a random element from this population and are asked to guess the value of the response variable, without knowing i,j or x. Again, the least-squares estimate is an expected value, this time averaged over i,j,x: E[Y] = Sum{ Pr[A=i,B=j,X=x]*E[Y(i,j,x)] } over i=1,2; j=1,2; x=1,p Plugging in the linear model for E[Y(i,j,x)] and collecting terms gives: E[Y] = m + Sum{ Pr[A=i]*a(i) } + Sum{ Pr[B=j]*b(j) } + Sum{Pr[A=i,B=j]*ab(i,j)} + Sum{ Pr[X=x]*c*x } + Sum{ Pr[A=i,X=x]*ac(i)*x } + Sum{ Pr[B=j,X=x]*bc(j)*x } + Sum{ Pr[A=i,B=j,X=x]*abc(i,j)*x } So, in order to guess the value of the response, without knowing the values of the predictors, we need to know the probability that the predictors take on different values (i.e., their distribution). Viewed in this framework, the assumptions that PROC GLM makes when computing LSMEANS are: 1) The class variables have independent, uniform distributions (e.g., Pr[A=i] = .5; Pr[B=j] = .5; Pr[A=i,B=j] = .25 ) 2) Variables that do not appear in a CLASS statement are independent of CLASS variables 3) The X values that occur in the experiment are a simple random sample from the population (i.e., the observed frequencies of the X values are indicative of the true population distribution). Thus, CLASS and non-CLASS variables are treated quite differently. The observed frequency distributions of CLASS variables *are not* viewed as representative of the population, whereas the observed (marginal) frequencies of non-CLASS variables *are* viewed as representing the population. As with the CLASS variables, the joint distribution of CLASS and non-CLASS variables is treated as an artifact of the experiment, not representative of the population. Using these assumptions and the general formula above, the LSMEAN for the INTERCEPT can then be computed as follows: E[Y] = m + .5*Sum{a(i)} + .5*Sum{b(i)} + .25*Sum{ab(i,j)} + c*xbar + .5*Sum{ac(i)}*xbar + .5*Sum{bc(j)}*xbar + .25*Sum{abc(i,j)}*xbar Here, xbar denotes the sample mean of the x's, which would be the usual estimate of Sum{Pr[X=x]*x} (i.e., the population mean) based on a simple random sample. Now, it turns out (as near as I can figure) that PROC GLM will not actually compute an LSMEAN for the INTERCEPT term. However, the above can be extended to explain how GLM does compute LSMEANS for other factors. If we drew a random observation and were told only that A=1, then to get our least squares estimate of the response variable E[Y|A=1] we simply replace each probability in the general formula with a conditional probability e.g., Pr[A=i] ---> Pr[A=i|A=1] = 1 if i=1 and 0 otherwise Pr[B=j] ---> Pr[B=j|A=1] Pr[A=i,B=j] ---> Pr[A=i,B=j|A=1] = Pr[B=j|A=1] if i=1 and 0 otherwise etc. Under the assumptions previously stated for the distribution of the predictors, we then have the LSMEAN for A=1 as: E[Y|A=1] = m + a(1) + .5*Sum{ b(j) } + .5*Sum{ ab(1,j) } + c*xbar + ac(1)*xbar + .5*Sum{ bc(j) }*xbar + .5*Sum{ abc(1,j) }*xbar That this is what SAS actually does can be verified by fitting a model like the one above and requesting LSMEANS with the E option to see the coefficients that are applied to the model parameters. <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> To get even more concrete and avoid dealing with GLM's parameterization, imagine a 2 x 2 experiment without any covariate. Then the LSMEANS for the AB interaction are simply the observed cell means; the LSMEANS for the A factor are obtained by averaging the means across the levels of B; and the LSMEANS for B are the average of the means across the levels of A. Computing the mean of the means eliminates the influence of different cell sizes on the estimate. It makes sense if the different cells sizes are not indicative of the population of interest. If, on the other hand, the observed cell sizes are representative of the population, then computing the usual means (e.g., the mean of all observations for which A=1) makes more sense. This would also be equivalent to inserting appropriate probabilities in the preceding framework. <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> In my view, this framework of imagining how one would guess the value of the response variable based upon various amounts of information about the predictors, helps to make LSMEANS make sense. I imagine that there are other conceptualizations that also work, and the above may not have been the original motivation for LSMEANS. However, based upon the above, one should be able to construct variations on the LSMEANS (i.e., using the ESTIMATE statement) that are most appropriate for the research question at hand (e.g., using something other than the sample mean for the covariate, or dropping the assumption that the distribution of the covariate is independent of the class variables, etc.). I hope that his has been useful to some, and that others will point out any errors or alternative points of view. Tim Dorcey BITNET: TCD@CORNELLA Statistical Software Consultant Internet: TCD@CORNELLA.CIT.CORNELL.EDU Cornell Information Technologies Cornell University Ithaca, NY 14853 ======================================================================== 28 Sender: "SAS(r) Discussion" From: Dick Campbell 312/413-3759 FAX 312/996-5104 Subject: ANCOVA and LSMEANS If one has a standard one way ANCOVA the SAS LSMEANS statement produces classic adjusted means. However, suppose the design is a two way factorial. If the design is orthogonal, LSMEANS still produces standard results. How- ever, if the design is non-orthogonal, it is my understanding that the adjusted means that one gets are estimated as though the design were orthogonal. This makes sense when departures from othogonality are "nuisance" effects due to trivial amounts of missing data. Its doesn't make sense if the data come from a survey. Example: PROC GLM; CLASS RACE SEX; MODEL y = RACE SEX EDUC; (assume no sig. two way interaction) LSMEANS RACE SEX; In this case, unequal N's in the cells of the two by two reflect RACE and SEX differences in mortality. You would not want adjusted means to assume such effects away. I am concerned about this because a loose reading of the SAS manual would lead on to think that LSMEANS is just an easy way to get classical adjusted means in ANCOVA. It is for simple designs. For more complex designs it appears to do more than that. ======================================================================== 131 Sender: "SAS(r) Discussion" From: Tim Dorcey Subject: Re: ANCOVA and LSMEANS This is an important point to be aware of when using LSMEANS. However, you should be able to get whatever you want by using an appropriate ESTIMATE statement. I like to conceptualize this in terms of trying to predict the response variable based on varying amounts of information. Using the example above, we have estimated the parameters of a linear model for the expected value of Y given values of Race (R), Sex (S), and Education (E): E{Y|R=i,S=j,E=e} = m + r[i] + s[j] + b*e So, if I drew a random individual from the population and was told that their Race was i, there Sex was j, and their Education was e, I would use the above formula to calculate a least squares estimate of their value for Y. Now, what would I do if I was only told what their Race was? In order to get a least squares mean for Race=1, I would need to make some assumptions about the distribution of the characteristics in the sample. The value that PROC GLM will compute for a least squares mean can be obtained by making the following assumptions about the population distribution of the various predictors in the model: 1) All predictors are independent of one another 2) All CLASS variables are uniformly distributed across their levels 3) The distribution of non-CLASS variables is represented by their observed distribution in the sample. Taken together, these assumptions imply that: E{Y|R = 1} = m + r[1] + .5*s[1] + .5*s[2] + b*ebar where ebar is the mean of E in the sample and assuming that Sex takes on 2 different values. I.e., since we don't know what S or E are, we assume that each value of S is equally likely and that the likelihood of a particular value for E is proportional to its frequency in the sample. If the assumptions which lead to this estimate are not reasonable for the population of interest, then one can use the ESTIMATE statement. For example, suppose that the joint distribution of Race x Sex in the population that we are interested in is: Race=1 Race=2 Race=3 Sex=1 .3 .2 .1 Sex=2 .1 .1 .2 but that we continue to assume that Education is independent of either of these factors. Note that this assumption about Education may seem implausible for most currently existing population, but we still might like to guess what we would expect to see in a population where there were no education differences among the groups. Then, to get our least squares mean for Race = 1, we first note that the conditional distribution of Sex given Race = 1 is: Pr{S=1|R=1} = .75 and Pr{S=2|R=1} = .25. Thus: E{Y|R=1} = m + r[1] + .75*s[1] + .25*s[2] + b*ebar or, in PROC GLM, terminology: ESTIMATE 'R=1' INTERCEPT 1 RACE 1 0 0 SEX .75 .25 EDUC 10.243; (assuming the mean of EDUC is 10.243). Similarly, we could construct, e.g., E{Y|S=2} = m + .25*r[1] + .25*r[2] + .5*r[3] + s[2] + b*ebar or, E{Y|E=11.5) = m + .4*r[1] + .3*r[2] + .3*r[3] + .6*s[1] + .4*s[2] + b*11.5 Now let's drop the assumption that Education is independent of Race and Sex and suppose that the expected value of Education for each of the groups is: Race=1 Race=2 Race=3 Sex=1 10 9 8 Sex=2 12 20 11 Then we have: E{Y|R=1} = m + r[1] + .75*s[1] + .25*s[2] + b*(.75*10 + .25*12) Note that we can make up whatever particular joint distribution for S,R,E that we like, and obtain least squares means for that hypothetical population (as long as we believe our original linear model is valid in the population we are thinking of). Also, we can test whether two least squares means are different from one another by creating the appropriate estimates as above, and then create a new estimate whose coefficients are the difference between corresponding coefficients of the 2 estimates. E.g., we could ask questions like "if we had a population where the average education in each group was 13.2, would we expect a difference in Y between Race 1 and Race 2?" (use a CONTRAST statement to compare more than 2 estimates). If we were to adopt the most general assumptions that were considered above, and then suppose that the sample from which we estimated the model parameters is representative of the population that we are actually interested in, then we could estimate the joint distribution of Race and Sex and Education from that same data (i.e., we could fill in the two 2 x 3 tables above with sample proportions and sample means). In that case, however, there would be little point in fitting the linear model in the first place, because all of our least squares means would end up being the same as the simple arithmetic means, e.g., E{Y|R=1} = average of all observations with R=1 Some time ago I posted a similar discussion of this way of looking at LSMEANS and ESTIMATE, and would be happy to forward a copy to anyone who found this posting useful (hey, you made it this far....)