Transform-SAS.txt It is sometimes desirable to apply a nonlinear but monotonic transformation to a variable that is so badly skewed that you feel you have seriously violated the normality assumption of the inferential statistic you wish to use with that variable. One can obtain measures of skewness (and kurtosis) as well as a test of the null hypothesis that the variable is normally distributed in the population with SAS' PROC UNIVARIATE. This procedure produces one page of output for each variable. If you have several variables and are willing to settle for measures of skewness (and kurtosis if desired) then PROC MEANS is more paper-efficient. Suppose that we wish to evaluate variable X1 with PROC UNIVARIATE. Here is how to do so: PROC UNIVARIATE NORMAL; VAR X; If you wished to obtain normality tests of many variables you could reduce the size of the output by using the NOPRINT option and OUTPUT OUT=name KEYWORD1=name11 name12 ....... KEYWORD2=name21 name22 ...........; and then printing the statistics that are output. Here is an example where I evaluated two variables and three different transformations of each. The keyword SKEWNESS is obviously for the skewness statistic. The keyword PROBN is for the probability level for the normality test. ------------------------ part of the program ---------------------------- Q8a=sqrt(6-q8); Q8b=log10(6-q8); Q8c=1/(6-q8); Q60a=sqrt(q60); Q60b=log10(q60); Q60c=1/(q60); proc univariate normal noprint; output out=lotus probn=pn8 pn8a pn8b pn8c pn60 pn60a pn60b pn60c skewness=sk8 sk8a sk8b sk8c sk60 sk60a sk60b sk60c; VAR Q8 Q8a Q8b Q8c Q60 Q60a Q60b Q60c; proc print; ------------------------------- the output ------------------------------- OBS SK8 SK8A SK8B SK8C SK60 SK60A SK60B 1 -1.52205 0.83072 0.25711 0.38757 1.53966 0.87672 0.41336 OBS SK60C PN8 PN8A PN8B PN8C PN60 PN60A PN60B PN60C 1 0.028908 0 0 0 0 0 0 0 0 ----------------------------------------------------------------------- Note that although some of the transformations did a good job of reducing the skewness, they all remain significantly non-normal. I do not advise using a significance test of the normality assumption to determine whether or not a variable is close enough to normal not to worry about the assumption. You see, the sample size (which was large for the example above) has a great influence on the size of the obtained p. With even moderate sample sizes the normality test will have so much power that it will be significant even with relatively minor deviations from normality, deviations so small that one should not worry about them. Here are some recommendations regarding what sort of transformations to try. It is OK to try several transformations and then choose the one that best normalizes the variable. To reduce positive skewness, try: 1. a square-root transformation (as I did with Q60a above). If any of the data are < 0 then you will need to add the appropriate constant (to make all data non-negative) prior to doing the square root (unless you wish to work with imaginary numbers). 2. a log transformation (as I did with Q60b) -- as with sqrt, you may need to add a constant first. 3. a reciprocal transformation (as I did with Q60c) -- but don't try to divide by zero. To reduce negative skewness simply apply one of the three transformations noted above for positive skewness but first REFLECT the data -- that is, subtract each datum from the maximum value of the variable plus one. Look at what I did with Q8. Q8 ranged from 1 to 5, so I subtracted each datum from 6 before applying a transformation. Do keep in mind that this reverses the meaning of the variable -- for example, on Q6 the higher the score the greater the subject's agreement with a statement, so on reflected Q6 the higher the score the greater the subject's disagreement with the statement. The reciprocal transformation also reverses the meaning of a variable. For example, if X is how long it took a subject to complete a task, 1/X is a measure of how fast the subject completed the task (the higher 1/X, the less time it took). You may want to multiply reciprocally transformed variables by minus one to restore their original meaning. In some cases you may wish to reduce skewness by eliminating the outliers that cause the skewness (if you can justify doing so) or recoding the data to bring them closer to the median. For example, consider Q10 below: ------------------------------------------------------------------------ if Q10< 5 then Q10a=4; else Q10a=5; proc freq; tables Q10; proc means mean std skewness kurtosis; VAR Q10 Q10a; Cumulative Cumulative Q10 Frequency Percent Frequency Percent ------------------------------------------------- 1 3 2.7 3 2.7 2 1 0.9 4 3.6 3 1 0.9 5 4.5 4 34 30.6 39 35.1 5 72 64.9 111 100.0 Variable Mean Std Dev Skewness Kurtosis ---------------------------------------------------------------- Q10 4.5405405 0.8066066 -2.6741092 8.8193877 Q10A 4.6486486 0.4795575 -0.6313159 -1.6311630 ---------------------------------------------------------------- Q10 was a measure of how strongly the subject agreed with the statement "Animals can feel pain." Note that most subjects strongly agreed with this statement and very few disagreed. I wanted to use this item as a predictor in a canonical regression analysis, but I did not want the few outliers to have an enormous influence on the solution, so I dichotomized the variable so that 5 = strong agreement, 4=not strong agreement. Do note the use of PROC MEANS to obtain measures of skewness and kurtosis. PROC MEANS uses only one line of output for each variable, so I advise its use when you wish to evaluate many variables. See the program file "REFLECT SAS" for an example of how to use an ARRAY statement and a DO loop to reflect a large number of variables.