Splitting a Data File into Two Random Halves

Q:  Why would one want to do this?

A:  To provide cross-validation of the result you obtained when you analyzed the whole data set.  For example, suppose that that I have conducted a multiple regression and I am wondering if the regression coefficients that I obtained are stable or not -- that is, if I were to obtain a second random sample from the population of interest would multiple regression produce essentially the same regression coefficients or not?  Rather than going out to get a second sample, what I do is split the data file into two random halves, use multiple regression to develop a model for each half, and then see if the two models are pretty much the same or not.

Q:  How can I do this?

A:  Assign random numbers to each case in the data file.  Then sort the cases by the random numbers.  Then split the file into the two halves by the median random number.  Here is an example, using SPSS code:

COMPUTE randnum = RV.UNIFORM(0,1) .
EXECUTE .
SORT CASES BY
randnum (A) .
RECODE
randnum
(Lowest thru .5=1) (.5 thru Highest=2) INTO half .
EXECUTE .
SORT CASES BY half .
SPLIT FILE
SEPARATE BY half .
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT ar
/METHOD=ENTER ideal misanth .

Back to the Stat Help Page

Contact Information for the Webmaster,
Dr. Karl L. Wuensch