10 Longitudinal Datasets

Data from large-scale educational assessment programs require special statistical methods for analysis. Because of their scope and complexity, EdSurvey gives users functions to perform analyses that account for complex sample survey designs. This chapter provides some analysis guidelines and tips that apply to most NCES longitudinal data, using the Early Childhood Longitudinal Study, Kindergarten Class of 2010–11 (ECLS-K:2011) as the example data.

10.1 Using EdSurvey to Access ECLS-K:2011 Data for Analysis

Refer to Chapter 4 for how to download and read in a specific longitudinal dataset of interest. The following example shows how to download the ECLS-K:2011 Kindergarten–Fifth Grade public-use data file:

downloadECLS_K(years = 2011, root = "C:/", cache=FALSE)

To load the ECLS-K:2011 data for fifth graders and create an edsurvey.data.frame, select the pathway to the ECLS-K:2011 data folder and assign it the name eclsk11 with this call:

eclsk11 <- readECLS_K2011("C:/ECLS_K/2011")

The function may take several minutes to run the first time; subsequent calls to readECLS_K2011 are stored on the user’s drive for easy access and near instant retrieval. Once read in, users can analyze and merge data from the ECLS-K:2011 dataset after loading the data into the R working environment.

10.2 Retrieving Survey Weights

The variables associated with survey weights can be seen from the showWeights functions, respectively, when setting the verbose argument to TRUE.

showWeights(data = eclsk11, verbose = TRUE)

In this version, the (lengthy) results are not shown, but the user can easily see the results by running the same code. Selecting the survey weights is especially important for ECLS-K and ECLS-K:2011. Once selected, users can specify the survey weight using the weightVar argument in EdSurvey analytical functions.

To learn more about selecting sample weights for analyses using ECLS:K-2011 data, consult the Calculation and Use of Sample Weights section of the Public-Use Data File User’s Manuals for the respective public-use file of interest. The ECLS-K:2011 Kindergarten–Fifth Grade User’s Manual, Public Version is relevant for the K–5 data used in this chapter. Alternative releases are on the NCES site: ECLS-K:2011 Public-Use Data File User’s Manuals.

10.3 Retrieving Stratum and PSU Variables

The functions getStratumVar and getPSUVar return the default stratum variable name or a PSU variable associated with a weight variable. Because ECLS-K:2011 does not have default weights, users need to specify a weight to return the associated psu/stratum variables. For example, the total student weight from the ninth round weightVar = "w9c29p_9a0" returns the following:

getStratumVar(data = eclsk11, weightVar = "w9c29p_9a0")
#> [1] "w9c29p_9astr"
getPSUVar(data = eclsk11, weightVar = "w9c29p_9a0")
#> [1] "w9c29p_9apsu"

These arguments are quite useful for accessing the variables associated with the weights in longitudinal surveys.

10.4 Recoding Data

Data recoding is especially important when performing analyses with ECLS:K-2011 data. By default, EdSurvey omits special values, such as multiple entries, skipped values, or NAs. Typically, this setting helps users by dropping the levels of factors not typically included in regressions, tables, correlations, and other analyses. For ECLS:K-2011, this default setting requires careful consideration. There are many instances in which user should keep special values for their analyses; in these cases, recoding the data is advised.

In ECLS:K-2011, special codes are used to indicate item nonresponse, legitimate skips, and unit nonresponse.

#> Warning: package 'kableExtra' was built under R version
#> 4.3.1
Table 10.1. Missing value codes used in the ECLS-K:2011 data file
Value Description
-1 Not applicable, including legitimate skips
-2 Data suppressed (public-use data file only)
-4 Data suppressed due to administration error
-5 Item not asked in School Administrator Questionnaire Form B
-7 Refused (a type of item nonresponse)
-8 Don’t know (a type of item nonresponse)
-9 Not ascertained (a type of item nonresponse)
(blank) System missing, including unit nonresponse
SOURCE: U.S. Department of Education, National Center for Education Statistics, Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS:K-2011), kindergarten-fifth grade (K--5) restricted-use data file.

The method for recoding these values appears later in this chapter in Recoding Variables in a Dataset in the Retrieving Data for Further Manipulation With getData section of this chapter.

10.5 Removing Special Values

EdSurvey uses listwise deletion to remove special values in all analyses by default, such as those detailed in Table 10.1. To use a different method, set dropOmittedLevels = FALSE when running your analysis. You can then remove levels that you want to remove with a call to subset, as discussed in the “Subsetting the Data” section in Chapter 3.

10.6 Explore Variable Distributions With summary2

The summary2 function produces weighted and unweighted descriptive statistics for a variable. This functionality is quite useful for gathering response information for survey variables when conducting data exploration. By default, the estimates are not weighted. For example, the variable x9povty_i (“Imputed poverty level”) returns the following output:

summary2(data = eclsk11, variable = "x9povty_i")
#> Estimates are not weighted.
#>                                                                  x9povty_i
#> 1                                               1: BELOW POVERTY THRESHOLD
#> 2 2: AT OR ABOVE POVERTY THRESHOLD, BELOW 200 PERCENT OF POVERTY THRESHOLD
#> 3                          3: AT OR ABOVE 200 PERCENT OF POVERTY THRESHOLD
#> 4                                                                     <NA>
#>      N  Percent
#> 1 2185 12.02267
#> 2 2226 12.24827
#> 3 5809 31.96324
#> 4 7954 43.76582

By default, the summary2 function includes omitted levels; to remove those, set dropOmittedLevels = TRUE:

summary2(data = eclsk11, variable = "x9povty_i", dropOmittedLevels = TRUE)
#> Estimates are not weighted.
#>                                                                  x9povty_i
#> 1                                               1: BELOW POVERTY THRESHOLD
#> 2 2: AT OR ABOVE POVERTY THRESHOLD, BELOW 200 PERCENT OF POVERTY THRESHOLD
#> 3                          3: AT OR ABOVE 200 PERCENT OF POVERTY THRESHOLD
#>      N  Percent
#> 1 2185 21.37965
#> 2 2226 21.78082
#> 3 5809 56.83953

The summary2 function returns the weighted number of cases, the weighted percentage, and the weighted standard error for a categorical variable when specified in the argument weightVar, here using the total student weight from the 9th round weightVar = "w9c29p_9a0":

summary2(data = eclsk11, variable = "x9povty_i", weightVar = "w9c29p_9a0")
#> Warning in calcEdsurveyTable(formula, data, weightVar,
#> jrrIMax, pctAggregationLevel, : Removing 9632 rows with 0
#> weight from analysis.
#> Estimates are weighted using the weight variable 'w9c29p_9a0'
#>                                                                  x9povty_i
#> 1                                               1: BELOW POVERTY THRESHOLD
#> 2 2: AT OR ABOVE POVERTY THRESHOLD, BELOW 200 PERCENT OF POVERTY THRESHOLD
#> 3                          3: AT OR ABOVE 200 PERCENT OF POVERTY THRESHOLD
#>      N Weighted N Weighted Percent Weighted Percent SE
#> 1 1720     887797         22.28798           0.9054683
#> 2 1823     936566         23.51232           0.6704569
#> 3 4999    2158936         54.19970           1.0531340

10.7 Retrieving Data for Further Manipulation With getData

10.7.1 Retrieving a Set of Variables in a Dataset

Although EdSurvey allows for rudimentary data manipulation and analysis directly on a edsurvey.data.frame connection, the function getData() can extract a dataset of variables for manipulation and analyses as with other data.frame objects. This object—referred to as a light.edsurvey.data.frame—can then be used with packaged EdSurvey analytical functions.

Variables are extracted from an edsurvey.data.frame and returned as a light.edsurvey.data.frame by specifying a set of variable names in varnames or by entering a formula in formula.8

To access and manipulate data the x_chsex_r (“Sex of students”), the weight variable w5cf5pf_50, p5sumsch (“Child attended summer school”), and p5nhrprm (“Hours per day child attended summer school”) variables in eclsk11, call getData.

gddat <- getData(data = eclsk11, varnames = c("x_chsex_r", "w5cf5pf_50", "x12sesl",
                                              "p5sumsch", "p5nhrprm"), 
                                 dropOmittedLevels = FALSE, addAttributes = TRUE)

By default, setting dropOmittedLevels to TRUE removes special values, such as multiple entries or NAs. getData tries to help by dropping the levels of factors for regression, tables, and correlations not typically included in analyses. Here we set dropOmittedLevels to FALSE to recode special values in an example that follows.

The argument addAttributes = TRUE ensures that the analysis functions shown so far can continue to be used with the resulting dataset: gddat.

10.7.2 Retrieving All Variables in a Dataset

To extract all the data in an edsurvey.data.frame, define the varnames argument as names(eclsk11), which will query all variables. Setting the argument dropOmittedLevels = FALSE ensures that values that would normally be removed are included:

lsdf0 <- getData(data = eclsk11, varnames = colnames(eclsk11), addAttributes = TRUE,
                 dropOmittedLevels = FALSE)
dim(x = lsdf0) 
dim(x = eclsk11)

Additional details on the features of the getData function appear in the vignette titled Using the getData Function in EdSurvey.

10.7.3 Recoding Variables in a Dataset

As mentioned earlier, data recoding is of particular importance when performing analyses with ECLS:K-2011 data given the complexity of its survey design on the dataset. EdSurvey offers methods of recoding data to fit these needs.

Let’s suppose you desire to explore student performance in mathematics based on the number of hours/day a parent reported their child attended summer school (p5nhrprm). We’d first need to recode that variable so that students who did not attend summer school (where p5sumsch coded as a 2: NO) are included in the analytic subset with 0 minutes.

The table function is a simple method of ascertaining the number of values for each level of a variable in a dataset. Using the table function for the p5nhrprm variable indicates that parents reported their child attending summer school anywhere from 2 to 7 hours per day:

table(gddat$p5nhrprm,useNA = "ifany")
#> 
#>     2     3     4     5     6     7  <NA> 
#>    55    72   126    43    87    62 17729

To include children who attended summer school for 0 hours per day—those who were skipped by the design of the survey—recode p5nhrprm values to zero where p5sumsch == "2: NO":

gddat$p5nhrprm <- ifelse(gddat$p5sumsch == "2: NO", 0, gddat$p5nhrprm)
table(gddat$p5nhrprm,useNA = "ifany")
#> 
#>     0     2     3     4     5     6     7  <NA> 
#>  3913    55    72   126    43    87    62 13816

Alternatively, for demonstration purposes, a researcher also may choose to recode the -1 values for the p5nhrprm variable directly:

gddat$p5nhrprm <- ifelse(gddat$p5nhrprm == "-1: NOT APPLICABLE*", 0, gddat$p5nhrprm)
table(gddat$p5nhrprm,useNA = "ifany")
#> 
#>     0     2     3     4     5     6     7  <NA> 
#>  3913    55    72   126    43    87    62 13816

A second example of recoding a variable in response to a skip pattern pertains to the frequency that (a) a child does homework (the variable p9hmwork) and (b) a parent/someone else helps (the variable p9hlphwk). The levelsSDF function is useful to show a variable’s levels and their unweighted n sizes.

levelsSDF("p9hmwork",eclsk11)
#> Levels for Variable 'p9hmwork' (Lowest level first):
#>     1. 1: NEVER (n = 294)
#>     2. 2: LESS THAN ONCE A WEEK (n = 381)
#>     3. 3: 1 TO 2 TIMES A WEEK (n = 1406)
#>     4. 4: 3 TO 4 TIMES A WEEK (n = 4236)
#>     5. 5: 5 OR MORE TIMES A WEEK (n = 3838)
#>     -1. -1: NOT APPLICABLE* (n = 51)
#>     -7. -7: REFUSED* (n = 1)
#>     -8. -8: DON'T KNOW* (n = 13)
#>     -9. -9: NOT ASCERTAINED* (n = 0)
#>     NOTE: * indicates an omitted level.
levelsSDF("p9hlphwk",eclsk11)
#> Levels for Variable 'p9hlphwk' (Lowest level first):
#>     1. 1: NEVER (n = 643)
#>     2. 2: LESS THAN ONCE A WEEK (n = 1786)
#>     3. 3: 1 TO 2 TIMES A WEEK (n = 3774)
#>     4. 4: 3 TO 4 TIMES A WEEK (n = 2558)
#>     5. 5: 5 OR MORE TIMES A WEEK (n = 1094)
#>     -1. -1: NOT APPLICABLE* (n = 359)
#>     -7. -7: REFUSED* (n = 2)
#>     -8. -8: DON'T KNOW* (n = 4)
#>     -9. -9: NOT ASCERTAINED* (n = 0)
#>     NOTE: * indicates an omitted level.

The skip pattern for this sequence of survey questions is as follows: If p9hmwork == "1: NEVER" then p9hlphwk is skipped and coded "-1: NOT APPLICABLE". To include this subset of data in the analysis, the variable p9hlphwk can be recoded to 0. First, retrieve the data via getData (along with a few other variables for a subsequent example) to recode using ifelse:

mvData <- getData(data = eclsk11, varnames = c("p9hmwork", "p9hlphwk", "x_chsex_r",
                                              "x9rscalk5", "x9mscalk5", "w9c29p_9t90"), 
                                 dropOmittedLevels = FALSE, addAttributes = TRUE)
mvData$p9hlphwk <- ifelse(mvData$p9hmwork == "1: NEVER" &
                          mvData$p9hlphwk == "-1: NOT APPLICABLE", 0,
                          mvData$p9hlphwk)

Then use table to view the counts of each level.

table(mvData$p9hlphwk,useNA = "ifany")
#> 
#>    0    1    2    3    4    5    6    7    8 <NA> 
#>  294  643 1786 3774 2558 1094   65    2    4 7954

Now the 294 cases of the variable mvData$p9hmwork == "1: NEVER" are included as a level in our recoded mvData$p9hlphwk.

With a few recoding steps, the appropriate value levels can be included in the dataset in preparation for analysis with EdSurvey. To find more information about special values specific to ECLS-K:2011, consult the Missing Values section of the ECLS-K:2011 Public-Use Data File User’s Manual.

10.7.3.1 Applying rebindAttributes to Use EdSurvey Functions With Manipulated Data Frames

A helper function that pairs well with getData is rebindAttributes. This function allows users to reassign the attributes from a survey dataset to a data frame that might have had its attributes stripped during the manipulation process. After rebinding attributes, all variables—including those outside the original dataset—are available for use in EdSurvey analytical functions.

The p9hlphwk variable from the Recoding Variables in a Dataset section of this chapter has been recoded using the ifelse function; therefore, the following example will display how to apply survey attributes to this object for analysis.

Using the mvData object created earlier, apply rebindAttributes from the attribute data eclsk11 to the manipulated data frame mvData. The new variables are now available for EdSurvey analytical functions:

mvData <- rebindAttributes(data = mvData, attributeData = eclsk11)
lm2 <- lm.sdf(formula = x9rscalk5 ~ x_chsex_r + p9hlphwk, data = mvData,
              weightVar = "w9c29p_9t90")
#> Removing 1422 rows with nonpositive weight from analysis.
summary(object = lm2)
#> 
#> Formula: x9rscalk5 ~ x_chsex_r + p9hlphwk
#> 
#> Weight variable: 'w9c29p_9t90'
#> Variance method: jackknife
#> JK replicates: 80
#> full data n: 18174
#> n used: 7906
#> 
#> Coefficients:
#>                         coef        se        t    dof
#> (Intercept)        142.46410   0.74660 190.8168 67.852
#> x_chsex_r2: FEMALE   1.44536   0.38854   3.7200 46.223
#> p9hlphwk            -2.11562   0.21672  -9.7619 54.904
#>                     Pr(>|t|)    
#> (Intercept)        < 2.2e-16 ***
#> x_chsex_r2: FEMALE 0.0005384 ***
#> p9hlphwk           1.338e-13 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Multiple R-squared: 0.0284

More information on the rebindAttributes function is available in Chapter 9.

10.8 Making a Table With edsurveyTable

Summary tables can be created in EdSurvey using the edsurveyTable function. A call to edsurveyTable9 with two variables, x_chsex_r (“Sex of students”) and p9curmar (“Current marital status”), creates a table that shows the number and percentage of students by gender and their parent’s current marital status. Percentages add up to 100 within each gender.

es1 <- edsurveyTable(formula = ~ x_chsex_r + p9curmar, data = eclsk11,
                     weightVar = "w9c29p_9t90",
                     varMethod = "jackknife")
#> Warning in calcEdsurveyTable(formula, data, weightVar,
#> jrrIMax, pctAggregationLevel, : Removing 2251 rows with 0
#> weight from analysis.

This edsurveyTable is saved as the object es1, and the resulting table can be displayed by printing the object:

Table 10.2. Weighted and Unweighted Sample Size, Percentage Distribution, and Standard Error of Percentage Distribution of Children by Students’ Gender and Their Parents’ Marital Status
x_chsex_r p9curmar N WTD_N PCT SE(PCT)
1: MALE 1: MARRIED (1) 2938 1367616.83 67.608642 1.1756039
1: MALE 2: SEPARATED (2) 151 86412.02 4.271810 0.3944507
1: MALE 3: DIVORCED OR WIDOWED (3, 4) 442 250607.34 12.388866 0.7625198
1: MALE 4: NEVER MARRIED (5) 425 273190.02 13.505249 0.9075561
1: MALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 81 45017.01 2.225432 0.3368478
2: FEMALE 1: MARRIED (1) 2870 1319848.64 69.131210 1.0257652
2: FEMALE 2: SEPARATED (2) 143 80672.81 4.225491 0.4357400
2: FEMALE 3: DIVORCED OR WIDOWED (3, 4) 428 224738.15 11.771365 0.6138104
2: FEMALE 4: NEVER MARRIED (5) 385 237346.10 12.431746 0.7270084
2: FEMALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 82 46587.90 2.440187 0.2406026

Given that the previous analysis uses parent data from Round 9, the weight variable "w9c29p_9a0" also might be appropriate. Both "w9c29p_9t90" and "w9c29p_9a0" could be used for this analysis, although both include nonresponse adjustments for additional data components or rounds of data collection than those of interest in the current analysis. Therefore, analysts need to determine which weight they prefer to use because there is no weight that adjusts for nonresponse for only the sources used in this analysis. Successive analyses in this chapter that mix Round 9 child and parent variables might substitute the selected weight chosen. Note the slight differences in *n* used and results. Consult the 4.3.1 Types of Sample Weights section of the ECLS-K:2011 Kindergarten–Fifth Grade User’s Manual, Public Version for additional guidance on choosing the most appropriate sample weight for an analysis.

es1p <- edsurveyTable(formula = ~ x_chsex_r + p9curmar, data = eclsk11,
                     weightVar = "w9c29p_9a0",
                     varMethod = "jackknife")
#> Warning in calcEdsurveyTable(formula, data, weightVar,
#> jrrIMax, pctAggregationLevel, : Removing 1673 rows with 0
#> weight from analysis.
Table 10.3. Weighted and Unweighted Sample Size, Percentage Distribution, and Standard Error of Percentage Distribution of Children by Students’ Gender and Their Parents’ Marital Status—Using Parent Weights
x_chsex_r p9curmar N WTD_N PCT SE(PCT)
1: MALE 1: MARRIED (1) 3160 1384646.17 67.662995 1.2201761
1: MALE 2: SEPARATED (2) 165 87807.35 4.290850 0.4019977
1: MALE 3: DIVORCED OR WIDOWED (3, 4) 473 257917.26 12.603548 0.7437032
1: MALE 4: NEVER MARRIED (5) 465 273250.63 13.352838 0.8934741
1: MALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 85 42764.76 2.089770 0.3269484
2: FEMALE 1: MARRIED (1) 3072 1340918.74 69.585339 0.9959942
2: FEMALE 2: SEPARATED (2) 147 78575.20 4.077564 0.4313110
2: FEMALE 3: DIVORCED OR WIDOWED (3, 4) 449 223633.62 11.605193 0.5995680
2: FEMALE 4: NEVER MARRIED (5) 417 234899.85 12.189841 0.7083241
2: FEMALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 90 48985.90 2.542063 0.2864988

The function also features variance estimation by setting the varMethod argument.10 As shown in the previous example, the default varMethod = "jackknife" indicates that the call used the jackknife method for variance estimation. By setting varMethod = "Taylor", the same edsurveyTable call in the previous example can return results using Taylor series variance estimation:

es1t <- edsurveyTable(formula = ~ x_chsex_r + p9curmar, data = eclsk11,
                      weightVar = "w9c29p_9t90",
                      varMethod = "Taylor")
#> Warning in calcEdsurveyTable(formula, data, weightVar,
#> jrrIMax, pctAggregationLevel, : Removing 2251 rows with 0
#> weight from analysis.
Table 10.4. Weighted and Unweighted Sample Size, Percentage Distribution, and Standard Error of Percentage Distribution of Children by Students’ Gender and Their Parents’ Marital Status—Taylor Series
x_chsex_r p9curmar N WTD_N PCT SE(PCT)
1: MALE 1: MARRIED (1) 2938 1367616.83 67.608642 1.3241743
1: MALE 2: SEPARATED (2) 151 86412.02 4.271810 0.4490012
1: MALE 3: DIVORCED OR WIDOWED (3, 4) 442 250607.34 12.388866 0.7819802
1: MALE 4: NEVER MARRIED (5) 425 273190.02 13.505249 1.1456202
1: MALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 81 45017.01 2.225432 0.3459628
2: FEMALE 1: MARRIED (1) 2870 1319848.64 69.131210 1.1836404
2: FEMALE 2: SEPARATED (2) 143 80672.81 4.225491 0.4377188
2: FEMALE 3: DIVORCED OR WIDOWED (3, 4) 428 224738.15 11.771365 0.6806314
2: FEMALE 4: NEVER MARRIED (5) 385 237346.10 12.431746 0.8638412
2: FEMALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 82 46587.90 2.440187 0.3446689

If the percentages do not add up to 100 at the desired level, adjust the pctAggregationLevel argument to change the aggregation level. By default, pctAggregationLevel = 1, indicating that the formula will be aggregated at each level of the first variable in the call; in our previous example, this is x_chsex_r. Setting pctAggregationLevel = 0 aggregates at each level of each variable in the call.

The calculation of means and standard errors requires computation time that the user may not want to wait for. If you wish to simply see a table of the levels and the N sizes, set the returnMeans and returnSepct arguments to FALSE to omit those columns as follows:

es1b <- edsurveyTable(formula = ~ x_chsex_r + p9curmar, data = eclsk11,
                      weightVar = "w9c29p_9t90", jrrIMax = Inf,
                      returnMeans = FALSE, returnSepct = FALSE)

In this edsurveyTable, the resulting table can be displayed by printing the object:

Table 10.5. Weighted and Unweighted Sample Size and Percentage Distribution of Children by Students’ Gender and Their Parents’ Marital Status
x_chsex_r p9curmar N WTD_N PCT
1: MALE 1: MARRIED (1) 3718 1367616.83 67.608642
1: MALE 2: SEPARATED (2) 210 86412.02 4.271810
1: MALE 3: DIVORCED OR WIDOWED (3, 4) 570 250607.34 12.388866
1: MALE 4: NEVER MARRIED (5) 599 273190.02 13.505249
1: MALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 103 45017.01 2.225432
2: FEMALE 1: MARRIED (1) 3585 1319848.64 69.131210
2: FEMALE 2: SEPARATED (2) 194 80672.81 4.225491
2: FEMALE 3: DIVORCED OR WIDOWED (3, 4) 545 224738.15 11.771365
2: FEMALE 4: NEVER MARRIED (5) 558 237346.10 12.431746
2: FEMALE 5: CIVIL UNION/DOMESTIC PARTNERSHIP (6) 114 46587.90 2.440187

For more details on the arguments in the edsurveyTable function, look at the examples using ?edsurveyTable.