6 Data Manipulation in EdSurvey and Base R

EdSurvey gives users functions to efficiently analyze education survey data. As mentioned in Chapter 3, EdSurvey is flexible when conducting data manipulation and preparing for survey analysis. It allows for rudimentary data manipulation and analysis with both EdSurvey and base R functions to edit data before processing. Also, by calling the function getData(), one can extract a light.edsurvey.data.frame to manipulate similarly as other data.frame objects in other R packages. This concept is further detailed in Chapter 9analysisOutsideEdSurvey).

6.1 Subsetting the Data

Analysts can directly use a subset of a dataset with EdSurvey functions. In this example, a summary table is created with edsurveyTable after filtering the sample to include only those students whose value for the dsex variable is male and race (as the variable sdracem) is either 1 or 3 (White or Hispanic). Both value levels and labels can be used in EdSurvey functions.

sdf <- readNAEP(system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
sdfm <- subset(sdf, dsex == "Male" & (sdracem == 3 | sdracem == 1))
es1 <- edsurveyTable(formula = composite ~ dsex + sdracem, data = sdfm)
es1
Table 6.1. Subsetting Data
dsex sdracem N WTD_N PCT SE(PCT) MEAN SE(MEAN)
Male White 5160 5035.169 76.11329 1.625174 287.6603 0.8995013
Male Hispanic 1244 1580.192 23.88671 1.625174 260.8268 1.5822251

6.2 Recoding Variable Names and Levels Using recode.sdf and rename.sdf

To assist in the process of standardizing data for an edsurvey.data.frame, a light.edsurvey.data.frame, and an edsurvey.data.frame.list, the functions recode.sdf() and rename.sdf() are very handy.

The recode.sdf() function accepts the levels of a variable as a vector from their current values to their newly recoded value. For example, changing the lowest level of b017451 from "Never or hardly ever" to "Infrequently" and the highest level from "Every day" to "Frequently", will recode levels for that variable in our connection to sdf:

sdf2 <- recode.sdf(sdf, recode = list(
  b017451 = list(from = c("Never or hardly ever"),
               to = c("Infrequently")),
  b017451 = list(from = c("Every day"),
               to = c ("Frequently"))
  )
)
searchSDF("b017451", sdf2, levels = TRUE)                              
#> Variable: b017451
#> Label: Talk about studies at home
#> Levels (Lowest level first):
#>      2. Once every few weeks
#>      3. About once a week
#>      4. 2 or 3 times a week
#>      8. Omitted
#>      0. Multiple
#>      9. Infrequently
#>      10. Frequently

In addition, we can change the name of variables using rename.sdf(). The recoded variable "b017451" can be changed to a value that more effectively describes its contents, such as "studytalkfrequency":

sdf2 <- rename.sdf(sdf2, "b017451", "studytalkfrequency")
searchSDF("studytalkfrequency", sdf2, levels = TRUE)
#> Variable: studytalkfrequency
#> Label: Talk about studies at home
#> Levels (Lowest level first):
#>      2. Once every few weeks
#>      3. About once a week
#>      4. 2 or 3 times a week
#>      8. Omitted
#>      0. Multiple
#>      9. Infrequently
#>      10. Frequently

Most EdSurvey analytical functions allow the user to recode variable levels through the recode argument, including, for example, cor.sdf(), lm.sdf(), and edsurveyTable(). Reference a function’s documentation page for details.

It also is important to note that EdSurvey functions (and function arguments) do not permanently overwrite the variable information from your data source; they recode it only for the current connection to the data in R. The original file formatting can be retrieved by reconnecting to the data source via readNAEP().

6.3 Retrieving Data for Further Manipulation With getData

Data can be extracted and manipulated using the function getData. The function getData takes an edsurvey.data.frame and returns a light.edsurvey.data.frame containing the requested variables by either specifying a set of variable names in varnames or entering a formula in formula.4 For plausible values and weights, only the names of the main scale/subscale and the weight variable need to be included here for all necessary supplementary variables to be automatically included.

To access and manipulate data for the dsex and b017451 variables and the num_oper subject scale in sdf, call getData. In the following code, the head function is used, which reveals only the first few rows of the resulting data:

gddat <- getData(data = sdf, varnames = c("dsex","b017451", "num_oper"),
                 omittedLevels = TRUE)
head(gddat)
#>     dsex              b017451 mrps11 mrps12 mrps13 mrps14
#> 1   Male            Every day 321.57 299.37 306.94 310.63
#> 2 Female    About once a week 283.79 273.33 285.96 284.48
#> 3 Female            Every day 334.42 323.01 316.28 361.03
#> 4   Male            Every day 337.25 315.84 316.92 319.00
#> 6 Female Once every few weeks 253.99 248.50 260.63 270.22
#> 7   Male  2 or 3 times a week 313.27 329.85 315.97 313.91
#>   mrps15
#> 1 308.04
#> 2 281.61
#> 3 321.40
#> 4 323.68
#> 6 267.75
#> 7 311.69

By default, setting omittedLevels to TRUE removes special values such as multiple entries or instances of NA. getData tries to help by dropping the levels of factors for regression, tables, and correlations that are not typically included in analysis.

6.4 Retrieving All Variables in a Dataset

To extract all data in an edsurvey.data.frame, define the varnames argument as colnames(sdf), which will query all variables. Setting the arguments omittedLevels and defaultConditions to FALSE ensures that values that would normally be removed are included:

lsdf0 <- getData(data = sdf, varnames = colnames(sdf), addAttributes = TRUE,
                 omittedLevels = FALSE, defaultConditions = FALSE)
dim(lsdf0) # excludes the one school variable in the sdf
dim(sdf)

Once retrieved, this dataset can be used with all EdSurvey functions.

6.5 Using EdSurvey Functions on a Unique light.edsurvey.data.frame

After manipulating the data, you can use a light.edsurvey.data.frame with any EdSurvey function. Most notably, a light.edsurvey.data.frame can create tables using edsurveyTable and run regressions with the lm.sdf function.

6.5.1 edsurveyTable

The following example creates an edsurveyTable using the manipulated light.edsurvey.data.frame (named gddat), the variables dsex and b017451, the five plausible values for composite, and the default weight origwt:5

gddat <- getData(data = sdf, varnames = c("composite", "dsex", "b017451",
                                          "c052601","origwt"), addAttributes = TRUE)
es2 <- edsurveyTable(formula = composite ~ dsex + b017451,
                     weightVar = "origwt", data = gddat)
Table 6.2. Using EdSurvey Functions on a light.edsurvey.data.frame
dsex b017451 N WTD_N PCT SE(PCT) MEAN SE(MEAN)
Male Never or hardly ever 2171 2276.820 28.99585 0.7044670 270.8526 1.090086
Male Once every few weeks 1489 1535.884 19.55985 0.5538779 275.6296 1.357837
Male About once a week 1293 1339.204 17.05508 0.5278360 281.7165 1.449683
Male 2 or 3 times a week 1424 1454.934 18.52893 0.5158073 284.7212 1.661465
Male Every day 1203 1245.385 15.86028 0.5824622 277.8021 1.929363
Female Never or hardly ever 1383 1425.512 18.24810 0.5115641 266.7741 1.555760
Female Once every few weeks 1419 1454.837 18.62349 0.5134568 271.5970 1.295964
Female About once a week 1379 1450.724 18.57084 0.5789385 279.3023 1.660139
Female 2 or 3 times a week 1697 1737.825 22.24604 0.5070853 282.8398 1.459509
Female Every day 1686 1742.940 22.31153 0.6531813 275.7997 1.321104

6.5.2 lm.sdf

To generate a linear model using a light.edsurvey.data.frame, the included arguments from the previous example, as well as the weight origwt, are passed through the lm.sdf function:6

lm2 <- lm.sdf(formula = composite ~ dsex + b017451, weightVar = "origwt", data = gddat)
summary(lm2)
#> 
#> Formula: composite ~ dsex + b017451
#> 
#> Weight variable: 'origwt'
#> Variance method: jackknife
#> JK replicates: 62
#> Plausible values: 5
#> jrrIMax: 1
#> full data n: 17606
#> n used: 15144
#> 
#> Coefficients:
#>                                  coef        se        t
#> (Intercept)                 270.40708   1.05390 256.5768
#> dsexFemale                   -2.92147   0.61554  -4.7462
#> b017451Once every few weeks   4.68200   1.16792   4.0088
#> b017451About once a week     11.57319   1.26477   9.1504
#> b0174512 or 3 times a week   14.88024   1.23890  12.0108
#> b017451Every day              7.93104   1.28155   6.1886
#>                                dof  Pr(>|t|)    
#> (Intercept)                 51.496 < 2.2e-16 ***
#> dsexFemale                  53.963 1.565e-05 ***
#> b017451Once every few weeks 55.188 0.0001848 ***
#> b017451About once a week    49.005 3.519e-12 ***
#> b0174512 or 3 times a week  77.130 < 2.2e-16 ***
#> b017451Every day            50.501 1.074e-07 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Multiple R-squared: 0.0224

Contrasts from treatment groups also can be omitted from a linear model by stating the variable name in the relevels argument. In this example, values with dsex = "Female" are withheld from the regression. Use the base R function summary to view details about the linear model.

lm3 <- lm.sdf(formula = composite ~ dsex + b017451, data = gddat,
              relevels = list(dsex = "Female"))
summary(lm3)
#> 
#> Formula: composite ~ dsex + b017451
#> 
#> Weight variable: 'origwt'
#> Variance method: jackknife
#> JK replicates: 62
#> Plausible values: 5
#> jrrIMax: 1
#> full data n: 17606
#> n used: 15144
#> 
#> Coefficients:
#>                                  coef        se        t
#> (Intercept)                 267.48561   1.11204 240.5350
#> dsexMale                      2.92147   0.61554   4.7462
#> b017451Once every few weeks   4.68200   1.16792   4.0088
#> b017451About once a week     11.57319   1.26477   9.1504
#> b0174512 or 3 times a week   14.88024   1.23890  12.0108
#> b017451Every day              7.93104   1.28155   6.1886
#>                                dof  Pr(>|t|)    
#> (Intercept)                 65.757 < 2.2e-16 ***
#> dsexMale                    53.963 1.565e-05 ***
#> b017451Once every few weeks 55.188 0.0001848 ***
#> b017451About once a week    49.005 3.519e-12 ***
#> b0174512 or 3 times a week  77.130 < 2.2e-16 ***
#> b017451Every day            50.501 1.074e-07 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Multiple R-squared: 0.0224

6.5.3 cor.sdf

Users might want to generate a correlation to explore a manipulated light.edsurvey.data.frame. The marginal correlation coefficient among plausible values of the subject scales and subscales can be calculated on a light.edsurvey.data.frame object eddat using the cor.sdf function and the Pearson method. In this example, the variable dsex == "Female" subsets our light.edsurvey.data.frame to calculate the correlation between the subject subscales num_oper and algebra using the default weight origwt:7

eddat <- getData(data = sdf, varnames = c("num_oper","algebra","dsex", 'origwt'),
                addAttributes = TRUE, omittedLevels = FALSE)
eddat <- subset(eddat,dsex == "Female")
cor2 <- cor.sdf(x = "num_oper", y = "algebra", weightVar = "origwt",
                data = eddat, method = "Pearson")
cor2
#> Method: Pearson
#> full data n: 17606
#> n used: 8429
#> 
#> Correlation: 0.8917132
#> Standard Error: 0.006153243
#> Confidence Interval: [0.8785106, 0.9035547]

6.6 Applying rebindAttributes to Use EdSurvey Functions With Manipulated Data Frames

A helper function that pairs well with getData is rebindAttributes. This function allows users to reassign the attributes from a survey dataset to a data frame that might have had its attributes stripped during the manipulation process. Once attributes have been rebinded, all variables—including those outside the original dataset—are available for use in EdSurvey analytical functions.

For instance, the sdf object contains the following attributes:

attributes(sdf)
#> $names
#>  [1] "userConditions"       "defaultConditions"   
#>  [3] "dataList"             "weights"             
#>  [5] "pvvars"               "subject"             
#>  [7] "year"                 "assessmentCode"      
#>  [9] "dataType"             "gradeLevel"          
#> [11] "achievementLevels"    "omittedLevels"       
#> [13] "survey"               "country"             
#> [15] "psuVar"               "stratumVar"          
#> [17] "jkSumMultiplier"      "recodes"             
#> [19] "dim0"                 "validateFactorLabels"
#> [21] "reqDecimalConversion" "fr2Path"             
#> [23] "cache"               
#> 
#> $class
#> [1] "edsurvey.data.frame" "edsurvey.data"

These attributes are lost when variables are retrieved via getData(). For example, a user might want to run a linear model using composite, the default weight origwt, the variable dsex, and the categorical variable b017451 recoded into a binary variable. To do so, we can return a portion of the sdf survey data as the gddat object. Next, use the base R function ifelse to conditionally recode the variable b017451 by collapsing the levels "Never or hardly ever" and "Once every few weeks" into one level ("Rarely") and all other levels into "At least once a week".

gddat <- getData(data = sdf, varnames = c("dsex", "b017451", "origwt", "composite"),
                 omittedLevels = TRUE)
gddat$studyTalk <- ifelse(gddat$b017451 %in% c("Never or hardly ever",
                                               "Once every few weeks"),
                          "Rarely", "At least once a week")

From there, apply rebindAttributes from the attribute data sdf to the manipulated data frame gddat. The new variables are now available for use in EdSurvey analytical functions:

gddat <- rebindAttributes(gddat, sdf)
lm2 <- lm.sdf(formula = composite ~ dsex + studyTalk, data = gddat)
summary(lm2)
#> 
#> Formula: composite ~ dsex + studyTalk
#> 
#> Weight variable: 'origwt'
#> Variance method: jackknife
#> JK replicates: 62
#> Plausible values: 5
#> jrrIMax: 1
#> full data n: 17606
#> n used: 16331
#> 
#> Coefficients:
#>                      coef        se        t    dof
#> (Intercept)     281.69030   0.96690 291.3349 39.915
#> dsexFemale       -2.89797   0.59549  -4.8665 52.433
#> studyTalkRarely  -9.41418   0.79620 -11.8239 53.205
#>                  Pr(>|t|)    
#> (Intercept)     < 2.2e-16 ***
#> dsexFemale      1.081e-05 ***
#> studyTalkRarely < 2.2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Multiple R-squared: 0.0168

Additional details on the features of the getData function appear in the vignette titled Using the getData Function in EdSurvey.

6.7 Important Data Manipulation Notes

6.7.1 Memory Usage

Because many NCES databases have hundreds of columns and millions of rows, EdSurvey allows users to analyze data without storing it in the global environment. Alternatively, the getData function retrieves a light.edsurvey.data.frame into the global environment, which can be costlier to memory usage. The base R function object.size estimates the memory being used to store an R object. Computations using objects stored in the global environment are markedly costlier to memory than those made directly from the EdSurvey database:

object.size(gddat <- getData(data = sdf,
                             varnames = c('composite', 'dsex', 'b017451', 'origwt'),
                             addAttributes = TRUE, omittedLevels = FALSE))
#> 9545568 bytes
object.size(lm7 <- lm.sdf(formula = composite ~ dsex + b017451,
                          weightVar='origwt', data = gddat))
#> 7170632 bytes
object.size(lm8 <- lm.sdf(formula = composite ~ dsex + b017451,
                          weightVar='origwt', data = sdf))
#> 2390816 bytes

Although a manipulated light.edsurvey.data.frame requires nearly 10 MB of working memory to store both the light.edsurvey.data.frame and the regression model object (lm7), the resulting object of the same computation made directly from the EdSurvey database (lm8) holds only 5–7 kB. It is a good practice to remove unnecessary values saved in the global environment. Because we have stored many large data objects, let’s remove these before moving on.

rm(df,gddat,eddat)
#> Warning in rm(df, gddat, eddat): object 'df' not found

Some operating systems continue to hold the memory usage even after removing an object. R will clean up your global environment automatically, but a forced garbage cleanup also can be employed:

gc()

6.7.2 Forgetting to Include a Column Variable

When creating a summary table or running regression, EdSurvey will give a warning when a column is missing:

gddat <- getData(data = sdf, 
                 varnames = c(all.vars(composite ~ lep + dsex + iep), "origwt"), 
                 addAttributes = TRUE, omittedLevels = FALSE)
lm9 <- lm.sdf(formula = composite ~ lep + dsex + iep + b017451, data = gddat)
## Using default weight variable 'origwt'
## Error in getData(sdf, c(all.vars(formula), wgt), ..., includeNaLabel = TRUE)
  ## The following variable names are required for this call 
  ## and are not on the incoming data 'b017451'.

The solution is simple: Edit the call to getData to include the variable and rerun the linear model.

gddat <- getData(data = sdf,
                 varnames = c(all.vars(composite ~ lep + dsex + iep + b017451),"origwt"), 
                 addAttributes = TRUE, omittedLevels = FALSE)
lm9 <- lm.sdf(formula = composite ~ lep + dsex + iep + b017451, data = gddat)
lm9
#>                 (Intercept)                       lepNo 
#>                  207.356989                   35.278034 
#>                  dsexFemale                       iepNo 
#>                   -5.285498                   36.170641 
#> b017451Once every few weeks    b017451About once a week 
#>                    3.254744                    9.210189 
#>  b0174512 or 3 times a week            b017451Every day 
#>                   12.659496                    6.808825