9 Analysis Outside EdSurvey
Last edited: July 2023
Suggested Citation
Lee, M. Analysis Outside EdSurvey. In Bailey, P. and Zhang, T. (eds.), Analyzing NCES Data Using EdSurvey: A User’s Guide.
EdSurvey
gives users functions to efficiently analyze education survey data. Although EdSurvey
allows for rudimentary data manipulation and analysis, this chapter will discuss how to integrate other R packages into EdSurvey
. As this chapter will demonstrate, this functionality is especially useful for data processing and manipulation in popular R packages such as dplyr
.
9.1 Integration With Any Other Package
By calling the function getData()
, one can extract a light.edsurvey.data.frame
: a data.frame
-like object containing requested variables, weights, and each weight’s associated replicate weights. This light.edsurvey.data.frame
can be not only manipulated as with other data.frame
objects but also used with packaged EdSurvey
functions. As noted in Chapter 6, setting the arguments dropOmittedLevels
and defaultConditions
to FALSE
ensures that the values that would normally be removed are included. The argument addAttributes = TRUE
ensures the extraction of necessary survey design attributes, including the replicate weights, PSU variables, and strata variables.
library(EdSurvey)
#> Loading required package: car
#> Loading required package: carData
#> Loading required package: lfactors
#> lfactors v1.0.4
#> Loading required package: Dire
#> Dire v2.2.0
#> EdSurvey v4.0.7
sdf <- readNAEP(path = system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
gddat <- getData(data = sdf, varnames = c('composite', 'dsex', 'b017451', 'origwt'),
addAttributes = TRUE, dropOmittedLevels = FALSE)
The base R function gsub
allows users to substitute one string for another. The following step recodes “Every day” to “Seven days a week”. The head
function reveals the first 6 values of the recoded variable b017451
accessed by the $
operator:
# 1. Recode a Column Based on a String
gddat$b017451 <- gsub(pattern = "Every day", replacement = "Seven days a week",
x = gddat$b017451)
head(x = gddat$b017451)
#> [1] "Seven days a week" "About once a week"
#> [3] "Seven days a week" "Seven days a week"
#> [5] "Once every few weeks" "2 or 3 times a week"
After manipulating the data, you can use a light.edsurvey.data.frame
with any EdSurvey
function. As shown in the previous example, after retrieving a dataset, it can be used with most other R package functions, but occasionally one might encounter errors. A helper function to circumvent these errors is rebindAttributes
.
9.2 Applying rebindAttributes
to Use EdSurvey
Functions With Manipulated Data Frames
The rebindAttributes
function allows users to reassign the survey data attributes required by EdSurvey
to a data frame that might have had its attributes stripped during the manipulation process. After rebinding the attributes, all variables—including those outside the original dataset—are available for EdSurvey
analytical functions.
For example, a user might want to run a linear model using composite
, the default weight origwt
, the variable dsex
, and the categorical variable b017451
recoded into a binary variable. To do so, we can return a portion of the sdf
survey data as the gddat
object. Next, use the base R function ifelse
to conditionally recode the variable b017451
by collapsing the levels "Never or hardly ever"
and "Once every few weeks"
into one level ("Rarely"
) and all other levels into "At least once a week"
.
gddat <- getData(data = sdf, varnames = c("dsex", "b017451", "origwt", "composite"),
dropOmittedLevels = TRUE)
gddat$studyTalk <- ifelse(gddat$b017451 %in% c("Never or hardly ever",
"Once every few weeks"),
"Rarely", "At least once a week")
From there, apply rebindAttributes
from the attribute data sdf
to the manipulated data frame gddat
. The new variables are now available for use in EdSurvey
analytical functions:
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
lm2 <- lm.sdf(formula = composite ~ dsex + studyTalk, data = gddat)
summary(object = lm2)
#>
#> Formula: composite ~ dsex + studyTalk
#>
#> Weight variable: 'origwt'
#> Variance method: jackknife
#> JK replicates: 62
#> Plausible values: 5
#> jrrIMax: 1
#> full data n: 17606
#> n used: 16331
#>
#> Coefficients:
#> coef se t dof
#> (Intercept) 281.69030 0.96690 291.3349 39.915
#> dsexFemale -2.89797 0.59549 -4.8665 52.433
#> studyTalkRarely -9.41418 0.79620 -11.8239 53.205
#> Pr(>|t|)
#> (Intercept) < 2.2e-16 ***
#> dsexFemale 1.081e-05 ***
#> studyTalkRarely < 2.2e-16 ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Multiple R-squared: 0.0168
9.3 Integration With dplyr
One popular package for data manipulation in the R ecosystem is dplyr
. Given its ubiquity, it merits noting common errors that one might encounter when performing analyses using EdSurvey
together with dplyr
.
Let’s say a user is interested in predicting how often a student talks about studies at home based on their gender and disability status. The following example demonstrates how to predict whether a student talks about studies at home (b017451
) based on their sex (dsex
) and whether they have an individualized education plan (iep
) using the weight origwt
. The dependent variable b017451
specified using the outcome level of the regression with I(b017451 == "Never or hardly ever")
:
gddat <- getData(data = sdf, varnames = c("dsex", "b017451", "iep", "lep", "origwt", "composite"),
addAttributes = TRUE, dropOmittedLevels = TRUE)
The dplyr
function unite()
takes multiple variables and concatenates them, similar to the base R function paste0()
. The %>%
(pipe) operator allows an object to be passed forward to another function call.
# Unite columns
gddat <- gddat %>% unite(col = "combinedVar", dsex, iep, sep = "_")
table(gddat$combinedVar)
#>
#> Female_No Female_Yes Male_No Male_Yes
#> 7574 590 7044 1113
# Specify level in I()
logit1 <- logit.sdf(formula = I(b017451 == "Never or hardly ever") ~ combinedVar,
data = gddat)
#> Error in checkDataClass(data, c("edsurvey.data.frame", "light.edsurvey.data.frame", : The argument 'data' must be an edsurvey.data.frame, a light.edsurvey.data.frame, or an edsurvey.data.frame.list. See "Using the 'EdSurvey' Package's getData Function to Manipulate the NAEP Primer Data vignette" for how to work with data in a light.edsurvey.data.frame.
When we attempt to run the logistic regression, EdSurvey
returns an error that it cannot locate the survey weights for this data frame. After creating a new variable, EdSurvey
can no longer access the survey attributes needed to complete this analysis. To remedy, apply rebindAttributes
from the attribute data sdf
to the manipulated data frame gddat
:
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
logit1 <- logit.sdf(formula = I(b017451 =="Never or hardly ever") ~ combinedVar,
data = gddat)
Other functions, such as rowwise()
, group_by()
, and ungroup()
silently override the class of the light.edsurvey.data.frame
, causing the attributes to be inaccessible. In the following example, we use mutate()
to create a new variable mrpcmAverage
that calculates the mean of each row’s plausible values:
gddat <- getData(data = sdf,
varnames = c("dsex", "b017451", "iep", "lep", "origwt", "composite"),
addAttributes = TRUE, dropOmittedLevels = TRUE)
gddat <- gddat %>%
rowwise() %>%
mutate(mrpcmAverage = mean(c(mrpcm1, mrpcm2, mrpcm3, mrpcm4, mrpcm5), na.rm = TRUE))
class(gddat)
#> [1] "rowwise_df" "tbl_df" "tbl" "data.frame"
The function rebindAttributes()
reapplies survey attributes and prepares the data for use with EdSurvey
analysis functions.
gddat <- rebindAttributes(data = gddat, attributeData = sdf)
class(gddat)
#> [1] "light.edsurvey.data.frame" "data.frame"