Cross-Validation
comparing random effects vs. fixed effects
This is my attempt
to replicate one of the cross-validation procedures discussed in Gelman
(2006, Technometrics). According to Gelman,
the purpose of the study was “to estimate the distribution of radon levels in
each of the approximately 3,000 U.S. counties, so that homeowners could make
decisions about measuring or remediating the radon in their houses based on the
best available knowledge of local conditions.”
Also from the article:
·
In
performing the analysis, we had an important predictor— whether the measurement
was taken in a basement. I believe this information is captured by the
“floor” variable.
·
We
also had an important county-level predictor—a measurement of soil uranium that
was available at the county level.
·
The
level 1 residuals represent “within-county variation,” which in this case
includes measurement error, natural variation in radon levels within a house
over time, and variation between houses (beyond what is explained by the
basement indicator).
·
The
level 2 residuals represent variation between counties beyond what is explained
by the county-level uranium predictor.
·
focusing
on a subset of our data—the 919 houses from the state radon survey of the 85
counties of Minnesota
·
Complete
pooling = the same line for every county
o
“particularly inappropriate for this application, whose goal
is to identify the locations in which residents are at high risk of radon”
·
No
pooling = 85 least squares lines (different intercepts but same slope)
o
“the
no-pooling model overfits the data; for example, it gave an implausibly high
estimate of the average radon levels in Lac Qui Parle County, in which only two
observations were available”
Here’s the
point of this exercise
We
can use cross-validation to formally demonstrate the benefits of multilevel
modeling. We perform two cross-validation tests: first removing single data
points and checking the prediction from the model fit to the rest of the data,
then removing single counties and performing the same procedure. For each
cross-validation step, we compare complete-pooling, no-pooling, and multilevel
estimates.
(a) First, a
warm-up. Create model1. Plot the residuals for this model against the (log)
uranium variable (include your plot). Describe what you learn from this graph
and what it tells you about adding (log) uranium to the model.
(b) Now run the
remaining code and report the 3 SSE values.
Write a few sentences explaining what the code is doing. In particular, is
this implementing “leave one observation out” or “leave one county out”? How many observations are used in this
analysis? Why not 919? What is the point of “likeme”?
What are the sse values measuring? How are they
computed? What are the roles of J and k?
(c) Which
method (complete pooling, no pooling, or partial pooling) gives more accurate
predictions?