Cross-Validation comparing random effects vs. fixed effects

 

This is my attempt to replicate one of the cross-validation procedures discussed in Gelman (2006, Technometrics). According to Gelman, the purpose of the study was “to estimate the distribution of radon levels in each of the approximately 3,000 U.S. counties, so that homeowners could make decisions about measuring or remediating the radon in their houses based on the best available knowledge of local conditions.”  Also from the article:

·       In performing the analysis, we had an important predictor— whether the measurement was taken in a basement. I believe this information is captured by the “floor” variable.

·       We also had an important county-level predictor—a measurement of soil uranium that was available at the county level.

·       The level 1 residuals represent “within-county variation,” which in this case includes measurement error, natural variation in radon levels within a house over time, and variation between houses (beyond what is explained by the basement indicator).

·       The level 2 residuals represent variation between counties beyond what is explained by the county-level uranium predictor.

·       focusing on a subset of our data—the 919 houses from the state radon survey of the 85 counties of Minnesota

·       Complete pooling = the same line for every county

o   particularly inappropriate for this application, whose goal is to identify the locations in which residents are at high risk of radon”

·       No pooling = 85 least squares lines (different intercepts but same slope)

o   “the no-pooling model overfits the data; for example, it gave an implausibly high estimate of the average radon levels in Lac Qui Parle County, in which only two observations were available”

Here’s the point of this exercise

We can use cross-validation to formally demonstrate the benefits of multilevel modeling. We perform two cross-validation tests: first removing single data points and checking the prediction from the model fit to the rest of the data, then removing single counties and performing the same procedure. For each cross-validation step, we compare complete-pooling, no-pooling, and multilevel estimates.

R script

(a) First, a warm-up. Create model1. Plot the residuals for this model against the (log) uranium variable (include your plot). Describe what you learn from this graph and what it tells you about adding (log) uranium to the model.

(b) Now run the remaining code and report the 3 SSE values.  Write a few sentences explaining what the code is doing.  In particular, is this implementing “leave one observation out” or “leave one county out”?  How many observations are used in this analysis? Why not 919? What is the point of “likeme”? What are the sse values measuring? How are they computed?  What are the roles of J and k?

(c) Which method (complete pooling, no pooling, or partial pooling) gives more accurate predictions?