Stat 414 – HW 6
Due 8am, Monday, Nov. 10
1) Consider a basic random slopes model:
,
with ![]()
(a) Derive the expression for
.
(Hint: Var(aX + bZ) = a2Var(X) + b2Var(Z) +
2abCov(X, Z))
You can turn in your work hard copy or upload a screen capture – just make sure it’s readable!
(b) Derive the expression for
,
the covariance between two observations in the same Level 2 group.
(Hint: Cov(aX + bZ, cY + dW) = acCov(X, Y) + adCov(X, W) + cdCov(Y, W) + bdCov(Z, W)). Show all your steps!
Recall the 2006 Programme for International Student Assessment (PISA), organized by the OECD, gathered data on 15- to 16-year-old children from schools across Australia. Variables in the dataset included
· z_read, a standardized reading score
· cen_pos, parental occupational status, a 64-category ordinal score (the lowest value represents low occupational status), which is centered and we will treat as quantitative
· cen_escs: the centered PISA index of economic, social, and cultural status, a composite index created from items on the student questionnaire, including parental occupation and education as well as possessions within the home (e.g., books, computers, areas for children to study).
· female, an indicator variable with 1 = female and 0 = male
· cen_size: grand-mean centered school size, school sizes range from 10 to 2835
ReadingScores = read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt", header=T)
head(ReadingScores)
ReadingScores2 <- ReadingScores |>
filter(!(schoolid %in% c(139, 350)))
One of the models we explored last time had random slopes for cen_escs (after removing 2 schools with no cen_escs value).
model2b = lme(z_read~cen_escs, random = ~ 1 + cen_escs | schoolid, data = ReadingScores2, method="REML")
summary(model2b)
preds = predict(model2b, newdata = ReadingScores2)
ggplot(ReadingScores2, aes(x = cen_escs , y = preds, group=schoolid)) +
geom_smooth(method = "lm", formula = y ~ x, alpha = .5, se = FALSE) +
theme_bw()
(c) Use this output to find the estimated value for
.
Show the details!
(d) Calculate the estimated variance of
an observation (
)
with cen_escs = -1.209. Show the details!
(e) Calculate the cen_escs value that is predicted to have the smallest variation in z_read scores. Show the details! Does it match your graph?
(f) Calculate the estimated covariance of two observations in
the same school (
and
),
one with cen_escs = -1.209 and one with cen_escs =-.7123. Show
the details!
(g) Calculate the estimated correlation for the two observations in (f). Show the details! Hint: you may need to do another calculation or two first
(h) Verify your calculations in (c)-(g) in the output below. Be clear where you find each value. (They won't match exactly, partly depending on whether or not remove the missing value observations). Some updated values REML and ML (but there will still be rounding discrepancies).
> getVarCov(model2b)
Random effects variance covariance matrix
(Intercept) cen_escs
(Intercept) 0.166580 -0.023286
cen_escs -0.023286 0.019821
Standard Deviations: 0.40814 0.14079
> getVarCov(model2b, type = "marginal")[[1]][1:5, 1:5]
schoolid 1
Marginal variance covariance matrix
vcm2[[1]][1:5, 1:5]
1 2 3 4 5
1 0.9597177 0.2506416 0.2283951 0.2234525 0.2139925
2 0.2506416 0.9572646 0.2274188 0.2225303 0.2131739
3 0.2283951 0.2274188 0.9176519 0.2058967 0.1984085
4 0.2234525 0.2225303 0.2058967 0.9100439 0.1951280
5 0.2139925 0.2131739 0.1984085 0.1951280 0.8966920
cov2cor(vcm2[[1]])[1:5, 1:5]
1 2 3 4 5
1 1.0000000 0.2614962 0.2433750 0.2391014 0.2306774
2 0.2614962 1.0000000 0.2426451 0.2384196 0.2300893
3 0.2433750 0.2426451 1.0000000 0.2253094 0.2187256
4 0.2391014 0.2384196 0.2253094 1.0000000 0.2160064
5 0.2306774 0.2300893 0.2187256 0.2160064 1.0000000
2) Now focus on the female variable instead.
(a) Summarize differences you find between the z_read variable for males and females
load(url("http://www.rossmanchance.com/iscam4/ISCAM.RData"))
iscamsummary(ReadingScores$z_read, ReadingScores$female)
boxplot(ReadingScores$z_read~ ReadingScores$female)
The graph below shows the OLS lines for each school comparing males and females
ggplot(ReadingScores2, aes(x = female , y = z_read, group=schoolid)) +
geom_smooth(method = "lm", formula = y ~ x, alpha = .5, se = FALSE) +
theme_bw()
(b) Which appears larger, the school to school variation in the intercepts or in the slopes? Speculate whether you think either will be statistically significant.
(c) Create modelA with female (as the only predictor variable) and modelB with female with random slopes. Include your R code and the most relevant output. Which model do you recommend? Justify your answer.
Extra credit: Does your answer to (c) surprise you based on the graph? What could be the explanation?
(d) Include the “summary” output for modelB. Report and interpret (in context) the estimated correlation between the random intercepts and the random slopes.
(e) According to modelB, which group (males or females) is predicted to have more variability in their standardized reading scores? Explain. Is this consistent with the (raw) data? (Be very clear what numbers you are comparing.)
(f) According to modelB, is it surprising to find a school with a negative slope? Show your calculation details and also interpret what it means for a school to have a negative slope in this context.
3) Return to model 2b from problem 1.
(a) According to model2b, are the random effects for intercepts and slopes pretty normally distributed? (Which graph is which? How do you know?)
hist(ranef(model2b)[[1]])
hist(ranef(model2)$schoolid[[1]])
hist(ranef(model2b)[[2]])
hist(ranef(model2)$schoolid[[2]])
Use R to identify schools with extreme slopes or intercepts with model2b. (Hint: Ask ChatGPT about the which command? Include your prompt, code, answer)
DFBeta is a measure of how much slope coefficients change when individual observations are removed from the dataset.
(b) Run the influence.ME package (it may take a couple of minutes). Do any schools appear to be influential on the estimated coefficients (intercept and/or slope)? How are you deciding?
install.packages("influence.ME")
model2 <- lmer(z_read ~ cen_escs + (1 + cen_escs | schoolid), data = ReadingScores2, REML = F)
library(influence.ME)
modelinfstats <- influence(model2, "schoolid") #may take a couple minutes
plot(modelinfstats,
which="dfbetas",
xlab="DFbetaS",
ylab="school id")
which(dfbetas(modelinfstats)[,1] < -.2)
which(dfbetas(modelinfstats)[,2] > .2)
Can you identify any of these schools in the graph of fitted lines in problem 1? What feature(s) should they have?
Cook’s Distance is a measure how much the (sum of squared) prediction errors change when individual observations are removed from the dataset (how well do predict each value without that value in the dataset, similar to the PRESS statistic you saw before)
(c) Do any schools appear to be influential on the overall model fit/predictions? What do you learn from the following?
plot(modelinfstats, which="cook", xlab="cooks d", ylab="school id")
which(cooks.distance(modelinfstats) > .1)pick (and report) a good cut off
(d) What happens when school 132 is removed from the dataset? Do any of the general interpretations/conclusions change? Are the changes what you expected?
model2d <- exclude.influence(model2, grouping = "schoolid", level = "132")
summary(model2d, corr=F)
texreg::screenreg(list(model2, model2d), digits = 3, single.row = T, stars = 0, custom.model.names = c("with 132", "without 132"))
(e) Repeat (d) for school 107. Include your code and well-labeled output table and discussion.