Stat
414 – HW 5
Due
midnight, Friday, Nov. 1 or Monday Nov. 4
Please submit 3
separate files. Remember to include
relevant code and output (please trim out much of the irrelevant output). Make sure you justify your answer.
1) The data in CityToSea5K2023 are the results of the 5K race in Pismo in Nov 2023 (sadly, the race is not being run this year). Variables include gender, age, and finish time (in minutes).
race =
read.table("https://www.rossmanchance.com/chance/stat414F24/data/CityToSea5K2023.txt",
header=T)
head(race)
(a) Fit a linear
regression model (OLS) to predict finish time based on age. Produce a
scatterplot showing the fitted model and summarize the relationship and comment
on whether or not the direction of the association
makes sense in this context.
summary(model0
<- lm(time ~ Age, data = race))
plot
(race$time ~ race$Age,
col=factor(race$Gender))
abline(model0)
(b) Do you see any
patterns in the residuals related to the gender variable?
plot(model0$residuals
~ model0$fitted.values, col=factor(race$Gender))
abline(h = 0)
(c) Predict finish time
based on gender and age. Produce a
color-coded scatterplot, overlaying the two lines.
summary(model1
<- lm(time ~ Gender + Age, data = race))
#Enter
appropriate values for a and b!
plot
(race$time ~ race$Age,
col=factor(race$Gender))
abline(a = , b = )
abline(a = , b =
, col = "red")
Explain the values you chose for a and b to create the two lines.
(d) What concerning
patterns do you still see in the residuals vs. fitted plot?
plot(residuals(model1)~fitted.values(model1), col = factor(race$Gender))
Add the interaction to the model and examined the fitted model
summary(model2
<- lm(time ~ Gender + Age + Gender:Age,
data = race))
#or
summary(model2 <- lm(time ~ Gender*Age, data =
race))
#Enter
appropriate values for a and b!
plot
(race$time ~ race$Age,
col=factor(race$Gender))
abline(a = , b =
)
abline(a = , b =
, col = "red")
#see also? with(race, interaction.plot(Age, Gender, model2$fitted.values, type =
"l"))
What values did you
choose for a and b?
(e) Describe the nature of the interaction in
context.
(f) How many term(s) did the interaction add to the model?
Is the interaction statistically significant? State your hypotheses and include
supporting evidence (include code).
Although the
multicollinearity from the interaction term is not huge, we will still consider
centering our quantitative variable.
Age.c = race$Age - mean(race$Age)
summary(model3
<- lm(time ~ Gender + Age.c
+ Gender:Age.c, data = race))
(g) Explain how to
interpret each of the coefficients, including the intercept, in this
model.
(h) Explain why the
coefficient of gender changed between model 2 and model 3 based on the
graph. Did the statistical significance
change?
(i) How do these plots
compare to each other? Why?
plot(effects::allEffects(model2))
plot(effects::allEffects(model3))
(j) Explain the output
you get from
effects::allEffects(model3)
Opinion: Do you think this would
be an effective way to present the model to a non-statistician?
(k) Explain why it
would probably not be a good idea to use gender as a
random effect for these data.
`
2) Recall the 2006 Programme for International Student Assessment (PISA),
organized by the OECD, gathered data on 15- to 16-year-old children from
schools across Australia. Variables in
the dataset include
· z_read, a standardized reading score
· cen_pos, parental occupational
status, a 64-category ordinal score (the lowest value represents low
occupational status), which is centered and we will
treat as quantitative
· cen_escs: the
centered PISA index of economic, social, and cultural status, a composite index
created from items on the student questionnaire, including parental occupation
and education as well as possessions within the home (e.g., books, computers,
areas for children to study).
· female, an
indicator variable with 1 = female and 0 = male
ReadingScores =
read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt",
header=T)
head(ReadingScores)
We mentioned that
multilevel models were originally conceptualized as “intercepts as outcomes”
models and now we also want to consider “slopes as outcomes.”
(a) Fit a
separate OLS line for each school, regressing reading score on cen_pos:
#run a separate OLS regression for each
school
intercepts= seq(1,length(levels(ReadingScores$schoolid)))
slopes=seq(1,length(levels(ReadingScores$schoolid)))
for (j in 1:356){
modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_pos,
subset = ReadingScores$schoolid == j)
intercepts[j] = summary(modelj)$coefficients[1]
slopes[j] = summary(modelj)$coefficients[2]
}
ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$cen_pos)
plot(ReadingScores$cen_pos, ReadingScores$yhat_ols,
main="Fixed Coefficient\n Regression Lines", ylim=c(-6,
2), col="lightgrey")
for
(j in 1:356){
abline(a = intercepts[j], b = slopes[j])
}
#R gives a warning, but I think we are ok!
par(mfrow=c(1,2))
hist(intercepts)
hist(slopes)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
iscamsummary(intercepts)
iscamsummary(slopes)
Which would you say is larger, the variation in
the intercepts or in the slopes? How are you deciding?
(b) Suppose
we use pos rather than cen_pos.
intercepts=
seq(1,length(levels(ReadingScores$schoolid)))
slopes=seq(1,length(levels(ReadingScores$schoolid)))
for
(j in 1:356){
modelj=lm(ReadingScores$z_read ~ ReadingScores$pos, subset = ReadingScores$schoolid
== j)
intercepts[j] = summary(modelj)$coefficients[1]
slopes[j] = summary(modelj)$coefficients[2]
}
ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$pos)
plot(ReadingScores$pos, ReadingScores$yhat_ols,
main="Fixed Coefficient\n Regression Lines", ylim=c(-6,
2), col="lightgrey", xlim
= c(0, 85))
for
(j in 1:356){
abline(a =
intercepts[j], b = slopes[j])
}
iscamsummary(intercepts)
iscamsummary(slopes)
Is there
now more or less variation in the intercepts?
Explain why. How about the slopes?
(c) Explain
what you learn from the following graph:
plot(slopes~ intercepts)
Be sure your comments are in context (e.g.,
don’t just use the words intercepts and slopes but include what is meant by
intercept, what is meant by slope), e.g., Schools that have higher …
(d) Now consider the cen_escs variable
(“family capital”) and whether it should be included with fixed or random
slopes. First look at the OLS lines
#run a separate OLS regression for each
school
intercepts2=
seq(1,length(levels(ReadingScores$schoolid)))
slopes2=seq(1,length(levels(ReadingScores$schoolid)))
for
(j in 1:356){
modelj=lm(ReadingScores$z_read
~ ReadingScores$cen_escs, subset = ReadingScores$schoolid == j)
intercepts2[j]
= summary(modelj)$coefficients[1]
slopes2[j]
= summary(modelj)$coefficients[2]
}
ReadingScores$yhat_ols <-
intercepts2 + (slopes2 * ReadingScores$cen_escs)
plot(ReadingScores$cen_escs, ReadingScores$yhat_ols,
main="Fixed Coefficient\n Regression Lines", ylim=c(-6,
2), col="lightgrey")
for (j in 1:356){
abline(a = intercepts2[j], b
= slopes2[j])
}
iscamsummary(intercepts2)
iscamsummary(slopes2)
Is there more or less variation in the slopes
than you found for cen_pos?
(e) So now compare the following models
model1 = lmer(z_read~cen_pos
+ (1 | schoolid), data = ReadingScores,
REML = F)
model1b = lmer(z_read~cen_pos
+ (1 + cen_pos | schoolid),
data = ReadingScores, REML = F)
model2 = lmer(z_read~cen_escs + (1 | schoolid), data =
ReadingScores, REML = F)
model2b = lmer(z_read~cen_escs
+ (1 + cen_escs | schoolid),
data = ReadingScores, REML = F)
texreg::screenreg(list(model1,
model1b), digits = 3, single.row = TRUE, stars = 0,
custom.model.names =
c("fixed slope cen_pos", " random
slope cen_pos"), custom.note
= "")
anova(model1,
model1b)
texreg::screenreg(list(model2,
model2b), digits = 3, single.row = TRUE, stars = 0,
custom.model.names =
c("fixed slope cen_escs", " random
slope cen_escs"), custom.note
= "")
anova(model2,
model2b)
Discuss whether the effect of increasing parental
occupation or the effect of family capital on reading scores differs significantly
across the schools. Make sure it’s clear what part of the outputs you are using
to justify your answers. Also discuss
how these conclusions match/don’t match what you saw earlier in this exercise.
(f) Based on model2b,
give a “95% interval” for the slopes of cen_escs
for this population of schools. Is it plausible for a school to have a negative
slope?
(g) Create a graph of the “fitted model” for cen_escs and discuss how it compares to the individual
least squares fit lines (other than the number of lines). In
particular, is there is evidence of “shrinkage” in the multilevel model
compared to fitting separate OLS lines? Explain.
ReadingScores$yhat.mlm <-
predict(model2b, ReadingScores)
plot(ReadingScores$cen_escs, ReadingScores$yhat.mlm,
main="Random Coefficient\n Regression Lines", ylim=c(-6,
2), col="lightgrey")
abline(h=seq(-6, 2,
2), col="lightgrey", lwd=0.5)
for (j in
1:356){
lines(ReadingScores$cen_escs[ReadingScores$schoolid == j], ReadingScores$yhat.mlm[ReadingScores$schoolid == j])
}
Here
is an awesome way to look at a random sample of 9 of the lines
library(ggeffects)
plot(ggpredict(model2b, terms = c("cen_escs", "schoolid
[sample = 9]"), type = "re"), show_ci
= FALSE)
#if you have trouble suppressing
the ci bands, try
plot(ggpredict(model2b, terms
= c("cen_escs", "schoolid
[sample = 9]"), type = "re", ci = FALSE))
(h) Regress the OLS
intercepts and slopes on the school_size
variable.
school_size = aggregate(ReadingScores$schsize, list(ReadingScores$schoolid),
FUN=mean )$x
summary(lm(intercepts2 ~ school_size))
summary(lm(slopes2 ~ school_size))
Interpret the intercept
and slope coefficients for each of these models.
(i) Does school size
explain significant variation in either the intercepts or the slopes of cen_escs? Summarize
what you learn in English.
(j) Now add school size
to the multilevel model (the total enrollment at the school, schsize but centered as cen_size.
The I()
function evaluates the expression before fitting the model)
model3 = lmer(z_read ~ cen_escs
+ I( cen_size/100) + (1 + cen_escs | schoolid), data =
ReadingScores, REML = F)
summary(model3,
corr=F)
texreg::screenreg(list(model2b,
model3), digits = 3, single.row = TRUE, stars = 0,
custom.model.names =
c("random slope cen_escs", "adding
school size"), custom.note = "")
· Interpret the
coefficient of cen_size in context. (Keep in
mind what variable was actually used in the model)
· What percentage of the
variation in intercepts is explained by school size?
· What percentage of the
variation in cen_escs slopes is explained by the
school size?
· Which
is
variation explained larger? Why does that make sense from what you saw in (h)?
(k) How many parameters
are estimated by this model?
(l) Identify and
interpret the intercept, coefficient of cen_escs, and
in context.
(m) Explain why it
doesn’t make sense to consider random slopes for the cen_size
variable.
3) More exploration of
Reading Scores
(a) Are the random
effects for intercepts and slopes pretty normally
distributed? Do we still have any extreme slopes or intercepts with model 3?
hist(ranef(model3)$schoolid[[1]])
hist(ranef(model3)$schoolid[[2]])
Which schools are they?
which(ranef(model3)$schoolid[[1]]<
-1)
which(ranef(model3)$schoolid[[2]]>.4)
How
many observations at each of these schools?
table(ReadingScores$schoolid)
(b) Run the influence.ME package (it may
take a couple of minutes). Do any schools appear to be influential on the
estimated coefficients (intercept and/or slope)? How are you deciding?
library(influence.ME)
#This
is back to model 1, but should behave a little better :)
modelinfstats <-
influence(model1, "schoolid") #may
take a couple minutes
plot(modelinfstats,
which="dfbetas",
xlab="DFbetaS",
ylab="school
id")
which(dfbetas(modelinfstats)[,1] <
-.4)
which(dfbetas(modelinfstats)[,2] <
-.2)
(c) Do any schools appear to be influential on
the overall model fit/predictions? What do you learn from the following?
plot(modelinfstats, which="cook", xlab="cooks
d", ylab="school id")
which(cooks.distance(modelinfstats)
> .1)
(d) What happens when
school 132 is removed from the dataset?
Do any of the general interpretations/conclusions change? Are the
changes what you expected?
#Now
starting with model 3
model4
<- exclude.influence(model3, grouping = "schoolid", level = "132")
summary(model4,
corr=F)
texreg::screenreg(list(model3,
model4), digits = 3, single.row = T, stars = 0, custom.model.names = c("with 132", "without
132"))