Stat 414 – HW 5

Due midnight, Friday, Nov. 1 or Monday Nov. 4

 

Please submit 3 separate files.  Remember to include relevant code and output (please trim out much of the irrelevant output).  Make sure you justify your answer.

 

1) The data in CityToSea5K2023 are the results of the 5K race in Pismo in Nov 2023 (sadly, the race is not being run this year). Variables include gender, age, and finish time (in minutes).

race = read.table("https://www.rossmanchance.com/chance/stat414F24/data/CityToSea5K2023.txt", header=T)

head(race)

(a) Fit a linear regression model (OLS) to predict finish time based on age. Produce a scatterplot showing the fitted model and summarize the relationship and comment on whether or not the direction of the association makes sense in this context.

summary(model0 <- lm(time ~ Age, data = race))

plot (race$time ~ race$Age, col=factor(race$Gender))

abline(model0)

 

(b) Do you see any patterns in the residuals related to the gender variable?

plot(model0$residuals ~ model0$fitted.values, col=factor(race$Gender))

abline(h = 0)

 

(c) Predict finish time based on gender and age.  Produce a color-coded scatterplot, overlaying the two lines. 

summary(model1 <- lm(time ~ Gender + Age, data = race))

#Enter appropriate values for a and b!

plot (race$time ~ race$Age, col=factor(race$Gender))

abline(a =    , b = )

abline(a =  , b =   , col = "red")

Explain the values you chose for a and b to create the two lines.

 

(d) What concerning patterns do you still see in the residuals vs. fitted plot?

plot(residuals(model1)~fitted.values(model1), col = factor(race$Gender))

 

Add the interaction to the model and examined the fitted model

summary(model2 <- lm(time ~ Gender + Age + Gender:Age, data = race))

#or summary(model2 <- lm(time ~ Gender*Age, data = race))

#Enter appropriate values for a and b!

plot (race$time ~ race$Age, col=factor(race$Gender))

abline(a =    , b =    )

abline(a =    , b =    , col = "red")

#see also?  with(race, interaction.plot(Age, Gender, model2$fitted.values, type = "l"))

What values did you choose for a and b?

 

(e) Describe the nature of the interaction in context.

(f) How many term(s) did the interaction add to the model? Is the interaction statistically significant? State your hypotheses and include supporting evidence (include code).

Although the multicollinearity from the interaction term is not huge, we will still consider centering our quantitative variable.

Age.c = race$Age - mean(race$Age)

summary(model3 <- lm(time ~ Gender + Age.c + Gender:Age.c, data = race))

 

(g) Explain how to interpret each of the coefficients, including the intercept, in this model. 

(h) Explain why the coefficient of gender changed between model 2 and model 3 based on the graph.  Did the statistical significance change?

(i) How do these plots compare to each other?  Why?

plot(effects::allEffects(model2))

plot(effects::allEffects(model3))

(j) Explain the output you get from

effects::allEffects(model3)

Opinion: Do you think this would be an effective way to present the model to a non-statistician?

 

(k) Explain why it would probably not be a good idea to use gender as a random effect for these data.

`

2) Recall the 2006 Programme for International Student Assessment (PISA), organized by the OECD, gathered data on 15- to 16-year-old children from schools across Australia.  Variables in the dataset include

·       z_read, a standardized reading score

·       cen_pos, parental occupational status, a 64-category ordinal score (the lowest value represents low occupational status), which is centered and we will treat as quantitative

·       cen_escs: the centered PISA index of economic, social, and cultural status, a composite index created from items on the student questionnaire, including parental occupation and education as well as possessions within the home (e.g., books, computers, areas for children to study).

·       female, an indicator variable with 1 = female and 0 = male

ReadingScores = read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt", header=T)

head(ReadingScores)

 

We mentioned that multilevel models were originally conceptualized as “intercepts as outcomes” models and now we also want to consider “slopes as outcomes.”

(a) Fit a separate OLS line for each school, regressing reading score on cen_pos:

#run a separate OLS regression for each school

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$cen_pos)

plot(ReadingScores$cen_pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

#R gives a warning, but I think we are ok!

 

par(mfrow=c(1,2))

hist(intercepts)

hist(slopes)

 

load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))

iscamsummary(intercepts)

iscamsummary(slopes)

 

Which would you say is larger, the variation in the intercepts or in the slopes? How are you deciding?

(b) Suppose we use pos rather than cen_pos.

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$pos)

plot(ReadingScores$pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey", xlim = c(0, 85))

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

iscamsummary(intercepts)

iscamsummary(slopes)

 

Is there now more or less variation in the intercepts?  Explain why. How about the slopes?

 

(c) Explain what you learn from the following graph:  plot(slopes~ intercepts)

Be sure your comments are in context (e.g., don’t just use the words intercepts and slopes but include what is meant by intercept, what is meant by slope), e.g., Schools that have higher …

(d) Now consider the cen_escs variable (“family capital”) and whether it should be included with fixed or random slopes. First look at the OLS lines

#run a separate OLS regression for each school

intercepts2= seq(1,length(levels(ReadingScores$schoolid)))

slopes2=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_escs, subset = ReadingScores$schoolid == j)

  intercepts2[j] = summary(modelj)$coefficients[1]

  slopes2[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts2 + (slopes2 * ReadingScores$cen_escs)

plot(ReadingScores$cen_escs, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts2[j], b = slopes2[j])

}

iscamsummary(intercepts2)

iscamsummary(slopes2)

 

Is there more or less variation in the slopes than you found for cen_pos?

(e) So now compare the following models

model1 = lmer(z_read~cen_pos + (1 | schoolid), data = ReadingScores, REML = F)

model1b = lmer(z_read~cen_pos + (1 + cen_pos | schoolid), data = ReadingScores, REML = F)

model2 =  lmer(z_read~cen_escs + (1  | schoolid), data = ReadingScores, REML = F)

model2b = lmer(z_read~cen_escs + (1 + cen_escs | schoolid), data = ReadingScores, REML = F)

 

texreg::screenreg(list(model1, model1b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_pos", " random slope cen_pos"), custom.note = "")

anova(model1, model1b)

 

texreg::screenreg(list(model2, model2b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_escs", " random slope cen_escs"), custom.note = "")

anova(model2, model2b)

 

Discuss whether the effect of increasing parental occupation or the effect of family capital on reading scores differs significantly across the schools. Make sure it’s clear what part of the outputs you are using to justify your answers.  Also discuss how these conclusions match/don’t match what you saw earlier in this exercise.

(f) Based on model2b, give a “95% interval” for the slopes of cen_escs for this population of schools. Is it plausible for a school to have a negative slope?

 

(g) Create a graph of the “fitted model” for cen_escs and discuss how it compares to the individual least squares fit lines (other than the number of lines). In particular, is there is evidence of “shrinkage” in the multilevel model compared to fitting separate OLS lines? Explain.

ReadingScores$yhat.mlm <- predict(model2b, ReadingScores)

plot(ReadingScores$cen_escs, ReadingScores$yhat.mlm, main="Random Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

abline(h=seq(-6, 2, 2), col="lightgrey", lwd=0.5)

for (j in 1:356){

  lines(ReadingScores$cen_escs[ReadingScores$schoolid == j], ReadingScores$yhat.mlm[ReadingScores$schoolid == j])

}

 

Here is an awesome way to look at a random sample of 9 of the lines

library(ggeffects)

plot(ggpredict(model2b, terms = c("cen_escs", "schoolid [sample = 9]"), type = "re"), show_ci = FALSE)

#if you have trouble suppressing the ci bands, try

plot(ggpredict(model2b, terms = c("cen_escs", "schoolid [sample = 9]"), type = "re", ci = FALSE))

 

(h) Regress the OLS intercepts and slopes on the school_size variable.

school_size = aggregate(ReadingScores$schsize, list(ReadingScores$schoolid), FUN=mean )$x

summary(lm(intercepts2 ~ school_size))

summary(lm(slopes2 ~ school_size))

Interpret the intercept and slope coefficients for each of these models. 

(i) Does school size explain significant variation in either the intercepts or the slopes of cen_escs?  Summarize what you learn in English.

 

(j) Now add school size to the multilevel model (the total enrollment at the school, schsize but centered as cen_size. The I() function evaluates the expression before fitting the model)

model3 = lmer(z_read ~ cen_escs + I( cen_size/100) + (1 + cen_escs  | schoolid), data = ReadingScores, REML = F)

summary(model3, corr=F)

texreg::screenreg(list(model2b, model3), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("random slope cen_escs", "adding school size"), custom.note = "")

 

·       Interpret the coefficient of cen_size in context. (Keep in mind what variable was actually used in the model)

·       What percentage of the variation in intercepts is explained by school size?

·       What percentage of the variation in cen_escs slopes is explained by the school size?

·       Which is variation explained larger? Why does that make sense from what you saw in (h)?

 

(k) How many parameters are estimated by this model?

(l) Identify and interpret the intercept, coefficient of cen_escs, and  in context.

(m) Explain why it doesn’t make sense to consider random slopes for the cen_size variable.

 

3) More exploration of Reading Scores

(a) Are the random effects for intercepts and slopes pretty normally distributed? Do we still have any extreme slopes or intercepts with model 3?

hist(ranef(model3)$schoolid[[1]])

hist(ranef(model3)$schoolid[[2]])

Which schools are they?

which(ranef(model3)$schoolid[[1]]< -1)

which(ranef(model3)$schoolid[[2]]>.4)

How many observations at each of these schools?

table(ReadingScores$schoolid)

 

(b) Run the influence.ME package (it may take a couple of minutes). Do any schools appear to be influential on the estimated coefficients (intercept and/or slope)? How are you deciding? 

library(influence.ME)

#This is back to model 1, but should behave a little better :)

modelinfstats <- influence(model1, "schoolid")  #may take a couple minutes

plot(modelinfstats,

     which="dfbetas",

     xlab="DFbetaS",

     ylab="school id")

 

which(dfbetas(modelinfstats)[,1] < -.4)

which(dfbetas(modelinfstats)[,2] < -.2)

 

(c)  Do any schools appear to be influential on the overall model fit/predictions? What do you learn from the following?

plot(modelinfstats, which="cook", xlab="cooks d", ylab="school id")

which(cooks.distance(modelinfstats) >  .1)

 

(d) What happens when school 132 is removed from the dataset?  Do any of the general interpretations/conclusions change? Are the changes what you expected?

#Now starting with model 3

model4 <- exclude.influence(model3, grouping = "schoolid", level = "132")

summary(model4, corr=F)

texreg::screenreg(list(model3, model4), digits = 3, single.row = T, stars = 0, custom.model.names = c("with 132", "without 132"))