Stat 414 – HW 5

Due 8am Monday Nov. 3

 

Please submit 3 separate files.  Remember to include relevant code and output (please trim out much of the irrelevant output).  Make sure you justify your answer.

 

1) Reconsider the “life satisfaction” question on the 2014 European Social Survey (0 = extremely dissatisfied and 10 = extremely satisfied”) for 29 countries.  Run the following commands in R.

essdata <-load(url("https://www.rossmanchance.com/chance/stat414F22/data/ess5000.Rdata"))

essdata <- d

model0 <- lmer(stflife ~ 1 + (1 | cntry), data = essdata)

essdata <- essdata %>%  mutate(edrange = eduyrs/20)

model1 <- lmer(stflife ~ 1 + edrange + (1 | cntry), data = essdata)

essdata <- essdata %>% mutate(aveed = ave(essdata$edrange, essdata$country ))

model2 <- lmer(stflife ~ 1 + edrange + aveed + (1 | cntry), data = essdata)

deved = essdata$edrange - essdata$aveed

model3 <- lmer(stflife ~ 1 + deved + aveed + (1 | cntry), data = essdata)

essdata <- essdata %>% mutate(gni01 = scales::rescale(gnipc1000)) #min-max scaling

model4 <- lmer(stflife ~ 1 + deved + aveed + gni01 + (1 | cntry), data = essdata)

install.packages("texreg")

texreg::screenreg(

     list(model0, model1, model2, model3, model4),

     custom.model.names = c("null", "model1", "model2", "model3", "model4"),

     single.row = TRUE,        # estimates & SEs on one line

     digits = 3

 )

Include only the screenreg output and answer the following questions from that output. Yes, you have seen some of these questions before.

(a) How are the coefficients of “individual education” related in models 2 and 3?  Explain why.

(b) How is the coefficient of “country-mean education” in model 3 related to model 2? Explain why.

(c) Is the coefficient of “country-mean education” significantly different from zero in model 2? (How are you deciding from the screenreg output?) What does this tell you?

(d) Is the coefficient of “country-mean education” significantly different from zero in model 3? What does this tell you?

(e) Interpret the coefficient of the (rescaled) GNI variable in context. 

(f) Is the coefficient of the GNI variable statistically significant? What does that tell you?  (Be pretty specific.)

(g) How does adding the coefficient of the GNI variable impact the coefficient of country mean education?  Explain what you learn.

(h) If you had to pick one of these models, which would you choose and why?

(i) Consider the following three “types” of variables:

·     Variables that only explain variation at Level 2

·     Variables that explain variation at both Level 1 and Level 2

·     Variables that only explain variation at Level 1.

For the four variables you considered here: edrange, GNI, deviation ed variable, edmean (aveed), which type is each?

 

2) The data in CityToSea5K2023 are the results of the 5K race in Pismo in Nov 2023 (sadly, the races were not run in 2024, but you should come cheer folks on in 2025!). Variables include gender, age, and finish time (in minutes).

race= read.table("https://www.rossmanchance.com/chance/stat414F24/data/CityToSea5K2023.txt", header=T)

head(race)

(a) Fit a linear regression model (OLS) to predict finish time based on age. Produce a scatterplot showing the fitted model and summarize the relationship and comment on whether or not the direction of the association makes sense in this context.

summary(model1 <- lm(time ~ Age, data = race))

race$Gender <- factor(race$Gender)
plot(race$time ~ race$Age, col = as.numeric(race$Gender), pch = 16,
     xlab = "Age", ylab = "Time", main = "Time vs Age by Gender")
abline(model1) # Add the regression line
legend_colors <- 1:length(levels(race$Gender))  # Generate two unique color indices for the levels
legend("bottomright", legend = levels(race$Gender), col = legend_colors, pch = 16, title = "Gender")

 (b) Do you see any patterns in the residuals related to the gender variable?

plot(model1$residuals ~ model1$fitted.values, col=factor(race$Gender))

abline(h = 0) 

(c) Predict finish time based on gender and age.  Produce a color-coded scatterplot, overlaying the two lines. 

summary(model2 <- lm(time ~ Gender + Age, data = race))

#Enter appropriate values for a and b!

plot (race$time ~ race$Age, col=factor(race$Gender))

abline(a =    , b = )

abline(a =  , b =   , col = "red")

Explain the values you chose for a and b to create each of the two lines.

(d) What concerning patterns do you still see in the residuals vs. fitted plot?

plot(residuals(model1)~fitted.values(model1), col = factor(race$Gender))

 

Add the interaction to the model and examined the fitted model

summary(model2 <- lm(time ~ Gender + Age + Gender:Age, data = race))

#or summary(model2 <- lm(time ~ Gender*Age, data = race))

#Enter appropriate values for a and b!

plot (race$time ~ race$Age, col=factor(race$Gender))

abline(a =    , b =    )

abline(a =    , b =    , col = "red")

#see also?  with(race, interaction.plot(Age, Gender, model2$fitted.values, type = "l"))

What values did you choose for a and b?

 

(e) Describe the nature of the interaction in context.

(f) How many term(s) did the interaction add to the model? Is the interaction statistically significant? State your hypotheses and include supporting evidence (include code).

 

Although the multicollinearity from the interaction term is not huge, we should still consider centering our quantitative variable.

Age.c = race$Age - mean(race$Age)

summary(model3 <- lm(time ~ Gender + Age.c + Gender:Age.c, data = race))

(g) Explain how to interpret each of the coefficients, including the intercept, in this model. 

(h) Explain why the coefficient of gender changed between model 2 and model 3 based on the graph.  Did the statistical significance change?

(i) Explain to a consulting client the output you get from

effects::allEffects(model3)

 

Opinion: Do you think this (the output in i) would be an effective way to present the model to a non-statistician?

 

3) Data from the 2006 Programme for International Student Assessment (PISA) were obtained for 13,646 children 15-16 years of age from schools in all eight states and territories of Australia.  Let’s consider schools to be the Level 2 units. Here are some of the variables:

·       Standardized reading scores (z_read) – our response variable

·       Female (1 = females, 0 = males)

·       Grand-mean centered parental occupation (cen_pos), Parental occupational status is a scale ranging from low occupational status to high occupational status being treated as a quantitative variable (ranging from 16 to 90)

·       Grand-mean centered PISA index ( cen_escs), a composite index created from items on the student questionnaire, including parental occupation and education as well as possessions within the home (e.g., books, computers, areas for children to study). Basically a measure of “family capital.”

·       Grand-mean centered school size (cen_size), school sizes range from 10 to 2835

Load in the data

ReadingScores = read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt", header=T)

head(ReadingScores)

We mentioned that multilevel models were originally conceptualized as “intercepts as outcomes” models and now we also want to also consider “slopes as outcomes.”

(a) Of the variables given above, which are Level 1 and which are Level 2?

(b) Fit a separate OLS line for each school, regressing reading score on cen_pos, storing the intercept and slope for each of the 356 schools:

#run a separate OLS regression for each school

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$cen_pos)

plot(ReadingScores$cen_pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

#R gives a warning, but I think we are ok!

 

par(mfrow=c(1,2))

hist(intercepts)

hist(slopes)

 

load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))

iscamsummary(intercepts)

iscamsummary(slopes)

Which would you say is larger, the variation in the intercepts or in the slopes? How are you deciding?


(c) Suppose we use pos rather than cen_pos.

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$pos)

plot(ReadingScores$pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey", xlim = c(0, 85))

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

iscamsummary(intercepts)

iscamsummary(slopes)

 Is there now more or less variation in the intercepts?  Explain why. How about the slopes?

 

(d) Now consider the cen_escs variable (“family capital”). First look at the OLS lines

#run a separate OLS regression for each school

intercepts2= seq(1,length(levels(ReadingScores$schoolid)))

slopes2=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_escs, subset = ReadingScores$schoolid == j)

  intercepts2[j] = summary(modelj)$coefficients[1]

  slopes2[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts2 + (slopes2 * ReadingScores$cen_escs)

plot(ReadingScores$cen_escs, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts2[j], b = slopes2[j])

}

iscamsummary(intercepts)

iscamsummary(slopes)

Is there more or less variation in the slopes of cen_escs than you found for cen_pos?


(e) So now compare the following models

model1 = lmer(z_read~cen_pos + (1 | schoolid), data = ReadingScores, REML = F)

model1b = lmer(z_read~cen_pos + (1 + cen_pos | schoolid), data = ReadingScores, REML = F)

model2 =  lmer(z_read~cen_escs + (1  | schoolid), data = ReadingScores, REML = F)

model2b = lmer(z_read~cen_escs + (1 + cen_escs | schoolid), data = ReadingScores, REML = F)

 

texreg::screenreg(list(model1, model1b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_pos", " random slope cen_pos"), custom.note = "")

anova(model1, model1b)

 

texreg::screenreg(list(model2, model2b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_escs", " random slope cen_escs"), custom.note = "")

anova(model2, model2b)

 

Discuss whether the effect of increasing parental occupation or the effect of family capital on reading scores differs significantly across the schools. Make sure it’s clear what part of the outputs you are using to justify your answers.  Also discuss how these conclusions match/don’t match what you saw earlier in this exercise.

 

(f) Based on model2b, give a “95% interval” for the slopes of cen_escs for this population of schools. Is it plausible for a school to have a negative slope?

 

(g) Create a graph of the “fitted model” for cen_escs and discuss how it compares to the individual least squares fit lines. In particular, is there is evidence of “shrinkage” in the multilevel model compared to fitting separate OLS lines? Explain.

ReadingScores$yhat.mlm <- predict(model2b, ReadingScores)

plot(ReadingScores$cen_escs, ReadingScores$yhat.mlm, main="Random Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

abline(h=seq(-6, 2, 2), col="lightgrey", lwd=0.5)

for (j in 1:356){

  lines(ReadingScores$cen_escs[ReadingScores$schoolid == j], ReadingScores$yhat.mlm[ReadingScores$schoolid == j])

}

 

(h) Regress the OLS intercepts and and the OLS slopes on the school_size variable.

school_size = aggregate(ReadingScores$schsize, list(ReadingScores$schoolid), FUN=mean )$x

summary(lm(intercepts2 ~ school_size))

summary(lm(slopes2 ~ school_size))

Interpret the intercept and slope coefficients for each of these models.


(i) Based on these models, does school size explain significant variation in either the intercepts or the slopes of cen_escs?  Summarize what you learn in English.

 

(j) Now add school size to the multilevel model (the total enrollment at the school, schsize but centered as cen_size. The I() function evaluates the expression before fitting the model)

model3 = lmer(z_read ~ cen_escs + I( cen_size/100) + (1 + cen_escs  | schoolid), data = ReadingScores, REML = F)

summary(model3, corr=F)

texreg::screenreg(list(model2b, model3), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("random slope cen_escs", "adding school size"), custom.note = "")