Stat 414 – HW 5

Due midnight, Friday, Nov. 3

 

Don’t forget to always justify your answers and to include relevant code and output.

 

1) One of the questions of the 2014 European Social Survey was “All things considered, how satisfied are you with your life as a whole nowadays? Please answer assuming 0 = extremely dissatisfied and 10 = extremely satisfied.” We have data on 29 countries.

essdata <-load(url("https://www.rossmanchance.com/chance/stat414F22/data/ess5000.Rdata"))

essdata <- d

Optional: View(essdata) to get more information about the variables

(a) Consider the following passage from a different study:

For each school in the study, all teachers at a particular time were selected; in other words, the sample was the full population of teachers for each school. If we restrict our attention to a particular school (as individual school reports did) it may seem that no question of sampling arises. However, it is unlikely that we would get the same observations if we were to hypothetically conceive of repeating the sample study. We recognise uncertainty in the measurement of human characteristics such as those involving teachers, and there is likely to be measurement error.

Statisticians deal with this by assuming we are in fact dealing with a sample from the hypothetically infinite superpopulation of all teachers who might conceivably have been teachers in this school. The idea is that it makes sense to make more abstract generalisations about this school than are governed by the situation and context of just one measurement occasion. Many current studies in education also use data collected on the full population of school children at a point in time, and will make inferences from this population about a superpopulation. Generally, we will assume that certain aspects of our data arise as if the study units were randomly drawn from some hypothetical population, even when we include the full actual population in our study.

Explain how this passage is relevant to this dataset.

 

(b) The following produces a graph of the life satisfaction distributions for the countries.

ggplot(essdata, aes(x=stflife, group = cntry, color = cntry)) +

geom_density() + theme_bw() + labs(x = "Life Satisfaction")

What would this graph look like if there was no “within country” variation?  What would this graph look like if there was no “between country” variation? Which (within or between country) variation appears to be larger to you? (Justify your answer.)

 

(c) Fit a null model (model0) with random intercepts at the country level.  Find and interpret the ICC value. (Be sure to include your code!)

 

(d) Suppose we want to know whether education increases life satisfaction (i.e., whether more education appears to correspond, on average, to higher life satisfaction scores).  The education variable is on a 0-20 scale.  What does the following R command do?  What might this be helpful?

essdata <- essdata %>%  mutate(edrange = eduyrs/20)

 

(e) Summarize what you learn from the following graph (Do some countries appear to have higher life satisfaction scores than others? Does there appear to be a relationship between education and life satisfaction? Does this relationship appear to vary across countries? Justify your answers.)

ggplot(essdata, aes(x = edrange, y = stflife )) +

  geom_jitter(alpha = .1) +

  scale_x_continuous(breaks = c(0, 10, 20)) +

  scale_y_continuous(limits = c(-.5,10.5), breaks = c(0,10)) +

  geom_smooth(method = "lm", se = FALSE) +

  facet_wrap(~ cntry ) +

  theme_bw() +

  labs(x = "Years of Education" , y = "Life Satisfaction")

 

(f) Create a model (model1) that includes the edrange variable (include your code).  How much variation at Level 1 is explained? (Show your work/output.) At level 2? Which is larger? Which did we learn more about – why some people are happier or why some countries are happier?

 

(g) Gross national income (GNI), is a measure of the economic value produced or owned by citizens of that country. The variable gnipc1000 is this value per capita in thousands of euros.  We could try to center this variable, but its not clear if we would use the “every group counts once” overall mean or the pooled group mean (larger countries count more) to do this centering.  Instead, “max-min” scale this variable:

essdata <- essdata %>% mutate(gni01 = scales::rescale(gnipc1000))

Create a model (model2) that includes (edrange and) the rescaled GNI variable (include your code). Provide a detailed interpretation of the GNI variable slope in this context. Is this variable statistically significant?

 

(h) When GNI is added to the model, how much variation at Level 1 is explained? At level 2? Which is larger? Explain why this comparison makes sense.

 

(i) Now we want to answer: Does having higher education have the same effect as living in a country with higher education?  To explore this question, first create the group mean variable:

essdata <- essdata %>% mutate(aveed = ave(essdata$edrange, essdata$country ))

Add this variable to the model (model3, include your code).  Explain what you learn from the sign of this coefficient.

 

(j) When average education is added to the model, how much variation at Level 1 is explained? At level 2? Which is larger?

 

(k) Now find the deviation variable using the group means

deved = essdata$edrangeessdata$aveed

Replace the edrange variable with this deviation variable (model4).  How does the coefficient of aveed change?  How much additional/less variation is explained by replacing this variable (vs. model3)?

 

(l) The last model (model4) can be considered a “between-within” model because it separates the between group variation and the within group variation.  If the coefficient of the average education variable in the preceding model (model3) is not significant, that says the effect of education is the same within countries as between countries, i.e., allowing within and between effects to differ did not improve the model fit.)

performance::compare_performance(model2, model4, metrics = c("BIC","RMSE"), rank = TRUE)

anova(model2, model4)

What do you conclude?

 

(m) Consider the following three “types” of variables:

·       Variables that only explain variation at Level 2

·       Variables that explain variation at both Level 1 and Level 2

·       Variables that only explain variation at Level 1.

For the three variables you considered here: edrange, GNI, deviation ed variable, which type is each?

 

2) The 2006 Programme for International Student Assessment (PISA), organized by the OECD, gathered data on 15- to 16-year-old children from schools across Australia.  Variables in the dataset include

·       z_read, a standardized reading score

·       cen_pos, parental occupational status, a 64-category ordinal score (the lowest value represents low occupational status), which is centered and we will treat as quantitative

·       cen_escs: the centered PISA index of economic, social, and cultural status, a composite index created from items on the student questionnaire, including parental occupation and education as well as possessions within the home (e.g., books, computers, areas for children to study).

·       female, an indicator variable with 1 = female and 0 = male

ReadingScores = read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt", header=T)

head(ReadingScores)

 

(a) Fit model 1 and interpret the (intercept and slope) coefficients of the model in context.

model1 = lmer(z_read~cen_pos + female + (1 | schoolid), data = ReadingScores, REML = F)

summary(model1, corr = F)

 

(b) Sometimes the intercept interpretation is not meaningful. Does the intercept of this model make contextual sense? Explain.

 

The following gives you another way to conceptualize the multilevel models we are running, e.g., “slopes as outcomes” models.

(c) Fit a separate OLS line for each school, regressing reading score on cen_pos:

#run a separate OLS regression for each school

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$cen_pos)

plot(ReadingScores$cen_pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

#R gives a warning, but I think we are ok!

 

hist(intercepts)

hist(slopes)

load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))

iscamsummary(intercepts)

iscamsummary(slopes)

 

Comment on the behavior of these lines. How might we measure the variation in the intercepts? In the slopes? Which is larger?

Optional: You could also just look at say 20 of the schools, e.g.,:

first10 = ReadingScores[ReadingScores$schoolid<21,]

and modify the code

 

(d) Suppose we use pos rather than cen_pos.

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$pos, subset = ReadingScores$schoolid == j)

  intercepts[j] = summary(modelj)$coefficients[1]

  slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$pos)

plot(ReadingScores$pos, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey", xlim = c(0, 85))

for (j in 1:356){

  abline(a = intercepts[j], b = slopes[j])

}

Is there now more of less variation in the intercepts?  Explain why.

 

(e) Explain what you learn from the following graph:  plot(slopes~ intercepts)

Be sure your comments are in context (e.g., what is meant by intercept, what is meant by slope), e.g., Schools that have higher …

(f) Now consider the cen_escs variable (“family capital”) and whether it should be included with fixed or random slopes. First look at the OLS lines

#run a separate OLS regression for each school

intercepts2= seq(1,length(levels(ReadingScores$schoolid)))

slopes2=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

  modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_escs, subset = ReadingScores$schoolid == j)

  intercepts2[j] = summary(modelj)$coefficients[1]

  slopes2[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts2 + (slopes2 * ReadingScores$cen_escs)

plot(ReadingScores$cen_escs, ReadingScores$yhat_ols, main="Fixed Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

  abline(a = intercepts2[j], b = slopes2[j])

}

Is there more or less variation in the slopes than you found for cen_pos?

(g) So now compare the following models

model1 = lmer(z_read~cen_pos + (1 | schoolid), data = ReadingScores, REML = F)

model1b = lmer(z_read~cen_pos + (1 + cen_pos | schoolid), data = ReadingScores, REML = F)

model2 =  lmer(z_read~cen_escs + (1  | schoolid), data = ReadingScores, REML = F)

model2b = lmer(z_read~cen_escs + (1 + cen_escs | schoolid), data = ReadingScores, REML = F)

 

texreg::screenreg(list(model1, model1b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_pos", " random slope cen_pos"), custom.note = "")

anova(model1, model1b)

 

texreg::screenreg(list(model2, model2b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_escs", " random slope cen_escs"), custom.note = "")

anova(model2, model2b)

 

Discuss whether the effect of increasing parental occupation or family capital on reading scores differs significantly across the schools. Make sure it’s clear what part of the outputs you are using to justify your answers.  Also discuss how these conclusions match/don’t match what you saw earlier in this exercise.

(h) Based on model2b, give a “95% interval” for the slopes of cen_escs for this population of schools. Is it plausible for a school to have a negative slope?

 

(i) Create a graph of the “fitted model” for cen_escs and discuss how it compares to the individual least squares fit lines (other than the number of lines). In particular, is there is evidence of “shrinkage” in the multilevel model compared to fitting separate OLS lines? Explain.

ReadingScores$yhat.mlm <- predict(model2b, ReadingScores)

plot(ReadingScores$cen_escs, ReadingScores$yhat.mlm, main="Random Coefficient\n Regression Lines", ylim=c(-6, 2), col="lightgrey")

abline(h=seq(-6, 2, 2), col="lightgrey", lwd=0.5)

for (j in 1:356){

  lines(ReadingScores$cen_escs[ReadingScores$schoolid == j], ReadingScores$yhat.mlm[ReadingScores$schoolid == j])

}

 

Alternative:

library(ggeffects)

plot(ggpredict(model2b, terms = c("cen_escs", "schoolid [sample = 9]"), type = "re"), ci = FALSE)

 

(j) Regress the OLS intercepts and slopes on the school_size variable.

school_size = aggregate(ReadingScores$schsize, list(ReadingScores$schoolid), FUN=mean )$x

summary(lm(intercepts2 ~ school_size))

summary(lm(slopes2 ~ school_size))

Interpret the intercept and slope coefficients for each of these models.  Does school size explain significant variation in either the intercepts or the slopes of cen_escs?  Summarize what you learn in English.

 

(k) Now add school size to the model (the total enrollment at the school, schsize but centered as cen_size)

model3 = lmer(z_read ~ cen_escs +  cen_size + (1 + cen_escs  | schoolid), data = ReadingScores, REML = F)

summary(model3)

texreg::screenreg(list(model2b, model3), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("random slope cen_escs", "adding school size"), custom.note = "")

 

·       Interpret the coefficient of cen_size in context.

·       Create a new variable cen_size100 that divides cen_size by 100. Substitute this variable into the model instead (include your code, output) and explain why the interpretation of this coefficient is more meaningful in this context.

·       What percentage of the variation in intercepts is explained by school size? What percentage of the variation in cen_escs slopes is explained by the school size? Which is larger? Why does that make sense from what you saw in (j)?

·       Explain why it doesn’t make sense to consider random slopes for the cen_size variable.