**Stat
414 – HW 5**

**Due
midnight, Friday, Nov. 3**

**Don’t forget to always
justify your answers and to include relevant code and output.**

**1) **One of the questions of
the 2014 European Social Survey was “All things considered, how satisfied are
you with your life as a whole nowadays? Please answer
assuming 0 = extremely dissatisfied and 10 = extremely satisfied.” We have data
on 29 countries.

essdata
<-load(url("https://www.rossmanchance.com/chance/stat414F22/data/ess5000.Rdata"))

essdata
<- d

*Optional*: View(essdata) to get more information about the variables

(a)
Consider the following passage from a different study:

For each school in the study, all teachers at a particular time
were selected; in other words, the sample was the full population of teachers
for each school. If we restrict our attention to a particular school (as
individual school reports did) it may seem that no question of sampling arises.
However, it is unlikely that we would get the same observations if we were to
hypothetically conceive of repeating the sample study. We recognise
uncertainty in the measurement of human characteristics such as those involving
teachers, and there is likely to be measurement error.

Statisticians deal with this by assuming we are in fact dealing
with a sample from the hypothetically infinite superpopulation of all teachers
who might conceivably have been teachers in this school. The idea is that it
makes sense to make more abstract generalisations
about this school than are governed by the situation and context of just one
measurement occasion. Many current studies in education also use data collected
on the full population of school children at a point in time,
and will make inferences from this population about a superpopulation.
Generally, we will assume that certain aspects of our data arise as if the
study units were randomly drawn from some hypothetical population, even when we
include the full actual population in our study.

Explain how this passage is relevant to
this dataset.

(b)
The following produces a graph of the life satisfaction distributions for the
countries.

ggplot(essdata,
aes(x=stflife, group = cntry, color = cntry)) +

geom_density() + theme_bw() + labs(x = "Life
Satisfaction")

What would this graph look like if there was no “within country”
variation? What would this graph look
like if there was no “between country” variation? Which (within or between
country) variation appears to be larger to you? (Justify your answer.)

(c)
Fit a null model (model0) with random intercepts at the country level. Find **and** interpret the ICC value. (Be
sure to include your code!)

(d)
Suppose we want to know whether education increases life satisfaction (i.e.,
whether more education appears to correspond, on average, to higher life
satisfaction scores). The education
variable is on a 0-20 scale. What does
the following R command do? What might
this be helpful?

essdata
<- essdata %>% mutate(edrange = eduyrs/20)

(e)
Summarize what you learn from the following graph (Do some countries appear to
have higher life satisfaction scores than others? Does there appear to be a
relationship between education and life satisfaction? Does this relationship
appear to vary across countries? Justify your answers.)

ggplot(essdata, aes(x = edrange, y = stflife )) +

geom_jitter(alpha = .1) +

scale_x_continuous(breaks = c(0, 10, 20)) +

scale_y_continuous(limits = c(-.5,10.5), breaks = c(0,10)) +

geom_smooth(method = "lm", se =
FALSE) +

facet_wrap(~ cntry ) +

theme_bw() +

labs(x = "Years
of Education" , y = "Life Satisfaction")

(f)
Create a model (model1) that includes the *edrange*
variable (include your code). How much
variation at Level 1 is explained? (Show your work/output.) At level 2? Which
is larger? Which did we learn more about – why some people are happier or why
some countries are happier?

(g)
Gross national income (GNI), is a measure of the
economic value produced or owned by citizens of that country. The variable *gnipc1000*
is this value per capita in thousands of euros.
We could try to center this variable, but its
not clear if we would use the “every group counts once” overall mean or the
pooled group mean (larger countries count more) to do this centering. Instead, “max-min” scale this variable:

essdata
<- essdata %>% mutate(gni01
= scales::rescale(gnipc1000))

Create
a model (model2) that includes (edrange and) the
rescaled GNI variable (include your code). Provide a detailed interpretation of
the GNI variable slope in this context. Is this variable statistically
significant?

(h)
When GNI is added to the model, how much variation at Level 1 is explained? At
level 2? Which is larger? Explain why this comparison makes sense.

(i)
Now we want to answer: Does having higher education have the same effect as
living in a country with higher education?
To explore this question, first create the group mean variable:

essdata <- essdata
%>% mutate(aveed = ave(essdata$edrange, essdata$country ))

Add
this variable to the model (model3, include your code). Explain what you learn from the *sign*
of this coefficient.

(j)
When *average education* is added to the model, how much variation at
Level 1 is explained? At level 2? Which is larger?

(k)
Now find the deviation variable using the group means

deved = essdata$edrange – essdata$aveed

Replace
the *edrange** *variable with this
deviation variable (model4). How does
the coefficient of *aveed**
*change? How much additional/less
variation is explained by replacing this variable (vs. model3)?

(l)
The last model (model4) can be considered a “between-within” model because it
separates the between group variation and the within group variation. If the coefficient of the *average
education* variable in the preceding model (model3) is not significant, that
says the effect of education is the same within countries as between countries,
i.e., allowing within and between effects to differ did not improve the model
fit.)

performance::compare_performance(model2, model4, metrics = c("BIC","RMSE"),
rank = TRUE)

anova(model2, model4)

What
do you conclude?

(m)
Consider the following three “types” of variables:

·
Variables
that only explain variation at Level 2

·
Variables
that explain variation at both Level 1 and Level 2

·
Variables
that only explain variation at Level 1.

For
the three variables you considered here: edrange,
GNI, deviation ed variable, which type is each?

**2) **The 2006 Programme
for International Student Assessment (PISA), organized by the OECD, gathered
data on 15- to 16-year-old children from schools across Australia. Variables in the dataset include

· *z_read*, a standardized reading score

· *cen_pos*, parental occupational
status, a 64-category ordinal score (the lowest value represents low
occupational status), which is centered and we will
treat as quantitative

· *cen_escs** _{: }*the
centered PISA index of economic, social, and cultural status, a composite index
created from items on the student questionnaire, including parental occupation
and education as well as possessions within the home (e.g., books, computers,
areas for children to study).

· *female,* an
indicator variable with 1 = female and 0 = male

ReadingScores = read.table("https://www.rossmanchance.com/stat414/data/ReadingScores.txt",
header=T)

head(ReadingScores)

(a) Fit model 1 and interpret
the (intercept and slope) coefficients of the model in context.

model1 = lmer(z_read~cen_pos + female + (1 | schoolid), data = ReadingScores,
REML = F)

summary(model1, corr = F)

(b) Sometimes the
intercept interpretation is not meaningful. Does the intercept of this model
make contextual sense? Explain.

*The following gives you
another way to conceptualize the multilevel models we are running, e.g.,
“slopes as outcomes” models.*

(c) Fit a
separate OLS line for each school, regressing reading score on cen_pos:

#run a separate OLS regression for each
school

intercepts= seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for (j in 1:356){

modelj=lm(ReadingScores$z_read
~ ReadingScores$cen_pos, subset = ReadingScores$schoolid
== j)

intercepts[j] = summary(modelj)$coefficients[1]

slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$cen_pos)

plot(ReadingScores$cen_pos,
ReadingScores$yhat_ols, main="Fixed
Coefficient\n Regression Lines", ylim=c(-6, 2),
col="lightgrey")

for
(j in 1:356){

abline(a
= intercepts[j], b = slopes[j])

}

#R gives a warning, but I think we are ok!

hist(intercepts)

hist(slopes)

load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))

iscamsummary(intercepts)

iscamsummary(slopes)

Comment on the behavior of these lines. How
might we measure the variation in the intercepts? In the slopes? Which is
larger?

*Optional: *You could
also just look at say 20 of the schools, e.g.,:

first10
= ReadingScores[ReadingScores$schoolid<21,]

and modify
the code

(d) Suppose
we use *pos *rather than *cen_pos*.

intercepts=
seq(1,length(levels(ReadingScores$schoolid)))

slopes=seq(1,length(levels(ReadingScores$schoolid)))

for
(j in 1:356){

modelj=lm(ReadingScores$z_read ~ ReadingScores$pos,
subset = ReadingScores$schoolid == j)

intercepts[j] = summary(modelj)$coefficients[1]

slopes[j] = summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <- intercepts + (slopes * ReadingScores$pos)

plot(ReadingScores$pos,
ReadingScores$yhat_ols, main="Fixed
Coefficient\n Regression Lines", ylim=c(-6, 2),
col="lightgrey", xlim
= c(0, 85))

for
(j in 1:356){

abline(a = intercepts[j], b = slopes[j])

}

Is there
now more of less variation in the intercepts?
Explain why.

(e) Explain
what you learn from the following graph:
plot(slopes~
intercepts)

Be sure your comments are in context (e.g.,
what is meant by intercept, what is meant by slope), e.g., Schools that have
higher …

(f) Now consider the *cen_escs** *variable
(“family capital”) and whether it should be included with fixed or random
slopes. First look at the OLS lines

#run a separate OLS regression for each
school

intercepts2=
seq(1,length(levels(ReadingScores$schoolid)))

slopes2=seq(1,length(levels(ReadingScores$schoolid)))

for
(j in 1:356){

modelj=lm(ReadingScores$z_read ~ ReadingScores$cen_escs, subset = ReadingScores$schoolid
== j)

intercepts2[j]
= summary(modelj)$coefficients[1]

slopes2[j]
= summary(modelj)$coefficients[2]

}

ReadingScores$yhat_ols <-
intercepts2 + (slopes2 * ReadingScores$cen_escs)

plot(ReadingScores$cen_escs,
ReadingScores$yhat_ols, main="Fixed Coefficient\n
Regression Lines", ylim=c(-6, 2), col="lightgrey")

for (j in 1:356){

abline(a = intercepts2[j], b = slopes2[j])

}

Is there more or less variation in the slopes
than you found for *cen_pos*?

(g) So now compare the following models

model1 = lmer(z_read~cen_pos + (1 | schoolid), data = ReadingScores,
REML = F)

model1b = lmer(z_read~cen_pos + (1 + cen_pos | schoolid), data = ReadingScores, REML = F)

model2 = lmer(z_read~cen_escs +
(1 | schoolid),
data = ReadingScores, REML = F)

model2b = lmer(z_read~cen_escs + (1 + cen_escs | schoolid), data = ReadingScores, REML = F)

texreg::screenreg(list(model1,
model1b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_pos",
" random slope cen_pos"), custom.note = "")

anova(model1,
model1b)

texreg::screenreg(list(model2,
model2b), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("fixed slope cen_escs",
" random slope cen_escs"), custom.note = "")

anova(model2,
model2b)

Discuss whether the effect of increasing parental
occupation or family capital on reading scores differs significantly across the
schools. Make sure it’s clear what part of the outputs you are using to justify
your answers. Also discuss how these
conclusions match/don’t match what you saw earlier in this exercise.

(h) Based on model2b,
give a “95% interval” for the slopes of *cen_escs*
for this population of schools. Is it plausible for a school to have a negative
slope?

(i) Create a graph of the “fitted model” for cen_escs and discuss how it compares to the individual
least squares fit lines (other than the number of lines). In
particular, is there is evidence of “shrinkage” in the multilevel model
compared to fitting separate OLS lines? Explain.

ReadingScores$yhat.mlm <- predict(model2b, ReadingScores)

plot(ReadingScores$cen_escs, ReadingScores$yhat.mlm, main="Random Coefficient\n
Regression Lines", ylim=c(-6, 2), col="lightgrey")

abline(h=seq(-6, 2,
2), col="lightgrey", lwd=0.5)

for (j in 1:356){

lines(ReadingScores$cen_escs[ReadingScores$schoolid
== j], ReadingScores$yhat.mlm[ReadingScores$schoolid
== j])

}

Alternative:

library(ggeffects)

plot(ggpredict(model2b,
terms = c("cen_escs", "schoolid [sample = 9]"), type = "re"), ci =
FALSE)

(j) Regress the OLS
intercepts and slopes on the *school_size*
variable.

school_size = aggregate(ReadingScores$schsize,
list(ReadingScores$schoolid), FUN=mean )$x

summary(lm(intercepts2
~ school_size))

summary(lm(slopes2 ~ school_size))

Interpret the intercept
and slope coefficients for each of these models. Does school size explain significant
variation in either the intercepts or the slopes of cen_escs? Summarize what you learn in English.

(k) Now add school size
to the model (the total enrollment at the school, schsize
but centered as *cen_size*)

model3 = lmer(z_read ~ cen_escs + cen_size + (1 + cen_escs | schoolid), data = ReadingScores,
REML = F)

summary(model3)

texreg::screenreg(list(model2b,
model3), digits = 3, single.row = TRUE, stars = 0,

custom.model.names = c("random slope cen_escs",
"adding school size"), custom.note =
"")

· Interpret the
coefficient of *cen_size* in context.

· Create a new variable *cen_size100*
that divides *cen_size* by 100. Substitute this
variable into the model instead (include your code, output) and explain why the
interpretation of this coefficient is more meaningful in this context.

· What percentage of the
variation in intercepts is explained by school size? What percentage of the
variation in cen_escs slopes is explained by the
school size? Which is larger? Why does that make sense from what you saw in
(j)?

· Explain why it doesn’t
make sense to consider random slopes for the cen_size
variable.