---
title: "Stat 414 - Day 19"
subtitle: "Adding Level 2 Variables (4.5)"
output:
word_document:
reference_docx: BCStyleGuide.docx
html_notebook: default
editor: visual
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_knit$set(global.par = TRUE)
#knitr::opts_chunk$set(fig.width=4, fig.height=2.67)
```
```{r, echo=FALSE, message=FALSE}
par(mar=c(4.1, 4.1, 1.1, 1.1))
options(digits = 7, scipen=3)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
library(tidyverse)
library(nlme)
library(lme4)
library(performance)
```
------------------------------------------------------------------------
**Last Time: Adding Level 1 variables**
- Adding Level 1 variables should explain Level 1 variability (reduce $\sigma\hat$)
- Can calculate the percentage of Level 1 variability explained
- Can calculate $R^2$ values as the amount of total variation explained by the fixed effects
- Can also change Level 2 variability (and ICC)
- Explains Level 2 variability if the association at the group level is 'consistent'
- Can increase the Level 2 variability if the association is not consistent
- Can use performance::r2(model, by_group = TRUE)
- Can refer to $\sigma^2$ as the “conditional” variance and $\tau^2 + \sigma^2$ as the “marginal” variance (across the groups).
------------------------------------------------------------------------
**Example 1: Radon cont.**
Recall the radon dataset.
```{r}
ctydata <- read_csv("https://www.rossmanchance.com/stat414/data/cty.dat") #county-level data, including uranium
srrs2 <- read_csv("https://www.rossmanchance.com/stat414/data/srrs2.dat") #house-level data including stfips and cntyfips
#names need to match
ctydata$cntyfips <- as.numeric(ctydata$ctfips)
#have a few duplicated rows in county data
ctydata$newfips <- as.numeric(ctydata$stfips)*1000 + ctydata$cntyfips
length(unique(ctydata$newfips))
merged <- merge(srrs2, ctydata[!duplicated(ctydata$newfips),], by=c("cntyfips", "stfips"))
#focus on just Minnesota
mergedmn <- merged[which(merged$state == "MN"),]
nrow(mergedmn)
mergedmn$uranium = log(ifelse(mergedmn$Uppm==0, .1, mergedmn$Uppm))
mergedmn$logradon = log(ifelse(mergedmn$activity == 0, .1, mergedmn$activity))
#library(lme4)
model0 <- lmer(logradon ~ 1 + (1 | newfips), data = mergedmn)
summary(model0)
```
(a) Now we want to add the soil uranium level to the model. Is this a level 1 or level 2 variable? How can you tell?
```{r}
head(mergedmn)
head(ranef(model0)$newfips[[1]])
scatter.smooth(ranef(model0)$newfips[[1]]~unique(mergedmn$uranium))
```
(b) Show how to add this variable to the model equations. How many terms will it add to the model?
(c) What do we have to do differently in the lmer command to tell R about this variable?
```{r}
model1 <- lmer(logradon ~ 1 + uranium + (1 | newfips), data = mergedmn)
summary(model1, corr= F) #finally get rid of that "correlation of parameter estimates" output at the bottom.
```
(d) Interpret the coefficient of uranium in this model. Do you consider it statistically significant?
(e) How did the variance components change? Do these changes make sense? Explain.
```{r}
performance::r2(model1, by_group = TRUE)
```
**Example 2:** Recall the Netherlands study on language test performance for Grade 8 students nested within schools.
```{r message=FALSE}
neth = read.table("https://www.rossmanchance.com/stat414F20/data/NetherlandsLanguage.txt", header=TRUE)
head(neth)
model0 = lmer(langPOST ~ 1 + (1 | schoolnr), data = neth, REML = F)
summary(model0, corr = FALSE)
performance::icc(model0)
model1 = lmer(langPOST ~ 1 + IQ_verb + (1 | schoolnr), data = neth, REML = F)
summary(model1, corr = FALSE)
performance::icc(model1)
performance::r2(model1, by_group=TRUE)
```
Another important type of Level 2 variable is aggregating a Level 1 variable, e.g., average school IQ. These are sometimes referred to as "contextual effects."
(a) Is the school mean verbal IQ related to (average) performance score?
```{r}
plot(neth$langPOST ~ neth$sch_iqv)
#this enters the group means for each row in the full dataset
schoolmeans <- neth %>%
group_by(schoolnr) %>%
mutate(meanLang = mean(langPOST), meanIQ = mean(IQ_verb)) %>%
ungroup()
nrow(schoolmeans)
plot(schoolmeans$meanLang ~ schoolmeans$meanIQ)
lm(schoolmeans$meanLang ~ schoolmeans$meanIQ)
```
(b) Adding sch_iqv, what are the Level 1 and Level 2 model equations?
```{r}
model3 = lmer(langPOST ~ 1 + IQ_verb + sch_iqv + (1 | schoolnr), data = neth, REML = F)
summary(model3, corr = FALSE)
anova(model2, model3)
performance::r2(model3, by_group=TRUE)
```
(c) Interpret the slope of the sch_iqv variable in this model in context.
(d) Based on the output you have, is this new variable a significant addition to the model? How are you deciding?

(e) How much Level 1 variability did we explain? How much Level 2?

(f) Which do we understand more, why some schools have higher language scores or why some students have higher language scores? Explain your reasoning.
(g) What if we change to the "deviation" variable, verbal_IQ - sch_iqv? (aka Group Mean Centering)
```{r}
devIQ = neth$IQ_verb - neth$sch_iqv
model4 = lmer(langPOST ~ 1 + devIQ + sch_iqv + (1 | schoolnr), data = neth, REML = F)
summary(model4, corr = FALSE)
anova(model3, model4)
```
Interpret the slope coefficients for this model.
(h) Does the deviation variable explain variation at Level 1 and/or Level 2?
```{r}
model4 = lmer(langPOST ~ 1 + devIQ + sch_iqv + (1 | schoolnr), data = neth, REML = F)
summary(model4, corr = FALSE)
anova(model3, model4)
performance::r2(model4, by_group = TRUE)
model5 = lmer(langPOST ~ 1 + sch_iqv + (1 | schoolnr), data = neth, REML = F)
performance::r2(model5, by_group = TRUE)
model6 = lmer(langPOST ~ 1 + devIQ + (1 | schoolnr), data = neth, REML = F)
performance::r2(model6, by_group = TRUE)
```
_Notes:_
- “Group mean centering” (as opposed to grand mean centering) creates a “within” group variable.
- Some recommend calculating group means before observations with missing values are deleted (or better yet, use imputation)
- When using x and x-bar (rather than deviation and x-bar), the coefficient of x-bar is the difference between the within and between group effect. The significance of the group mean variable is akin to the Hausman specification test in econometrics: Is the difference between the "within group" and "between group" effects statistically significant?