---
title: "Stat 414 - Day 20"
subtitle: "Random Slopes (5.1)"
output:
word_document:
reference_docx: BCStyleGuide.docx
html_notebook: default
editor: visual
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_knit$set(global.par = TRUE)
#knitr::opts_chunk$set(fig.width=4, fig.height=2.67)
```
```{r, echo=FALSE, message=FALSE}
par(mar=c(4.1, 4.1, 1.1, 1.1))
options(digits = 7, scipen=3)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
library(tidyverse)
library(nlme)
library(lme4)
library(performance)
#install.packages("texreg")
#library(texreg)
```
------------------------------------------------------------------------
**Last Week: Adding Level 1 and Level 2 variables to random intercept models**
- Level 1 variables may explain association at Level 1 and/or Level 2 (if the distribution of the variable differs across the Level 2 units)
- Can also increase unexplained variation at Level 2
- Some can only explain Level 1 variation (e.g., income percentile, z-score in country)
- Level 2 variables only explain association at Level 2 (e.g., country uranium)
- Aggregating a Level 1 variable to Level 2 gives a nice "contextual" variable
- Does coefficient of group mean variable represent the additional contribution or the group level effect?
- $\hat\beta_1 x_{ij} + \hat\beta_2 \bar x_j$ ($\hat\beta_2$ is the additional contribution at Level 2)
- If not significant, the Level 1 and Level 2 associations are "the same"
- $\hat\beta_1 (x_{ij}-\bar x_j + \hat\beta_2 \bar x_{ij} (\hat\beta_2$ is the effect at Level 2)
- Nicely separates the Level 1 and Level 2 associations
------------------------------------------------------------------------
**Example 1:** Data were collected from nine beaches along the Dutch coast. Five readings were taken for each beach. The response variable was species richness (number of different species), and available variables are NAP (the height of the sampling station relative to the mean tidal level), and Exposure (a composite measure of wave action, length of the surf zone, slope, grain size, and the depth of the anaerobic layer).
```{r}
rikzdata<-read.table("http://www.rossmanchance.com/stat414/data/RIKZ.txt",header=T)
head(rikzdata)
```
(a) What are the Level 1 units in this study? How many are there? Any Level 1 variables?
(b) What are the Level 2 units in this study? How many are there? Any Level 2 variables?
(c) Does Richness vary with NAP? In the expected direction?
```{r}
plot(rikzdata$Richness ~ rikzdata$NAP)
```
Does Richness vary by beach?
```{r}
plot(rikzdata$Richness ~ rikzdata$Beach)
```
(d) Suppose we treated the beaches as a "fixed effect," write out the model equation:
```{r}
plot(predict(lm(Richness ~ factor(Beach), data = rikzdata))~rikzdata$Beach)
```
Let’s consider instead the the “random intercepts” or “variance components” or "unconditional means" model.
$Y_{ij}=\beta_0 + u_j+ \epsilon_{ij}$ for the $i^{th}$ site on $j^{th}$ beach
(e) How will $\hat u_1$ compare to $\hat\beta_1$?
Fit the null model:
```{r}
model0 = lmer(Richness ~ 1 + (1 | rikzdata$Beach), data = rikzdata)
summary(model0, corr=F)
plot(predict(model0)~rikzdata$Beach)
```
(f) According to this model, how much of the variation in Richness is due to the different beaches?
Fit the multilevel model that also allows Richness to vary with NAP, after adjusting for beach:
```{r}
model1 = lmer(Richness ~ NAP + (1 | Beach), data = rikzdata)
summary(model1, corr=F)
#Our predicted models
BeachAsFactor = factor(rikzdata$Beach)
preds = predict(model1, newdata = rikzdata)
ggplot(rikzdata, aes(x = NAP , y = preds , group = Beach, color = BeachAsFactor )) +
geom_smooth(method = "lm", alpha = .5, se = FALSE) +
geom_abline(intercept = 6.686, slope = -2.867) +
geom_point(data = rikzdata, aes(y = Richness, color=BeachAsFactor), alpha = .5) +
theme_bw()
```
(g) Does this model appear to be valid? Do any of the patterns in the residuals vs. fitted plots appear related to any particular beach? What does the pattern suggest doing with the model?
```{r}
plot(model1, col=BeachAsFactor)
plot(residuals(model1) ~ fitted.values(model1), col=rikzdata$Beach)
ggplot(rikzdata, aes(x = NAP, y = Richness)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
facet_wrap(~Beach) +
theme_bw() +
geom_abline(intercept = fixef(model1)[[1]], slope=fixef(model1)[[2]])
```
To allow the slopes to vary across the Level 2 units in the model equation, we add a $j$ index to the slope too. Level 1 equation: $$y_{ij} = \beta_{0j} + \beta_{1j} x_{1ij} + \epsilon_{ij}$$ where $\beta_{0j} = \beta_{00} + u_{0j}$ and $u_{0j}\sim N(0, \tau_0^2)$ and ...
(h) Write out the second level 2 equation
(i) Now create the composite equation.
(j) Give the expression for beach j's intercept. Give the expression for beach j's slope.
(k) What assumptions are we making about the distribution of the random slopes?
Fit the random slopes (or "random coefficients") model, allowing the slopes to vary across the beaches:
```{r warnings = FALSE}
model2 = lmer(Richness ~ NAP + (1 + NAP | Beach), data = rikzdata, REML = FALSE)
# you get a warning (not an error) and can ignore it
summary(model2)
#Our predicted models
preds = predict(model2, newdata = rikzdata)
ggplot(rikzdata, aes(x = NAP , y = preds , group = Beach, color = BeachAsFactor )) +
geom_smooth(method = "lm", alpha = .5, se = FALSE) +
geom_abline(intercept = 6.58, slope = -2.83) +
geom_point(data = rikzdata, aes(y = Richness, color=BeachAsFactor), alpha = .5) +
theme_bw()
```
(l) How many parameters does this add to the model? (What if beach was a fixed effect?) What do these new parameter estimate(s) tell you?
(m) Based on this model, what is the equation for Beach 1? What is the equation for Beach 9? (Hint: See output below)
```{r}
fixef(model2)
ranef(model2)
```
(n) Are the differences in the slopes statistically significant? (State hypotheses, df, test statistic, p-value, conclusion in context.)
```{r}
#do something here
```
**Notes:**
- In a random slopes model, be careful with the interpretation of the intercept variance and the intercept-by-slope covariance, they assume $x$ = 0. Another reason why its always good practice to center your explanatory variables so the intercept is meaningful.
- You can use Likelihood-ratio tests to assess whether the random slopes model is "worth pursuing statistically." As you can see, randomly slopes can improve the complexity of the model pretty quickly, so you should have empirical, or better yet theoretical, reasons for doing so.