---
title: "Stat 414 - Day 25"
subtitle: "Model Diagnostics: Residuals (Ch. 10)"
output:
word_document:
reference_docx: BCStyleGuide.docx
html_notebook: default
editor: visual
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_knit$set(global.par = TRUE)
#knitr::opts_chunk$set(fig.width=4, fig.height=2.67)
```
```{r, echo=FALSE, message=FALSE}
par(mar=c(4.1, 4.1, 1.1, 1.1))
options(digits = 7, scipen=3)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
library(tidyverse)
library(nlme)
library(lme4)
library(performance)
```
------------------------------------------------------------------------
**Last Time: Multiple random slopes**
- Try to minimize use of random slopes or model gets very complicated very quickly
- Can zero out covariances to simplify model but makes sense in context?
- Centering variables can sometimes help with convergence
------------------------------------------------------------------------
**Example 1: Beach data** Reconsider our beach data
```{r}
rikzdata <- read.table("http://www.rossmanchance.com/stat414/data/RIKZ.txt", header = TRUE)
head(rikzdata)
model2 = lmer(Richness ~ 1 + NAP + (1 | rikzdata$Beach), data = rikzdata)
summary(model2)
plot(model2)
head(ranef(model2)) #this is a data frame of the estimate u_j
```
(a) What do you learn from the residuals plot? How do you interpret these residuals?
(b) What does this model predict for the Richness when NAP = 0.045 for the “average” beach? What does this model predict for the first observation in the first beach? What is the first observation in the first beach?
(c) How do we calculate the residual for this observation?
We will consider three different types of residuals!
- _Conditional residuals_: our usual level 1 residuals, the prediction errors within a particular level 2 group
- These are what R returns with residuals(model)
- Check for normality, equal variance
- Can also plot residuals vs. other variables, use smoothers
- _Level 2 residuals_: our estimated random effects.
- This is what R returns with ranef(model)
- Check for normality but doesn’t always guarantee real effects follow normal distribution, check for outliers
- Useful to plot the Level 2 random effects vs. Level 2 units, other Level 2 variables
- Can also plot squared Level 2 residuals against Level 2 variables to check for heteroscedasticity (similar to "Levene's Test")
- "Random effect residuals" = response – fixed effects – conditional residuals
- _Marginal residuals_: prediction errors from overall model
- In R: response - model.matrix(model) %*% fixef(model)
- Accounts for (confounds) both random effects and random error
- Check for unusual observations
- Can be informative to plot these across the groups (probably differ)
(d) Verify the calculation of the conditional residual, the level 2 residual, and the marginal residual for the first observation in the first beach. Which is largest? Why?
```{r}
(model.matrix(model2) %*% fixef(model2))[1,1]
fitted.values(model2)[1]
residuals(model2)[1]
ranef(model2)[[1]][1,1]
(rikzdata$Richness[1] - model.matrix(model2) %*% fixef(model2))[1,1]
```
**Example 2:** Consider a hypothetical dataset with 10 subjects with 4 temporal-based observations (one every year). Each person has data for age, sex, average number of cigarettes smoked each week, level of nicotine dependence from the Fagerstrom Test of Nicotine Dependence (FTND), ratings of depressive symptoms from the Beck Depression Inventory (BDI), and a count variable for the total number of lifetime major depressive episodes suffered up to that point of data collection.
```{r}
cigstudy= read.delim("http://www.rossmanchance.com/stat414F20/data/CigStudy.txt", "\t", header=TRUE)
head(cigstudy)
plot(cigstudy$Cigs ~ cigstudy$subjectID)
identify(cigstudy$subjectID, cigstudy$Cigs)
```
(a) Fit a model predicting the number of cigarettes smoked each month use based on time and self-reported depression (BDI), including their interaction, and FTND score, with random intercepts for subjects (subjectID):
```{r}
model1 = lmer(Cigs ~ Time + BDI + Time:BDI + FTND + (1 |subjectID), data = cigstudy)
summary(model1, corr=F)
ranef(model1)
```
(b) Predict the average number of cigarettes smoked each week for the first time point for everyone with BDI = 7 and FTND = 6.
(c) Calculate the conditional and marginal residuals for the first person in the dataset.
```{r}
plot(cigstudy$Cigs ~ cigstudy$Time, col = cigstudy$subjectID, ylim = c(0, 30), xlim = c(0, 2))
abline(a = -8.1441 + .6416*7 +1.1425*6 - .3127*(6*0), b = 5.4092-.3127*7, col = "green")
abline(a = -8.1441 + .6416*7 +1.1425*6 - .3127*(6*0) - 1.016, b = 5.4092-.3127*7, col = "blue")
```
(d) Plot the conditional residuals vs. the predicted values. Look for equal variance across the fitted values, outliers. Also check for normality. What do you conclude?
```{r}
plot(residuals(model1) ~ fitted.values(model1))
residuals(model1)
qqnorm(residuals(model1))
#try the identify function? I can't always get it to work
#Copy and paste the commands below into the Console
plot(fitted.values(model1), residuals(model1))
identify(fitted.values(model1), residuals(model1) )
#If identify doesn't work for you, try:
#Tools -> Global Options... -> Appearance -> Zoom: 100%
#Also changed the computer display settings to 100% (make them match)
which(residuals(model1) > 20)
```
(e) Examine a graph of the marginal residuals vs. the marginal fitted values. Do you see evidence of outliers?
```{r}
margfits = model.matrix(model1) %*% fixef(model1)
margresids = cigstudy$Cigs - margfits
plot(margresids~ margfits)
which(margresids > 40)
```
(f) Investigate and discuss the nature of this outlier. Which subject does this outlier belong to? Compare the conditional and marginal residual for this observation. Which is larger? Why?
```{r}
cigstudy
margresids[28,1]
residuals(model1)[28]
plot(cigstudy$Cigs ~ cigstudy$Time, col = cigstudy$subjectID)
abline(a = -8.1441 + .6416*8 +1.1425*9 - .3127*(8*4), b = 5.4092-.3127*8, col = "green")
abline(a = -8.1441 + .6416*8 +1.1425*9 - .3127*(8*4) + 12.54, b = 5.4092-.3127*8, col = "blue")
```
```{r}
library(ggeffects)
cigstudy$subjectIDF <- factor(cigstudy$subjectID)
model1 = lmer(Cigs ~ Time + BDI + Time:BDI + FTND + (1 |subjectIDF), data = cigstudy)
preds <- ggpredict(model1, terms = c("Time", "subjectIDF"), type = "re")
ggplot(preds, aes(x = x, y = predicted, group = group, color = group)) +
geom_smooth(method = "lm", alpha = .5, se = FALSE) +
theme_bw() +
geom_abline(intercept = -8.1441 + .6416*7 +1.1425*6 - .3127*(6*0), slope = 5.4092-.3127*7, col = "black") +
labs(x = "Time", y = "Cigs")
```