---
title: "Stat 414 - Day 30"
subtitle: "Multilevel Logistic Regression (Ch. 17)"
output:
word_document:
reference_docx: BCStyleGuide.docx
html_notebook: default
editor: visual
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_knit$set(global.par = TRUE)
#knitr::opts_chunk$set(fig.width=4, fig.height=2.67)
```
```{r, echo=FALSE, message=FALSE}
par(mar=c(4.1, 4.1, 1.1, 1.1))
options(digits = 7, scipen=3)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
library(tidyverse)
library(nlme)
library(lme4)
library(performance)
#install.packages("texreg")
#library(texreg)
```
**Logistic Regression**
With a binary response variable, we can predict the probability of success using the logistic "link function" to create a linear relationship with the log-odds. $$ ln (\pi/(1-\pi)) = \beta_0 + \beta_1 x $$ In running the generalized linear model, we also specify the distribution of the errors. For the logistic model, the default is the binomial distribution. For a binomial random variable, recall that $E(Y) = n \pi$ and $Var(Y) = n\pi(1-\pi)$. Note that we have one parameter here $\pi$ and not separate parameters for the mean and variance. Also note that the logistic models works just as well for binomial observations (response = number or proportions of successes) or Bernoulli observations (response = success or failure).
**Example 1:** Data were collected on 5,366 women who recently gave birth in Bangledesh. One question we can ask is whether mother's age ($mage$) predicts whether or not the mother receives prenatal care during pregnancy ($antemed$).
Fit a logistic regression model
```{r}
bang = read.delim("https://www.rossmanchance.com/stat414/data/Bangladesh.txt", header=TRUE, "\t")
head(bang$antemed) #note, the response variable is in "ungrouped" (Bernoulli) format
model.glm = glm(antemed ~ mage, data = bang, family=binomial)
summary(model.glm)
```
Note, these models can get complicated to run and you will start to notice they take a few minutes. "Fisher scoring iterations" is one approach.
```{r}
#library(effects)
plot(effects::allEffects(model.glm))
#expanding the x-values to ridiculous numbers to see the "S-shaped" curve
xseq = seq(-200, 300)
preds = exp(.694793 - .007690*xseq)/(1 + exp(.694793 - .007690*xseq))
plot(preds~xseq, type="l", ylab = "predicted probability")
```
(a) Summarize the "effect" of mom's age on the response
These observations were taken across 361 communities. Are there substantial community to community differences in the likelihood of receiving prenatal care?
```{r}
props = aggregate(bang, by = list(bang$comm), FUN = mean)
hist(props$antemed)
plot(props$antemed ~ as.factor(props$comm), ylab= "proportion prenatal care", xlab = "community ID")
summary(props$antemed); sd(props$antemed)
```
(b) Why did I use the "mean" function in the aggregate command?
(c) What is the average proportion?
(d) Is the association between "whether or not prenatal care" and "community" statistically significant?
```{r}
counts = aggregate(bang, by = list(bang$comm), FUN = sum)
chisq.test(counts$antemed)
```
So let's add the community variable to the model
```{r}
commF = as.factor(bang$comm)
model.glm2 = glm(antemed ~ magec + commF, data = bang, family = binomial, contrasts = list(commF = contr.sum))
#Are you sure you want to see this?
summary(model.glm2)$coefficients[1:5,1:4]
```
Why did I use "as.factor" with the comm variable?
Does the "effect" of mom's age change much when we added community to the model? What does this tell you?
We do see the significant differences among communities, but again, the individual communities aren't my primary interest and that's a lot of coefficients to print out!
Once again, we have the option of treating the community id as a random effect rather than a fixed effect. Here is the random intercepts model $$ln (\pi_j/(1-\pi_j)) = \beta_0 + u_{0j}$$ where $$u_{0j} \sim N(0, \tau_0^2)$$
(e) Explain what $u_{0j}$ and $\tau_{0}^2$ represent in this context.
We will use the "glmer" function (in lme4 package) to fit multilevel logistic regression models.
```{r}
#The random intercepts model
model0 = glmer(antemed~ 1 + (1 | comm), family=binomial, data = bang)
summary(model0)
confint(model0)
#can use fitted.values to see the (back-transformed) predicted probabilities. Note how we assume the same probability for every woman in the same community
head(fitted.values(model0), 20)
```
(f) Interpret the intercept in context.
Notice that we are only given an estimate for $\tau_0$ but not $\sigma$. That's because there is no separate "within community variation" parameter in logistic regression. (We are assuming the same odds for each woman within the same community.) This has led to different suggestions for calculating the intraclass correlation coefficient. I'm partial to $$ICC = \tau_0^2 / (\tau_0^2 + \pi^2/3)$$ where $\pi^2/3$ comes from the variance of the logistic distribution.
(g) Use the suggested formula to calculate an ICC.
Note, this agrees with the performance package.
```{r}
performance::icc(model0)
```
(h) Fit the random intercepts model to predict the probability of receiving prenatal care from the mother's age when the child was born (grand mean centered), while allowing for the odds to vary among the communities.
```{r, warning = FALSE}
model1.mlm = glmer(antemed~ 1 + magec + (1 | comm), family=binomial, data = bang)
summary(model1.mlm, corr=FALSE)
```
- Write out the estimated model equation.
But the back-transformed probability functions will vary by community
```{r}
#library(effects)
plot(allEffects(model1.mlm))
#library(ggeffects)
plot(ggeffects::ggpredict(model1.mlm, terms=c("magec", "comm [sample = 9]"), type = "re"), ci = FALSE)
predprob=fitted(model1.mlm)
ggplot(data = bang, aes(y=predprob, x=magec, group=comm)) +
stat_smooth(method="loess", se=F) +
theme_bw()
```
**Example 2:** The data in shot.attempts.csv represent shot attempts in an imaginary hockey league.
```{r}
library(readr)
hockey = read_csv("https://www.rossmanchance.com/stat414/data/shot-attempts.csv")
head(hockey)
#Create distance and angle variables (from Galamini Jr)
hockey$dist = with(hockey, sqrt((x_coord - 89)^2 + (y_coord^2)))
hockey$angle = with(hockey, atan2(y_coord, 89 - x_coord))
#make binary response variable
hockey$goal = ifelse(hockey$shot_result == "Goal", 1, 0)
```
(a) Fit a random intercepts model to predict the probability of scoring a goal from the distance and angle of the shot, while allowing for the odds to vary among the players.
```{r}
model2.mlm = glmer(goal ~ 1 + angle + dist + (1 | shooter_id), family=binomial, data = hockey)
summary(model2.mlm, corr=FALSE)
confint(model2.mlm)
plot(allEffects(model2.mlm))
plot(ggeffects::ggpredict(model1.mlm, terms=c("dist", "shooter_id [sample = 9]"), type = "re"), ci = FALSE)
```
(b) Include random effects for goalie as well. How many parameters does this add to the model? How do you interpret the parameter(s)? How do the ariance components compare? What does this tell you? What assumption are we making in this model that you might want to question?
```{r}
model3.mlm = #
summary(model3.mlm, corr=FALSE)
confint(model3.mlm)
plot(allEffects(model3.mlm))
plot(ggeffects::ggpredict(model1.mlm, terms=c("dist", "shooter_id [sample = 9]"), type = "re"), ci = FALSE)
performance::icc(model3.mlm)
```
Explore the estimated player and goalie effects:
```{r}
u0 <- as.data.frame(ranef(model3.mlm, condVar = TRUE) )
ggplot(u0[u0$grpvar=="shooter_id",], aes(y = condval, x = grp)) +
geom_point() +
geom_errorbar(aes(ymin = condval - 1.96*condsd,
max = condval + 1.96*condsd), width = 0) +
labs(title = "Shooter Intercepts", x = "Shooter", y = "Estimate and 95% CI") +
theme(axis.text.x = element_text(angle = 90)) + theme_bw()
ggplot(u0[u0$grpvar=="goalie_id",], aes(y = condval, x = grp)) +
geom_point() +
geom_errorbar(aes(ymin = condval - 1.96*condsd,
max = condval + 1.96*condsd), width = 0) +
labs(title = "Goalie Intercepts", x = "Goalie", y = "Estimate and 95% CI") +
theme(axis.text.x = element_text(angle = 90)) + theme_bw()
```