---
title: "Logistic Regresson (brief overview)"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warnings=FALSE)
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
library(tidyverse)
library(lme4)
library(effects)
library(readr)
```
This lesson gives a brief overview of logistic regression. The main distinction is we have a binary response variable rather than a quantitative response variable.
**Example 1:** Between 1972-1974 a survey was taken in the Whickham district of the United Kingdom (Appleton et al., 1996; Simonoff, 2003), including information such as smoking status and age. Twenty years later, a follow-up study was conducted, and it was determined whether the interviewee was still alive. First consider the smokers and non-smokers:
```{r}
mymatrix = matrix(c(443, 139, 502, 230), ncol=2, dimnames = list(c("alive", "died"), c("smokers", "non-smokers")))
mymatrix
```
*(a) What is the response variable? Quantitative or categorical?*
There are several statistics we could use to compare the likelihood of being alive between the smokers and non-smokers, including
* difference in conditional proportions `(443/(443+139) - (502)/(502+230))`
* relative risk = ratio of conditional proportions `(443/(443+139) / (502)/(502+230))`
* odds ratio
The difference in conditional proportions has some limitations, namely if the success probability is small, you will be working with small numbers and so it is difficult to look at the difference and say "that's large" or not. The relative risk helps you see whether one value is large compared to the other value, but it is problematic to use with "case-control studies" (I find some successes and I find some failures, so I can't turn around and use the data to estimate the probability of success.) Odds ratio doesn't have either of these issues, but is more difficult to interpret.
**Definition:** *Odds* is the ratio of the probability of success to the probability of failure, e.g., how many times more likely is a success outcome than a failure outcome.
Here, the odds of being alive for smokers is `443/(443+139) / 139/(443+139) = 443/139 = 3.19`. Smokers were 3.2 times more likely to be alive than dead at the follow-up study.
*(b) Compute the odds of being alive for the non-smokers*
To compare these two values, it is again better to look at the ratio then the difference. You can choose to put the larger number in the numerator. So the odds ratio would be `3.19/2.18 = 1.46' so we can say the odds of being alive (rather than not) were 1.46 times larger for the smokers than the non-smokers. We could also say the odds of being alive were '(1.46 - 1)*100 = 46%' higher for the smokers than the non-smokers.
*(c) Compute the odds ratio of being alive for the non-smokers (numerator) compared to the smokers (denominator). Also report the percentage change (subtract 1 and multiply by 100% and report as a decrease)*
Either interpretation is perfectly fine. However you should be more bothered by these data suggesting that smoking is beneficial for your health! So we want to "adjust" for possible confounding variables.
Consider the following data:
```{r}
mymatrix2 = matrix(c(21, 114, 117-114, 29, 273, 281-273, 39, 209, 230-209, 49, 169, 208-169, 59, 145, 236-145, 69, 35, 165-35, 79, 0,77-0), nrow=3, dimnames = list(c("midage", "alive", "died"), c("under 25", "25-35", "35-45", "45-55", "55-65", "65-75", "over 75")))
mymatrix2
```
Does probability of being alive appear to depend on the age at the first interview? Let's explore:
```{r}
props = prop.table(mymatrix2[2:3,], margin=2)
par(mar = c(5,5,0,1) + 0.1) #the excess white space around the graphs is really starting to annoy me
plot(props[1,]~mymatrix2[1,])
```
*(d) Explain what this is a graph of and what you learn*
Of course when we have a relationship we want to fit a line, but that's not appropriate here (and generally for proportions as the response) for two main reasons:
* We can't extend the line much further without predicting probabilities below 0 or above 1
* The relationship is not linear.
To solve the second issue, we want to transform the data. But remember the transformations we saw before were for "monotonic" relationships or polynomial relationships. With proportions we tend to see more of an "S-shaped" curve where the "response" values approach zero in one direction and approach one in the other direction. So we will use a different kind of transformation.
**Definition** The logit transformation is $ln(\pi/(1-\pi))$ which is equivalent to the log odds of success.
```{r}
#We have to do something about the zero. I'm just going to put in 1 there for now.
props[,7] = c(.193, .987)
logodds= log(props[1,]/props[2,])
par(mar = c(5,5,0,1) + 0.1)
plot(logodds~mymatrix2[1,])
```
The relationship should be more linear and I don't have any problem with the response going off to plus or minus infinity.
So the logistic regression model is:
$$expected\ log (\pi/(1-\pi)) = \beta_0 + \beta_1 x$$
We can fit a logistic regression model in R using glm (generalized linear model) rather than lm.
```{r}
WhickhamData = read_delim("http://www.rossmanchance.com/stat414/data/WhickhamData.txt", "\t")
WhickhamData
```
Notice this data file is in "count /frequency format" or "grouped" (already have the counts for each possible explanatory variable combination) so we can think of each row as a binomial random variable where we have the observed number of successes and the sample size for that binomial random variable.
```{r}
#when the data is in this grouped format, tell R the counts for successes and failures
WhickhamData$failures = WhickhamData$interviewed-WhickhamData$alive
#I want to treat age as quantitative in this model
agecats = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+")
agevalues = c(21, 29, 39, 49, 59, 69, 79)
ageQ = as.numeric(agevalues[match(WhickhamData$age, agecats)])
model1 = glm(cbind(alive, failures)~ ageQ, family=binomial("logit"), data = WhickhamData)
summary(model1)
par(mar = c(5,5,0,1) + 0.1)
age100=seq(0:100)
#predicted probability is odds/(1+odds)
predprob = exp(7.38 - .1228*age100)/(1+exp(7.38 - .1228*age100))
plot(predprob~age100, type="l")
points(x=mymatrix2[1,], y=props[1,])
```
So the fitted model is
$$ predicted\ log\ odds\ of\ alive = 7.38 - 0.123 age$$
Clearly age is statistically significant ($z = -17.58$) and with a negative coefficient, which seems to imply the probability of being alive 20 years later is smaller for individuals who were older at the time of the first interview.
*(e) To interpret the intercept, what are the predicted log odds of being alive at age = 0? What are the predicted odds of being alive at age = 0?*
Again, most people don't have good intuition for odds, so you can convert this back to a probability by using the relationship $probability = odds/(1 + odds)$
*(f) What is the predicted probability of someone who was a newborn in Whickham UK at the start of the studying being alive 20 years later?*
To interpret the slope, we start off as usual with "a one-unit increase in x..." If you do the algebra, this corresponds to an $exp(\hat\beta)$ predicted *multiplicative* increase in the response.
*(g) EStimate the decrease in the odds of survival with each additional year of age. What about a 10 year age difference?*
Now let's go back to the smoking variable
```{r}
model2 = glm(cbind(alive, failures)~ smoking.status, family=binomial("logit"), data = WhickhamData)
summary(model2)
```
Smoking.status is a categorical variable, remember that R creates a 0/1 variable in the model.
*(h) Provide an interpretation of the sloe coefficient in this model. How does it compare to the odds ratio we computed by hand above?*
You saw above that age is related to the response and it turns out the real question is whether there is an association between smoking status and survival status *after adjusting for age*.
```{r}
model3 = glm(cbind(alive, failures)~ ageQ + smoking.status, family=binomial("logit"), data = WhickhamData)
summary(model3)
```
*(i) What do you learn about the association between smoking.status and probability of being alive, after adjusting for age? Interpret as if to a non-statistician.*
To see why this is happening, we can explore the relationship between age and smoking
*(j) Explain what you learn and how this relates to the above analyses.*
**Summary:** Logistic Regression allows us to model the log odds of success for a categorical response variable based on any number of quantitative or categorical predictor variables. In general, if $x_j$ is increased by one unit (all other variables fixed), the odds of success, that is the odds that $Y = 1$, are multiplied by $e^{\hat\beta_j}$. (And the estimated increase in the odds associated with a change of d units is $exp(d \times \hat\beta_j)$.)
With a binary predictor, $exp(\beta)$ is the ratio of the population odds when x=1 to the odds for x=0, more directly the odds ratio between these two groups.
Note: The conclusions are the same no matter which outcome is labeled as success vs. failure.