(b) You have enough information to write out the fitted regression model using indicator coding.

(c) Fit a linear regression model to predict FEV from the smoker variable and confirm your equation. ```{r} model1 = lm(FEV ~ Smoker, data=FEVdata) summary(model1) ``` (d) Is the smoker variable statistically significant? How are you deciding?

Note: The previous test is equivalent to a one-way ANOVA or two-sample (pooled) $t$-test. (e) Is the model valid? ```{r, echo=FALSE} par(mfrow=c(2,2)) plot(model1) par(mfrow=c(1,1)) ``` *Note:* The third graph (Scale-Location) is also used to explore the equal variance assumption. It plots the square root of the (abolute value of standardized) residuals. Like residuals vs. fitted values, you don't want to see changes in the variation at different fitted values. You would also like the red smoother to be approximately horizontal. Otherwise it tells you that the average magnitude of the residuals is changing with the fitted values (e.g., like a linear relationship...) Sometimes patterns in the residual plots reveal more than nonlinearity or unequal variance. For example, you can see whether a pattern in the residuals appears to be related to another variable not currently in the model. (f) What do you learn from a graph of residuals vs. age? ```{r} plot(model1$residuals ~ FEVdata$Age) #Reminder, R is very picky about punctuation ```

(g) Add Age to the model. Has the coefficient of Smoker changed? Why? Is Smoker still significant? How do we now interpret the coefficient of Smoker? ```{r} model2 = lm(FEV ~ Smoker + Age, data = FEVdata) summary(model2) ```

(h) What does this model look like?

Use R to overlay the model on the raw data: ```{r, echo=FALSE} newx1 = data.frame(Age = FEVdata$Age, Smoker = rep("yes", 654)) newx2 = data.frame(Age = FEVdata$Age, Smoker = rep("no", 654)) fits1=predict(model2, newx1) fits2=predict(model2, newx2) #may need to copy these into Session window/run them as a set plot(FEVdata$FEV ~ FEVdata$Age, col=as.factor(FEVdata$Smoker)) lines(FEVdata$Age, fits1, col=1) lines(FEVdata$Age, fits2, col=2) #or, with tidyverse ggplot(FEVdata,aes(x=Age,y=FEV,color= Smoker))+ geom_point()+ geom_line(aes(y=predict(model2))) + theme_bw() #but might see more with the residual plots par(mfrow=c(2,2)) plot(model2, col=as.factor(FEVdata$Smoker)) par(mfrow=c(1,1)) ``` Including the binary variable allows the intercepts to differ, but we are still assuming the slopes are the same. Produce a graph to decide whether there is evidence that the relationship between FEV and age differs for the smokers and nonsmokers. ```{r} coplot(FEV ~ Age | Smoker, data = FEVdata, panel = function(x,y,...) { panel.smooth(x,y) abline(lm(y ~ x), col="blue") } ) #or FEVdata %>% ggplot(aes(x = Age, y = FEV)) + geom_point()+ facet_wrap(~Smoker) + geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) + labs(title = "FEV vs. Age for non-smokers and smokers") + theme_bw() ``` (i) What do you learn?

**Definition**: A quantitative variable and a categorical variable $interact$ if the slopes of the regression lines differ. (After all, it's the slope that tells us about the association between the two variables, so this says the association between the response and the quantitative variable depends on the category of the categorical variable.) To include an interaction between $x_1$ and $x_2$ in the model, we literally multiply $x_1$ and $x_2$ together and add this variable to the model. (j) What does it mean to multiply Smoker and Age (one categorical and one quantitative variable) together?

Add the interaction to the model ```{r} #You can make R do the multiplication for you by including Smoker:Age with a colon to signify an interaction model3 = lm(FEV ~ Smoker + Age + Smoker:Age, data = FEVdata) #or model3 = lm(FEV ~ Smoker*Age, data = FEVdata) #notice the possible short-cut here, the * means include all 3 terms summary(model3) ``` (k) Write out the full equation and then write out the equation (FEV vs. age) for the smokers and the non-smokers.

(l) What does the sign of the interaction term tell you? (Note: Another way to interpret this interaction - what is the slope of age in the full equation?)

```{r} ggplot(FEVdata,aes(x=Age,y=FEV,color= Smoker))+geom_point()+ geom_line(aes(y=predict(model3))) + labs(title = "FEV vs. Age with interaction") + theme_bw() ``` (m) Is this model valid? ```{r, echo=FALSE} par(mfrow=c(2,2)) plot(model3) ```

(n) What about multicollinearity? ```{r} #install.packages(car) car::vif(model3) par(mfrow=c(1,1)) ``` (o) Why are VIF values more informative than looking at pairwise correlation coefficients among the explanatory variables?

Because an interaction is a "product," centering the quantitative variable might help with the multicollinearity. ```{r} #Manually centering the age variable Age.c = FEVdata$Age - mean(FEVdata$Age) model4 = lm(FEV ~ Smoker*Age.c, data = FEVdata) summary(model4) car::vif(model4) ``` (p) Did we improve the multicollinearity?

(q) How do we interpret the coefficient of smoker in this model? (

**Notes:** - Indicator variables change intercepts; Interaction terms change slopes. - _Always_ best to use graphs to help illustrate an interaction. - Centering to remove multicollinearity doesn't work on all pairs of variables, just "products" like quadratic and interaction. - When center with interaction, the interpretation of the "main effect" is about the change in response when the other variable is at its mean (to "zero out" the interaction). In this case, the *adjusted association* is much different from the *unadjusted association.* You can get a graph of the adjusted association, for example an *added variable plot*. ```{r} car::avPlots(model2) ```