(b) How does the residual standard error compare to the standard deviation of Heart? Examine the ANOVA table ```{r} #Note aov and anova do different things anova(model1) ``` (c) Confirm the calculation of the residual standard error from this table.

(d) How is the $F$ value for Walk calculated? What are the degrees of freedom for this $F$-value? Why? What does this $F$-value (and p-value) tell you? (What is the null hypothesis?)

(e) How does the $F$ value for Walk compare to the $t$-statistic for Year in the first table? How do the p-values compare? What null hypothesis does this p-value test?

(f) In the Regression Table output, what is the standard error of the slope coefficient for Walk? How do we interpret this value? Demo: [Regression applet](https://www.rossmanchance.com/applets/2021/regshuffle/regshuffle.htm) [PaceData](https://www.rossmanchance.com/stat414/data/Pace.txt)

(g) What if we had centered the Walk variable first? ```{r} #centering = subtract the variable mean from each value model2 = lm(Heart ~ I(Walk - mean(Walk)), data = PaceData) anova(model2) ```

(h) What if we had standardized the walk variable first? ```{r} #The scale command allows you to center and/or standardize (also divide by the SD, think z-score) zWalk = scale(PaceData$Walk, mean(PaceData$Walk), sd(PaceData$Walk)) #I used a capital W! model3 = lm(Heart ~ zWalk, data = PaceData ) anova(model3) plot(PaceData$Heart ~ zWalk) abline(model3) ```

For each region in the United States, 3 large cities, 3 medium-size cities, and 3 smaller cities were selected. Let's see whether the regional differences are statistically significant. ```{r} boxplot(Heart ~ Region, data = PaceData) summary(model4a <- aov(Heart ~ Region, data = PaceData)) #ICC = intraclass correlation coefficient #(MSG - MSW)/(MSG + (k-1)MSW), where k = (common) group size (63.509 - 23.785)/(63.509+8*23.785) #library(multilevel) ICC1(model4a) ``` The above ANOVA is equivalent to a "linear model" ... ```{r} model4 <- lm(Heart ~ Region, data = PaceData ) anova(model4) ``` (i) Why is this a linear model? But what should we worry about?

Now fit the multiple regression model. ```{r} #Note the shortcut anova(model5 <- lm(Heart ~ Walk + Region, data = PaceData)) ``` (j) What do you conclude?

(k) Does the coefficient of Walk change between the two models? ```{r} model1 model5 #summary(model1) #summary(model5) ``` What does that tell you? (l) What do you learn from the following model? ```{r} model6 <- lm(Heart ~ Walk + Region + Talk + Bank + Watch, data= PaceData) summary(model6) ``` Is talk a statistically significanct predictor of heart disease? The most common method to check for linear associations among the explanatory variables is

Not sure why these $year$ and $year^2$ are no longer collinear? ```{r} plot(centeredyear ~ I(centeredyear^2)) ``` All we have done is move the curve in the quadratic relationship to be in the middle of our x-space. This is fine, even beneficial, having strongly related x-variables just causes problems with they "line up." (o) How do we interpret the intercept of this model with both variables centered? #### Notes - The above focuses on "statistical significance" but it is also important to look at "standardized effect sizes" to help assess "practical significance." (They are unitless so don't depend on the scaling of the variables.) See the Quiz for some calculations. - Centering a variable (by subtracting the mean from each value) can help with - making the intercept more interpretable (when x is at the mean rather than when x is at zero, which may not be a value in dataset) - making comparings of slopes more meaningful (a one SD change) - reducing multicolinearity in ''product'' terms _For next time_ Review interpreting coefficients with categorical variables.