Stat 414 – HW 3

Due Friday, midnight, Oct. 20

 

Please submit individual files for each problem. There are a few nice packages for displaying multiple models.  For example

install.packages("stargazer")

library(stargazer)

stargazer(model1, model2, type = "text")

 

1) Read this Oct. 1, 2023 article by Nate Silver: https://www.natesilver.net/p/fine-ill-run-a-regression-analysis

(a) The article mentions “true and robust” and we looked at “robust standard errors” in the course.  What is meant by the term “robust” in statistics?  Identify another robust procedure you have seen in this course this quarter.

(b) One of the critiques of Nate’s claims was that “unadjusted state comparisons are misleading”.  Explain the argument in your own words. Do you agree or disagree? Explain your reasoning.

(c) Nate mentions “this is almost entirely orthogonal to state partisanship.”  What is meant by “orthogonal” in this context? 

(d) In the regression model, define the “biden” variable.  Is this variable quantitative or categorical? What does it mean for the variable to have a negative coefficient in the model? What needs to be true for this coefficient to be meaningful?

(e) What are the observational units (aka cases) in his regression analysis? 

(f) Do you agree with his argument to drop Biden from the model?

 

2) Reconsider the salary data from HW 2.

The “within group” slope (regression salary on number of semesters after adjusting for major) was -2.186 and the “between group” slope (regressing the major mean salary on the major mean number of semesters) was 1.822. But in the model with both semester and avgsem, the coefficient of avgsem was “awkward” to interpret (the average semesters for the major increased by one, but everyone in the major stayed the same?). The slope ended up being the difference in the two previous slopes: 1.822 – (-2.186) = 3.990.  Here is another way we could run the model.

(a) First, we will create a “deviation” variable:  where we have subtracted the group mean from each observation.  This is called “group mean centering” as opposed to the “grand mean centering” we did before by subtracting the overall mean. Is this a “level 1” or “level 2” variable?

(b) Now include this variable, and the group mean variable in the same model:

saldata = read.table("http://www.rossmanchance.com/stat414/data/saldata.txt","\t", header=TRUE)

summary(model4 <- lm(salary ~ semesters + avgsem, data = saldata))


saldata$avgsem = ave(saldata$semesters, saldata$major)
#create a “deviation” variable

saldata$dev = saldata$semesters -saldata$avgsem

model6 = lm(salary ~ dev + avgsem, data = saldata)

Which coefficient(s) have changed from model 4?  How do you know interpret each slope coefficient? (What is going on here?)

 

 

 

3) Recall the squid data, where we looked at several different models that allowed the variances to vary. (See the models/recreate the model summaries in class this week as well as SquidModels.R)

(a) Look at the first ten observations

head(Squid, 10)

(b) Install the nlraa package and then look at the variance-covariance matrices for each model for the first 10 observations:

#install.packages("nlraa") 

vcmatrix1 = nlraa::var_cov(model1REML); vcmatrix1[1:10, 1:10]

vcmatrix2 = nlraa::var_cov(model2REML); vcmatrix2[1:10, 1:10]

vcmatrix3 = nlraa::var_cov(model3REML); vcmatrix3[1:10, 1:10]

vcmatrix4 = nlraa::var_cov(model4REML); vcmatrix4[1:10, 1:10]

(c) What is true about the diagonal elements for matrix 1? Why?  How is the first value related to the residual standard error for model 1?

(d) For matrix 2, how is the very first value () related to the residual standard error for model 2? Which observation has the largest variance in matrix 2? Why? 

(e) Which observation has the largest variance in matrix 3? Why? How is the very first value () related to the residual standard error for model 3? How is the second value () related to the residual standard error for model 3?

(f) In matrix 4, explain how/why the variances for observations 8-10 differ.