Stat 414 – HW 2

Due by midnight, Friday, Oct. 4

 

Please submit each problem as separate files in Canvas (doc or pdf format). You will need to install packages, including Tidyverse for this assignment.

 

1) Recall the squid data, where we looked at several different models that allowed the variances to vary.

Model1:

Model 2:

 Model 3:

Model 4:

gls(Testisweight ~ DML)

gls(Testisweight ~ DML,

weights = varFixed(~DML))

gls(Testisweight ~ DML,

weights = varIdent(form= ~ 1 | MONTH)

gls(Testisweight ~ DML,

weights = varPower(form =~ DML | MONTH ))

Residual standard error: 3.3523

Residual standard error: 0.1935

Residual standard error: 1.555

Residual standard error: 0.0003

(a) Look at the first 10 observations (include your output)

head(Squid, 10)

(b) Install the nlraa package and then look at the variance-covariance matrices for each model for the first 5 observations (include your output):

install.packages("nlraa") 

vcmatrix1 = nlraa::var_cov(model1REML); vcmatrix1[1:5, 1:5]

(c) What is true about the diagonal elements for matrix 1? Why?  How is the first value related to the residual standard error for model 1 (Model1REML)?  Note: You will probably want the residual standard error for each model.  One short-cut for that is sigma(model)

 

vcmatrix2 = nlraa::var_cov(model2REML); vcmatrix2[1:5, 1:5]

(d) For vcmatrix2, how is the very first value () related to the residual standard error for model 2? Which observation has the largest variance in the first 5 rows of vcmatrix2? Why? 

 

summary(model3REML)  #include the variance function output

vcmatrix3 = nlraa::var_cov(model3REML); vcmatrix3[1:5, 1:5]

(e) For vcmatric3, how is the very first value () related to the residual standard error for model 3? How is the second value () related to the residual standard error for model 3? Which observation has the largest variance in the first 5 rows of vcmatrix3? Why?

 

summary(model4REML) #include the variance function output

vcmatrix4 = nlraa::var_cov(model4REML); vcmatrix4[1:10, 1:10]

(f) In matrix 4, verify the calculation of the value for squid #8 (show the numbers substituted into a formula). Explain how/why the variances for observations 8-10 differ. (Which is largest and why?)

 

Let’s add month to the model

(g) Explain why this command is necessary

Squid$fMONTH = factor(Squid$MONTH)

 

(h) Add fMONTH to the model

model5REML <- gls(Testisweight ~ DML + fMONTH, data = Squid, method = "REML")

summary(model5REML)

How many parameters does this add to the model? According to the model, which month tends to have the highest Testisweights?  Does this agree with what we saw in the raw data in class? Explain how you are deciding.

 

(i) Carry out a likelihood ratio test comparing model5REML to model1REML. Include your output and summarize your conclusions in context.

 

(j) Does adding fMONTH to the model as a predictor improve the homogeneity assumption?

plot(model5REML)

 

2) A risk factor of cardiovascular disease is the “topography of adipose tissue” (AT).  However, the only way to reliably measure deep abdominal AT is “computed tomography” which is costly, requires irradiation of the subjects, and is not available to many physicians.  Depres et al. (1991) wanted to explore whether more easily available measurements (e.g., waist circumference, cm) could accurately estimate “deep” AT area (cm2).  Their subjects were men from the Quebec City metropolitan area (recruited by “solicitation through the media”) between the ages of 18 and 42 years who were free from metazsbolic disease (e.g., diabetes, CHD) that would require treatment.

deepfat <- read.table("https://rossmanchance.com/stat414/data/deepfat.txt", header=TRUE)

head(deepfat)

scatter.smooth(deepfat$AT ~ deepfat$waist)

Fit the least squares regression model and evaluate the residuals.

summary(model1 <- lm(AT ~ waist, data = deepfat))

A black text on a white background

Description automatically generated

plot(model1)

(a) Include and summarize what you learn from the Scale-Location graph. (Be specific as to the nature of the relationship.)

 

With transformations, we don’t always know whether we should transform Y or transform X. Let’s first try a “variance-stabilizing transformation” on the response like log or square root. 

sqrtAT = sqrt(deepfat$AT)

summary(model2 <- lm(sqrtAT ~ waist, data = deepfat))

plot(model2)

(b) Does this help the unequal variance assumption? Justify your answers, including appropriate output.

 

(c) Did this transformation cause any other problems? Why?  Is transforming the X variable likely to be helpful here?

 

An alternative to a variable transformation, when the only issue is heterogeneity, is weighted least squares.  For these data, let’s model the variability as a function of the explanatory variable (and use the standardized residuals in the residual plot).

summary(model3 <- lm(AT ~ waist, weights = 1/waist, data = deepfat))

plot(rstandard(model3)~fitted.values(model3))

(d) How does the slope change between model 1 and model 3? Did the residual standard error change between model 1 and model 3?  Explain why these changes make sense based on the original scatterplot.

(e) Create prediction intervals and confidence intervals for model 1:

Write a one-sentence interpretation for each interval.

(f) How does the prediction interval compare to  ? (Show your work)

 

(g) Try this code

predictions <- predict(model1, newdata = data.frame(waist= deepfat$waist), interval = "prediction")

alldata <- cbind(deepfat, predictions)

ggplot(alldata, aes(x = waist, y = AT)) + #define x and y axis variables

  geom_point() + #add scatterplot points

  stat_smooth(method = lm) + #confidence bands

  geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval

  geom_line(aes(y = upr), col = "coral2", linetype = "dashed") + #upr pred interval

  theme_bw()

Where are the confidence bands widest? Why?

 

(h) Repeat (g) for the weighted regression.

predictions <- predict(model3, newdata = data.frame(waist = deepfat$waist), interval = "prediction", weights=1/deepfat$waist)

alldata <- cbind(deepfat, predictions)

ggplot(alldata, aes(x = waist, y = AT)) + #define x and y axis variables

  geom_point() + #add scatterplot points

  stat_smooth(method = lm) + #confidence bands

  geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval

  geom_line(aes(y = upr), col = "coral2", linetype = "dashed") + #upr pred interval

  theme_bw()

Where are the prediction bands widest? Why?

 

(i) If you are worried about outliers, R has a built-in function for fitting Tukey’s resistant line:

              line(deepfat$waist, deepfat$AT, iter=10)

              plot(deepfat$AT ~ deepfat$waist)

              abline(model1)

              abline(line(deepfat$waist, deepfat$AT, iter=10), col="red")

How do the slope and intercept change?  Provide a possible explanation as to why they have changed this way, as if to a consulting client. (Find and) Include a simple explanation of how this line is calculated. Yes, I want you to do some investigation here. Cite your source(s).

 

3) Data on unprovoked shark attacks in Florida were collected from 1946-2011. The response variable is the number of unprovided shark attacks in the state each year, and potential predictors are the resident population of the state and the year.

sharks <- read.table("https://rossmanchance.com/stat414/data/sharks.txt", header=TRUE)

head(sharks)

plot(sharks) #nice way to see all the bivariate associations at once

(a) Examine the graph of the number of attacks vs. year. Comment on whether you think the linearity and equal variance assumptions are met.

 

Create year groups:

sharks$Year_coded <- factor(sharks$Year_coded, levels = c("45-55", "55-65", "65-75", "75-85", "85-95", "95-05", "05-15"))

ggplot(sharks, aes(x=Attacks)) + geom_histogram() + facet_wrap(~Year_coded)

(b) Explain what is revealed by these histograms and what they tell you about the normality assumption. (Try to be very precise in your wording here.)

 

Your answers should agree with the residual plots too

model1 <- lm(Attacks ~ Year, data=sharks)

plot(model1)

 

(c) Explain why a transformation of the response variable is likely to be helpful here.

 

(d) We can’t take the log of zero so try

lnattacks = log(sharks$Attacks + .5)

model2 <- lm(lnattacks ~ Year , data=sharks)

plot(model2)

Would you consider the issues in the regression assumptions you found in (a) and (b) to be fixed? Improved? Explain.

 

Extra Credit: How do we interpret the slope when the response has been log transformed? (See handout in Canvas)

 

Note: Even better, instead of assuming the conditional distributions are approximately normal, use can use a general linear model to model them as a Poisson distribution.