Stat 414 – HW 2
Due by midnight, Friday, Oct. 4
Please submit
each problem as separate files in Canvas (doc or pdf format). You will need to
install packages, including Tidyverse for this assignment.
1) Recall the squid data, where we looked at several
different models that allowed the variances to vary.
Model1: |
Model 2: |
Model 3: |
Model 4: |
gls(Testisweight ~ DML) |
gls(Testisweight ~ DML, weights = varFixed(~DML)) |
gls(Testisweight ~ DML, weights = varIdent(form= ~ 1 | MONTH) |
gls(Testisweight ~ DML, weights = varPower(form =~ DML | MONTH )) |
Residual standard error: 3.3523 |
Residual standard error: 0.1935 |
Residual standard error: 1.555 |
Residual standard error: 0.0003 |
(a) Look at the first 10
observations (include your output)
head(Squid, 10)
(b) Install
the nlraa package and then look at the variance-covariance matrices
for each model for the first 5 observations
(include your output):
install.packages("nlraa")
vcmatrix1 = nlraa::var_cov(model1REML); vcmatrix1[1:5, 1:5]
(c) What is true about
the diagonal elements for matrix 1? Why? How is the first value
related to the residual standard error for model 1 (Model1REML)? Note: You will
probably want the residual standard error for each model. One short-cut for that is sigma(model)
vcmatrix2 = nlraa::var_cov(model2REML); vcmatrix2[1:5, 1:5]
(d) For vcmatrix2, how
is the very first value ()
related to the residual standard error for model 2? Which observation
has the largest variance in the first 5 rows of vcmatrix2? Why?
summary(model3REML) #include the variance function output
vcmatrix3 = nlraa::var_cov(model3REML); vcmatrix3[1:5, 1:5]
(e) For vcmatric3, how
is the very first value ()
related to the residual standard error for model 3? How is the second
value ()
related to the residual standard error for model 3? Which observation has the
largest variance in the first 5 rows of vcmatrix3? Why?
summary(model4REML) #include the
variance function output
vcmatrix4 = nlraa::var_cov(model4REML); vcmatrix4[1:10, 1:10]
(f) In matrix 4, verify
the calculation of the value for squid #8 (show the numbers substituted into a
formula). Explain how/why the variances for observations 8-10 differ. (Which is
largest and why?)
Let’s add month
to the model
(g) Explain
why this command is necessary
Squid$fMONTH
= factor(Squid$MONTH)
(h) Add fMONTH to the model
model5REML
<- gls(Testisweight ~ DML + fMONTH, data = Squid, method = "REML")
summary(model5REML)
How many parameters does this add to the
model? According to the model, which month tends to have the highest
Testisweights? Does this agree with what
we saw in the raw data in class? Explain how you are deciding.
(i) Carry out a likelihood ratio test
comparing model5REML to model1REML. Include your output
and summarize your conclusions in context.
(j) Does adding fMONTH to the model as a
predictor improve the homogeneity assumption?
plot(model5REML)
2) A risk factor of cardiovascular disease is the “topography
of adipose tissue” (AT). However, the
only way to reliably measure deep abdominal AT is “computed tomography” which
is costly, requires irradiation of the subjects, and is not available to many
physicians. Depres
et al. (1991) wanted to explore whether more easily available measurements
(e.g., waist circumference, cm) could accurately estimate “deep” AT area (cm2). Their subjects were men from the Quebec City
metropolitan area (recruited by “solicitation through the media”) between the
ages of 18 and 42 years who were free from metazsbolic disease (e.g., diabetes,
CHD) that would require treatment.
deepfat
<-
read.table("https://rossmanchance.com/stat414/data/deepfat.txt",
header=TRUE)
head(deepfat)
scatter.smooth(deepfat$AT
~ deepfat$waist)
Fit the least squares regression model
and evaluate the residuals.
summary(model1
<- lm(AT ~ waist, data = deepfat))
plot(model1)
(a) Include and summarize what you learn
from the Scale-Location graph. (Be specific as to the nature of the
relationship.)
With transformations, we don’t always
know whether we should transform Y or transform X. Let’s first try a
“variance-stabilizing transformation” on the response like log or square
root.
sqrtAT =
sqrt(deepfat$AT)
summary(model2
<- lm(sqrtAT ~ waist, data = deepfat))
plot(model2)
(b) Does this help the unequal variance assumption?
Justify your answers, including appropriate output.
(c) Did this transformation cause any
other problems? Why? Is transforming the
X variable likely to be helpful here?
An alternative to a variable
transformation, when the only issue is heterogeneity, is weighted least
squares. For these data, let’s model the
variability as a function of the explanatory variable (and use the standardized
residuals in the residual plot).
summary(model3
<- lm(AT ~ waist, weights = 1/waist, data = deepfat))
plot(rstandard(model3)~fitted.values(model3))
(d) How does the slope change between
model 1 and model 3? Did the residual standard error change between model 1 and
model 3? Explain why these changes make
sense based on the original scatterplot.
(e) Create prediction intervals and
confidence intervals for model 1:
Write a one-sentence interpretation for
each interval.
(f) How does the prediction interval
compare to ? (Show
your work)
(g) Try this code
predictions <- predict(model1, newdata = data.frame(waist=
deepfat$waist), interval = "prediction")
alldata <- cbind(deepfat, predictions)
ggplot(alldata, aes(x = waist, y = AT)) + #define x and y axis
variables
geom_point() + #add
scatterplot points
stat_smooth(method = lm) +
#confidence bands
geom_line(aes(y = lwr),
col = "coral2", linetype = "dashed") + #lwr pred interval
geom_line(aes(y = upr),
col = "coral2", linetype = "dashed") + #upr pred interval
theme_bw()
Where are the confidence bands widest?
Why?
(h) Repeat (g) for the weighted
regression.
predictions <- predict(model3, newdata = data.frame(waist =
deepfat$waist), interval = "prediction", weights=1/deepfat$waist)
alldata <- cbind(deepfat, predictions)
ggplot(alldata, aes(x = waist, y = AT)) + #define x and y axis
variables
geom_point() + #add
scatterplot points
stat_smooth(method = lm) +
#confidence bands
geom_line(aes(y = lwr),
col = "coral2", linetype = "dashed") + #lwr pred interval
geom_line(aes(y = upr),
col = "coral2", linetype = "dashed") + #upr pred interval
theme_bw()
Where are the prediction bands widest?
Why?
(i) If you are worried about outliers, R
has a built-in function for fitting Tukey’s resistant line:
line(deepfat$waist,
deepfat$AT, iter=10)
plot(deepfat$AT ~ deepfat$waist)
abline(model1)
abline(line(deepfat$waist,
deepfat$AT, iter=10), col="red")
How do the slope and intercept
change? Provide a possible explanation
as to why they have changed this way, as if to a consulting client. (Find and)
Include a simple explanation of how this line is calculated. Yes, I want you to
do some investigation here. Cite your source(s).
3) Data on unprovoked shark attacks in
Florida were collected from 1946-2011. The response variable is the number of
unprovided shark attacks in the state each year, and potential predictors are
the resident population of the state and the year.
sharks
<- read.table("https://rossmanchance.com/stat414/data/sharks.txt",
header=TRUE)
head(sharks)
plot(sharks) #nice way to see all the
bivariate associations at once
(a)
Examine the graph of the number of attacks vs. year. Comment on whether you
think the linearity and equal variance assumptions are met.
Create
year groups:
sharks$Year_coded <-
factor(sharks$Year_coded, levels = c("45-55", "55-65",
"65-75", "75-85", "85-95", "95-05",
"05-15"))
ggplot(sharks, aes(x=Attacks)) +
geom_histogram() + facet_wrap(~Year_coded)
(b)
Explain what is revealed by these histograms and what they tell you about the
normality assumption. (Try to be very precise in your wording here.)
Your
answers should agree with the residual plots too
model1 <- lm(Attacks ~ Year,
data=sharks)
plot(model1)
(c)
Explain why a transformation of the response variable is likely to be helpful
here.
(d)
We can’t take the log of zero so try
lnattacks = log(sharks$Attacks + .5)
model2 <- lm(lnattacks ~ Year , data=sharks)
plot(model2)
Would
you consider the issues in the regression assumptions you found in (a) and (b)
to be fixed? Improved? Explain.
Extra Credit: How do we interpret the slope when the response
has been log transformed? (See handout in Canvas)
Note:
Even better, instead of assuming
the conditional distributions are approximately normal, use can use a general
linear model to model them as a Poisson distribution.