Stat 414 – HW 2

Due by midnight, Friday, Oct. 6

 

This assignment still includes some review material and a few things you may need to look up. Please cite your sources.

 

Please submit each problem as separate files in Canvas (doc or pdf format)

 

1) The file saldata.txt contains data on 24 college graduates, including their starting salary (in thousands of dollars), how many semesters they spent in college, and their major. (This isn’t read data but is based on real data :)).

saldata = read.table("http://www.rossmanchance.com/stat414/data/saldata.txt","\t", header=TRUE)
model1
<- lm(salary ~ semesters, data = saldata)
summary(model1)
plot(saldata$salary ~ saldata$semesters)

(a)      Describe the association between salary and time spent in college. Interpret the slope coefficient in context.

(b)      Fit a model that includes both salary and major.

summary(model2 <- lm(salary ~ semesters + major, data = saldata))
Provide an interpretation of the coefficient of semesters in this model. 

(c)       Is major a statistically significant predictor of salary after adjusting for number of semesters?

State Ho and Ha, include an appropriate test statistic, and p-value. Interpret your results in context.  Be sure to include any output you create to answer this question.

(d)      What would you do differently if I asked you whether an individual’s number of semesters significantly predicts salary after adjusting for major?

 

Let’s aggregate this information to the different majors.

Majordata = aggregate(saldata[, 1:2], list(saldata$major), mean)

head(Majordata)

plot(Majordata)

(e)      What has this command done?

(f)        Find the least squares regression model for these data and carefully interpret the slope in context.

summary(model3 <- lm(salary ~ semesters, data = Majordata))

Now we want to fit a model that has both of these variables, so we need to create a dataset that has both of these variables.  Try

newsaldata <- merge(saldata, Majordata, by.x="major", by.y="Group.1", suffixes=c("",".G"))

head(newsaldata)

Now try

saldata$avgsem = ave(saldata$semesters, saldata$major)

head(saldata, 10)

with(saldata, plot(salary ~ semesters, col = avgsem))

 

(g)      Now fit

summary(model4 <- lm(salary ~ semesters + avgsem, data = saldata))

Where have you seen the coefficient of semesters before?  Why?

 

(h)      In model 4, how do we interpret the coefficient of avgsem?  (Hint: Why is it not the same as in (b)?  Can you hold the other variable in the model constant? How are this coefficient and the one in (f) related?)

 

Now fit

summary(model5 <- lm(salary ~ semesters*avgsem, data = saldata))

(i)        Provide a detailed interpretation of the coefficient of the interaction. Hint: What’s the difference between the graph below and if I made that graph for model 4?

with(saldata, interaction.plot(semesters, major, model5$fitted.values, type = "l"))

(j)        Is the interaction statistically significant? (Cite your evidence.)

 

 

2) A risk factor of cardiovascular disease is the “topography of adipose tissue” (AT).  However, the only way to reliably measure deep abdominal AT is “computed tomography” which is costly, requires irradiation of the subjects, and is not available to many physicians.  Researchers wanted to explore whether more easily available measurements could accurately estimate AT.   Their subjects were men between the ages of 18 and 42 years who were free from metazsbolic disease that would require treatment.

deepfat <- read.table("https://rossmanchance.com/stat414/data/deepfat.txt", header=TRUE)

head(deepfat)

plot(deepfat) #nice way to see all the bivariate associations at once

Fit the least squares regression model and evaluate the residuals.

summary(model1 <- lm(AT ~ waist.cm., data = deepfat))

plot(model1)

(a) Interpret the R2 value in context.

            Note: Here is another interesting relationship that holds for any number of explanatory variables

cor(deepfat$AT, model1$fitted.values)^2

 

It appears the variability in the residuals is increasing with the response variable (see increasing trend in 3rd residual plot?). This suggests we could try a “variance-stabilizing transformation” like log or square root.  Try taking the square root of the response.

sqrtAT = sqrt(deepfat$AT)

summary(model2 <- lm(sqrtAT ~ waist.cm., data = deepfat))

plot(model2)

(b) Does this help the unequal variance assumption? Justify your answer.

 

Of course, transforming one variable can impact the form of the association.  An alternative to a variable transformation, when the only issue is heterogeneity, is weighted least squares.  For this problem, model the variability as a function of the explanatory variable (and use the standardized residuals in the residual plot).

summary(model3 <- lm(AT ~ waist.cm.,   weights = 1/waist.cm., data = deepfat))

plot(rstandard(model3)~fitted.values(model3))

(c) How does the slope change? Did the residual standard error change?  Explain why these changes make sense based on the original scatterplot.

 

(d) Create prediction intervals and confidence intervals for model 1:

predict(model1, newdata = data.frame(waist.cm. = 100), interval = "prediction")

predict(model1, newdata = data.frame(waist.cm. = 100), interval = "confidence")

Write a one-sentence interpretation for each interval.

 

(e) Try this code

predictions <- predict(model1, newdata = data.frame(waist.cm. = deepfat$waist.cm.), interval = "prediction")

alldata <- cbind(deepfat, predictions)

 

ggplot(alldata, aes(x = waist.cm., y = AT)) + #define x and y axis variables

  geom_point() + #add scatterplot points

  stat_smooth(method = lm) + #confidence bands

  geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval

  geom_line(aes(y = upr), col = "coral2", linetype = "dashed") + #upr pred interval

  theme_bw()

Where are the confidence bands widest? Why?

 

(f) How does the prediction interval compare to  ? (Show your work)

 

(g) Repeat (e) for the weighted regression.

predictions <- predict(model3, newdata = data.frame(waist.cm. = deepfat$waist.cm.), interval = "prediction", weights=1/deepfat$waist.cm.)

alldata <- cbind(deepfat, predictions)

 

ggplot(alldata, aes(x = waist.cm., y = AT)) + #define x and y axis variables

  geom_point() + #add scatterplot points

  stat_smooth(method = lm) + #confidence bands

  geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval

  geom_line(aes(y = upr), col = "coral2", linetype = "dashed") + #upr pred interval

  theme_bw()

Describe and explain the main change.

 

(h) If you are worried about outliers, R has a built-in function for fitting Tukey’s resistant line:

               line(deepfat$waist.cm., deepfat$AT, iter=10)

               plot(deepfat$AT ~ deepfat$waist.cm.)

               abline(model1)

               abline(line(deepfat$waist.cm., deepfat$AT, iter=10), col="red")

How does the slope and intercept change?  Provide a possible explanation as to why they have changed this way, as if to a consulting client. (Find and) Include a simple explanation of how this line is calculated. Yes, I want you to do some investigation here. Cite your source(s).

 

3) Poisson regression

What do I do with count data, which is often skewed to the right and the variability increases with the mean?

Chart, line chart

Description automatically generated

Data on unprovoked shark attacks in Florida were collected from 1946-2011. The response variable is the number of unprovided shark attacks in the state each year, and potential predictors are the resident population of the state and the year.

(a) Create a graph of the number of attacks vs. year and vs. population.  Do the graphs behave as expected:  Do the conditional distributions appear skewed to the right? Does the variability in the response seem to increase with the number of attacks? Do the residuals show any issues with the linear model?

sharks <- read.table("https://rossmanchance.com/stat414/data/sharks.txt", header=TRUE)

head(sharks)

plot(sharks) #nice way to see all the bivariate associations at once

sharks$Year_coded <- factor(sharks$Year_coded, levels = c("45-55", "55-65", "65-75", "75-85", "85-95", "95-05", "05-15"))

#tidyverse

ggplot(sharks, aes(x=Attacks)) + geom_histogram() + facet_wrap(~Year_coded)

model1 <- lm(Attacks ~ Year, data=sharks)

plot(model1)

Instead of assuming the conditional distribution are approximately normal, we can model them as Poisson distributions. The extension of the basic linear regression model is the general linear model. The Poisson regression model fits the model

 where the observed yi values are assumed to follow a Poisson distribution.  The log transformation here is referred to as the “link function” and expresses how the response is linearly related to the explanatory variable. We’ve been doing a special case, the “identity” link and the normal distribution for the errors.

(b) Fit a Poisson regression model to these data.

model2 <- glm(Attacks ~ Year + Population, data=sharks, family=poisson)

summary(model2)

ggplot(sharks,aes(x=Year,y=Attacks))+geom_point()+

  geom_line(aes(y=exp(predict(model2)))) +

  theme_bw()

Interpret the coefficient of Year in this model.

Extra Credit: How do we the slope when the response has been log transformed? (See handout in Canvas)