Stat 414 – HW 1

Due by midnight, Sunday, Sept. 28

 

This first assignment does assume you remember a few topics from earlier courses, you may need to continue reviewing your notes from your previous courses, and be sure to ask questions!  Cite your sources!

Remember to include all relevant output in your final submission. If you make a qmd file, make sure you render the file into a Word file and then convert to pdf?

 

1) The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May (2020 was the first year since 1945 that it wasn’t run in May). The race is known as the “Most Exciting Two Minutes in Sports,” and is the first leg of racing’s Triple Crown. The dataset KYDerby25.txt contains information on each running of the Kentucky Derby since 1875.

 

First, load in the data:

KYDerby25 = read.table("https://www.rossmanchance.com/KYDerby25.txt", header=TRUE)
#You may want to comment out this next line out before knitting, especially on a mac
View(KYDerby25)

#For a quicker view, always a good idea to start with
head(KYDerby25)

 

(a) Examine the distribution of times, what is the first thing you notice? Why is it called the most exciting two minutes in sports?

hist(KYDerby25$Time, xlab = "Time to finish (seconds)")

 

The two clumps in the data are caused by a change in track length. Let’s change the variable of interest (the “response variable”) to speed, taking the track length into account.

speed = (.25*(KYDerby25$Year<1896)+1.25)/(KYDerby25$Time/3600)
hist(speed)

with(KYDerby25, summary(speed))

#I actually prefer

load(url("https://www.rossmanchance.com/iscam3/ISCAM.RData"))
iscamsummary(speed)

 

(b)     Interpret the mean and the standard deviation (in context).

(c)      Summarize how the speeds have changed over time.

with(KYDerby25, plot(speed ~ Year, ylab="speed", xlab="year"))

 

Model 1

(d)     Fit, interpret (interpret the intercept and the slope in context), and evaluate (i.e., summarize what you learn about the validity of model 2 with respect to linearity, normality, and equal variance) the linear model

model1 = lm(speed ~ Year, data = KYDerby25); model1

par(mfrow=c(2,2))

plot(model1)

 

Model 2

Because the relationship isn’t linear, let’s try a quadratic model, including both  and  in the model. The additional term allows the model to “turn.”

#Create the quadratic term
yearsq
= KYDerby25$Year*KYDerby25$Year

#We can also use the I() function to tell R to evaluate the expression before fitting the model
model2
= lm(speed~Year + I(Year^2), data = KYDerby25)
summary(model2)

(e) The coefficient of  is positive and the coefficient of  is negative. What does this imply about the behavior of the model?

 

Overlay the fitted model onto the scatterplot. Submit a copy of your output.

par(mfrow=c(1,1))

#Need to run the next two lines together

plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model2$fitted.values),
col="red")

#An alternative approach to show the model on the graph for visual inspection using tidyverse/ggplot. Choose one to include
library(tidyverse)

KYDerby25 %>%
  ggplot(aes(
x = Year, y = speed)) +
  geom_point() +
  geom_line(aes(
x = Year, y = model2$fitted.values), color = "red") +
  labs(
title = "Quadratic model") +
 theme_bw()

 

(f) Based on the graph you have created, does this model appear to have the right form?

 

Now that we are fitting a model with more than one explanatory variable, we should consider multi-collinearity or the presence of a linear relationship among explanatory variables.

par(mfrow=c(1,1))
#yearsq = KYDerby25$Year*KYDerby25$Year
plot(yearsq~ KYDerby25$Year)
#install.packages("car")
car::vif(model2)

(g) VIF values larger than 10 indicate a high degree of multicollinearity, meaning the standard errors of our slope coefficient estimates may be quite large.  Is there evidence of multicollinearity in model 2?

 

We discussed in class some advantages of centering variables.  Turns out centering can also help polynomial models.

#create the centered year variable

c.year = KYDerby25$Year - mean(KYDerby25$Year)

#and it’s quadratic buddy

c.year.sq = c.year*c.year

#are these variables linearly related?

plot(c.year.sq~ c.year)

model2b <- lm(speed ~ c.year + c.year.sq)

car::vif(model2b)

summary(model2b)

(h) Summarize whether/how/why centering the variable has helped with the collinearity in the year and year2 variables.

 

(i) How did centering the variable change the residual standard error?

 

(j) Reft the model using maximum likelihood estimation

model2b.ML = nlme::gls(speed ~ 1 + c.year + c.year.sq, method = "ML", data = KYDerby25 )

summary(model2b.ML)

Are the estimated intercept and slope coefficients similar to the OLS model? How does the residual standard error compare?

 

Model 3

Instead of a polynomial model, if we consider the nonlinear relationship is “monotonic” we can try a power transformation. With  increasing at a slower and slower rate, we can try a log transformation of the X variable (to “slow it down”).

#Like many other packages "log" refers to natural log
log.year
= log(KYDerby25$Year)
model3
= lm(speed ~ log.year, data = KYDerby25)
#May have to copy the next two lines together into the session window
plot(speed ~ KYDerby25$Year)
lines(cbind(KYDerby25$Year, model3$fitted.values),
col="green")

 

This model does not appear to be very helpful! The model we are fitting is curved, but not curved in the right place. We can often solve this by first shifting the data…

#Let's make the first year = 1 (we could start at zero but then couldn't take the log)
shiftedyear
= KYDerby25$Year - 1874
logx
= log(shiftedyear)
model3b
= lm(speed~logx)

summary(model3b)
plot(speed~logx)

(j) Is the association between speed and log(year) linear?

 

Again, compare the model to the raw data

#May have to copy the next two lines together into the session window
plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model3b$fitted.values),
col="blue")

#tidyverse version
KYDerby25 %>%
  ggplot(aes(
x = Year, y = speed)) +
  geom_point() +
  geom_line(aes(
x = Year, y = model3b$fitted.values), color = "red") +
  labs(
title = "Speed vs. log(Year - 1874)") +
 theme_bw()

 

Compare the models

Produce and include a graph that overlays both models.

plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model2$fitted.values),
col="red")
lines(cbind(KYDerby25$Year, model3b$fitted.values),
col="blue")

(k) Which model would you recommend based on the graphs comparing the models to the raw data?

(l) Which model would you recommend based on the sums of squared errors? (be sure to report the values you are comparing)  Is it ok to compare the two models this way?  Explain why you do/do not think so.

(13)  Which model form makes the most sense in context? Explain.

(14)  Which model would you recommend and why?