This first assignment does assume you remember a few topics from earlier courses, you may need to continue reviewing your notes from your previous courses, and be sure to ask questions! Cite your sources!

1) Introductions/Initial Surveys

2) The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May (2020 was the first year since 1945 that it wasn’t run in May). The race is known as the “Most Exciting Two Minutes in Sports,” and is the first leg of racing’s Triple Crown. The dataset KYDerby24.txt contains information on each running of the Kentucky Derby since 1875.

Load in the data:

KYDerby24 = read.table("https://www.rossmanchance.com/KYDerby24.txt", header=TRUE)

#You may want to comment this next line out before knitting, especially on a mac
#View(KYDerby24)

#For a quicker view, always a good idea to start with 
head(KYDerby24)

(a) Examine the distribution of times, what is the first thing you notice? Why is it called the most exciting two minutes in sports?

hist(KYDerby24$Time, xlab = "Time to finish (seconds)")

The two clumps in the data are caused by a change in track length. Let’s change the variable of interest (the “response variable”) to speed, taking the track length into account.

speed = (.25*(KYDerby24$Year<1896)+1.25)/(KYDerby24$Time/3600) 
hist(speed) 
with(KYDerby24, summary(speed)) 
#I actually prefer
load(url("https://www.rossmanchance.com/iscam3/ISCAM.RData")) 
iscamsummary(speed)

(b) Interpret the mean and the standard deviation (in context).

(c) Summarize how the speeds have changed over time.

with(KYDerby24, plot(speed ~ Year, ylab="time", xlab="year"))

Because the relationship isn’t linear, let’s try a quadratic model, including both \(year\) and \(year^2\) in the model. The additional term allows the model to “turn.”

#Create the quadratic term
yearsq = KYDerby24$Year*KYDerby24$Year

#We can also use the I() function to tell R to evaluate the expression before fitting the model
model2 = lm(speed~Year + I(Year^2), data = KYDerby24)
model2

(d) The coefficient of \(year\) is positive and the coefficient of \(year^2\) is negative. What does this imply about the behavior of the model?

Overlay the fitted model onto the scatterplot. Submit a copy of your output.

plot(KYDerby24$speed~KYDerby24$Year)
lines(cbind(KYDerby24$Year, model2$fitted.values), col="red") 

#An alternative approach to show the model on the graph for visual inspection using tidyverse/ggplot. You can comment one out. 
KYDerby24 %>% ggplot(aes(x = Year, y = speed)) + geom_point() + geom_line(aes(x =
Year, y = model2$fitted.values), color = "red") + labs(title ="Quadratic model") + theme_bw() 

(e) Based on the graph you have created, does this model appear to have the right form?

Examine the residual plots

#R has a nice set of 4 graphs
par(mfrow=c(2,2)) 
plot(model2)

(f) Summarize what you learn about the validity of model 2 with respect to linearity, normality, and equal variance. Be sure to justify your conclusions!

Now that we are fitting a model with more than one explanatory variable, we should consider multi-collinearity or the presence of a linear relationship among explanatory variables.

par(mfrow=c(1,1)) 
yearsq = KYDerby24$Year*KYDerby24$Year 
plot(yearsq~ KYDerby24$Year)
#install.packages("car") 
car::vif(model2) 

(g) VIF values larger than 10 indicate a high degree of multicollinearity, meaning the standard errors of our slope coefficient estimates may be quite large. Is there evidence of multicollinearity in model 2?

We discussed in class some advantages of centering variables. Turns out centering can also help polynomial models.

#create the centered year variable 
c.year <- KYDerby24$Year - mean(KYDerby24$Year) 
#and it’s quadratic buddy 
c.year.sq <- c.year*c.year 
#are these variables linearly related? 
plot(c.year.sq ~ c.year) 

Extra credit: Summarize why this is helpful!

Instead of a polynomial model, if we consider the nonlinear relationship is “monotonic” we can try a power transformation. With \(Y\) increasing at a slower and slower rate, we can try a log transformation of the \(X\) variable (to “slow it down”).

#Like many other packages "log" refers to natural log 
log.year = log(KYDerby24$Year)
model3 = lm(speed ~ log.year, data = KYDerby24)
#May have to copy the next two lines together into the session window
par(mfrow=c(1,1))
plot(KYDerby24$Year, speed)
lines(cbind(KYDerby24$Year, model3$fitted.values), col="green")

This model does not appear to be very helpful! The model we are fitting is curved, but not curved in the right place. We can often solve this by first shifting the data…

#Let's make the first year = 1 (we could start at zero but then couldn't take the log) 
shiftedyear = KYDerby24$Year - 1874
logx = log(shiftedyear)
model3b = lm(speed~logx)
plot(speed~logx, xlab="log(year - 1874)")

(h) Is the association between speed and log(year - 1874) linear?

(i) Does the model seem to have the right form?

#May have to copy the next two lines together into the session window
plot(speed~KYDerby24$Year)
lines(cbind(KYDerby24$Year, model3b$fitted.values), col="blue")

#tidyverse version 
KYDerby24 %>% ggplot(aes(x = Year, y = speed)) +
geom_point() + geom_line(aes(x = Year, y = model3b$fitted.values),
color = "red") + labs(title = "Speed vs. log(Year - 1874)") + theme_bw()

Produce and include a graph that overlays both models.

par(mfrow=c(1,1))
plot(KYDerby24$speed~KYDerby24$Year)
lines(cbind(KYDerby24$Year, model2$fitted.values), col="red")
lines(cbind(KYDerby24$Year, model3b$fitted.values), col="blue") 

Consider the AIC for both models

#Formula: 2*k + n*log(mean(residuals(model2)^2)) + n + n*log(2*pi)
#Note here k does not include sigma 
AIC(model2)
AIC(model3b)

(j) Is it ok to compare these two values? Which is better?

(k) Which model form makes the most sense in context? Explain.

(l) Which model would you recommend and why?