This first assignment does
assume you remember a few topics from earlier courses, you may need to continue
reviewing your notes from your previous courses, and be sure to ask
questions! Cite your sources!
Remember to include all relevant
output in your final submission. If you make a qmd file, make sure you render
the file into a Word file and then convert to pdf?
1) The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May (2020 was the first year since 1945 that it wasn’t run in May). The race is known as the “Most Exciting Two Minutes in Sports,” and is the first leg of racing’s Triple Crown. The dataset KYDerby25.txt contains information on each running of the Kentucky Derby since 1875.
KYDerby25 = read.table("https://www.rossmanchance.com/KYDerby25.txt", header=TRUE)
#You may want to comment out this next line out
before knitting, especially on a mac
View(KYDerby25)
#For a quicker view, always a good idea to start
with
head(KYDerby25)
(a) Examine the distribution of times, what is the first thing you notice? Why is it called the most exciting two minutes in sports?
hist(KYDerby25$Time, xlab = "Time to finish (seconds)")
The two clumps in the data are caused by a change in track length. Let’s change the variable of interest (the “response variable”) to speed, taking the track length into account.
speed = (.25*(KYDerby25$Year<1896)+1.25)/(KYDerby25$Time/3600)
hist(speed)
with(KYDerby25, summary(speed))
#I actually prefer
load(url("https://www.rossmanchance.com/iscam3/ISCAM.RData"))
iscamsummary(speed)
(b) Interpret the mean and the standard deviation (in context).
(c) Summarize how the speeds have changed over time.
with(KYDerby25, plot(speed ~ Year, ylab="speed", xlab="year"))
(d) Fit, interpret (interpret the intercept and the slope in context), and evaluate (i.e., summarize what you learn about the validity of model 2 with respect to linearity, normality, and equal variance) the linear model
model1
= lm(speed ~ Year, data = KYDerby25); model1
par(mfrow=c(2,2))
plot(model1)
Because the relationship isn’t linear, let’s try a
quadratic model, including both
and
in the model. The additional term allows the
model to “turn.”
#Create the quadratic term
yearsq = KYDerby25$Year*KYDerby25$Year
#We can also use the I() function to tell R to
evaluate the expression before fitting the model
model2 = lm(speed~Year + I(Year^2), data = KYDerby25)
summary(model2)
(e) The coefficient of
is positive and the coefficient of
is negative. What does this imply about the
behavior of the model?
Overlay the fitted model onto the scatterplot. Submit a copy of your output.
par(mfrow=c(1,1))
#Need
to run the next two lines together
plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model2$fitted.values), col="red")
#An alternative approach to show the
model on the graph for visual inspection using tidyverse/ggplot. Choose one to
include
library(tidyverse)
KYDerby25 %>%
ggplot(aes(x = Year, y = speed)) +
geom_point() +
geom_line(aes(x = Year, y = model2$fitted.values), color = "red") +
labs(title = "Quadratic
model") +
theme_bw()
(f) Based on the graph you have created, does this model appear to have the right form?
Now that we are fitting a model with more than one explanatory variable, we should consider multi-collinearity or the presence of a linear relationship among explanatory variables.
par(mfrow=c(1,1))
#yearsq = KYDerby25$Year*KYDerby25$Year
plot(yearsq~ KYDerby25$Year)
#install.packages("car")
car::vif(model2)
(g) VIF values larger than 10 indicate a high degree of multicollinearity, meaning the standard errors of our slope coefficient estimates may be quite large. Is there evidence of multicollinearity in model 2?
We discussed in class some advantages of centering variables. Turns out centering can also help polynomial models.
#create
the centered year variable
c.year = KYDerby25$Year - mean(KYDerby25$Year)
#and it’s quadratic buddy
c.year.sq = c.year*c.year
#are these variables
linearly related?
plot(c.year.sq~ c.year)
model2b <- lm(speed ~ c.year + c.year.sq)
car::vif(model2b)
summary(model2b)
(h) Summarize whether/how/why centering the variable has helped with the collinearity in the year and year2 variables.
(i) How did centering the variable change the residual standard error?
(j) Reft the model using maximum likelihood estimation
model2b.ML =
nlme::gls(speed ~ 1 + c.year + c.year.sq, method = "ML", data =
KYDerby25 )
summary(model2b.ML)
Are the estimated intercept and slope coefficients similar to the OLS model? How does the residual standard error compare?
Instead of a polynomial model, if we consider the
nonlinear relationship is “monotonic” we can try a power transformation. With
increasing at a slower and slower rate, we can
try a log transformation of the X variable (to “slow it down”).
#Like many other packages
"log" refers to natural log
log.year = log(KYDerby25$Year)
model3 = lm(speed ~ log.year, data = KYDerby25)
#May have to copy the next two lines together
into the session window
plot(speed ~ KYDerby25$Year)
lines(cbind(KYDerby25$Year, model3$fitted.values), col="green")
This model does not appear to be very helpful! The model we are fitting is curved, but not curved in the right place. We can often solve this by first shifting the data…
#Let's make the first year = 1 (we
could start at zero but then couldn't take the log)
shiftedyear = KYDerby25$Year - 1874
logx = log(shiftedyear)
model3b = lm(speed~logx)
summary(model3b)
plot(speed~logx)
(j) Is the association between speed and log(year) linear?
Again, compare the model to the raw data
#May have to copy the next two lines
together into the session window
plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model3b$fitted.values), col="blue")
#tidyverse version
KYDerby25 %>%
ggplot(aes(x = Year, y = speed)) +
geom_point() +
geom_line(aes(x = Year, y = model3b$fitted.values), color = "red") +
labs(title = "Speed vs.
log(Year - 1874)") +
theme_bw()
Produce and include
a graph that overlays both models.
plot(speed~KYDerby25$Year)
lines(cbind(KYDerby25$Year, model2$fitted.values), col="red")
lines(cbind(KYDerby25$Year, model3b$fitted.values), col="blue")
(k) Which model would you recommend based on the graphs comparing the models to the raw data?
(l) Which model would you recommend based on the sums of squared errors? (be sure to report the values you are comparing) Is it ok to compare the two models this way? Explain why you do/do not think so.
(13) Which model form makes the most sense in context? Explain.
(14) Which model would you recommend and why?