---
title: "Stat 414 - Day 1"
editor: visual
output:
word_document: default
html_notebook: default
---
```{r echo= FALSE}
#some set up here, echo=FALSE means this won't show up in document
#My preferences:
#Keep as .Rmd file with output as html_notebook.
#Use the cog to set Preview in Viewer Pane (should appear when save)
#then when done with document choose Knit to Word
#you may still need to install the RMarkdown package beforehand
```
```{r setup, include=FALSE}
##Step 1 is to set to eval = TRUE
##Step 2 is to change the word_document: setting above to default
knitr::opts_chunk$set(eval = TRUE)
knitr::opts_knit$set(global.par = TRUE)
```
```{r echo = FALSE}
#Some global settings
options(digits=2) #controls the number of significant digits in output
par(mar = c(4.1, 4.1, 1.1, 1.1)) #bottom, left, top, right so graphs don't use as much space in Word
```
------------------------------------------------------------------------
### Last Time:
- Multilevel data is when the *structure* of the data is characterized by "observational units" at different levels, often from clustering or nesting in the data (e.g., students nested in classrooms)
- Multilevel data needs to be analyzed differently from single level data
------------------------------------------------------------------------
## Example 1: Kentucky Derby winners
The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May (2020 is the first year since 1945 that it wasn't run in May). The race is known as the "Most Exciting Two Minutes in Sports," and is the first leg of racing's Triple Crown. The dataset KYDerby23.txt contains information on each running of the Kentucky Derby since 1875.
### First, load in the data:
```{r}
KYDerby23 = read.table("https://www.rossmanchance.com/KYDerby23.txt", header=TRUE)
#You may want to comment this next line out before knitting, especially on a mac
View(KYDerby23)
```
### Step 1 - Start with a graph!
```{r}
hist(KYDerby23$Time)
```
a) Examine the distribution of times, what is the first thing you notice? Why is it called the most exciting **two minutes** in sports?
The two clumps in the data are caused by a change in track length. Let's change the variable of interest (the "response variable") to speed, taking the track length into account.
```{r}
speed = (.25*(KYDerby23$Year<1896)+1.25)/(KYDerby23$Time/3600)
hist(speed)
with(KYDerby23, summary(speed))
qqnorm(KYDerby23$speed)
#these are some functions I use
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
iscamsummary(speed)
```
b) Interpret the mean and the standard deviation (in context).
c) Is the distribution of the response variable normally distributed? Is this a problem? What are some steps we can take if we think this is a problem?
### Bivariate graph
```{r}
with(KYDerby23, plot(speed ~ Year))
```
d) Summarize how the speeds have changed over time.
e) Is the association (time trend) linear? Is this a problem? What are some things we can do if we think this is a problem?
### Fit and interpret a model
A *least squares regression* model fits the best fitting line by minimizing the sum of the squared residuals.
```{r}
model1 = lm(speed~ Year, data=KYDerby23)
model1
```
f) Write out the least squares regression equation, using appropriate statistical notation, and interpret the [coefficients]{.underline} in context.
### Validate the model
g) Before we look at p-values and confidence intervals, what are the primary "assumptions" that need to be satisfied for inference in regression models? How do we "check" these assumptions? What can we do if any assumptions are not met?
With more complicated models, an important diagnostic tool is residual plots. The two to start with are a graph of residuals vs. fitted values (aka predicted values) and a histogram and/or normal probability plot of the residuals. (Note: You may need to make sure RMarkdown is installed, do this outside of this document...)
```{r}
#the next line is optional, can also resize once in Word document
par(mfrow=c(1,2))
#residuals vs. fitted values, with a smoother
scatter.smooth(model1$residuals ~ model1$fitted.values)
#normal probability plot of residuals
qqnorm(model1$residuals)
```
_For tomorrow:_ Summarize what you learn from these graphs.
What are the assumptions of the basic regression model? How do we check the assumptions? What can we do if any assumptions are not met?
#### Practice:
A researcher suspects that loud music can affect how quickly drivers react. She randomly selects drivers to drive the same stretch of road with varying levels of music volume. Stopping distances for each driver are measured along with the decibel level of the music on their car radio.
(a) Describe each assumption of the basic regression model in this context.