Stat 414 – HW 6
Due Friday, Nov 8, 2pm
0) Due Thursday 8am
In class on Thursday we
are going to have a guest speaker: Sam Ventura. Dr. Ventura will be presenting
on how hierarchical or multilevel models can play an important role in player evaluation
in team sports. He will look at models for wins above replacement and offensive
and defensive player ratings in football and hockey. He will then apply these
ideas to the NFL and the NFL draft, including providing a definitive answer to
the question of whether Joe Flacco is elite.
Dr. Ventura received
his Ph.D. in Statistics from Carnegie Mellon in 2015 (and a BS in computational
finance and statistics). He is currently Director of Hockey Research for the
Pittsburgh Penguins and an affiliated faculty member at Carnegie Mellon
University’s Department of Statistics & Data Sciences. He is also associate
editor for the Journal of Quantitative Analysis in
Sports. His
academic research focuses on clustering, prediction, record linkage, synthetic
data, infectious diseases, and sports (particularly hockey and football).
By Thursday morning, preferably sooner, please email at least one
question you would like to ask Dr. Ventura during his “visit” (Zoom). Please also try to arrive on time and
well-rested Thursday, I had to call in a lot of favors for this one J.
1) Read the paper Song
discrimination by nestling collared flycatchers during early development by
McFarlane et al. (Biology Letters,
2016) (http://rsbl.royalsocietypublishing.org/content/12/7/20160234#F2)
Note that there is
supplemental material for the paper that contains some additional details on
the model used.
(a)
Describe
the response variable being considered in Figure 2.
(b)
They
use a mixed model that contains a random effect. What is the random effect as
they describe it in the paper and why are they accounting for it?
(c)
They
did not clearly specify this, but they used a random intercept model and the
estimated variance for the random effect is 0.0009031 and the estimated
residual (or error) variance is 0.0083427. Calculate and interpret the
intra-class correlation for two different observations taken within the same
level of the random effect.
(d)
In
the first model that they report results from in Section 3, they are ignoring
song type or species and are just using age
as a fixed effect. Interpret the estimated coefficients and tests results that
they provide (e.g., the day 7, 9, 12 day test results vs. embryo).
(e) They did not report the 4 days results
except in the supplement because “they responded similarly.” Do you agree?
Explain.
This is an example of issues with
selective reporting of results and is not good science - if you did a test it
should be reported and discussed.
Data were collected by the Minnesota
Department of Education for all Minnesota schools during the years 2008-2010 to
compare charter and non-charter schools.
School performance is measured by the mean score on the math portion of
the Minnesota Comprehensive Assessment (MCA-II) data for the 6th grade students
enrolled in 618 different Minnesota schools during the years 2008, 2009, and
2010. (MCA test scores for sixth graders
are scaled to fall between 600 and 700, where scores above 650 for individual
students indicate “meeting standards.” Thus, schools with averages below 650
will often have increased incentive to improve their scores the following
year.)
(a) Identify Level 1 and Level 2. Are the variables listed below Level 1 or
Level 2 variables?
· percentage of students with free and reduced lunch
· percentage of students with special education needs
· percentage of students who are non-white
· charter or public non-charter school
· urban or rural
Note, level 2 variables are the 2010 values (why is
this ok to do?).
(b) Next we want to explore how MCA math test scores
relate to these variables. This can be
done using the data values for all three years or by averaging the data values
for the three years into one number.
Give a break pro/con of these approaches.
(c) For the second approach open the “wide format” of
the data (chart.wide.txt,
this includes three columns for the three time points for each school) and use
the SchoolAvg variable as the response.
Examine the associations of these variable with each of the variations
listed in (a).
Which variables seems most useful in predicting the average math score?
(d) Now open the “long
format” of the data (chart.long.txt). Create two
visual representations of math scores vs. time for the first 20 schools:
· separate graphs for each school
· connecting lines or smoothers for each school overlaid on
same graph (i.e., “spaghetti plot”)
Explain what year08
represents.
(e) Do some schools have higher intercepts? What does
this mean in context?
(f) Do some schools have higher slopes? What does this
mean in context?
(g) Separate the first graph by charter (charter = 1)
and non-charter (charter = 0) schools. Does one group tend to have higher
scores? Does one group tend to have more
variability?
(h) Fit a multilevel model with year08, random
intercepts, and random slopes. (Be sure to use schoolnum, which are unique, not
school name.) Describe what this model is doing. What percentage of
within-school variation is explained by the linear increase over time?
(i) Produce a graph of the
Math scores vs. year, separated by the charter/non charter schools. [R: boxplot(MathAvgScore ~ year08*charter)] What do you learn?
(j) Include charter
as a Level 2 variable (remember that means you include it as a fixed effect and
its interaction with year08).
Summarize the charter effect on the intercepts and the charter effect on the
slopes. Is either statistically significant?
(Be very clear how you are deciding.) How much school to school
variation in the intercepts has been explained by the charter school
variable? What about the slopes?
(k) Write out the overall equations for
non-charter schools and for charter schools.
(l) Provide detailed interpretation of each of
the estimated parameters in your model.
Continuing the previous problem. Return to the
model with time as a Level 1 variable and charter as a Level 2 variable (which
I’m calling “model2” below).
(a) Graph the Level 1 conditional residuals vs.
the fitted values. Do you see any
problems?
plot(resid(model2)~ fitted.values(model2))
(b) Graph the Level 1 conditional residuals vs.
the Level 1 variable (year08). Does the
linearity assumption seem reasonable?
plot(resid(model2)~ chart_long$year08)
(c) What do you conclude from the normal
probability plot?
qqnorm(resid(model2))
(d) Do the Level 2 residuals appear to follow a
normal distribution? Any outliers?
interceptresids =
ranef(model2)[[1]][,1] #these are the
random "effects" for the intercepts
sloperesids =
ranef(model2)[[1]][,2] #these are the random "effects" for the slopes
qqnorm(interceptresids)
qqnorm(sloperesids)
Note: We could check linearity by plotting the
Level 2 residuals vs. a quantitative Level 2 variable in the model.
(e) Is there any evidence that these residuals
are related to the percentage of students receiving
free lunch? (Use the wide format here.)
plot(interceptresids~
chart_wide$schPctfree)
plot(sloperesids~
chart_wide$schPctfree)
(f) Add “schpctfree” into the model (for both intercepts and
slopes). How does this impact the
charter effect? Why would that be? How
does it impact the growth per year? Does this reduce any unexplained
variability between schools? Is this a significantly better model?
Reconsider the math scores
for students in charter and non-charter schools. Open the data in the wide format.
(a) Find the correlation
matrix of these observations.
(b) How does this compare
to the covariance matrix found in the last model?
cov2cor(getVarCov(model3,
type="marginal", individual = 1)[[1]])
This command only words for one of our
ways of running multilevel models in R
(c) The previous line was for school 1. How does the correlation matrix change for
school 2?
cov2cor(getVarCov(model3,
type="marginal", individual = 2)[[1]])
Note: You may want to work the following formulas
by hand and include pictures of your work…
(d) Write out the equation
for a three level model with one Level 1 variable.
I would start
writing out the Level equations
Level 1, random
intercepts and random slopes
Level 2, random
intercepts
Level 3, (equations
for the level 2 intercepts)
I believe you
will end up with 5 “random terms” in the composite model.
(e) Show how to find the
formula for the variance of the response for an individual observation.
(f) Now find the
covariance between two observations in the same level 2 group.
(g) Now find the
covariance between two observations in different level 2 groups but the same level
3 group.
This
will be put on the review 2 problem set
1)
Consider this
paragraph: The multilevel models we have considered up to this point control
for clustering, and allow us to quantify the extent of dependency
and to investigate whether the effects of level 1 variables vary across
these clusters.
(a) I have underlined 3
components, explain in detail what each of these components means in the
multilevel model.
(b) The multilevel model
in the paragraph does not account for “contextual effects.” What is meant by
that?
(c) Give a short rule in your own words describing when an interpretation of an estimated coefficient should “hold constant” another covariate or “set to 0” that covariate
2) The article you read for HW 5 had the following: “application of multilevel models for clustered data has attractive features: (a) the correction of underestimation of standard errors, (b) the examination of the cross-level interaction, (c) the elimination of concerns about aggregation bias, and (d) the estimation of the variability of coefficients at the cluster level.
Explain each of these
components to a non-statistician.