Stat 414 – HW 6

Due Friday, Nov 8, 2pm

 

0) Due Thursday 8am

In class on Thursday we are going to have a guest speaker: Sam Ventura. Dr. Ventura will be presenting on how hierarchical or multilevel models can play an important role in player evaluation in team sports. He will look at models for wins above replacement and offensive and defensive player ratings in football and hockey. He will then apply these ideas to the NFL and the NFL draft, including providing a definitive answer to the question of whether Joe Flacco is elite.

Dr. Ventura received his Ph.D. in Statistics from Carnegie Mellon in 2015 (and a BS in computational finance and statistics). He is currently Director of Hockey Research for the Pittsburgh Penguins and an affiliated faculty member at Carnegie Mellon University’s Department of Statistics & Data Sciences. He is also associate editor for the Journal of Quantitative Analysis in Sports. His academic research focuses on clustering, prediction, record linkage, synthetic data, infectious diseases, and sports (particularly hockey and football).

By Thursday morning, preferably sooner, please email at least one question you would like to ask Dr. Ventura during his “visit” (Zoom).  Please also try to arrive on time and well-rested Thursday, I had to call in a lot of favors for this one J.

 

1) Read the paper Song discrimination by nestling collared flycatchers during early development by McFarlane et al. (Biology Letters, 2016) (http://rsbl.royalsocietypublishing.org/content/12/7/20160234#F2)

Note that there is supplemental material for the paper that contains some additional details on the model used.

(a)   Describe the response variable being considered in Figure 2.

(b)   They use a mixed model that contains a random effect. What is the random effect as they describe it in the paper and why are they accounting for it?

(c)   They did not clearly specify this, but they used a random intercept model and the estimated variance for the random effect is 0.0009031 and the estimated residual (or error) variance is 0.0083427. Calculate and interpret the intra-class correlation for two different observations taken within the same level of the random effect.

(d)   In the first model that they report results from in Section 3, they are ignoring song type or species and are just using age as a fixed effect. Interpret the estimated coefficients and tests results that they provide (e.g., the day 7, 9, 12 day test results vs. embryo).

(e)   They did not report the 4 days results except in the supplement because “they responded similarly.” Do you agree? Explain.

This is an example of issues with selective reporting of results and is not good science - if you did a test it should be reported and discussed.

 

2) hw6problem1.Rmd

Data were collected by the Minnesota Department of Education for all Minnesota schools during the years 2008-2010 to compare charter and non-charter schools.  School performance is measured by the mean score on the math portion of the Minnesota Comprehensive Assessment (MCA-II) data for the 6th grade students enrolled in 618 different Minnesota schools during the years 2008, 2009, and 2010.  (MCA test scores for sixth graders are scaled to fall between 600 and 700, where scores above 650 for individual students indicate “meeting standards.” Thus, schools with averages below 650 will often have increased incentive to improve their scores the following year.)

(a) Identify Level 1 and Level 2.  Are the variables listed below Level 1 or Level 2 variables?

·       percentage of students with free and reduced lunch

·       percentage of students with special education needs

·       percentage of students who are non-white

·       charter or public non-charter school

·       urban or rural

Note, level 2 variables are the 2010 values (why is this ok to do?).

(b) Next we want to explore how MCA math test scores relate to these variables.  This can be done using the data values for all three years or by averaging the data values for the three years into one number.  Give a break pro/con of these approaches.

(c) For the second approach open the “wide format” of the data (chart.wide.txt, this includes three columns for the three time points for each school) and use the SchoolAvg variable as the response.  Examine the associations of these variable with each of the variations listed in (a). Which variables seems most useful in predicting the average math score?

(d) Now open the “long format” of the data (chart.long.txt). Create two visual representations of math scores vs. time for the first 20 schools:

·       separate graphs for each school

·       connecting lines or smoothers for each school overlaid on same graph (i.e., “spaghetti plot”)

Explain what year08 represents.

(e) Do some schools have higher intercepts? What does this mean in context?

(f) Do some schools have higher slopes? What does this mean in context?

(g) Separate the first graph by charter (charter = 1) and non-charter (charter = 0) schools. Does one group tend to have higher scores?  Does one group tend to have more variability?

(h) Fit a multilevel model with year08, random intercepts, and random slopes. (Be sure to use schoolnum, which are unique, not school name.) Describe what this model is doing. What percentage of within-school variation is explained by the linear increase over time?

(i) Produce a graph of the Math scores vs. year, separated by the charter/non charter schools. [R: boxplot(MathAvgScore ~ year08*charter)] What do you learn?

(j) Include charter as a Level 2 variable (remember that means you include it as a fixed effect and its interaction with year08). Summarize the charter effect on the intercepts and the charter effect on the slopes. Is either statistically significant?  (Be very clear how you are deciding.) How much school to school variation in the intercepts has been explained by the charter school variable?  What about the slopes?

(k) Write out the overall equations for non-charter schools and for charter schools. 

(l) Provide detailed interpretation of each of the estimated parameters in your model.

 

3) hw6problem3.Rmd

Continuing the previous problem. Return to the model with time as a Level 1 variable and charter as a Level 2 variable (which I’m calling “model2” below).

(a) Graph the Level 1 conditional residuals vs. the fitted values.  Do you see any problems?

plot(resid(model2)~ fitted.values(model2))

(b) Graph the Level 1 conditional residuals vs. the Level 1 variable (year08).  Does the linearity assumption seem reasonable?

plot(resid(model2)~ chart_long$year08)

(c) What do you conclude from the normal probability plot?

qqnorm(resid(model2))

(d) Do the Level 2 residuals appear to follow a normal distribution?  Any outliers?

interceptresids = ranef(model2)[[1]][,1]  #these are the random "effects" for the intercepts

sloperesids = ranef(model2)[[1]][,2] #these are the random "effects" for the slopes

qqnorm(interceptresids)

qqnorm(sloperesids)

Note: We could check linearity by plotting the Level 2 residuals vs. a quantitative Level 2 variable in the model.

(e) Is there any evidence that these residuals are related to the percentage of students receiving

free lunch? (Use the wide format here.)

plot(interceptresids~ chart_wide$schPctfree)

plot(sloperesids~ chart_wide$schPctfree)

(f) Add “schpctfree” into the model (for both intercepts and slopes).  How does this impact the charter effect? Why would that be?  How does it impact the growth per year? Does this reduce any unexplained variability between schools? Is this a significantly better model?

 

4) hw6problem4.Rmd

Reconsider the math scores for students in charter and non-charter schools.  Open the data in the wide format.

(a) Find the correlation matrix of these observations.

(b) How does this compare to the covariance matrix found in the last model?

cov2cor(getVarCov(model3, type="marginal", individual = 1)[[1]])

This command only words for one of our ways of running multilevel models in R

(c) The previous line was for school 1.  How does the correlation matrix change for school 2?

cov2cor(getVarCov(model3, type="marginal", individual = 2)[[1]])

Note: You may want to work the following formulas by hand and include pictures of your work…

(d) Write out the equation for a three level model with one Level 1 variable.

I would start writing out the Level equations

Level 1, random intercepts and random slopes

Level 2, random intercepts

Level 3, (equations for the level 2 intercepts)

I believe you will end up with 5 “random terms” in the composite model.

(e) Show how to find the formula for the variance of the response for an individual observation.

(f) Now find the covariance between two observations in the same level 2 group.

(g) Now find the covariance between two observations in different level 2 groups but the same level 3 group.

 

 

This will be put on the review 2 problem set

1) Consider this paragraph: The multilevel models we have considered up to this point control for clustering, and allow us to quantify the extent of dependency and to investigate whether the effects of level 1 variables vary across these clusters. 

(a) I have underlined 3 components, explain in detail what each of these components means in the multilevel model.

(b) The multilevel model in the paragraph does not account for “contextual effects.” What is meant by that?

(c) Give a short rule in your own words describing when an interpretation of an estimated coefficient should “hold constant” another covariate or “set to 0” that covariate

 

2) The article you read for HW 5 had the following: “application of multilevel models for clustered data has attractive features: (a) the correction of underestimation of standard errors, (b) the examination of the cross-level interaction, (c) the elimination of concerns about aggregation bias, and (d) the estimation of the variability of coefficients at the cluster level.

Explain each of these components to a non-statistician.