rStat 414 – Review 2 problem solutions

1) Consider this paragraph: The multilevel models we have considered up to this point control for clustering, and allow us to quantify the extent of dependency and to investigate whether the effects of level 1 variables vary across these clusters.

(a) I have underlined 3 components, explain in detail what each of these components means in the multilevel model.

Control for clustering: We have observations that fall into natural groups and we don’t want to treat the observations within the groups as independent, by including the “clustering variable” in the model, the other slope coefficients will be “adjusted” or “controlled” for that clustering variable (whether we treat it as fixed or random)

Quantify the extent of the dependency: The ICC measures how correlated the observations in the same group

Whether the effects of level 1 variables vary across the clusters: random slopes

(b) The multilevel model in the paragraph does not account for “contextual effects.” What is meant by that?

The ability to include Level 2 variables, variables explaining differences among the clusters. In particular, we can aggregate level one variables to be at Level 2 (e.g., group means).

(c) Give a short rule in your own words describing when an interpretation of an estimated coefficient should “hold constant” another covariate or “set to 0” that covariate

We should hold constant variable 2 when we are interpreting variable 1, unless the interaction of these two variables is also included, then if the other is at zero we can interpret the main effect of the first.

2) The article you read for HW 5 had the following: “application of multilevel models for clustered data has attractive features: (a) the correction of underestimation of standard errors, (b) the examination of the cross-level interaction, (c) the elimination of concerns about aggregation bias, and (d) the estimation of the variability of coefficients at the cluster level.

Explain each of these components to a non-statistician.

(a) Assuming independence allows us to think we have a larger sample size (more information) than we really do and underestimates standard errors.

(b) Including interaction terms between Level 1 and Level 2 variables

(c) Still get to analyze the data at Level 1 vs. aggregating the variables to Level 2 which could have a different relationship than the Level 1 relationship

(d) We are able to estimate the intercept-to-intercept and slope-to-slope variation at Level 2

3) (a) Identify the three-levels in this study.

Level 1 = time point

Level 2 = plant

Level 3 = pot

The treatments were applied to the plots, so they are Level 3 variables. Time is a level 1 variable.

(b) Examine spaghetti plots of the plant heights across the measurements for each of the species (coneflower and leadplant). Is it reasonable to assume linear growth between Day 13 and Day 28? Does the initial height and/or rate of growth seem to differ between the species? Is there more variability in one species than the other?

Linear growth seems a reasonable approximation. The initial heights at time 13 and the overall slope from time13 to time 28 looks similar between the two species but much more plant to plant variation for the coneflowers. Maybe some evidence that coneflowers grow faster days 13 -18 and then slow down, with leadplants catch up at the end.

(c) Examine spaghetti plots of the plant heights over time separately for the three types of soil, separately for each species. What do you learn?

The rate of growth looks faster for Cult and Stp soils, especially for the leadplants. Again more variation with the coneflowers.

(d) Examine spaghetti plots of the plant heights over time separately for the two levels of sterilization, separately for each species. What do you learn?

The rate of growth is faster for the sterilized plants and the non-sterilized coneflowers did better – there is little growth for the non-sterilized leadplants. Here I also see higher initial growth (day13) for the coneflowers.

Focusing on just the leadplants

(e) I next calculated time13 = time – time 13. Give two reasons this could be a good idea.

If we were to include any interactions or quadratic terms, this would guard against multicollinearity. It also makes the intercept of the model correspond to the first time point.

(f) Then I fit an “unconditional means” or “random intercepts” model with no predictors.

How many parameters are estimated? Provide an interpretation of each, including the variance components. Anything interesting about the relative size of the variance components?

We have estimated 4 parameters.

The intercept (2.388 in mm) estimates the overall average growth of the plants on day 13 (average plant, average pot)

The plant variance component (0.28) is a measure in the plant to plant variation in day 13 heights within the same pot.

The pot variance (0.05) is a measure of the pot to pot variation in day 13 heights.

The residual variance (0.73) is the variability of the plant heights across the 4 time measurements within the same plant.

Total variance = .278 + .0487 + .7278 = 1.05

69% of the total variation is due to difference over time for each plant, 26.4% is due to variability in the plants in the same pot, and only 4.6% is pot to pot variation.

(g) Next I included the new time variable in the model assuming linear growth.

Explain what (time13|pot/plant) means to the model. Write out the theoretical level equations (in terms of ’s). How many variance/covariance parameters are there/why?

How much of the within-plant variability is explained by the linear changes over time?

Interpret the fixed effects. Are either of the fixed effects statistically significant?

The (time13 | pot/plant) component is indicating that plants are nested within pots and that we are allowing the growth rate (slope of time) to vary across the plants and across the pots (as well as the intercepts = measurement on day 13).

Let ijk refer to the i^th measurement of the j^th plant in the k^th pot

Level 1 equation:

Level 2 equations: and

Level 3 equations: and

The composite equation can then be written as

where we have random variation in the intercepts at both Levels 2 and 3 and random variation in the slopes at both Levels 2 and 3.

There are 5 variance parameters (within plant variation, between plant variation in intercepts, between plant variation in slopes, between pot variation in intercepts, between pot variation in slopes). There are 2 covariance parameters: intercepts and slopes at level 2, intercepts and slopes at level 3. There are also 2 fixed parameters for 9 parameters total.

The within plant variability decreased from 0.7278 to 0.0822, an 89% decrease! In other words 89% of the within plant variation in heights can be explained by the linear growth over time.

The overall leadplant height for day 13 is 1.54mm, with an average increase of 0.112 per day for an average plant in an average pot.

Both the intercept (initial plant height) and the slope are statistically significant with t-ratios of 22 and 14 respectively.

(h) Next I added the sterilization and soil type variables, including interactions with the time variable.

Why did I include interactions with the time variable? Is this model a significant improvement from the model in (g)?

The interactions with time is what puts the level 2 variable into the equation for the random slopes. For example

corresponds to having a sterile x time term in the model.

Level 1 equation:

Level 2 equations: and

Level 3 equations:

and

This model estimates 15 parameters.

(i) But this model in running into some boundary conditions. One option is to simplify the model, e.g., removing some variance components. Write out the model equations, for a new model so that the intercepts have random components at Levels 2 and 3 but the slopes are only allowed to vary at level 2. What is the practical interpretation of this modelling choice? How many parameters does this remove from the model?

[If you check, this model should be more stable, and not significantly worse.]

Level 1 equation:

Level 2 equations: and

Level 3 equations:

and

Composite equation:

This model allows the growth rates to differ from plant to plant but not from pot to pot (after accounting for soil type and sterilization).

This doesn’t change the fixed effects but now there are only 5 variance components to estimate (2 fewer).

To run this model in R, you would use something like (time13 | plant) + (1 | pot), intercepts and slopes vary across plants but only intercepts vary across pots.

(j) Next we could consider adding an interaction between sterilization and soil type to the model, along with the three-way interaction between sterilization, soil type, and time.

How many parameters does this add? Interpret the nature of the three-way interactions. Explain what type of visual would help you assess the evidence of such an interaction.

This adds 4 parameters to the model. We could look at a graph of the plant heights over time vs. soil type with separate panels for sterilized or not. This would allow us to assess whether the change in growth rate across soil types is different depending on whether or not the plants have been sterilized. In other words, are the differences in the growth rates depending on type of soil type differ for the sterilized or non-sterilized plants.

In the graph below, we see that the growth rates (slopes of the line) appear similar across the three soil types for the non-sterilized plants. For the sterilized plants, the growth rate looks a bit smaller for the REM plants compared to CULT and STP. This three-way interaction ends up not being statistically significant.

(k) How would you change the previous model so that neither sterilization or soil type (or their interaction) are allowed to influence Day 13 measurements? Why might this be a reasonable consideration?

We would drop sterile, soilREM, and soilSTP and the 2 interaction terms from the model. The first level 3 equation would become: ,but the second one would not change. In other words, we would be forcing the (overall) fitted model to start in the same spot in the 6 graphs above.

We saw in our exploratory data analysis that there didn’t seem to be much overall difference in the initial heights across these treatments (maybe it takes two weeks for the effects of these treatments to kick in). This is also supported by the small t-values for these 5 terms (remember if your focus is on model building, you might want to create tet and validation datasets…)

(l) Return to the model in (j). Interpret it! (A brief summary of the important features, especially as the agree/disagree with your exploratory data analysis. What would the “effects plots” look like? What seems to maximize growth?!)

Focusing on the significant treatments: Sterilizing the plants does not appear to have a significant effect on day 13 height (t = -.551) but does appear to improve their growth rate (t = 5.966), after adjusting for soil type but soil type doesn’t appear to make much of a difference. Restored (STP) soil appears to increase growth rate over cultivated soil (t = 2.546) (don’t really have to talk about holding sterilization fixed or picking a category in this factorial design (so no confounding) with no 3-way interaction (so effect doesn’t change).

From the later model, there is also weak evidence of an interaction, with sterilized remnant soil having smaller growth rates (i.e., the benefit of sterilization is somewhat muted in the remnant soil). So to maximize growth, use sterilized (STP) soil.