Stat 414 – Review 1 Problem Solutions

 

1) Data was collected on corn grown on the island of Antigua (corn.txt). The response variable is the harvest weight (harvwt) per plot (unknown units). There are eight sites with eight separate plots within each site where the corn is grown under the same treatment condition.  One question of interest is whether site has an effect on harvest weight.

(a) Identify the response variable and the Level 1 units and Level 2 units.

 

Response variable = harvest weight

Level 1 units = plots

Level 2 units = sites

 

(b) Produce a graph of the harvest weights for the eight sites.  Does there appear to be substantial site to site variation?

 

 

 

Yes, we do see substantial site to site variation, for example the ORAN site and NSAN site indicate very different havest weight values.

 

(c) Explain how we could use a traditional one-way ANOVA to model these data: identify the parameters, identify the error. Write out the statistical model (with proper indices, assumptions of error distribution, parameters). 

 

Yij = harvest weight of ith plot in site j

Yij =  + j + eij  j = 1, …, 8 sites, eij ~ N(0, 2)

where  is the overall mean harvest weight in the population, j is the fixed effect of the jth site, and 2 is the within site variation (or the plot to plot variation, which we are assuming is the same for each site). For one-way ANOVA the j are considered fixed effects; so these are the “effects” (not the group means, think effect coding).

 

(d) State null and alternative hypothesis and report a test statistic and p-value for assessing the site-to-site variation. Also produce a table of the estimated coefficients for the sites.

 

H0: 1 = …. = 7 = 0

Ha: at least one j ≠ 0

(So we only need 7 of them and we compare the set to zero, not just each other)

F = 17.4, df = 7 and 24, p-value < .0001.  We have strong evidence that the population mean weights are these sites are not equal.

 

> summary(lm(harvwt~site, contrasts=list(site="contr.treatment")))

This gives indicator parameterization

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   4.8850     0.3800  12.856 2.96e-12 ***

siteLFAN     -0.6775     0.5374  -1.261 0.219513   

siteNSAN     -2.7950     0.5374  -5.201 2.50e-05 ***

siteORAN      2.0300     0.5374   3.778 0.000922 ***

siteOVAN     -0.0525     0.5374  -0.098 0.922984   

siteTEAN     -1.8487     0.5374  -3.440 0.002135 **

siteWEAN      0.6412     0.5374   1.193 0.244412   

siteWLAN     -2.0438     0.5374  -3.803 0.000865 ***

 

> summary(lm(harvwt~site, contrasts=list(site="contr.sum")))

This gives effect parameterization

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  4.29172    0.13434  31.946  < 2e-16 ***

site1        0.59328    0.35544   1.669  0.10808   

site2       -0.08422    0.35544  -0.237  0.81471   

site3       -2.20172    0.35544  -6.194 2.12e-06 ***

site4        2.62328    0.35544   7.380 1.28e-07 ***

site5        0.54078    0.35544   1.521  0.14121   

site6       -1.25547    0.35544  -3.532  0.00170 **

site7        1.23453    0.35544   3.473  0.00197 **

 

 

By the way, there is a nice “display()” function in the arm package.

 

But these are more about effects or group differences rather than estimates for the sites.

> sapply(split(fitted(lm(harvwt~site)), site),mean)

   DBAN    LFAN    NSAN    ORAN    OVAN    TEAN    WEAN    WLAN

4.88500 4.20750 2.09000 6.91500 4.83250 3.03625 5.52625 2.84125

 

So these do match the group means.

 

 

Now suppose we want to use a multilevel model. The main advantages are the ability to have plot level covariates, as well as site level covariates and an estimate of the error associated with the site level.

(e) Write out a statistical model for such a multilevel model, defining the parameters, with proper indices, assumptions of error distributions. 

 

Yij = harvest weight of ith plot in site j

Yij =  + j + eij  j = 1, …, 8 sites, eij ~ N(0, 2) and j ~ N(0, 2)

where  is the overall mean harvest weight in the population,  is the site to site variability, and 2 is the within site variation (or the plot to plot variation, which we are assuming is the same for each site).

 

(f) State null and alternative hypothesis and report a test statistic and p-value for assessing the site-to-site variation.

 

H0:  = 0

Ha:  > 0

 

> model1=lme(fixed=harvwt~ 1, random = ~1 | site)

> model2=gls(harvwt~1)

> anova(model1, model2)

       Model df      AIC      BIC    logLik   Test  L.Ratio p-value

model1     1  3 100.4163 104.7183 -47.20816                       

model2     2  2 126.4154 129.2834 -61.20772 1 vs 2 27.99912  <.0001

 

(g) Also produce a table of the estimated coefficients for the sites.  How does these compare to the estimates from (d)? Why?

 

 

DBAN is first site, 4.291719 + .55918 = 4.851

> model1$fitted

      fixed     site

1  4.291719 4.850901

2  4.291719 4.212340

3  4.291719 2.216544

4  4.291719 6.764226

5  4.291719 4.801418

6  4.291719 3.108408

7  4.291719 5.455295

8  4.291719 2.924616

9  4.291719 4.850901

10 4.291719 4.212340

11 4.291719 2.216544

12 4.291719 6.764226

13 4.291719 4.801418

14 4.291719 3.108408

15 4.291719 5.455295

16 4.291719 2.924616

17 4.291719 4.850901

18 4.291719 4.212340

19 4.291719 2.216544

20 4.291719 6.764226

21 4.291719 4.801418

22 4.291719 3.108408

23 4.291719 5.455295

24 4.291719 2.924616

25 4.291719 4.850901

26 4.291719 4.212340

27 4.291719 2.216544

28 4.291719 6.764226

29 4.291719 4.801418

30 4.291719 3.108408

31 4.291719 5.455295

32 4.291719 2.924616

 

 

> sapply(split(fitted(model1), site),mean)

    DBAN     LFAN     NSAN     ORAN     OVAN     TEAN     WEAN     WLAN

4.850901 4.212340 2.216544 6.764226 4.801418 3.108408 5.455295 2.924616

 

DBAN decreased from 0.593 to 0.559.  All of the estimated effects got closer to zero because of the “shrinkage” that occurs when we treat them as random effects. (So the estimated group means shrink towards 4.2819 – for example DBAN decreases from 4.885 to 4.851, WLAN increased from 2.841 to 2.925.

 

(h) Estimate and interpret the intraclass correlation coefficient. (How does this compare to R2 from the one-way ANOVA?)

 

 

ICC = 2.3677/2.94527 = .8039, about 80% of the variation in response is at the site level. It’s a little lower because we are now assuming more variation in the data to be explained (MSTotal for the fixed model is 2.176).

 

This is in the ball park of R2

 

(i) Examine the residuals vs. fits graphs for this model and summarize what you learn.

 

plot(model1)

 

 

There does not appear to be huge violations in terms on linearity (though not really a condition in this one-way anova type model but don’t see any concerning patterns) or equal variance.

 

Below are the standard deviations of the residuals per level. We do have some that are more than twice the size of others (e.g., OVAN and DBAN).

 

 

2) A researcher is interested in seeing how different light conditions (Long = 16 hr days, Short = 8 hr days) affects the concentration of an enzyme in golden hamsters in heart tissue. Ten hamsters are randomly placed into each of the two light treatment groups, for a total of twenty hamsters. After a treatment period of several weeks, the hamsters are sacrificed and enzyme concentration (mg/ml) is measured in four subsamples of heart tissue for each hamster. The data is in the file hamster.txt. The primary research question is to compare the effects of light on the enzyme concentration in heart tissue.

(a) Display a graph that highlights the primary research question.

 

Clearly the long day is associated with higher concentrations of the heart enzyme. There is variation in the measurements within each hamster, but that is smaller than the treatment to treatment variation.  The within group variation in the hamsters looks pretty similar.

 

ggplot(hamsters, aes(y=conc, x = day, fill = id)) +

       geom_boxplot()+

 # geom_dotplot(binaxis='y', stackdir='center', dotsize=1) +

    theme_bw()

   

 

(b) Fit and interpret a model that addresses the primary research question. Summarize your results.

 

Fitting a linear model between concentration and day with hamsters as a random effect (following a normal distribution). (Note: You could consider using a no intercept model here as there should not be any concentration on day 0.)

 

> model1= lmer(conc ~ day + (1 | id))

> summary(model1)

Linear mixed model fit by REML ['lmerMod']

Formula: conc ~ day + (1 | id)

 

REML criterion at convergence: -227.7

 

Scaled residuals:

     Min       1Q   Median       3Q      Max

-2.30009 -0.54445 -0.03795  0.54010  1.93376

 

Random effects:

 Groups   Name        Variance Std.Dev.

 id       (Intercept) 0.004216 0.06493

 Residual             0.001644 0.04055

Number of obs: 80, groups:  id, 20

 

Fixed effects:

            Estimate Std. Error t value

(Intercept)  1.63225    0.02151  75.884

dayShort    -0.25550    0.03042  -8.399

 

 

The t-statistic for dayShort is -0.255 / 0.0304 = -.12775/.01521 =  -8.4, which is large (I consider anything larger than 2 in absolute value to be large), giving us convincing evidence that, after adjusting for the hamster to hamster variation, the mean concentration increases with time.

 

With indicator parameterization, the intercept predicts the mean concentration for the Long day treatment to be 1.63225 (with standard error 0.02151) or 1.37675 for the Short day treatment for the average hamster.

 

The dayShort slope estimates that the average concentration with the Short day treatment will be 0.2555 lower than average Long day treatment concentration for the average hamster.

 

An approximate 95% confidence interval for the difference (long minus short) is -.255 + 2(.030) or (0.195, 0.315).

 

This was a randomized experiment, so we can conclude that the Long treatment causes an increase in the concentration of the heart enzyme.

 

The hamsters explain .004216/(.004216 + .001644) or 71.9% of the random variation in the concentrations, after adjusting for the treatments.

 

3)

(a) A has the largest ICC because the variation between the schools is the largest and B has the smallest ICC.

(b) The estimated effect would increase because there would be less shrinkage to the overall mean due to the larger sample size.

(c) The width of the confidence interval would decrease because it would be a more precise estimate.

 

4)

(a) 90.86

(b) 90.86 – sqrt(10.66)

(c) 90.86 + 1.4145 + 2 × sqrt(10.66)

(d) 90.86 + 3.113