Stat 414 – HW 7

Due 8am, Monday, Nov. 17

 

You don’t need to include R code I give you but should include R code you create/modify and relevant output. Please submit separate files for each question.

 

1) Toenail dermatophyte onychomycosis (TDO) is a common toenail infection where the nail plate becomes separated from the nail bed. Traditionally, antifungal compounds have been used to treat TDO. New compounds have been developed in an attempt to reduce the treatment duration. The data in toenail.txt are from a randomized, double-blind, multi-center study comparing two oral treatments (Itraconazole and Terbinafine) over a 3-month period, with measurements taken at baseline, every month during treatment, and every 3 months afterwards, with a maximum of 7 measurements per subject.  At each visit, the target nail is coded as 0 (not severe, none or mild separation) or 1 (moderate or severe = complete separation). The research questions are whether the probability of severe infection decreases over time, and whether that evolution differs between the two treatments.

#library(readr)
toenail = read.table("https://www.rossmanchance.com/stat414/data/toenail.txt", header=TRUE)

head(toenail)

contrasts(factor(toenail$ttmt)) #so itraconazole will be the reference group in the models

(a) Explain what is measured by the time variable in this dataset (what do the values in the dataset indicate?)..

(b) Explore the data:

with(toenail, round(tapply(y, list(ttmt, time), mean), 2))

props = ave(toenail$y, by = list(toenail$time, toenail$ttmt))

ggplot(data = toenail, aes(y=props, x=time, group=ttmt, col = ttmt)) +

  geom_line() +

theme_bw()

Include the graph and briefly summarize what you learn. (Be clear what the y­-axis represents in the graph.)

(c) First fit a model that treats the observations on each individual as independent:

model1 = glm(y ~ time*ttmt, family = binomial, data = toenail)

summary(model1)

How would you interpret the estimated interaction in this context? Is the interaction close to significant?

(d) Descriptively, how does the behavior of the fitted model compare to that of the raw data?

        glm_fits = fitted.values(model1)
        ggplot(data = toenail, aes(y = glm_fits, x = time, color = ttmt)) +
        geom_line(aes(group=ttmt)) + 
                 labs(title = "glm", xlab="time", ylab="estimated prop") +
        theme_bw() + ylim(0, .5)
 

plot(effects::allEffects(model1))

Notice the non-uniform scaling of the y-axis for the second graph! Don’t look at the second graph and conclude there is a linear association/model! In fact, we should be saving “curve” rather than “line” much more often…

 

Now fit the multilevel model with random intercepts for the subjects.

model2 = lme4::glmer(y ~ time*ttmt + (1 | SubjID), family = binomial, data = toenail, nAGQ = 50)

#adaptive Gause-Hermite Quadrature, nAGQ = 50

summary(model2, corr=F)

(e) Does this model (model2) suggest a statistically significant difference in the rate of decline between the two treatments? Explain how you are deciding.

(f) Pick and include one of the following three graphs and briefly explain why you preferred the graph you selected.

#Look at all the graphs but do not include

plot(effects::allEffects(model2))

plot(ggeffects::ggpredict(model2, terms=c("time", "SubjID [sample=9]"), type="random"), show_ci = F)  #rerun if not enough subjects displayed

 

fits = predict(model2,re.form=NA,newdata=toenail) #fixed effects only

mlm_fits=exp(fits)/(1+exp(fits)) #convert log odds to predicted probabilities

                 ggplot(data = toenail, aes(y = mlm_fits, x = time, color = ttmt)) +
        geom_line(aes(group=ttmt)) + 
                 labs(title = "glm", xlab="time", ylab="estimated prop") +
        theme_bw() + ylim(0, .5)

 

Compare the models

combined_data <- data.frame( time = toenail$time, ttmt = toenail$ttmt, fits = c(glm_fits, mlm_fits), model = rep(c("GLM", "MLM"), each = nrow(toenail)) )

ggplot(data = combined_data, aes(x = time, y = fits, color = ttmt, linetype = model)) +          

geom_line(aes(group = interaction(ttmt, model))) +

labs(title = "Comparison of GLM and GLMER Models", x = "Time", y = "Estimated Probability") +

theme_bw() + ylim(0, 0.5)

You may notice that these models look pretty different! That’s because they estimate different things! Previously, we interpreted the slope coefficient of a fixed effect in a multilevel model as the “effect” of changing x-variables within a specific group or “on average,” the marginal effect in the population. With logistic regression models, the “within individual” rate of decline is not the same as the “overall average” rate of decline (because of the nonlinear back-transformation). The slope of the marginal relationship (solid curves) applies to a randomly selected person (population average). The slope of the conditional relationship (dotted curves) applies to a particular individual.  Keep in mind this is more about “adjusting for the individual” than using random effects, we would see the same distinction if we included subjects as fixed effects (which we can’t really do here because of missing values etc.)

 

(g) Because the intercepts are so different, it is difficult to compare the slopes of these models from the graphs. For model 1 (not including subject), what is the predicted odds of severe infection for Itraconazole at time 0 and again at time 6. Calculate the odds ratio (how many times smaller are the odds at time 6?).  Repeat for model2.  Which model (population-average in model 1 or subject specific in model 2) reflects a sharper decrease in the odds of severe infection?

 

We can use the equation to convert the conditional (“subject specific”) slope to the marginal (think population average) coefficient:

slope divided by square root of 1 + 0.324 times intercept variance

(h) For model2, use this equation to estimate the marginal intercept and the marginal slope for time.  Are these values closer to model 1?

 

Another way to estimate the marginal (population average) coefficient and robust standard errors (think ‘sandwich estimators’) is to use generalized estimating equations (GEEs). Decent reference on GEEs vs. MLMs

First, fit a model with a “bad correlation structure.” (Output here in case doesn't run for you.)

install.packages("gee")

model3 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

                  corstr = "exchangeable", scale.fix = TRUE, scale.value = 1)

summary(model3)

(h) Why do we consider the “working correlation matrix” assumptions to be unreasonable in this context?

 

Now allow for a more appropriate correlation matrix (output here):

model4 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

                  corstr = "unstructured")

summary(model4)

  ggplot(data = toenail, aes(y = fitted.values(model4), x = time, color = ttmt)) +
       geom_line(aes(group=ttmt)) + 
    labs(title = "gee", xlab="time", ylab="estimated prop") +
          theme_bw() + ylim(0, .5)

(i) What do you notice about the pattern of correlations in the new “working correlation matrix”? Does that make sense in this context?

(j) For models 3 and 4, how do the “robust standard errors” (which ‘correct’ for the dependence in the data) compare to the “naïve standard errors”? Is this what we expect to see? Explain. For which model are they more different?

(k) Which are closer to the observed proportions of severe infections, the marginal (population-averaged) values or the conditional (subject-specific) values?

head(fitted.values(model2), 10)
head(fitted.values(model4), 10)

(l) Suggest a good candidate for a “level 3” (units) in this study.

 

2) In this exercise, you will analyze data from a study involving 67 schools. In this study, students were given a math test in grades 3, 4, and 5 that was similar to the SAT math test (e.g., scale 0-800).  The data in schooldata involves 234 repeated measures collected from 122 students across 12 teachers.

schooldata = read.table("https://www.rossmanchance.com/stat414/data/schooldata.txt", header=TRUE)
head(schooldata)
with(schooldata, table(studid, tchrid))[1:10, 1:10]

(a) Explain what is measured by the time variable in this dataset.

(b) Are students nested within teacher?  Explain how you can tell from the output of the table.  Also, sketch a diagram showing the “crossed effects” for a few cases.

(c) Based on the following graph, would you treat year as a quantitative variable or categorical (factor) in this study?  Explain your reasoning.

means <- tapply(schooldata$math, schooldata$year, mean)
with(schooldata, boxplot(math ~ year))
points(1:3, means, col = "red", pch = 19)

(d) Write out the level (model) equations corresponding to an unconditional growth (include year but not random slopes) model (not the prediction equation) with separate student random effects and teacher random effects.  Include the relevant assumptions on the random components.

(e) From the following output

model1 <- lmer(math ~ year + (1 | studid) + (1 | tchrid), data = schooldata)
summary(model1, corr= F)
confint(model1)

Which has more ‘impact’ on test scores in a given year, the student or their teacher?  (How are you deciding?)

(f) Are the student effects statistically significant? (How are you deciding?)

(g) Is the time effect statistically significant? (How are you deciding?)

 

The first two rows of the estimated variance-covariance matrix for the responses will have

Outcome/Group

Student 13099, Teacher 14433

Student 13100, Teacher 14433

Student 13100, Teacher 14484

Student 13099, Teacher 14433

a.

 

b.

c.

Student 13100, Teacher 14433

d.

 

e.

f.

 

(h) According to the random intercepts model above, specify these 6 values.  (You may assume the random intercepts for students and teachers are independent and that residuals are independent across observations, and assume independence across repeated measurements but be thinking how we will change our model to account for that!)

(i) Examine the normality of each of the random effects, e.g.,

qqnorm(ranef(model1)$studid[[1]])

qqnorm(ranef(model1)$tchrid[[1]])

What do you conclude/suggest next?

 

(j) So far, we have assumed “additive effects” for the students and teachers.  Suppose we wanted to consider an interaction between our crossed random effects.  We are unlikely to have repeat combinations for our longitudinal data (recall you need replication to include the interaction df), but what about our hypothetical hockey data…

hockey = read_csv("https://www.rossmanchance.com/stat414/data/shot-attempts.csv")

hockey$dist = with(hockey, sqrt((x_coord - 89)^2 + (y_coord^2)))

hockey$angle = with(hockey, atan2(y_coord, 89 - x_coord))

hockey$goal = ifelse(hockey$shot_result == "Goal", 1, 0)

 

model3.mlm = glmer(goal ~ 1 + angle + dist + (1 | shooter_id) + (1 | goalie_id), family=binomial, data = hockey)

model4.mlm = glmer(goal ~ 1 + angle + dist + (1 | shooter_id) + (1 | goalie_id) + (1 | shooter_id:goalie_id), family=binomial, data = hockey)  #be patient!

summary(model4.mlm, corr = F)

anova(model3.mlm, model4.mlm)

How many terms does this add to the model?  Is this model a significant improvement over the additive model?   (Include relevant output and be clear how you are deciding.)  Does the random interaction variance appear substantial? (Include relevant output and be clear how you are deciding.)  Provide a brief interpretation of the implications of this interaction in context.