Stat 414 – HW 7

Due midnight, Friday, Dec. 1

 

You don’t need to include R code I give you but should include R code you create/modify and relevant output.

 

1) Toenail dermatophyte onychomycosis (TDO) is a common toenail infection where the nail plate becomes separated from the nail bed. Traditionally, antifungal compounds have been used to treat TDO. New compounds have been developed in an attempt to reduce the treatment duration. The data in toenail.txt are from a randomized, double-blind, multi-center study comparing two oral treatments ((Itraconazole and Terbinafine) over a 3-month period, with measurements taken at baseline, every month during treatment, and ever 3 months afterwards, with a maximum of 7 measurements per subject.  At each visit, the target nail is coded as 0 (not severe, none or mild separation) or 1 (moderate or severe = complete separation). The research questions are whether the probability of severe infection decreases over time, and whether that evolution differs between the two treatments.

 

#library(readr)
toenail = read.table("https://www.rossmanchance.com/stat414/data/toenail.txt", header=TRUE)

toenail$ttmt = factor(toenail$ttmt)

(a) Explain what is measured by the time variable in this dataset.

(b) Explore the data:

with(toenail, round(tapply(y, list(ttmt, time), mean), 2))

#there is something wrong with interaction.plot, so let’s use this instead

props = ave(toenail$y, by = list(toenail$time, toenail$ttmt))

ggplot(data = toenail, aes(y=props, x=time, group=ttmt, col = ttmt)) +

  geom_line() +

theme_bw()

Include the output and briefly summarize what you learn. (Be clear what the y­-axis represents in the graph.)

(c) First fit a model that treats the observations on each individual as independent:

model1 = glm(y ~ time*ttmt, family = binomial, data = toenail)

summary(model1)

Is the interaction close to significant? How would you interpret the interaction in this context?

(d) Descriptively, how does the behavior of the fitted model compare to that of the raw data?

        ggplot(data = toenail, aes(y = fitted.values(model1), x = time, color = ttmt)) +
                    geom_line(aes(group=ttmt)) + 
                    labs(title = "glm", xlab="time", ylab="estimated prop") +
                    theme_bw() + ylim(0, .5)

plot(effects::allEffects(model1))

 

(e) Now fit the multilevel model

model2 = glmer(y ~ time*ttmt + (1 | SubjID), family = binomial, data = toenail, nAGQ = 50)

#adaptive Gause-Hermite Quadrature, nAGQ = 50

summary(model2, corr=F)

#Include one of the following graphs!

plot(effects::allEffects(model2))

 

fits = predict(model2,re.form=NA,newdata=toenail) #fixed effects only

fits=exp(fits)/(1+exp(fits) #convert log odds to predicted probabilities

                    ggplot(data = toenail, aes(y = fits, x = time, color = ttmt)) +
                    geom_line(aes(group=ttmt)) + 
                    labs(title = "mlm", xlab="time", ylab="estimated prop") +
                    theme_bw() + ylim(0, .5)

 

plot(ggeffects::ggpredict(model2, terms=c("time", "SubjID [sample=9]"), type="re"), ci = F)

 

Does this model suggest a significant difference in the rate of decline between the two treatments? Explain how you are deciding.

 

You may have noticed that these models look pretty different. That’s because they estimate different things! Previously, we interpreted the slope coefficient of a fixed effect in a multilevel model as the “effect” of changing x-variables within a specific group or “on average,” the marginal effect in the population. With logistic regression models, the “within individual” rate of decline is not the same as the “overall average” rate of decline (partly because of the nonlinear back transformation). The slope of the marginal relationship applies to a randomly selected person (population average). The slope of the conditional relationship applies to a particular individual.  The difference between these values increases as the cluster curves are more spread out (larger variance of the intercepts).

 

One approach is the convert the conditional (“subject specific”) slope to the marginal (think population average) coefficient, with the equation:

(f) For the above model, what is the marginal slope for time?  Which is larger here, the subject-specific slope or the population-average (group level) slope? Does that make sense in context? Explain.

 

Alternatively, we can use generalized estimating equations (GEEs) to come up with parameter estimates and robust standard errors (think ‘sandwich estimators’).  First, fit a model with a “bad correlation structure.”

install.packages("gee")

model3 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

                  corstr = "exchangeable", scale.fix = TRUE, scale.value = 1)

summary(model3)

A screenshot of a computer

Description automatically generated

(g) Why do I consider the “working correlation matrix” assumptions to be unreasonable in this context?

Now allow for a more appropriate correlation matrix:

model4 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

                  corstr = "unstructured")

summary(model4)

        ggplot(data = toenail, aes(y = fitted.values(model4), x = time, color = ttmt)) +
                    geom_line(aes(group=ttmt)) + 
                    labs(title = "gee", xlab="time", ylab="estimated prop") +
                    theme_bw() + ylim(0, .5)
A screenshot of a computer

Description automatically generated
A graph with a line

Description automatically generated

(h) What do you notice about the pattern of correlations in the “working correlation matrix”? Is the correlation modelled as the same between any two observations on the same individual?  Does that make sense in this context?

(i) For models 3 and 4, how do the “robust standard errors” (which ‘correct’ for the dependence in the data) compare to the “naïve standard errors”? Is this what we expect to see? Explain. For which model are they more different?

(j) Which are closer to the observed proportions of severe infections, the marginal (population-averaged) values or the conditional (subject-specific) values?

head(fitted.values(model2), 10)
head(fitted.values(model4), 10)

Decent reference on GEEs vs. MLMs

(k) Suggest a good candidate for a “level 3” in this study.

 

2) In this exercise you will analyze data from the Scotland Neighborhood Study (Garner and Raudenbush, 1991). This study set out to test the hypothesis that a neighborhood's level of social deprivation has a negative effect on a student’s educational attainment even after controlling for the student’s prior attainment and family background.  The data consist of 2,310 students who attended 17 secondary schools and resided in 524 neighborhoods (for one “school district”). Scottish secondary schools teach students from age 11-12 to the end of compulsory schooling (age 15-16).  The response variable is a total attainment score (approximately standardized), based on a series of national examinations (the GSCE) taken at the end of compulsory secondary schooling in Scotland (age 16). Successful performance in these examinations is a crucial factor in decisions regarding employment or further post compulsory education possibly leading to entrance to universities. Higher scores indicate higher attainment. Predictor variables include student level verbal reasoning and prior reading attainment scores on entering secondary education, a range of family level background characteristics, and a neighborhood-level deprivation score.

install.packages("haven") #to be able to open a STATA data file

library(haven)

GCSE <- read_dta("https://www.rossmanchance.com/stat414/data/GCSE.dta")

head(GCSE)

 

(a)  Are these data purely hierarchical? (e.g., are neighborhoods nested within schools?)

head(with(GCSE, table(neighid, schid)), 20)

 

(b)  Summarize what you learn from the following commands (you may want to run parts of the commands/do further exploration of the data to see what is going on, please document)

(library tidyverse)

schoolcombos<- GCSE %>% select(schid,neighid) %>% unique()
summary(tapply(schoolcombos$neighid, as.factor(schoolcombos$schid), length))
summary(tapply(schoolcombos$schid, as.factor(schoolcombos$neighid), length))
 
Hints: Maybe also consider

table(GCSE[GCSE$schid == 0,]$neighid)

table(GCSE[GCSE$neighid == 29,]$schid)

 
 

(c)   Fit the null (unconditional) model with separate school random effects and neighborhood random effects. (Model 0)

model0 = lmer(attain ~ 1 + (1|schid) + (1|neighid), data = GCSE, REML = FALSE)

summary(model0)

Briefly interpret the 3 variance estimates.

 

(d)  Is the model with random effects a significant improvement over the single-level model without random effects?

 

modelNR = gls(attain ~ 1 , data = GCSE, method = "ML")

summary(modelNR)

logLik(model0)

logLik(modelNR)

 

(e)  Summarize what you learn from the following output

A screenshot of a computer code

Description automatically generated

(f)    Back to your null model in (c). What percentage of variation is explained by each component (i.e., variance partition coefficients or VPCs)? Are the “educational disparities” larger among schools or among neighborhoods?

(g)  To calculate the correlation between two students (ICCs), you need to consider which components they share.

-       Two students in the same school and neighborhood =

-       Two students in the same school but different neighborhoods: =

What is the formula for two students in the same neighborhood but different schools?

Calculate the three ICC values. Which of these correlations is largest? Does this make sense in context? Explain briefly.

(h)  Examine the normality of each of the random effects, e.g.,

qqnorm(ranef(model0)$schid[[1]])
qqnorm(ranef(model0)$neighid[[1]])

What do you conclude/suggest next?

 

(i)    So far, we have assumed “additive effects” for the schools and neighborhoods.  Let’s consider a model that allows for an interaction between the schools and neighborhoods.

                    Try something like: combos = paste(as.character(GCSE$schid), as.character(GCSE$neighid))

model1 = lmer(attain ~ 1 + (1|schid) + (1|neighid) + (1 | combos), data = GCSE, REML = FALSE)

summary(model1)

anova(model0, model1)

How many terms does this add to the model?  Is this model a significant improvement over the additive model?   (Include relevant output and be clear how you are deciding.)  Does the random interaction variance appear substantial? (Include relevant output and be clear how you are deciding.)  Provide a brief interpretation of the implications of this interaction in context.

 

(j)    Now add two measures of prior attainment: verbal reasoning (p7vrq) and reading scores (p7read). Are these variables jointly significant?  (Include relevant output and be clear how you are deciding.)   How much school-level variance and how much neighborhood-level variance and how much Level 1 variance is explained by these two variables?  What does it tell you if these variables explain any Level 2 variance?

 

(k)   Suggest a potential school-level variable that could be included (not necessarily one in the current data set).  Suggest a potential neighborhood-level variable that could be included (not necessarily in the current data set). 

 

3) The U.S. National Longitudinal Study of Youth (NLSY) original study began in 1979 with a nationally-representative sample of nearly 13,000 young people aged 14 to 21. CurranData.txt contains data from a sub-study of children of the female NLSY respondents which began in 1986 when the children were aged between 6 and 8 years. Child assessments (e.g., reading scores) were then measured biennially in 1988, 1990 and 1992. The sub-sample of 221 children were assessed on all four occasions.

(a)  Read in two datasets.  The first is in “wide format” and the second is in “long format.”  What is the difference in these formats?

CurranDataWide <- read.delim("https://www.rossmanchance.com/stat414/data/CurranData.txt", "\t", header=TRUE)

head(CurranDataWide)

CurranDataLong <- read.delim("https://www.rossmanchance.com/stat414/data/CurranData_Long.txt", "\t", header=TRUE)

head(CurranDataLong)

(b)  Look at the first few rows in wide format.  Which are time-varying variables and which are time-invariant?

(c)   Summarize what you learn from the “spaghetti plot” (pick one to include)

library(tidyverse)

ggplot(data = CurranDataLong, aes(y = RAscore, x = Time)) +

  geom_line(aes(group=id), color="light grey") +

  geom_smooth(aes(group=1), color="black", size=1) +

  theme_bw() +

labs(x="Years since 1986", y="Reading Score")

OR

ggplot(CurranDataLong, aes(x = Time, y = RAscore, group = id)) +

  geom_smooth(method = "loess", alpha = .5, se = FALSE) +

  geom_smooth(aes(group=1), color="black") +

  theme_bw()

 

(c) Examine the correlations of the reading scores across the four time points. (Explain what the R code is doing.)  How would you summarize their behavior? Explain why the pattern in the covariation/ correlation makes sense in this context.

cor(select(CurranDataWide, contains("reading")))             

 

(d) If we were to fit an OLS model to these data, what assumption would this model make about the correlations of the four observations on each person? Does that assumption seem reasonable here?

 

(e) Use lme (not lmer) to fit a random intercepts longitudinal model (meaning, include Time).  Tell me about the variable Time. Report the intra-class correlation coefficient. What assumptions does this model make about the correlations of the four observations on each person? Does that assumption seem reasonable here?

library(nlme)

summary(model1 <- lme(RAscore ~ Time, random = ~1 | id, data = CurranDataLong))

cov2cor(getVarCov(model1, type="marginal")[[1]])

 

(f) Fit a random slopes model and examine the covariance matrix.  Verify the values for var(y11) = variance of “first observation” for person 1, var(y21) = variance of “second observation” for person 1 (which can differ because we are using random slopes), cor(1,2) = correlation of the first two measurements on each person, cor(1,3) = correlation of first and 3rd measurement on each person. (Show me the numbers!) How are the variances changing over time? How do the correlations change as the observations get further apart?

model2 = lme(RAscore ~ Time, random = ~Time|id, data = CurranDataLong)

summary(model2)

getVarCov(model2, type="marginal")

cov2cor(getVarCov(model2, type="marginal")[[1]])

fits = fitted.values(model2)

ggplot(data = CurranDataLong, aes(y = fits, x = Time)) +

  geom_line(aes(group=id)) +

  theme_bw()

 

(g) Return to the basic two-level random intercepts model with time and time2 as Level 1 variables, but without random slopes. Fit the AR(1) error structure. Is this model significantly better than the “independent errors” random intercepts model? 

model4 = lme(fixed = RAscore ~ Time + I(Time*Time), random = ~1| id, correlation = corAR1(), data = CurranDataLong)

summary(model4); intervals(model4)

cov2cor(getVarCov(model4, type="marginal")[[1]])

model5 = lme(fixed = RAscore ~ Time + I(Time*Time), random = ~1| id, data = CurranDataLong)

anova(model5, model4)

 

(h) What is the estimated parameter of the AR(1) model (“autocorrelation”).  How do you interpret it? Does there appear to be correlation between adjacent observations even after conditioning on student?

 

(i) How would we decide whether the pattern of growth (i.e., reading progress) differs between males and females? Explain the process, you don’t have to carry out.