Stat 414 – HW 7

Due midnight, Friday, Nov. 22

You don’t need to include R code I give you but should include R code you create/modify and relevant output. Please submit separate files for each question.

1) Toenail dermatophyte onychomycosis (TDO) is a common toenail infection where the nail plate becomes separated from the nail bed. Traditionally, antifungal compounds have been used to treat TDO. New compounds have been developed in an attempt to reduce the treatment duration. The data in toenail.txt are from a randomized, double-blind, multi-center study comparing two oral treatments ((Itraconazole and Terbinafine) over a 3-month period, with measurements taken at baseline, every month during treatment, and ever 3 months afterwards, with a maximum of 7 measurements per subject. At each visit, the target nail is coded as 0 (not severe, none or mild separation) or 1 (moderate or severe = complete separation). The research questions are whether the probability of severe infection decreases over time, and whether that evolution differs between the two treatments.

#library(readr)

toenail = read.table("https://www.rossmanchance.com/stat414/data/toenail.txt", header=TRUE)

toenail$ttmt = factor(toenail$ttmt)

(a) Explain what is measured by the time variable in this dataset.

(b) Explore the data:

with(toenail, round(tapply(y, list(ttmt, time), mean), 2))

props = ave(toenail$y, by = list(toenail$time, toenail$ttmt))

ggplot(data = toenail, aes(y=props, x=time, group=ttmt, col = ttmt)) +

geom_line() +

theme_bw()

Include the output and briefly summarize what you learn. (Be clear what the y-axis represents in the graph.)

model1 = glm(y ~ time*ttmt, family = binomial, data = toenail)

summary(model1)

Is the interaction close to significant? How would you interpret the interaction in this context?

(d) Descriptively, how does the behavior of the fitted model compare to that of the raw data?

        glm_fits = fitted.values(model1)

        ggplot(data = toenail, aes(y = glm_fits, x = time, color = ttmt)) +

        geom_line(aes(group=ttmt)) +

                 labs(title = "glm", xlab="time", ylab="estimated prop") +

        theme_bw() + ylim(0, .5)

plot(effects::allEffects(model1))

Notice the non-uniform scaling of the y-axis for the graph on the right! Don’t look at the graph on the right and conclude there is a linear association/model! In fact, we should be saving “curve” rather than “line” much more often…

(e) Now fit the multilevel model

model2 = glmer(y ~ time*ttmt + (1 | SubjID), family = binomial, data = toenail, nAGQ = 50)

#adaptive Gause-Hermite Quadrature, nAGQ = 50

summary(model2, corr=F)

#Look at all the graphs but do not include

plot(effects::allEffects(model2))

plot(ggeffects::ggpredict(model2, terms=c("time", "SubjID [sample=9]"), type="re"), show_ci = F)

fits = predict(model2,re.form=NA,newdata=toenail) #fixed effects only

mlm_fits=exp(fits)/(1+exp(fits)) #convert log odds to predicted probabilities

                 ggplot(data = toenail, aes(y = mlm_fits, x = time, color = ttmt)) +

        geom_line(aes(group=ttmt)) +

                 labs(title = "glm", xlab="time", ylab="estimated prop") +

        theme_bw() + ylim(0, .5)

Does this model suggest a significant difference in the rate of decline between the two treatments? Explain how you are deciding.

Compare the models

#Include this graph

combined_data <- data.frame( time = toenail$time, ttmt = toenail$ttmt, fits = c(glm_fits, mlm_fits), model = rep(c("GLM", "MLM"), each = nrow(toenail)) )

ggplot(data = combined_data, aes(x = time, y = fits, color = ttmt, linetype = model)) +

geom_line(aes(group = interaction(ttmt, model))) +

labs(title = "Comparison of GLM and GLMER Models", x = "Time", y = "Estimated Probability") +

theme_bw() + ylim(0, 0.5)

You may have noticed that these models look pretty different. That’s because they estimate different things! Previously, we interpreted the slope coefficient of a fixed effect in a multilevel model as the “effect” of changing x-variables within a specific group or “on average,” the marginal effect in the population. With logistic regression models, the “within individual” rate of decline is not the same as the “overall average” rate of decline (because of the nonlinear back-transformation). The slope of the marginal relationship applies to a randomly selected person (population average). The slope of the conditional relationship applies to a particular individual.

Use the equation to convert the conditional (“subject specific”) slope to the marginal (think population average) coefficient:

(f) For the above model, what is the marginal slope for time? Which is larger here, the subject-specific slope or the population-average (group level) slope? Does that make sense in context? Explain.

Alternatively, we can use generalized estimating equations (GEEs) to come up with parameter estimates and robust standard errors (think ‘sandwich estimators’). First, fit a model with a “bad correlation structure.” (Output included)

install.packages("gee")

model3 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

corstr = "exchangeable", scale.fix = TRUE, scale.value = 1)

summary(model3)

A screenshot of a computer

Description automatically generated

(g) Why do I consider the “working correlation matrix” assumptions to be unreasonable in this context?

Now allow for a more appropriate correlation matrix (output included):

model4 = gee::gee(y ~ time*ttmt, family = binomial, data = toenail, id = SubjID,

corstr = "unstructured")

summary(model4)

        ggplot(data = toenail, aes(y = fitted.values(model4), x = time, color = ttmt)) +

        geom_line(aes(group=ttmt)) +

                 labs(title = "gee", xlab="time", ylab="estimated prop") +

        theme_bw() + ylim(0, .5)

(h) What do you notice about the pattern of correlations in the “working correlation matrix”? Does that make sense in this context?

(i) For models 3 and 4, how do the “robust standard errors” (which ‘correct’ for the dependence in the data) compare to the “naïve standard errors”? Is this what we expect to see? Explain. For which model are they more different?

(j) Which are closer to the observed proportions of severe infections, the marginal (population-averaged) values or the conditional (subject-specific) values?

head(fitted.values(model2), 10)

head(fitted.values(model4), 10)

Decent reference on GEEs vs. MLMs

(k) Suggest a good candidate for a “level 3” in this study.

2) In this exercise, you will analyze data from a study involving 67 schools. In this study, students were given a math test in graders 3, 4, and 5 that was similar to the SAT math test (e.g., scale 0-800). The data in schooldata involves 234 repeated measures collected from 122 students across 12 teachers.

schooldata = read.table("https://www.rossmanchance.com/stat414/data/schooldata.txt", header=TRUE)

head(schooldata)

with(schooldata, table(studid, tchrid))[1:10, 1:10]

(a) Explain what is measured by the time variable in this dataset.

(b) Are students nested within teacher? Explain how you can tell from the output of the table. Also, sketch a diagram showing the “crossed effects” for a few cases.

(c) Based on the following graph, would you treat year as a quantitative variable or categorical (factor) in this study? Explain your reasoning.

means <- tapply(schooldata$math, schooldata$year, mean)

with(schooldata, boxplot(math ~ year))

points(1:3, means, col = "red", pch = 19)

(d) Write out the level (model) equations corresponding to an unconditional growth (include year but not random slopes) model (not the prediction equation). Include the relevant assumptions on the random components.

(e) From the following output

model1 <- lmer(math ~ year + (1 | studid) + (1 | tchrid), data = schooldata)

summary(model1, corr= F)

confint(model1)

Which has more ‘impact’ on test scores in a given year, the student or their teacher? (How are you deciding?)

(f) Are the student effects statistically significant? (How are you deciding?)

(g) Is the time effect statistically significant? (How are you deciding?)

The first two rows of the estimated variance-covariance matrix for the responses will have

Student 13099, grade 3

Student 13100, grade 3

Student 13100, grade 4

Student 13099, grade 3

Student 13100, grade 3

(h) Use the random intercepts model above to specify these 6 values.

(i) Examine the normality of each of the random effects, e.g.,

qnorm(ranef(model1)$studid[[1]])

qqnorm(ranef(model1)$tchrid[[1]])

What do you conclude/suggest next?

(j) So far, we have assumed “additive effects” for the students and teachers. Suppose we wanted to consider an interaction between our crossed random effects. We are unlikely to have repeat combinations for our longitudinal data, but what about our hypothetical hockey data…

model3.mlm = glmer(goal ~ 1 + angle + dist + (1 | shooter_id) + (1 | goalie_id), family=binomial, data = hockey)

model4.mlm = glmer(goal ~ 1 + angle + dist + (1 | shooter_id) + (1 | goalie_id) + (1 | shooter_id:goalie_id, family=binomial, data = hockey) #be patient!

summary(model4.mlm, corr = F)

anova(model3.mlm, model4.mlm)

How many terms does this add to the model? Is this model a significant improvement over the additive model? (Include relevant output and be clear how you are deciding.) Does the random interaction variance appear substantial? (Include relevant output and be clear how you are deciding.) Provide a brief interpretation of the implications of this interaction in context.

3) The U.S. National Longitudinal Study of Youth (NLSY) original study began in 1979 with a nationally-representative sample of nearly 13,000 young people aged 14 to 21. CurranData.txt contains data from a sub-study of children of the female NLSY respondents which began in 1986 when the children were aged between 6 and 8 years. Child assessments (e.g., reading scores) were then measured biennially in 1988, 1990 and 1992. The sub-sample of 221 children were assessed on all four occasions.

(a) Read in two datasets. The first is in “wide format” and the second is in “long format.” What is the difference in these formats?

CurranDataWide <- read.delim("https://www.rossmanchance.com/stat414/data/CurranData.txt", "\t", header=TRUE)

head(CurranDataWide)

CurranDataLong <- read.delim("https://www.rossmanchance.com/stat414/data/CurranData_Long.txt", "\t", header=TRUE)

head(CurranDataLong)

(b) Look at the first few rows in wide format. Which are time-varying variables and which are time-invariant?

library(tidyverse)

ggplot(data = CurranDataLong, aes(y = RAscore, x = Time)) +

geom_line(aes(group=id), color="light grey") +

geom_smooth(aes(group=1), color="black", size=1) +

theme_bw() +

labs(x="Years since 1986", y="Reading Score")

ggplot(CurranDataLong, aes(x = Time, y = RAscore, group = id)) +

geom_smooth(method = "loess", alpha = .5, se = FALSE) +

geom_smooth(aes(group=1), color="black") +

theme_bw()

(c) Examine the correlations of the reading scores across the four time points. (Explain what the R code is doing.) How would you summarize their behavior? Explain why the pattern in the covariation/ correlation makes sense in this context.

cor(select(CurranDataWide, contains("reading")))

(d) Use lme (not lmer) to fit a random intercepts longitudinal model (meaning, include Time).

1. Tell me about the variable Time.

2. Report the intra-class correlation coefficient.

3. What assumptions does this model make about the correlations of the four observations on each person? Does that assumption seem reasonable here?

library(nlme)

summary(model1 <- lme(RAscore ~ Time, random = ~1 | id, data = CurranDataLong))

cov2cor(getVarCov(model1, type="marginal")[[1]])

(e) Fit a random slopes model and examine the covariance matrix.

model2 = lme(RAscore ~ Time, random = ~Time|id, data = CurranDataLong)

summary(model2)

getVarCov(model2, type="marginal")

cov2cor(getVarCov(model2, type="marginal")[[1]])

fits = fitted.values(model2)

ggplot(data = CurranDataLong, aes(y = fits, x = Time)) +

geom_line(aes(group=id)) +

theme_bw()

1. Verify the values for (Show me the numbers!)

· var(y₁₁) = variance of “first observation” for person 1,

· var(y₂₁) = variance of “second observation” for person 1 (which can differ because we are using random slopes),

· cor(1,2) = correlation of the first two measurements on each person,

· cor(1,3) = correlation of first and 3rd measurement on each person.

2. How are the variances changing over time? How do the correlations change as the observations get further apart?

(f) Return to the basic two-level random intercepts model with time and time² as Level 1 variables, but without random slopes. Fit the AR(1) error structure.

model4 = lme(fixed = RAscore ~ Time + I(Time*Time), random = ~1| id, correlation = corAR1(), data = CurranDataLong)

summary(model4); intervals(model4)

cov2cor(getVarCov(model4, type="marginal")[[1]])

model5 = lme(fixed = RAscore ~ Time + I(Time*Time), random = ~1| id, data = CurranDataLong)

anova(model5, model4)

Is this model significantly better than the “independent errors” random intercepts model?

(g) What is the estimated parameter of the AR(1) model (“autocorrelation”). How do you interpret it? Does there appear to be correlation between adjacent observations even after conditioning on student?

(h) How would we decide whether the pattern of growth (i.e., reading progress) differs between males and females? Explain the process, you don’t have to carry out.

4) Several research studies have explored the hypothesis that teachers’ expectations influence pupils’ intellectual development. In this exercise you will combine results across 19 different experiments. In these studies, the treatment was whether or not teachers were encouraged to have “high” expectations.

study	Author	year	weeks	setting	tester	n1	n2	Cohen’s d	SE of diff
1	Rosenthal et al.	1974	2	group	aware	77	339	0.03	0.0156
2	Conn et al.	1968	21	group	aware	60	198	0.12	0.0216
3	Jose & Cody	1971	19	group	aware	72	72	-0.14	0.0279
4	Pellegrini & Hicks	1972	0	group	aware	11	22	1.18	0.1391
5	Pellegrini & Hicks	1972	0	group	blind	11	22	0.26	0.1362
6	Evans & Rosenthal	1969	3	group	aware	129	348	-0.06	0.0106
7	Fielder et al.	1971	17	group	blind	110	636	-0.02	0.0106
8	Claiborn	1969	24	group	aware	26	99	-0.32	0.0484
9	Kester	1969	0	group	aware	75	74	0.27	0.0269
10	Maxwell	1970	1	indiv	blind	32	32	0.8	0.063
11	Carter	1970	0	group	blind	22	22	0.54	0.0912
12	Flowers	1966	0	group	blind	43	38	0.18	0.0497
13	Keshock	1970	1	indiv	blind	24	24	-0.02	0.0835
14	Henrikson	1970	2	indiv	blind	19	32	0.23	0.0841
15	Fine	1972	17	group	aware	80	79	-0.18	0.0253
16	Grieger	1970	5	group	blind	72	72	-0.06	0.0279
17	Rosenthal & Jacobson	1968	1	group	aware	65	255	0.3	0.0193
18	Fleming & Anttonen	1971	2	group	blind	233	224	0.07	0.0088
19	Ginsburg	1970	7	group	aware	65	67	-0.07	0.0303

To compare different studies, you need to use a “standardized difference” or “effect size.” Cohen’s d is like the t-statistic but doesn’t use sample sizes, . The above table reports the Cohen’s d values and their associated standard errors. Alas it is rare for studies to provide access to the raw data and we must rely on the effect size reported (and then different packages in R), so then convert each study to the same measure as was done here. This also allows us to use the reported SE’s of the effects to measure the Level 1 residual variance (to estimate

(a) Do the effect sizes appear similar across the studies?

So now we want to estimate the true effect of teacher expectations.

(b) Which 1 or 2 studies would you consider the most reliable?

bryk = read.table("https://www.rossmanchance.com/stat414/data/bryk.txt", header=TRUE)

bryk$se = sqrt(bryk$var)

bryk$weeks.b <- ifelse(bryk$weeks > 1, 1, 0) #we will use this variable later

install.packages("meta"); library(meta)

model1 <- metagen(TE = diff,

seTE = se,

studlab = study,

data = bryk,

sm = "SMD", #we are working with standardized mean differences

method.tau = "REML",

method.random.ci = "HK"

)

summary(model1) #output below

(a) The output first reports the weight (as a percentage) given to each study (for both fixed effects and random effects). Which study is given the most weight? Is this what you expected? Explain.

(b) Report the estimated “pooled effect size” treating the studies as random effects. Interpret the associated confidence interval in context. Would this meta-analysis suggest that the true treatment effect differs from zero?

(c) In this context, represents the study-to-study variability in the true effect sizes. In other words, this provides a measure of how homogeneous or heterogeneous the studies are. Does the study to study variability in effect sizes appear to be statistically significant? Justify your answer.

Relatedly, Cochran’s Q statistic is a weighted sum of squares comparing the individual study effects to the estimated common effect:

where is the inverse of the study’s variance and is from the fixed effects model.

(d) Show a calculation of Cochran’s Q “by hand.”

(e) If Q is large, this is evidence of “excess variation.” It is assumed that Q follows a chi-square distribution with df = k – 1 where k is the number of studies. What do you conclude from the R output?

(f) A common visualization in a meta-analysis is a forest plot:

meta::forest(model1) #output below

Provide a brief summary of how the “fixed effects” results and the “random effects” results differ.

So maybe we shouldn’t pool these studies together, maybe small/negative effects are masking large positive effects? We can use a lot of our usual tools to check for outliers and influential studies. Maybe we can explain some of the variability across studies?

The “expectancy treatment” involved deception – in some cases researchers presented teachers with a list of pupils who allegedly displayed potential for tremendous intellectual or a list of students where they had inflated the IQ scores. (In reality, the student names on these lists were determined randomly.) Why question is whether the teachers really believed these claims (and the treatment was truly applied). This could be related to how well the teachers knew their students when they were given the information. This is encapsulated in the “weeks of prior contact” variable.

(g) “Update” the meta-analysis to separate the studies based on the “number of prior contact weeks”

update(model1, subgroup= weeks.b, tau.common=FALSE)

1. In the Results for subgroups section, do the estimated effect sizes appear to differ substantially across the subgroups?

2. What do you learn from the test of subgroup differences for the fixed effects (common effect) model? (Hint: What are the df and why?)

3. What do you learn from the test of subgroup differences for the random effects model

(h) Summarize what you learn from the following graph (is it consistent with the discussion above?):

A diagram of a graph

Description automatically generated

A screenshot of a computer

Description automatically generated

A screenshot of a graph

Description automatically generated