Stat 414 – HW 3

Due by midnight, Friday, Oct. 11

 

Please submit each problem as separate files in Canvas (doc or pdf format). You should include all relevant output.

 

1) Recall the data on 24 college graduates, including their starting salary (in thousands of dollars), how many semesters they spent in college, and their major.

saldata = read.table("http://www.rossmanchance.com/stat414/data/saldata.txt","\t", header=TRUE)

summary(model1 <- lm(salary ~ semesters, data = saldata))

(a)      Fit a model that includes both salary and major.

summary(model2 <- lm(salary ~ semesters + major, data = saldata))
Provide an interpretation of the coefficient of semesters in this model. 

(b)      Is major a statistically significant predictor of salary after adjusting for number of semesters?

State Ho and Ha using appropriate symbols, include an appropriate test statistic, degrees of freedom, and p-value. Interpret your results in context.  Be sure to include any output you create to answer this question.

 

Let’s aggregate this information to the different majors.

Majordata = aggregate(saldata[, 1:2], list(saldata$major), mean)

head(Majordata)

plot(Majordata$salary ~ Majordata$semesters)

(c)      What has this command done? How many observations are in Majordata?

(d)      Find the least squares regression model for these data and carefully interpret the slope in context.

summary(model3 <- lm(salary ~ semesters, data = Majordata))

(d2) Does this appear to be a strong association? How are you deciding?

 

Now we want to fit a model that has both of these variables, so we need to create a dataset that has both of these variables.  Try

newsaldata <- merge(saldata, Majordata, by.x="major", by.y="Group.1", suffixes=c("",".G"))

head(newsaldata, 10)

and/or

saldata$avgsem = ave(saldata$semesters, saldata$major)

head(saldata, 10)

with(saldata, plot(salary ~ semesters, col = avgsem))

 

(e)      Now fit

summary(model4 <- lm(salary ~ semesters + avgsem, data = saldata))

Where have you seen the coefficient of semesters before?  Why?

 

(f)        In model 4, how do we interpret the coefficient of avgsem?  (Hint: Why is it not the same as in (d)?  Can you hold the other variable in the model constant? How are this coefficient and the one in (d) related?)

 

Now “grand mean center” the semesters variable and use it in the model instead of semesters

semesters.c <- saldata$semesters - mean(saldata$semesters)

summary(model5 <- lm(salary ~ semesters.c + avgsem, data = saldata))

(g)      Have the coefficients changed?  Why or why not?

 

Now “group mean center” the semesters variable and use it in the model instead of semesters

#create a “deviation” variable

saldata$dev = saldata$semesters -saldata$avgsem

summary(model6 <- lm(salary ~ dev + avgsem, data = saldata))

Which coefficient(s) have changed from model 4?  How do you know interpret each slope coefficient? (What is going on here?)

 

You may want to organize your output using

model1$coefficients #semesters (model1)

model2$coefficients #semesters + major (a)

model3$coefficients #aggregated  (d)

model4$coefficients #semesters + avg semesters (e)

model5$coefficients #semesters centered + avg semesters (f)

model6$coefficients #semesters group mean centered + avg semesters (g)

 

 

2) Cal Poly student researchers wanted to assess the impact of wearing a swim cap and the type of swim stroke (freestyle, breaststroke, backstroke, and butterfly) on 25-yard lap times (Basurto, Frattone, & Garcia, 2015). Swimmers at the campus recreation center were recruited and confirmed they were comfortable with all four strokes at that distance. Four swimmers were randomly assigned to each of the eight conditions in random order, giving each swimmer one minute to rest between laps. The data are in swimdata.txt.

swimdata = read.table("http://www.rossmanchance.com/stat414F20/data/swimdata.txt","\t", header=TRUE)

head(swimdata) #Notice how R has named the Time variable

(a) Fit a model to predict swim time from “cap” and “stroke type” (use lm).

(b) Is there statistically significant (meaning give me a p-value) swimmer-to-swimmer variation?

(c) Is there substantial correlation among the four observations for these swimmers (meaning give me the ICC)?

(d) Now add the swimmers as fixed effects. Remember to make sure R recognizes ID as a factor. Include the both the “summary” and the “anova” summary of your model.  How many parameters were estimated by this model?  How did adding swimmer impact the significance of Cap and Stoke? The slope coefficients? As you might have predicted?

(e) Fit the model from (d) using swimmer as a random effect

summary(model4 <- lmer(Time.sec. ~ Stroke + Cap +  (1 | ID), data = swimdata))

How many parameters are estimated by this model?

(f) Use confint(model4) and tell me what/how you learn about the significance of the swimmer-to-swimmer variability.

 

3) Brooks et al. (2008) studied incentives to improve adult literacy. Twenty-eight classes were assigned to either receive the treatment group (participants received a 5£ (US $10) M&S voucher for each class they attended) or to a control group. The main outcome of interest was number of class sessions attended.

adultlit <- read.table("https://www.rossmanchance.com/stat414/data/adultlit.txt", header=TRUE)

model1 <- lm(sessions ~ group, data = adultlit); summary(model1)

(a) What is the estimated treatment effect? (0 = intervention, 1 = control), standard error, and t­-statistic?

(b) What model assumption of this analysis is violated by this study design? Why do you think the data were collected this way?

(c) Calculate and interpret the intraclass correlation coefficient. (Show your work.)

Hints: model2 <- lm(sessions ~ factor(classid), data = adultlit); anova(model2)

mean(as.numeric(table(adultlit$classid)))

(d) So instead of thinking we have 152 observations, we could analyze the data at the classroom level.  How many observations would we have at the classroom level?

Hint: length(unique(adultlit$classid))

Instead, the “effective sample size” will be something in between, depending on how strongly correlated the observations are within a class. We can compute the effective sample size as  where I is the number of groups and n is the (common) sample size for each group.

(e) If the ICC = 0, what is the effective sample size? If the ICC = 1, what is the effective sample size?  Explain the intuition for each.

With unequal sample sizes, we will use the average sample size, e.g., mean(as.numeric(table(adultlit$classid)))

(f) Compute the effective sample size for this study.

Of course why the sample size matters is in our estimates of the standard error of the coefficients.  We could just adjust the values.

(g) Think of  as a form of   and recompute the estimate of the standard error of the slope of “group” in model 1 using the effective sample size. (Show your work)

(h) Using the  estimate from (g), is the treatment still statistically significant? (Show your work)

 

Alternatively, we can fit a model that takes the different classes into account. 

(i) Would you recommend treating the classid variable as fixed or random? Explain your reasoning.

(j) Run the model you suggested in (i) and summarize how this impacts the estimate of  – is it closer to (a) or (g)?