Stat 414 – HW 3
Due by midnight, Friday, Oct. 11
Please submit
each problem as separate files in Canvas (doc or pdf format). You should
include all relevant output.
1) Recall the data on 24 college graduates, including their starting salary (in thousands of dollars), how many semesters they spent in college, and their major.
saldata = read.table("http://www.rossmanchance.com/stat414/data/saldata.txt","\t", header=TRUE)
summary(model1 <- lm(salary ~ semesters, data = saldata))
(a)
Fit a model that
includes both salary and major.
summary(model2 <- lm(salary ~ semesters + major, data = saldata))
Provide an interpretation of the coefficient
of semesters in this model.
(b) Is major a statistically significant predictor of
salary after adjusting for number of semesters?
State
Ho and Ha using appropriate symbols, include an appropriate test statistic,
degrees of freedom, and p-value. Interpret your results in context. Be sure to include any output you create to
answer this question.
Let’s aggregate this information to the different
majors.
Majordata = aggregate(saldata[, 1:2], list(saldata$major), mean)
head(Majordata)
plot(Majordata$salary ~ Majordata$semesters)
(c)
What
has this command done? How many observations are in Majordata?
(d)
Find
the least squares regression model for these data and carefully interpret the
slope in context.
summary(model3 <- lm(salary ~
semesters, data = Majordata))
(d2) Does this appear to be a strong
association? How are you deciding?
Now we want to fit a model that has both
of these variables, so we need to create a dataset that has both of these
variables. Try
newsaldata <- merge(saldata, Majordata, by.x="major",
by.y="Group.1", suffixes=c("",".G"))
head(newsaldata, 10)
and/or
saldata$avgsem = ave(saldata$semesters, saldata$major)
head(saldata, 10)
with(saldata, plot(salary ~ semesters, col = avgsem))
(e)
Now
fit
summary(model4
<- lm(salary ~ semesters + avgsem, data = saldata))
Where have you seen the coefficient of
semesters before? Why?
(f)
In
model 4, how do we interpret the coefficient of avgsem? (Hint: Why is it not the same as in
(d)? Can you hold the other variable in
the model constant? How are this coefficient and the one in (d) related?)
Now “grand mean center” the semesters
variable and use it in the model instead of semesters
semesters.c
<- saldata$semesters - mean(saldata$semesters)
summary(model5
<- lm(salary ~ semesters.c + avgsem, data = saldata))
(g)
Have
the coefficients changed? Why or why
not?
Now “group mean center” the semesters
variable and use it in the model instead of semesters
#create a “deviation” variable
saldata$dev = saldata$semesters -saldata$avgsem
summary(model6 <- lm(salary ~ dev + avgsem, data = saldata))
Which coefficient(s) have changed from
model 4? How do you know interpret each
slope coefficient? (What is going on here?)
You may want to organize your output
using
model1$coefficients #semesters (model1)
model2$coefficients #semesters + major (a)
model3$coefficients #aggregated (d)
model4$coefficients
#semesters + avg semesters (e)
model5$coefficients #semesters centered + avg semesters (f)
model6$coefficients #semesters group mean centered + avg semesters (g)
2) Cal Poly student researchers wanted to assess the impact of
wearing a swim cap and the type of swim stroke (freestyle, breaststroke,
backstroke, and butterfly) on 25-yard lap times (Basurto, Frattone, &
Garcia, 2015). Swimmers at the campus recreation center were recruited and
confirmed they were comfortable with all four strokes at that distance. Four
swimmers were randomly assigned to each of the eight conditions in random
order, giving each swimmer one minute to rest between laps. The data are in
swimdata.txt.
swimdata = read.table("http://www.rossmanchance.com/stat414F20/data/swimdata.txt","\t", header=TRUE)
head(swimdata) #Notice how R has named the Time variable
(a) Fit a model to predict swim time
from “cap” and “stroke type” (use lm).
(b) Is there statistically significant
(meaning give me a p-value) swimmer-to-swimmer variation?
(c) Is there substantial correlation among
the four observations for these swimmers (meaning give me the ICC)?
(d) Now add the swimmers as fixed
effects. Remember to make sure R recognizes ID as a
factor. Include the both the “summary” and the “anova” summary of your
model. How many parameters were
estimated by this model? How did adding
swimmer impact the significance of Cap and Stoke? The slope coefficients? As
you might have predicted?
(e) Fit the model from (d) using swimmer
as a random effect
summary(model4
<- lmer(Time.sec. ~ Stroke + Cap + (1
| ID), data = swimdata))
How many parameters are estimated by
this model?
(f) Use confint(model4) and tell me what/how you learn about
the significance of the swimmer-to-swimmer variability.
3) Brooks et al. (2008) studied incentives to
improve adult literacy. Twenty-eight classes were assigned to either receive
the treatment group (participants received a 5£ (US $10) M&S voucher for
each class they attended) or to a control group. The main outcome of interest
was number of class sessions attended.
adultlit <-
read.table("https://www.rossmanchance.com/stat414/data/adultlit.txt",
header=TRUE)
model1 <- lm(sessions ~ group, data = adultlit); summary(model1)
(a) What is the estimated treatment effect? (0 =
intervention, 1 = control), standard error, and t-statistic?
(b) What model assumption of this analysis is violated
by this study design? Why do you think the data were collected this way?
(c) Calculate and interpret the intraclass correlation
coefficient. (Show your work.)
Hints: model2 <- lm(sessions ~ factor(classid), data = adultlit); anova(model2)
mean(as.numeric(table(adultlit$classid)))
(d) So instead of thinking we have 152 observations, we
could analyze the data at the classroom level.
How many observations would we have at the classroom level?
Hint: length(unique(adultlit$classid))
Instead, the “effective sample size” will be something
in between, depending on how strongly correlated the observations are within a
class. We can compute the effective sample size as where I is the
number of groups and n is the (common) sample size for each group.
(e) If the ICC = 0, what is the effective sample size? If the
ICC = 1, what is the effective sample size?
Explain the intuition for each.
With unequal sample sizes, we will use the average sample size,
e.g., mean(as.numeric(table(adultlit$classid)))
(f) Compute the effective sample size for this study.
Of course why the sample size matters is in our
estimates of the standard error of the coefficients. We could just adjust the values.
(g) Think of as a form of and recompute the
estimate of the standard error of the slope of “group” in model 1 using the
effective sample size. (Show your work)
(h) Using the estimate from (g), is the
treatment still statistically significant? (Show your work)
Alternatively, we can fit a model that takes the different
classes into account.
(i) Would you recommend treating the classid variable as fixed
or random? Explain your reasoning.
(j) Run the model you suggested in (i) and summarize how this
impacts the estimate of – is it closer to (a) or
(g)?