Stat 414 - HW 3

Stat 414 - HW 3

Due by 8am, Monday Oct. 13

Please submit each problem as separate files in Canvas (doc or pdf format). You should include all relevant output.

1) The lq2002 data file contains data from a special issue of Leadership Quarterly. There are 27 columns, 2,042 observations, and 49 army companies. Our response variable is going to be HOSTILE, the "hostility scale score." Let's first use the Leadership Climate Scale Score (LEAD). Higher values of LEAD indicate a perception of better leadership.

(a) Create a scatterplot (add some jitter because of the overlapping values) and determine the correlation coefficient between LEAD and HOSTILE.

ggplot(lq2002, aes(x = LEAD, y = HOSTILE)) +

geom_jitter(width = 0.2, height = 0.2, alpha = 0.5) +

theme_bw()

with(lq2002, cor(LEAD, HOSTILE))

Optional:

install.packages("GGally")

GGally::ggpairs(lq2002[,c("LEAD", "HOSTILE")])

(b) Model 1: Fit a linear model predicting HOSTILE from LEAD. Include your output but also report the t-statistic, df, and p-value for the slope of LEAD. Is the relationship statistically significant?

(c) Model 2: Fit a model including both LEAD and TSIG = "perception of task significance."" Interpret the coefficient of LEAD in this model.

Now let's analyze the data at the company level (using company as the observational unit). So first we need to aggregate the data to the company level.

glq2002<-aggregate(lq2002, by=list(lq2002$COMPID),mean)

(d) Now recreate the scatterplot, correlation coefficient, and linear model predicting HOSTILE from LEAD. Include your output but also report your t-statistic, df, and p-value for the slope (Model 3).

(e) Would you consider the company-level relationship between HOSTILITY and LEAD stronger or weaker than the individual-level relationship? More or less significant? Would you consider the "impact" of increasing leadership perception larger or smaller at the company-level vs. the individual-level? Justify your answers and discuss any discrepancies.

(f) State a research question answered by the analysis in (b). State a research question answered by the analysis in (e).

(g) Model 4: Using lq2002, fit a model that includes both LEAD and CompID (as a factor)

summary(model4<- lm(HOSTILE ~ LEAD + as.factor(COMPID), data = lq2002))

How many parameters are estimated by this model? Interpret the slope of LEAD in this model. Has it changed (larger or smaller or not really) from (b)?

Hint: summary(model4)$coeffs[1:2]

The original dataset has the group level information entered as well. (Notice how the values of GLEAD are all the same within a specific company.)

(h) Model 5: Fit a model that uses both LEAD and GLEAD as predictors. Compare the slope coefficient of LEAD in this model to Model 4. Explain why they are related this way.

But the slope of GLEAD is not the same as in Model 3. In particular, how we can change the group mean while holding the LEAD value of every individual in the group constant?

Model 6: "Group mean center" the LEAD variable and fit a model with this "deviation" variable and the GLEAD variable.

#create a "deviation" variable

lq2002$dev = lq2002$LEAD - lq2002$GLEAD

summary(model6 <- lm(HOSTILE ~ dev + GLEAD, data = lq2002))

Which coefficient(s) have changed from model 5 and how specifically? How do you know interpret each slope coefficient? (What is going on here?)

2) Cal Poly student researchers wanted to assess the impact of wearing a swim cap and the type of swim stroke (freestyle, breaststroke, backstroke, and butterfly) on 25-yard lap times (Basurto, Frattone, & Garcia, 2015). Swimmers at the campus recreation center were recruited and confirmed they were comfortable with all four strokes at that distance. Four swimmers were randomly assigned to each of the eight conditions in random order, giving each swimmer one minute to rest between laps. The data are in swimdata.txt.

swimdata = read.table ("http://www.rossmanchance.com/stat414F20/data/swimdata.txt" ,"\t", header=TRUE)

head(swimdata) #Notice how R has named the Time variable

(a) Fit a linear model to predict swim time from "cap" and "stroke type" (use lm). Include both the "summary" and "ANOVA table" output. Interpret the coefficient of Cap in your model. Which stroke is estimated to have quicker swim times? How are you deciding?

(b) Fit a one-way ANOVA with the swimmers. Calculate omega-squared to estimate the proportion of variation in swim times that is due to the different swimmers (Show the values substituted into the formula, being clear where you get your values).

(d) Now add the swimmers to the model in (a). Remember to make sure R recognizes ID as a factor. Include the both the "summary" and the "anova" summary of your model. How many parameters were estimated by this model? How did adding swimmer impact the significance of Cap and Stoke? The slope coefficients? As you might have predicted?

(e) Is either stroke type or cap a significant predictor of swim type after adjusting for any swimmer-to-swimmer differences?

3) Brooks et al. (2008) studied incentives to improve adult literacy. Twenty-eight classes were assigned to either receive the treatment group (participants received a 5L (US $10) M&S voucher for each class they attended) or to a control group. The main outcome of interest was number of class sessions attended.

adultlit <- read.table("https://www.rossmanchance.com/stat414/data/adultlit.txt", header=TRUE)

model1 <- lm(sessions ~ group, data = adultlit); summary(model1)

(a) Report and interpret the coefficient of group (0 = intervention, 1 = control), and report its standard error, and t-statistic.

(b) What basic regression model assumption (LINE) is violated by this study design? Why do you think the data were collected this way?

Hints: model2 <- lm(sessions ~ factor(classid), data = adultlit); anova(model2)

mean(as.numeric(table(adultlit$classid)))

When we have unequal group sizes, to keep our formulas simple, we often substitute in the average group size.

(d) So instead of thinking we have 152 observations, we could analyze the data at the classroom level. How many observations would we have at the classroom level?

Hint: length(unique(adultlit$classid))

The "effective sample size" will be something in between the number of classes and the number of individuals, depending on how strongly correlated the observations are within a class. We can compute the effective sample size as where I is the number of groups and n is the (common) sample size for each group.

(e) If the ICC = 0, what is the effective sample size? If the ICC = 1, what is the effective sample size? Explain the intuition for each.

With unequal sample sizes, we will use the average sample size, e.g., mean(as.numeric(table(adultlit$classid)))

(f) Compute the effective sample size for this study.

Of course why the sample size matters is in our estimates of the standard error of the coefficients. We could just adjust the values.

(g) Think of as a form of and recompute the estimate of the standard error of the slope of "group" in model 1 using the effective sample size. (Show your work, e.g., how does changing the sample size change the calculated value?)

(h) Using the estimate from (g), is the treatment still statistically significant? (Show your work)

Alternatively, we can fit a model that takes the different classes into account.

(i) Add the class variable into the model and summarize how this impacts the estimate of - is it closer to (a) or (g)?