Stat 301 - HW 7

Due noon, Friday, March 6

If you submit your assignment in Canvas, remember to upload separate files for each problem and to put your name inside each file. Remember to show your work/calculations/computer details and to integrate this output into the body of the solution.

FYI: Solutions to Quiz 24

1) For the study on elephants’ walking distances, we considered the two groups of elephants as random samples from their respective populations. We considered these populations to be large, but had no access to the actual population data. The sample distributions were not particularly normal and the sample sizes were not particularly large, so this meant we were a little skeptical about the validity of the t-procedures. How can we investigate this? What can we do if we don’t want to use the t-procedures? If we had access to the populations, we could carry out a simulation to investigate whether the distribution of t-statistics follows a t-distribution, but what do we do when we (more realistically!) only have the samples? One way to create “populations” to sample from is to repeat our samples infinitely many times! This is equivalent to sampling with replacement. To estimate the sample-to-sample variation for a confidence interval, we can resample from each sample separately. To estimate the sample-to-sample variation for a test of significance, we can pool the two samples together and resample from that combined (infinite) sample. Keep in mind that you will always use the same sample sizes as the observed data.

(a) Open the Two sample bootstrapping applet. Paste in the elephant data in the Sample data box on the left and press Use data. Confirm the values for the sample means and standard deviations. Check the Show Sampling Options box and press Bootstrap Samples. Verify you have selected 23 Asian elephants and 33 African elephants. Look at the Plot and/or Data windows. Include a screen capture where you identify any elephants that were selected more than once, explain how you can tell.

(b) Press Bootstrap Samples again. Did you find the same sample means or does this procedure model sample-to-sample variation from the random sampling process?

(c) Include a screen capture of this sample selection (including the mean and SD values, Selected Summary Statistics) and use this bootstrap sample data to calculate a t-statistic (show your work). Use the pull-down menu to change the Statistic to t-statistic to confirm your calculation (the blue one).

(d) Now (with t-statistic selected), take at least 1,000 bootstrap samples. Check the box to overlay the t-distribution. Include a screen capture. Does this simulation analysis support the use of the t-distribution to calculate the p-value? Explain your answer.

(e) Use the pull-down menu to switch the statistic back to the difference in means. Where is this distribution centered? Why does this center make sense?

(f) What is the standard deviation of the difference in sample means distribution? How does it compare to what we predicted with the Central Limit Theorem? (Cite both values.)

(g) Use the observed difference in sample means and this standard deviation (from the bootstrap distribution of difference in sample means) to approximate a 95% confidence interval for the difference in population means (estimate + 2 SE). How does it compare to the 95% t-confidence interval? (Cite both intervals.)

(h) Uncheck and then recheck the Show Sampling Options box (to clear out the previous simulation results) and return the Number of Samples to 1. This time, check the Pooled box. Press the Bootstrap Sample button until you find a resampling that demonstrates how this method pools the two samples together and then selects a group of 23 and a group of 33 (identify some elephants that change species!). Now generate a bootstrap distribution (at least 1,000 bootstrap samples). What is the mean of this bootstrap distribution; why does the value make sense? How does the standard deviation compare (to f)?

(i) Count how many of the bootstrap differences in sample means are larger than the observed difference to approximate a two-sided p-value. How does the p-value compare to the t-test p-value?

(j) One large benefit of bootstrapping is it works with statistics other than differences in sample means. Use the pull-down menu to choose the difference in sample medians. Report a two-sided p-value. (Include a screen capture.) How does the p-value compare? Which p-value is smaller and why?

2) hw7RMarkdown_2.Rmd

A group of Cal Poly students wanted to investigate whether men with children tend to live longer than men without children. They randomly sampled men from the obituaries page on the San Luis Obispo Tribune’s website between June and November 2012. For each man selected, they noted the age at which the person died and whether or not the person had any children.

(a) State appropriate null and alternative hypotheses for testing whether the average lifespan is longer for men with children than for men without children.

(b) Identify and classify the explanatory variable and the response variable in this study.

(d) The data are in ChildrenandLifespan.txt. Use R or JMP or Theory-Based inference applet to create numerical and graphical summaries of the data comparing the two samples. Summarize what they reveal about the shapes, centers, and spreads of the two samples. Explain why the shape of the distribution of the response variable makes sense in this context.

R users: check out

proportion= table(ChildrenandLifespan$Children)/nrow(ChildrenandLifespan)

boxplot(ChildrenandLifespan$Age~ ChildrenandLifespan$Children, width=proportion)

JMP users: check out using Fix Y by X and then selecting Boxplots under Display Options.

(e) Do you consider the t-procedures valid for these data? Explain how you are deciding.

(f) Carry out a two-sample t-test to estimate p-value for this study. Include your output, including a well-labeled graph of the null distribution with the p-value shaded. Would you reject or fail to reject the null hypothesis at the 5% level of significance?

(g) Calculate a 95% confidence interval for these data. (Interpretation in next question.)

(h) Summarize the conclusions you would draw from this study including significance, estimation, causation, and generalizability. Provide a brief justification for each component.

3) To investigate an association between violent video games and aggressive behavior, British researchers Hollingdale and Greitemeyer (2014) randomly assigned 49 students from a university in the United Kingdom to play Call of Duty: Modern Warfare (a violent video game) and 52 students to play LittleBigPlanet 2 (a nonviolent/neutral video game). After 30 minutes of playing the video games, the subjects were asked to complete a marketing survey investigating a new hot chili sauce recipe. They were told they were to prepare some chili sauce for a taste tester and that the taste tester “couldn't stand hot chili sauce but was taking part due to good payment.” They were then presented with what appeared to be a very hot chili sauce and asked to spoon what they thought would be an appropriate amount into a bowl for a new recipe. The amount of chili sauce was weighed in grams after the participant left the experiment. The amount of chili sauce was used as a measure of aggression: the more chili sauce, the greater the subject’s aggression.

(a) Does this study involve random sampling or random assignment or both or neither?

(b) Load the VideoAgression data into the Comparing Groups – Quantitative applet. Screen capture the numerical and graphical summaries of the data comparing the two groups. Summarize what they reveal about the shapes, centers, and spreads of the two samples.

(d) Do you think “equal variances” is a reasonable assumption for these data? Explain.

(e) State appropriate null and alternative hypotheses to test whether there is an association between type of video games and level of aggression.

(f) Create a randomization distribution for the difference in means. Include a screen capture. How does the SD of this distribution compare to the “pooled standard deviation” (calculate)?

(g) Use the pull-down menu to select the t-statistic. Report the observed value of the t-statistic for the actual study (this is unpooled if you want to verify its value) and use it to determine the simulation-based and the t-distribution-based p-values. Include a screen capture. How do they p-values compare?

Questions (h)-(j) all use the same simulation results.

(h) Does 10 appear to be a plausible value for the increase in average aggression with more violent games?: Specify 10 as the hypothesized difference (or -10, check direction of subtraction). Set the Number of Shuffles to 1 and select the Plot. Press Shuffle Responses and watch the animation. Explain in your own words what this animation is doing and why.

(i) Set the number of Shuffles to 1000 and regenerate the randomization distribution of the difference in sample means. How do the values of the mean and standard deviation compare to (f). Which change(s) and why/why not?

(j) Generate a two-sided p-value (include a screen capture). What conclusion do you draw in context?

(k) Check the box for a 95% confidence interval (lower left). Interpret the interval in context and comment on whether it is consistent with your p-value in (j).

(l) Put the data into Excel or R or JMP and log transform the aggression scores. (Note: there is a zero, which you can turn into 0.5 first.) Use the log-transformed data to calculate a two-sample t-test for an association. (Include a screen capture.) Do the results differ/does one analysis provide stronger evidence of an association than the other? Explain.

Possible Extension Assignments

· Dr. Anna Bargagliotti from Loyola Marymount, Thursday, Mar 5 at 11:10 in 38-121. She will be speaking about her NSF funded research on Undergraduate Data Pathways -- an assessment of how universities provide undergraduate students opportunities to work with data across a variety of disciplines.

· How many elephants are in North American zoos? (Include your reference(s).) Does this impact our analysis? If so, how. (Be specific.)

· For problem 1, give an intuitive explanation for why we would expect the first simulation (unpooled) to give a larger standard deviation for the distribution of differences in sample means than the pooled approach.

· For the data in problem 2, use R (see p. 277) to carry out a randomization test to determine whether the ratio in standard deviations is statistically significant. (Feel free to first explore the difference in means to compare to the applet output.) Interpret your results in context (when/why might this be an interesting research question?).

· For problem 3, use the log-transformed data to obtain a 95% confidence interval and back-transform the endpoints to obtain a confidence interval for the ratio of the population medians. Interpret the interval in context. Explain why this becomes a ratio rather than a difference.

· Check out http://datavizcatalogue.com/blog/box-plot-variations/ and make use the data in problem 2 to make boxplots of varying width and some other variations and comment on the effectiveness of the different displays (e.g., the bee swarm!).

· Check out the Guess the p-value applet. How accurately can you anticipate the p-value from the picture? What other information is important? And how often get a small p-value when null is true? Does increasing the sample size change that?