In this chapter, we focus on developing two-sample normal-based procedures, both for comparing two populations and for comparing two treatment groups. We again start with categorical data (Sections 5.1 and 5.2) before proceeding to quantitative data (Sections 5.3-5.5). We encourage you to again focus on the distinction between random sampling (Sections 5.1 and 5.3) and randomization (Sections 5.2 and 5.4), the corresponding impact on the scope of conclusions, and the place of inference in the entire statistical process (collecting data, analyzing data, including focusing on the appropriate numerical and categorical summaries, and then, if appropriate, inferential conclusions). Several points in this chapter allow students to combine many of the techniques they have learned throughout the course. Students should also be familiar with the pedagogical style of the course by now; for example, they should not be surprised that they are first asked about study design issues and about observational units and variables, and they then use simulation to come up with an empirical p-value, and they then examine the empirical sampling/randomization distribution and see if it can be approximated by a normal probability model. For students who have caught on to this approach, this chapter can proceed fairly quickly and demonstrate to them that they have “learned how to learn” about statistical methods.
Please note that the following timings are especially approximate as we have often done more jumping around in this chapter, often requiring students to work outside of class with these investigations. We do feel there is a tremendous amount of flexibility in how you use this chapter’s investigations in your own courses.
Section 5.1: Two Samples on a Categorical Response
Timing/materials: This chapter makes heavy use of Minitab (all but Investigation 5.1.2). Investigation 5.1.1 may take about 35 minutes as a student exploration but you can also lead students through the questions more quickly. Investigation 5.1.2, with help, may only take 15 minutes. On p. 417-8, students are stepped through using Minitab and an applet for these calculations. Investigation 5.1.3 can take about 50 minutes.
Investigation 5.1.1 helps students see that the binomial model they have been using with categorical data does not apply in analyzing the difference between two groups. The survey being conducted in two different years provides a nice context for emphasizing the independence of the samples. Notice that we still have students begin by considering questions of observational units and variable definitions. Students then use simulation in (j) to see that a normal model does provide a useful approximation for the difference in the sample proportions. As they have often in the course, students analyze the study data through an empirical p-value from their simulation.
Investigation 5.1.2 then steps students through the
mathematical derivation of the mean and standard deviation of the sampling
distribution of the difference in sample proportions, including a brief
discussion of the
Investigation 5.1.3 steps students through the derivation of a confidence interval for an odds ratio. This investigation reinforces and combines many ideas from earlier in the course: case-control students, odds ratio as the most appropriate parameter in case-control studies, simulation of an empirical sampling distribution of the sample statistic, transformations to normalize a distribution, and construction of a normal-based confidence interval using the form: estimate ± (critical value × standard error of statistic). We again start students by considering study design issues before even revealing the study results. The journal article can be found at http://bmj.com/cgi/content/full/324/7346/1125. The researchers in this study conducted the interviews with case drivers in the hospital room or by telephone; they say only that proxies were used for drivers who had died. They defined a full night’s sleep to be at least seven hours, mostly between 11pm and 7am. They also used several other measures of sleep, specifically the Stanford and Epworth sleepiness scales, in case you would like students to look up the journal article and analyze some other variables. Note that the sample size for the “case” drivers given on the bottom of p. 421 is smaller than that reported on p. 420, reflecting the nonrespondents. Also note the typo on p. 422, the sample odds ratio is 1.59, not 1.48. In the simulation conducted in (i), each row will represent a new two-way table. Remember that the simulation assumes a value for the common population proportion and that also motivates an advantage of the more general normal based model which applies for any value of p. Many students are surprised in (i) that this time, unlike so many earlier investigations, the sampling distribution of this statistics (odds ratio) is not well approximated by a normal distribution, and we hope that students will think of applying a transformation to make the distribution more normal. We made the decision to simply tell students the formula for the standard error but you may want to go through the derivation with more mathematically inclined students. In (m), we ask for a 90% confidence interval, but you could consider a 95% confidence interval as well [(1.06, 2.39)]. You may need to remind students to “back-transform” with exponentiation in order to produce a confidence interval for the odds ratio rather than the log odds ratio. Another point to remind them of is that it’s relevant to check whether the interval includes the value one, rather than the value zero as with a difference in proportions.
Section 5.2: Randomized Experiments Revisted
Timing/materials: Investigation 5.2.1 requires a Minitab macro (and letrozole.mtw) but may only take about 30 minutes.
Section 5.2 transitions from independent random samples to randomized experiments, but you might remind students that this scenario was already discussed in Ch. 1. Students should recall that the hypergeometric distribution for Fisher’s exact test often looked fairly symmetric and normal. Here, we focus on using the normal model as a large sample approximation to Fisher’s exact test, complete with null and alternative hypothesis statements about the treatment effect, d. The normal-based model also has the advantage of providing a direct method for determining a confidence interval for the size of the treatment effect.
In Investigation 5.2.1, students use a macro to obtain an empirical sampling distribution for both the difference in group proportions and the sample odds ratio. You may wish to provide them with an existing file to execute instead of asking them to take the time to type all the commands in. Students are also reminded that the normal-based method also provides a test statistic as an additional measure of the extremeness of the sample result. Otherwise, you can caution them that there are not a lot of current advantages to using the large sample method but that previously (before modern computers) Fisher’s exact test was computationally intensive with large sample sizes. The details for using Minitab or an applet to carry out the z-procedures are given on p. 431. Once again, you might want to emphasize that the randomization in the experimental design allows for drawing causal conclusions when the difference in the groups turns out to be statistically significant.
Section 5.3: Comparing Two Samples on a Quantitative Response
Timing/materials: Investigation 5.3.1 also requires a Minitab macro that students create to use with NBASalaries0203.mtw, and Investigation 5.3.2 requires features that are new to Version 14 of Minitab. These two activities together should take 50-60 minutes. Investigation 5.3.3 (shopping99.mtw) should only take about 15 minutes.
Section 5.3 transitions to comparing two groups generated from independent random samples on a quantitative response variable. It might be useful to highlight the different contexts they will examine in this section (NBA salaries by conference, body temperatures of men vs. women, life expectancy of right and left handers) to help them understand the settings in which these techniques will apply. In Investigation 5.3.1, students again first examine simulation results and then consider the theoretical derivations of the mean and expected value. This is another situation, as with the Scottish militiamen and mothers’ ages, where we give students access to an entire population and ask them to repeatedly take random samples from it. You might want to remind students that this is a pedagogical device for studying sampling distributions; in real life we would only have access to one sample, and if we did have access to the whole population there would be no need to conduct inference. By the end of this investigation, we remind them of the utility of the t distribution with quantitative data. Again you may choose to present these results to students more directly if you are short on time. It will be important that they at least read through the “Probability Detour” on p. 436-7. You will probably also want to remind them of how Minitab handles stacked and unstacked data. We encourage you not to short change the discussion of the numerical and graphical summaries and what they imply to contrast what the inferential tools tell them.
Investigation 5.3.2 presents an interesting application of the two-sample t-statistic, assuming use of Minitab 14, where only the sample means and sample standard deviations need to be specified (as opposed to the raw data). If Minitab 14 is not available, you may consider having different students carry out the calculations for the different scenarios by hand. The point is to make sure the students have time to describe the effects of the sample standard deviations and the relative sample sizes in the two groups on the test statistic and p-value. You can also engage in a class discussion about which scenarios seem “plausible.” Most students will agree that sample standard deviations of 50 do not make much sense with means of 66 and 75, because even though the distribution of lifetimes may not be symmetric, we might expect the minimum to be further than just one standard deviation from the mean. Students will also debate the reasonableness of the different percentages of left-handers. The point we hope to make is even if we don’t know these values exactly (they truly were not reported in this study), we can still make some tentative conclusions about the significance of the result. Of course, you will still want to emphasize in class discussions that a statistically significant result is not sufficient, due to the observational nature of the study, to imply a cause and effect result. Questions (j) and (k) also reinforce this point. The context of this investigation is interesting to students, though be wary of sensitivity issues. Students will often have opinions (especially left-handed students) related to this study. You can also bring in some of the “history” of this type of research and some of the doubts of its validity (see p. 441). Practice Problem 5.3.3 provides practice with a similar context with a reminder of statistical vs. practice significance (“Should students pay money to be able to increase their scores by 65 points?”) that may be worthwhile to reinforce in class.
Investigation 5.3.3 provides another application of the two-sample t test by again analyzing the comparison shopping data, this time focusing on the price differences. The crucial point, which arises in (c) and (d), is that the earlier methods of Chapter 5 do not apply because of the paired (and therefore non-independent) nature of the data collection. This context has been used several times and you may want to compare the results of the different analyses (e.g., sign test vs. two-sample t test) as in (j). It’s also important to emphasize the difference between a statistically significant difference and a practically significant difference as in (i) where they are asked to comment on whether an average price difference per item of $.03 to $.29 is worthwhile. You can ask them to consider how much farther away the cheaper store would need to be for such a difference to no longer be worthwhile. The Practice Problems also provide a few additional applications of the t-procedures for comparing paired quantitative data, but you may want to add a few more/be ready to provide help on HW problems. Be sure not to miss the t-procedure summary on p. 452-3.
Section 5.4: Randomized Experiments Revisited
Timing/materials: Investigation 5.4.1 only requires analysis of the data in SleepDeprivation.mtw from Chapter 2. Exploration 5.4 also uses Minitab.
Section 5.4 mirrors Section 5.2 in that it demonstrates that the t-distribution provides a reasonable approximation to the randomization distribution. In Investigation 5.4.1 students are presented with the relevant output to see this, returning to the sleep deprivation study for which they approximated a randomization test near the end of Chapter 2. If you have more time, you may want students to help create this output for themselves. You may remind students that near the end of Chapter 2, they were actually shown the picture of a t-distribution (p. 148) to foreshadow what they are now learning in this section. We initially show the parallel with the pooled t-test but in general do not recommend pooling even with experimental data as the benefits do not appear to outweigh the risks. This is also a good point to remind students that in writing their final conclusions they should focus on whether the difference is statistically significant, the population(s) the results can be generalized to and whether a cause-and-effect conclusion can be drawn.
Exploration 5.4 can be considered optional. The exploration asks students to examine various approximations for the (unknown) exact degrees of freedom in non-pooled, two-sample t-procedures. This exploration may appeal to more mathematically inclined students who want to examine the relative merits of different approximations that are recommended.
Section 5.5: Other Statistics
Timing/materials: Investigation 5.5.1 also uses macros and will take about 20 minutes.
Section 5.5 can also be considered optional as it returns to the issue of bootstrapping, this time in the two sample case. The case is made that this approach may be advantageous when something other than the difference in sample/group means is of interest. In Investigation 5.5.1 the difference in group medians is considered for “truncated” data (not all response times will be completed by the time of the end of the study), where it’s impossible to calculate group means directly. Students create an empirical bootstrap distribution and see that it is not normal and then proceed to consider a bootstrap percentile interval. The simulations again emphasize whether we want to model the data as coming from two independent samples or from random assignment. Both of these approaches are hypothetical for this particular observational study, but students can see that the conclusions about statistical significance would be similar. We do not go into extensive data on the bootstrapping procedures, e.g., bias-corrected methods) but if this is a year long course for your students this could be a good place to expand the discussion.
There are two examples here, one highlighting a comparison of proportions and one highlighting a comparison of means.
At the end of this chapter it will be important to highlight how the appropriate procedure follows from the study design and the type and number of variables involved. We encourage you to give students a mixture of problems where deciding on t-test vs. z-test is not always transparent. Similarly for one-sample (matched pairs) and two-sample procedures. (We especially like exercise 15 for focusing on this issue as well as reminding them of important non-computational issues such as question wording in surveys.) Students will also need to be reminded on when and why they might want to consider Fisher’s exact test. Since this chapter focused mostly on methods, it will also be important to remind them not to ignore some of the larger issues such as interpreting statistical significance, type I and type II errors, association vs. causation, meaning of confidence, scope of conclusions, etc.
Issues of comparing proportions (as in Sections 5.1 and 5.2) are addressed in Exercises #1-22 and #35. Issues of comparing means (as in Sections 5.3 and 5.4) are addressed in Exercises #22-43. Exercises #22 and #35 concern issues of both types of analyses.