**Chapter 5**

**
**In this chapter, we focus on developing two-sample normal-based procedures,
both for comparing two populations and for comparing two treatment groups. We again start with categorical data
(Sections 5.1 and 5.2) before proceeding to quantitative data (Sections
5.3-5.5). We encourage you to again
focus on the distinction between random sampling (Sections 5.1 and 5.3) and
randomization (Sections 5.2 and 5.4), the corresponding impact on the scope of
conclusions, and the place of inference in the entire statistical process
(collecting data, analyzing data, including focusing on the appropriate
numerical and categorical summaries, and then, if appropriate, inferential
conclusions). Several points in this
chapter allow students to combine many of the techniques they have learned
throughout the course. Students should
also be familiar with the pedagogical style of the course by now; for example,
they should not be surprised that they are first asked about study design
issues and about observational units and variables, and they then use
simulation to come up with an empirical

Please note that the following timings are especially approximate as we have often done more jumping around in this chapter, often requiring students to work outside of class with these investigations. We do feel there is a tremendous amount of flexibility in how you use this chapter’s investigations in your own courses.

**Section 5.1: Two
Samples on a Categorical Response**

*Timing/materials:* This chapter makes heavy use of Minitab (all
but Investigation 5.1.2). Investigation 5.1.1 may take about 35 minutes as a
student exploration but you can also lead students through the questions more
quickly. Investigation 5.1.2, with
help, may only take 15 minutes. On p.
417-8, students are stepped through using Minitab and an applet for these
calculations. Investigation 5.1.3 can take
about 50 minutes.

Investigation 5.1.1 helps students see that the binomial
model they have been using with categorical data does not apply in analyzing
the difference between two groups. The
survey being conducted in two different years provides a nice context for
emphasizing the independence of the samples.
Notice that we still have students begin by considering questions of
observational units and variable definitions.
Students then use simulation in (j) to see that a normal model does
provide a useful approximation for the difference in the sample proportions. As
they have often in the course, students analyze the study data through an
empirical *p*-value from their
simulation.

Investigation 5.1.2 then steps students through the
mathematical derivation of the mean and standard deviation of the sampling
distribution of the difference in sample proportions, including a brief
discussion of the

Investigation 5.1.3 steps students through the derivation of a confidence interval for an odds ratio. This investigation reinforces and combines many ideas from earlier in the course: case-control students, odds ratio as the most appropriate parameter in case-control studies, simulation of an empirical sampling distribution of the sample statistic, transformations to normalize a distribution, and construction of a normal-based confidence interval using the form: estimate ± (critical value × standard error of statistic). We again start students by considering study design issues before even revealing the study results. The journal article can be found at http://bmj.com/cgi/content/full/324/7346/1125. The researchers in this study conducted the interviews with case drivers in the hospital room or by telephone; they say only that proxies were used for drivers who had died. They defined a full night’s sleep to be at least seven hours, mostly between 11pm and 7am. They also used several other measures of sleep, specifically the Stanford and Epworth sleepiness scales, in case you would like students to look up the journal article and analyze some other variables. Note that the sample size for the “case” drivers given on the bottom of p. 421 is smaller than that reported on p. 420, reflecting the nonrespondents. Also note the typo on p. 422, the sample odds ratio is 1.59, not 1.48. In the simulation conducted in (i), each row will represent a new two-way table. Remember that the simulation assumes a value for the common population proportion and that also motivates an advantage of the more general normal based model which applies for any value of p. Many students are surprised in (i) that this time, unlike so many earlier investigations, the sampling distribution of this statistics (odds ratio) is not well approximated by a normal distribution, and we hope that students will think of applying a transformation to make the distribution more normal. We made the decision to simply tell students the formula for the standard error but you may want to go through the derivation with more mathematically inclined students. In (m), we ask for a 90% confidence interval, but you could consider a 95% confidence interval as well [(1.06, 2.39)]. You may need to remind students to “back-transform” with exponentiation in order to produce a confidence interval for the odds ratio rather than the log odds ratio. Another point to remind them of is that it’s relevant to check whether the interval includes the value one, rather than the value zero as with a difference in proportions.

**Section 5.2:
Randomized Experiments Revisted**

*Timing/materials:* Investigation 5.2.1 requires a Minitab macro
(and letrozole.mtw) but may only take about 30
minutes.

Section 5.2 transitions from independent random samples to randomized experiments, but you might remind students that this scenario was already discussed in Ch. 1. Students should recall that the hypergeometric distribution for Fisher’s exact test often looked fairly symmetric and normal. Here, we focus on using the normal model as a large sample approximation to Fisher’s exact test, complete with null and alternative hypothesis statements about the treatment effect, d. The normal-based model also has the advantage of providing a direct method for determining a confidence interval for the size of the treatment effect.

In Investigation 5.2.1, students use a macro to obtain an
empirical sampling distribution for both the difference in group proportions
and the sample odds ratio. You may wish
to provide them with an existing file to execute instead of asking them to take
the time to type all the commands in.
Students are also reminded that the normal-based method also provides a
test statistic as an additional measure of the extremeness of the sample
result. Otherwise, you can caution them
that there are not a lot of current advantages to using the large sample method
but that previously (before modern computers) Fisher’s exact test was
computationally intensive with large sample sizes. The details for using Minitab or an applet to
carry out the *z*-procedures are given
on p. 431. Once again, you might want to
emphasize that the randomization in the experimental design allows for drawing
causal conclusions when the difference in the groups turns out to be
statistically significant.

**Section 5.3:
Comparing Two Samples on a Quantitative Response**

*Timing/materials:* Investigation 5.3.1 also requires a Minitab
macro that students create to use with NBASalaries0203.mtw, and
Investigation 5.3.2 requires features that are new to Version 14 of
Minitab. These two activities together should
take 50-60 minutes. Investigation 5.3.3
(shopping99.mtw)
should only take about 15 minutes.

Section 5.3 transitions to comparing two groups generated
from independent random samples on a quantitative response variable. It might be useful to highlight the different
contexts they will examine in this section (NBA salaries by conference, body
temperatures of men vs. women, life expectancy of right and left handers) to help them understand the settings in which
these techniques will apply. In Investigation
5.3.1, students again first examine simulation results and then consider the
theoretical derivations of the mean and expected value. This is another situation, as with the
Scottish militiamen and mothers’ ages, where we give students access to an
entire population and ask them to repeatedly take random samples from it. You might want to remind students that this
is a pedagogical device for studying sampling distributions; in real life we
would only have access to one sample, and if we did have access to the whole
population there would be no need to conduct inference. By the end of this investigation, we remind
them of the utility of the *t*
distribution with quantitative data.
Again you may choose to present these results to students more directly
if you are short on time. It will be
important that they at least read through the “Probability Detour” on p.
436-7. You will probably also want to
remind them of how Minitab handles stacked and unstacked
data. We encourage you not to short
change the discussion of the numerical and graphical summaries and what they
imply to contrast what the inferential tools tell them.

Investigation 5.3.2 presents an interesting application of
the two-sample *t*-statistic, assuming
use of Minitab 14, where only the sample means and sample standard deviations
need to be specified (as opposed to the raw data). If Minitab 14 is not available, you may
consider having different students carry out the calculations for the different
scenarios by hand. The point is to make
sure the students have time to describe the effects of the sample standard
deviations and the relative sample sizes in the two groups on the test
statistic and *p*-value. You can also engage in a class discussion
about which scenarios seem “plausible.”
Most students will agree that sample standard deviations of 50 do not
make much sense with means of 66 and 75, because even though the distribution
of lifetimes may not be symmetric, we might expect the minimum to be further
than just one standard deviation from the mean.
Students will also debate the reasonableness of the different
percentages of left-handers. The point
we hope to make is even if we don’t know these values exactly (they truly were
not reported in this study), we can still make some tentative conclusions about
the significance of the result. Of
course, you will still want to emphasize in class discussions that a
statistically significant result is not sufficient, due to the observational
nature of the study, to imply a cause and effect result. Questions (j) and (k) also reinforce this
point. The context of this investigation
is interesting to students, though be wary of sensitivity issues. Students will often have opinions (especially
left-handed students) related to this study. You can also bring in some of the
“history” of this type of research and some of the doubts of its validity (see
p. 441). Practice Problem 5.3.3 provides
practice with a similar context with a reminder of statistical vs. practice
significance (“Should students pay money to be able to increase their scores by
65 points?”) that may be worthwhile to reinforce in class.

Investigation 5.3.3 provides another application of the
two-sample *t* test by again analyzing
the comparison shopping data, this time focusing on the price differences. The crucial point, which arises in (c) and
(d), is that the earlier methods of Chapter 5 do not apply because of the
paired (and therefore non-independent) nature of the data collection. This
context has been used several times and you may want to compare the results of
the different analyses (e.g., sign test vs. two-sample *t* test) as in (j). It’s also
important to emphasize the difference between a statistically significant
difference and a practically significant difference as in (i)
where they are asked to comment on whether an average price difference *per item *of $.03 to $.29 is
worthwhile. You can ask them to consider
how much farther away the cheaper store would need to be for such a difference
to no longer be worthwhile. The Practice
Problems also provide a few additional applications of the *t*-procedures for comparing paired quantitative data, but you may
want to add a few more/be ready to provide help on HW problems. Be sure not to miss the *t*-procedure summary on p. 452-3.

**Section 5.4: Randomized
Experiments Revisited**

*Timing/materials:* Investigation 5.4.1 only requires analysis of
the data in SleepDeprivation.mtw from
Chapter 2. Exploration 5.4 also uses
Minitab.

Section 5.4 mirrors Section 5.2 in that it demonstrates that
the *t*-distribution provides a
reasonable approximation to the randomization distribution. In Investigation 5.4.1 students are presented
with the relevant output to see this, returning to the sleep deprivation study
for which they approximated a randomization test near the end of Chapter
2. If you have more time, you may want
students to help create this output for themselves. You may remind students that near the end of
Chapter 2, they were actually shown the picture of a *t*-distribution (p. 148) to foreshadow what they are now learning in
this section. We initially show the
parallel with the *pooled t-test* but
in general do not recommend pooling even with experimental data as the benefits
do not appear to outweigh the risks.
This is also a good point to remind students that in writing their final
conclusions they should focus on whether the difference is statistically
significant, the population(s) the results can be generalized to and whether a
cause-and-effect conclusion can be drawn.

Exploration 5.4 can be considered optional. The exploration asks students to examine
various approximations for the (unknown) exact degrees of freedom in
non-pooled, two-sample *t*-procedures. This exploration may appeal to more
mathematically inclined students who want to examine the relative merits of
different approximations that are recommended.

**Section 5.5: Other
Statistics**

*Timing/materials:* Investigation 5.5.1 also uses macros and will
take about 20 minutes.

Section 5.5 can also be considered optional as it returns to the issue of bootstrapping, this time in the two sample case. The case is made that this approach may be advantageous when something other than the difference in sample/group means is of interest. In Investigation 5.5.1 the difference in group medians is considered for “truncated” data (not all response times will be completed by the time of the end of the study), where it’s impossible to calculate group means directly. Students create an empirical bootstrap distribution and see that it is not normal and then proceed to consider a bootstrap percentile interval. The simulations again emphasize whether we want to model the data as coming from two independent samples or from random assignment. Both of these approaches are hypothetical for this particular observational study, but students can see that the conclusions about statistical significance would be similar. We do not go into extensive data on the bootstrapping procedures, e.g., bias-corrected methods) but if this is a year long course for your students this could be a good place to expand the discussion.

**Examples**

There are two examples here, one highlighting a comparison of proportions and one highlighting a comparison of means.

**Summary**

At the end of this chapter it will be important to highlight
how the appropriate procedure follows from the study design and the type and
number of variables involved. We
encourage you to give students a mixture of problems where deciding on *t*-test vs. *z*-test is not always transparent.
Similarly for one-sample (matched pairs) and two-sample procedures. (We especially like exercise 15 for focusing
on this issue as well as reminding them of important non-computational issues
such as question wording in surveys.)
Students will also need to be reminded on when and why they might want
to consider Fisher’s exact test. Since
this chapter focused mostly on methods, it will also be important to remind
them not to ignore some of the larger issues such as interpreting statistical
significance, type I and type II errors, association vs. causation, meaning of
confidence, scope of conclusions, etc.

** **

Issues of comparing proportions (as in Sections 5.1 and 5.2) are addressed in Exercises #1-22 and #35. Issues of comparing means (as in Sections 5.3 and 5.4) are addressed in Exercises #22-43. Exercises #22 and #35 concern issues of both types of analyses.