Example 1: Rowers' Weights

We present students with data on the weights of the rowers on the 1996 U.S. Men's Olympic rowing team and ask them to create a visual display of the distribution of weights. We ask students to comment on interesting features of the distribution, and they usually notice that the weights are clustered into two groups- one in the 180-220 pound range and another in the 150-160 pound range. They also observe that one team member weighed just 121 pounds. We then ask them to try to think of explanations for these findings, and some realize that those in the 150's are rowers in "lightweight" events and that the 121-pounder is the coxswain who calls out instructions but does not help with the rowing. Thus, students begin to discover some of the key features to look for when describing a distribution of data, and they also start to appreciate the importance of context for understanding and explaining what the data analysis reveals.

We also ask students to use these data to explore the concept of resistance. They find that the mean of the weights changes much more substantially than the median does as they omit first the coxswain and then the lightweights and finally alter the weight of the heaviest rower.

Example 2: Cancer Pamphlets

Students analyze data from a study of whether cancer pamphlet information is written at an appropriate level to be read and understood by cancer patients. The data consist of a sample of 63 patients whose reading level was determined and a sample of 30 pamphlets whose readability level was assessed on the same scale. We ask students to calculate the median of the patients' reading levels and the median of the pamphlets' readability levels, and these turn out to be identical. We then ask students if this reveals that the pamphlets are well-suited to the patients. Many students astutely notice the dilemma that more than one-quarter of the patients have a reading level below that of the simplest pamphlet. Through this activity students realize that measures of center do not tell the whole story of a set of data, that variability is often equally important, and that one must not lose sight of the original question that motivated the study. They also realize the importance of graphical displays for making such patterns more apparent.

Example 3: Draft Lottery

The 1970 draft lottery provides a classic example through which a fairly elementary statistical analysis reveals a important finding. The data are the 366 birthdates of the year and the draft numbers assigned to those birthdates by the 1970 lottery. Mindful of the lesson from the "cancer pamphlets" example to start with a graph of the data, we ask students to examine a scatterplot of the draft numbers versus the birthdates arranged sequentially from 1 (January 1) through 366 (December 366). Most students report that the scatterplot reveals nothing but a random scattering, as one would expect from a fair lottery. We then ask students to find the median draft number for their birthmonth, and we compare results across the twelve months. This analysis reveals a clear pattern: the median draft numbers tend to get smaller as the months move forward through the year. Equipped with this finding, students return to the scatterplot and notice a general lack of points in the lower left (few low draft numbers assigned to the birthdates early in the year) and in the upper right (few high draft numbers assigned to birthdates late in the year). This activity illustrates that numerical summaries such as medians can indeed be helpful for detecting patterns, that randomness is not as easy to achieve as we might think, and that statistics can involve life-and-death situations.

Example 4: Graduate Admissions

Another well-known historical example concerns data on graduate school applications to the University of California at Berkeley in 1973. We provide students with data on the applicant's sex and whether he/she was admitted or rejected for the six largest programs at the university. Students calculate the acceptance rates for the six programs combined and find that men had a significantly higher acceptance rate than women. We then divide students into groups and ask groups to find the acceptance rates by sex for the six programs individually. It turns out that the acceptance rates are quite equitable between men and women within each program, with more programs actually having a higher acceptance rate for women than for men. We then ask students to explain how these findings could possible be consistent. While the point here is subtle, some students realize that the explanation is that very few women applied to the programs with high acceptance rates, while women applied in large numbers to programs with much lower acceptance rates. Students thus discover the phenomenon known as "Simpson's paradox," and they begin to recognize the importance of searching for confounding variables and that one can not draw cause-and-effect conclusions from observational studies.

Example 5: Televisions and Life Expectancy

To lead them to one of the most important lessons to be gained from a statistics course, we present students with real data but in a somewhat contrived situation. The data are the life expectancy and the number of people per television in a sample of countries around the world. We ask students to create a scatterplot of the two variables and to comment on whether a relationship is apparent between them. It turns out that there is a moderately strong, negative association, indicating that countries with more people per television tend to have lower life expectancy than countries with fewer people per television. We then ask students whether it follows that sending televisions to a country like Haiti with a high ratio would therefore result in a higher life expectancy. Students realize immediately that this argument is ridiculous, and they are quick to suggest more plausible explanations for the association. Through this activity students discover and articulate for themselves one of the most important principles of statistics: that an association between two variables does not imply that there is a cause-and-effect relationship between them.