Teaching the Reasoning of Statistical Inference
A "Top Ten" List
Allan J. Rossman and Beth L. Chance
"Certainly the present trend toward reemphasizing actual experience with data analysis in beginning instruction before plunging into probability and inference makes sense pedagogically as well as presenting a more balanced introduction to statistical practice....Yet teachers of statistics, however we face the pedagogical obstacles posed by the difficulty of probability ideas, are obligated to present at least the basic reasoning of confidence intervals and significance testing as essential parts of our subject." - David Moore [16, p. 8]
During the past decade, a reform movement in statistics education has emphasized that introductory statistics courses should focus on student experiences with data and understanding of fundamental concepts. Cobb  summarized the guidelines of an MAA/ASA Joint Committee on Undergraduate Statistics as:
• Emphasize statistical thinking.
• More data and concepts; less theory, fewer recipes.
• Foster active learning.
Additional references on statistics education reform and current statistical practice include Gordon and Gordon , Cobb , Hoaglin and Moore , and Cobb and Moore . The guidelines also correspond with the philosophy of the recently developed Advanced Placement Statistics syllabus .
As the quote from David Moore suggests, these principles have been incorporated more readily into the teaching of data analysis than into the teaching of statistical inference. Moreover, instructors of introductory statistics often treat statistical inference as an isolated subject with little connection to the issues of exploratory data analysis and data collection that precede it in most courses. With apologies to David Letterman, we offer the following "Top Ten" list of recommendations for teaching the reasoning of statistical inference. We do not mean to imply an order of importance for these recommendations. Rather, our goal is to focus on the following themes: student investigation and discovery of inferential reasoning, proper interpretation and cautious use of results, and effective communication of findings. We include examples of activities and exercises that illustrate the principles of these suggestions.
#10 Have students perform physical simulations to discover basic ideas of inference.
We contend that simulation, not formal probability, provides the most effective introduction to sampling distributions and to concepts of inference. Surveying recent research in mathematics and statistics education research, Garfield  stated that the use of simulations can help students learn concepts through visualization and manipulation of concrete representations of abstract ideas. Moore  remarked that simulations offer "an alternative to proofs and algebraic derivations as a way of convincing students of the truth of important facts."
While modern technology performs simulations quickly and efficiently, we worry that students fail to connect the numbers and displays being produced with the process being simulated. We therefore advocate beginning with physical simulations, where students literally get a hands-on view of the process.
Example: We ask each student in the class to take a sample of 25 Reese’s Pieces candies and to calculate the proportion of orange candies in the sample (see ). Students then aggregate their results with their classmates’ to discover the simple but fundamental idea of sampling variability. Students observe first-hand that the outcomes of a statistic vary from sample to sample under repeated random sampling from the same population. They also notice that while these values differ, a predictable pattern emerges. Once this idea is clearly established, students can use technology to perform simulations more efficiently and to develop their understanding of properties of the sampling distribution.
Example: Landwehr, Swift, and Watkins  introduce the idea of confidence intervals without resorting to formulas. Using a fixed sample size, they ask groups of students to draw samples using a specific population proportion, varying this value among the groups. Each group then produces a "90% boxplot" containing the middle 90% of their generated proportions. Students then construct a chart which displays the 90% boxplots for the different values of the population proportion. Given a new sample proportion, students see which of these boxplots overlap with this value; these boxplots indicate which population proportions could reasonably have produced the new sample proportion. Students thus interpret a confidence interval as a set of plausible values for the population parameter based on the observed sample statistic.
Example: Scheaffer, Gnanadesikan, Watkins, and Witmer  introduce the concept of statistical significance by asking students to shuffle and deal cards. This activity pertains to a question of sex discrimination in a company’s firing of employees. The cards represent employees retained or dismissed, and students use the cards to investigate how often the actual number of females dismissed would occur by chance. This activity helps students to understand and to verbalize the concept of a p-value before they have ever seen an inference formula.
#9 Encourage students to use technology to explore properties of inference procedures.
The power of modern computers and graphing calculators allows instructors to shift the emphasis from students performing extensive hand calculations to students exploring the underlying concepts and properties of inference procedures. After students have conducted physical simulations to become comfortable with the idea of repeated samples, technology enables us to extend these ideas quickly and efficiently. For example, students can discover for themselves what the phrases "95% confidence" and "significance level" represent.
Example: Patterned after an exercise of Moore & McCabe , we provide students with a population of 1000 hypothetical SAT-M scores and ask them to use the computer to calculate the mean of this population, which turns out to equal 500. Students then use the computer to take 50 different samples of size 100 from this population and to construct a confidence interval for the population mean from each sample. Since students know the value of the population mean, they can then count how many of these intervals contain the population parameter. They also see first-hand that the intervals which fail to capture the population mean arise from samples with unusually high or low sample means. Through this activity students develop an understanding of the notion of "confidence" as describing how often a confidence interval captures the population parameter in the long run. Similarly, students test at the 5% level whether the population mean equals 500 for each sample and then count how many of the samples would lead to a false rejection of the null hypothesis. This activity helps students to develop intuition about Type I error.
Technology can also free students from computational drudgery, allowing them to concentrate on exploring properties of the inferential procedures such as the effects of the sample size or confidence level.
Example: We ask students whether results from a random sample provide strong evidence that more than half of a population favors a certain candidate, telling them only that 54% of the sample are in favor (see ). Students realize that the answer depends on the sample size used, and they use technology to determine the smallest sample size for which the result is significant. Students further investigate how this minimum sample size changes for different significance levels. This activity reinforces the idea that one cannot base decisions solely on point estimates.
In addition, technology enables students to investigate more complicated sampling distributions, such as those arising in regression or chi-square analyses.
Example: Returning to the SAT data, we introduce 1000 corresponding GPA values and ask students to calculate the population regression equation. Students then repeatedly sample 100 pairs of observations from this population and calculate the regression equation for each sample. Students plot the sample regression lines to visualize how they vary about the population regression line. By examining the sample regression coefficients, students observe the normality, variability, and unbiasedness of these sampling distributions.
#8 Present tests of significance in terms of p-values rather than rejection regions.
Not only do p-values provide more information than simple statements of rejection, they also better reflect statistical practice. We try to help students realize that while the significance level allows one to make a decision, the p-value expresses the strength of evidence provided by the sample data.
Example: We ask students to decide (at the 5% significance level) whether more than half of a company’s customers are women, based on two random samples of 200 customers (see ). In the first, 112 of the customers are women (p-value = .0448), and in the second, 124 are (p-value = .0003). Students indicate whether their report to the company would be the same in both cases. While they reject the null hypothesis in both cases, students realize that the sample results are quite different and that the p-value provides more information than a simple statement of rejection.
Example: In an episode of the television series ER, a doctor excitedly reports that the p-value of his study is currently .06 so that he is just "one successful outcome away from statistical significance." He eagerly begins looking for that one last patient so that his work can be published. We have students critique this argument, reinforcing cautions against using fixed significance levels. The example becomes even more dramatic when another doctor realizes that unsuccessful outcomes have been dubiously dropped from the study; we ask students to comment on this practice as well.
#7 Accompany tests of significance with confidence intervals whenever possible.
Confidence intervals provide more information than tests of significance but are generally underutilized in statistical teaching and practice. While a test of significance indicates whether a sample result is statistically significant, a confidence interval estimates the magnitude of the population parameter. This allows one to assess the practical significance of the sample result. Students need to understand the difference between "strong evidence of an effect" (a low p-value) and a "strong effect" (e.g., a very large difference in means).
Example: Utts  discusses a meta-analysis which showed a statistically significant reduction in cholesterol levels between a group of subjects who consumed oat bran and a control group. However, a 95% confidence interval for the mean amount of reduction extended from 3.3 mg/dl to 8.4 mg/dl, suggesting that the magnitude of the difference was actually quite small relative to average cholesterol levels of about 210 mg/dl.
Example: Students analyze data reported in The 1992 Statistical Abstract of the United States that 30.5% of a sample of 40,000 American households own a pet cat (see ). We ask whether this sample provides strong evidence that less than one-third of the population of all American households owns a cat and then whether it provides evidence that much less than one-third owns a cat. A significance test answers the first question in the affirmative (p-value < .0001), but a confidence interval supplies the additional information needed to answer the second question in the negative (95% c.i.: (.300, .310)). Students discern that large sample sizes can often lead to statistically significant results that are not practically significant.
Example: We ask students to gather prices at two different grocery stores on the same set of products and perform a matched pairs t-test. While the significance test examines whether there is a price difference, a confidence interval estimates the average amount of savings. Students use the interval to decide if the amount of savings is enough to compensate for other factors, such as additional travel time.
#6 Help students to recognize that insignificant results do not necessarily mean that no effect exists.
Just as statistical significance does not establish that an effect is practically important or even guarantee that the effect is present, lack of significance does not constitute proof of no effect. Instead, there may be an effect that the test procedure fails to detect, typically due to an sample size that was not large enough. We aim to help students develop an intuitive understanding of the subtle but important point that failing to reject the null hypothesis does not establish it to be true. Students should realize that the null hypothesis could be false but that the sample data did not provide sufficient evidence to reject it.
Example: In an activity illustrating the famous "Monty Hall Problem," we give three playing cards, two red and one black, to pairs of students. The cards represent prizes behind doors used in a game show, one a winner (black) and two losers (red). One student (dealer) shuffles the cards and holds them facing away from the other (contestant). The contestant chooses a card and the dealer reveals one of the two remaining cards to be red. The contestant is then asked to either switch to the remaining card or stay with the original choice, in an effort to find the (winning) black card. We have the students play this game 20 times and perform a test of significance to see if the proportion of wins using the switching strategy is different from 1/2. Most students fail to reject this hypothesis (power = .152). Since 1/2 agrees with many students’ intuition, they do not find this result surprising. However, if they continue to play the game or reason probabilistically, they discover that the actual probability of winning with the switch strategy is 2/3. Thus, they learn that the original sample size of 20 was not large enough to enable them to detect this difference.
#5 Stress the limited role that inference plays in statistical analysis.
While statistical inference is a widely used and very important class of techniques, it is just one component of a statistical analysis. Other important considerations include the design of the data collection procedure and an exploratory analysis of the data. Moreover, in many situations, statistical inference procedures cannot even be applied in a meaningful manner. Since they are overused, students should be taught to adopt a cautious attitude toward them.
Foremost among its limitations, statistical inference applies only to situations where sample data have been selected from a population or process or where experimental subjects have been randomly divided into treatment groups. This reliance on randomization is crucial for helping students to understand what statistical inference is all about.
Example: We ask students to use the data that 9 of the 100 U.S. Senators in 1998 are women to construct a confidence interval for the proportion of women in the 1998 Senate (see ). While the numbers are very easy to substitute into the familiar formula, the interval is meaningless since one knows with certainty the value of the population parameter in this case.
Such examples help students focus on the purposes of statistical inference- drawing conclusions about a population based on a sample or about a treatment effect based on random allocation of subjects.
#4 Always consider issues of data collection.
Another temptation is to ignore issues of random sampling and experimental design when moving on to the inference part of the course. We strongly advocate that students be forced to confront these issues when making inferences. For example, the distinction between an observational study and a controlled experiment determines what conclusions can be drawn from a test of significance. Moreover, applying inference techniques to poorly collected data can produce very misleading conclusions.
Example: A classic example often used to illustrate Simpson’s paradox is the data from a sex discrimination case against the University of California at Berkeley’s graduate admissions process (see  for an original source and  for an interesting account). We suggest having students also analyze these data in the context of statistical inference. A significance test reveals that the difference in the acceptance rates between men and women (.446 and .305, respectively) is highly significant (p-value < .0001), but students should realize that since the data come from an observational study and not a controlled experiment, they can not conclude that discrimination occurred.
Example: An infamous example often used to illustrate the pitfalls of biased sampling methods is the 1936 Literary Digest survey, in which 57% of 2.4 million respondents favored Alf Landon over Franklin Roosevelt in the presidential election. We introduce students to this example early in the course as an illustration of how improper sampling techniques can produce very misleading data (see  for a discussion of the sources of sampling biases). We return to this example when discussing inference, asking students to use the Literary Digest result to estimate the proportion of Landon supporters in the population (95% c.i.: (.5694,.5706)). As Roosevelt beat Landon in the actual election by a landslide, students see that the poor data collection methods in this setting render any inference results completely invalid.
#3 Always examine visual displays of the data.
An instructor can easily be tempted to limit discussion of exploratory and graphical methods to the first part of the introductory course. However, we strongly recommend always having students apply these techniques to data, including before they carry out inference procedures. In many cases an initial analysis of the data reveals much that a significance test or confidence interval does not. An exploratory analysis can also determine whether or not the inference procedure is even appropriate for the data at hand.
Example: We provide students with data on times between eruptions (in minutes) for the Old Faithful geyser, originally reported in  and also presented in  and . If students merely calculate a confidence interval for the population mean intereruption time (95% c.i.: (70.73, 73.90)) without inspecting the data first, they fail to notice the pronounced bimodal nature of the data with peaks around 55 and 78 minutes. With this realization students are able to describe the inter-eruption times more effectively.
Example: Anscombe  provides a particularly effective illustration of the need to examine data first. Given four different bivariate data sets, students calculate the same correlation coefficient (.816) and very significant regression equation (p-value = .002) for each one. However, when students examine scatterplots of the data, they discover that the data sets differ dramatically from each other. Students discern that linear regression is entirely inappropriate for all but one of the data sets, a fact they cannot recognize from the numerical summaries alone.
#2 Help students to see the common elements of inference procedures.
We want students to see that the reasoning and structure of statistical inference procedures are consistent, regardless of the specific technique being studied. For example, students should see the sampling distributions for several types of statistics to appreciate their similarities and understand the common reasoning process underlying the inference formulas. In addition, students can view these formulas as special cases of one basic idea. For example, confidence intervals in the introductory course have the form
estimate + (critical value)(standard error of the estimate).
Similarly, test statistics are typically of the form
estimated value - hypothesized value.
standard error of the estimate
By understanding this general structure of the formulas, students can concentrate on understanding one big idea, rather than trying to memorize a series of seemingly unrelated formulas. Students can then focus on the type and number of variables involved in order to properly decide which formula is applicable. This approach also empowers students to extend their knowledge beyond the inference procedures covered in the introductory course.
#1 Insist on complete presentation and interpretation of results in the context of the data.
Students need to realize that the end result of statistical inference is not simply a "yes" or "no" answer. We consider it unacceptable for a student to write a conclusion as brief as "reject the null hypothesis". Instead, students should discuss inference results in the context of the issue at hand, as in "the data provide strong evidence that Vietnam veterans divorce at a rate higher than the general population". Our goal is not only for students to be able to interpret conclusions reported in scholarly and popular literature, but also to be able to explain them clearly to people who are not familiar with statistics. Ideally, students also describe the reasoning behind the inference statement, for example by interpreting the phrases "95% confidence" and "significant result" in their own words. Finally, students should be given the opportunity to submit their interpretations repeatedly, with frequent feedback from the instructor, until they are able to express their ideas clearly. This emphasis on mastering the language further helps students internalize the concepts.
Statistical education reform emphasizes active learning on the part of students, conceptual understanding of fundamental statistical ideas, use of engaging applications involving genuine data, and development of student communication skills. While these principles have largely been accepted for teaching data analysis, we believe they have not been sufficiently implemented for teaching inference. To facilitate incorporation of these principles into the teaching of statistical inference, we have provided suggestions and examples that:
2. A. Azzalini and A.W. Bowman, A look at some data on the Old Faithful geyser, Journal of the Royal Statistical Society, Series C, 39 (1990), 357-366.
3. P. Bickel and J.W. O'Connell, Is there a sex bias in graduate admissions?, Science, 187 (1975), 398-404.
4. Samprit Chatterjee, Mark S. Handcock, and Jeffrey S. Simonoff, A Casebook to Accompany a First Course in Data Analysis. John Wiley & Sons, 1995.
5. George Cobb, Reconsidering statistics education: a National Science Foundation conference, Journal of Statistics Education [Online]. 1 (1993), http://www.stat.ncsu.edu/info/jse/v1n1/cobb.html.
6. George Cobb, Teaching statistics, in Lynn Steen, ed. Heeding the Call for Change: Suggestions for Curricular Action, MAA Notes #22, Mathematical Association of America, 1992, 3-43.
7. George W. Cobb and David S. Moore, Mathematics, statistics, and teaching, TheAmerican Mathematical Monthly, 104 (1997), 801-824.
8. The College Board, Advanced Placement course description: Statistics, College Entrance Examination Board and Educational Testing Services, 1996.
9. David Freedman, Robert Pisani, and Roger Purves, Statistics (3rd ed.), W.W. Norton & Co., 1998.
10. Joan Garfield, How students learn statistics, International Statistical Review, 63 (1995), 25-34.
11. Florence and Sheldon Gordon, eds., Statistics for the Twenty-First Century, MAA Notes #26, Mathematical Association of America, 1992.
12. D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski, eds. A Handbook of Small Data Sets, Chapman & Hall, 1994.
13. David Hoaglin and David Moore, eds., Perspectives on Contemporary Statistics, MAA Notes #21, Mathematical Association of America, 1992.
14. James M. Landwehr, Jim Swift, and Ann E. Watkins, Exploring Surveys and Information from Samples, Dale Seymour Publications, 1987.
15. David S. Moore, New pedagogy and new content: the case of statistics, International Statistical Review, 65 (1997), 123-165.
16. David S. Moore, What is statistics?, in David Hoaglin and David Moore, eds., Perspectives on Contemporary Statistics, MAA Notes #21, Mathematical Association of America, 1992, 1-17.
17. David S. Moore and George W. Cobb, Mathematics, statistics, and teaching, American Mathematical Monthly, 104 (1997), 801-824.
18. David S. Moore and George P. McCabe, Introduction to the Practice of Statistics (2nd ed.), W.H. Freeman, 1993.
19. Allan J. Rossman and Beth L. Chance, Workshop Statistics: Discovery with Data and Minitab, Springer-Verlag, 1998.
20. Richard L. Scheaffer, Mrudulla Gnanadesikan, Ann Watkins, and Jeffrey A. Witmer, Activity-Based Statistics, Springer-Verlag, 1996.
21. Jessica M. Utts, Seeing Through Statistics, Duxbury Press, 1996.