TONY ONWUEGBUZIE

BETH CHANCE

Context: My answers pertain to an introductory statistics course for nonmajors, intermediate algebra pre-requisite. Most recently I've tended to teach psychology and other social science and liberal arts majors in a general education course. Class sizes are capped at 48 and I have probably 2-3 sections of such a course each quarter.

I start by reviewing the content I want to cover and look through other question sources to get the juices flowing (see answer to constructing exams as well). I try to start a few days ahead of time so I have time to put it aside and then look at it again with a fresh perspective the next day. Often I will look through the week's news to find an interesting context. Once I have a tentative exam, I do show it to another faculty member to check for coverage and level of difficulty. I also advocate showing it to a TA to check for reasonableness and clarity. I'm fortunate to have a colleague who will review my exam after each of many revisions. I also time how long it takes me to take the exam (reading the questions and writing complete solutions) and multiply by 5 or 6 to see if that is close to the time allotment the students will have or if I should reduce the number of questions. I decide if the exam is reasonable by considering the point distribution and I even try to think a bit about the useful of each question (how much did I learn about the students). As I am grading the exam I make notes to myself on ways to change the question/grading the next time around.

GEORGE COBB

Context: My comments refer to Mount Holyoke's Stat240: Intro to Design and Analysis of Experiments, which for many students serves as an alternative introductory statistics course. (I teach this regularly, but have taught the "standard" intro course only once in 20 years; in fact that course has existed in our department for only the last four years.)

In the weeks leading up to the exam, I look for data sets. (They must be both real and new, where "new" means not in the book, and not on any previous exams or quizzes ever given here.)
Next, I write questions tied to the data sets, based on what I consider a reasonable analysis given what we've done so far in the course.
Next, I look duplications (I won't ask for the same thing to be done to two different data sets) and omissions. (Sometimes I have to find another data set to make sure that everything gets covered.)
Then I break the questions into equal size parts, to help students plan their time.
Finally, I check for length by taking the exam myself. However, since I try to allow students extra time, I don't worry a lot about getting the length exactly right.

I suppose you could call this a "data driven" approach to creating the exams. Steps 1 and 2 are the important ones; the rest is mainly grooming. I really do allow the data sets to suggest the questions, based on what I consider a reasonable analysis would look like. My hope is that this process helps keep the exams more like what statisticians actually do than would be the case with a different process. If, as I plan the exam, there are gaps in coverage, that tells me I need to find an additional data set of a different sort.

JOAN GARFIELD

Context: An introductory statistics class taught in a college of education but serving beginning graduate students in many departments across the university who have never before studied statistics. Typically about 25-30 students in a class.

In creating a test, I think about the content that the exam is to cover, and decide what are some key concepts and competencies I want to assess. I think of things that students may misunderstand that I think are important, so that I may see if any one is misunderstanding the material. I try to have a balance of items that assess basic literacy as well as challenging items that require students to reason about data or concepts and explain their reasoning. I never try to include "everything" studied in a unit but try to sample from each general topic area covered on the exam. If there's time I ask my TA to review the exam and check it for clarity and any typos. I usually have a pretty good sense of time, and try to keep the exams brief, realizing for every minute it takes me to solve a problem it will take the students about 10! Therefore, I do "take" the exam myself, writing down the correct answers, before I give it. I have also discovered many errors this way! I know it's a good exam if the students do not misunderstand the questions (as evidenced by questions in class as well as responses on the questions) and if I get a negatively skewed distribution with most students doing well, and at least some students getting all answers correct. (My students are graduate students who study and try to succeed in this class.) My goal is for all students to master the material and I design the test to help me determine if this has happened.

JOHN HOLCOMB

Context: The introductory statistics course in the mathematics department at Cleveland State University (CSU) that I teach generally runs 2-3 sections of 30-45 students in each section with one section offered during the summer. The course prerequisite is a college intermediate algebra course or a suitable score on our mathematics placement exam, although I do not believe that prerequisites are actually checked on our campus. The course is a fifteen week semester with an additional exam week. Cleveland State comes from a legacy of years of teaching on a ten week quarter system, and although we are now on semesters, each course is 4.0 credit hours. The introductory statistics course is generally offered for 65 minutes per class on M-W-F.

Cleveland State University is a comprehensive metropolitan university located in downtown Cleveland. There is only one dormitory on campus, so almost all the students are commuters. In addition, CSU is an open-enrollment institution that accepts every applicant with a high school diploma. The mathematics department where I teach offers a masters degree in mathematics. The general teaching load for each instructor is 8.0 credit hours per semester provided the faculty is active in research in some way.

I tend to follow the following outline.

Figure out what are the concepts/topics/methods that I want to test.
Find real data problems that will test those issues.
Write the questions
Make the key
Evaluate time
Evaluate if there is a question to separate the A's from the B's

My goal in an exam is to test the material that I told the students was important. I do not try to test material that I did not spend a considerable portion of time going over. This can be challenging when one gives open note/open book exams. Students often do not study enough and are not prepared for the exams. I try to warn them that with only 65 minutes for the exam, they have to be prepared and organized.

I find that I have to make the key BEFORE I make the final copies to distribute. I always find my mistakes in wording or I find the data did not test the concept I wanted, or any such kind of problem. I have difficulty making myself do this, but whenever I don't, I end up announcing several corrections while the students are taking the test and this is not good. This also helps me assess how long students might take to complete the exam.

Time on exams is difficult for me. Students often complain that my exams are too long. My feeling tends to be if they knew the material, it would not take them as long. This takes experience.

I also try to write one question that is a little bit of a stretch of understanding to separate the A's and the B's. My grades tend to be very U-shaped - lots of A's and B's or D's and F's, and very few C's. I want to make sure that the A students really understand the material well.

I typically do not have a colleague look at my exams. I find it difficult to help colleagues when they ask me to look at an exam, so I don't generally ask for their help. The colleague does not know my students, nor what I covered in class.

My guiding principle is basically, if a student comes to class and does the practice homework, will he or she do well on this exam?? If the answer is the student needs additional information somehow to get a B or A, then I think the exam is too difficult.

CARL LEE

Context: The type of course: Introductory statistics. Covers contents typical to an introductory statistics course. The majority of students are business majors (75%). The rest are from a variety of departments other than Science & Technology. Most students are junior, age ranging from 20 to 25. They are full time students, but many of them have some part time job. For each semester, we have about 400 to 500 students. Their background is usually weak. Less than 10 percent of students had pre-calculus.

I decide the important concepts to be tested first. Then, I look for some real world data sets that may be appropriate for the concepts covered and some concepts questions such as true/false questions that I gave previously. I also consider some real world scenarios and create questions based on the scenarios without having the actual data values.

I run some analyses myself before developing questions. Using the results, I began to think about questions that will test the important concepts I intend to test. Some questions will be multiple choice questions, some questions will be reasoning questions and some will be true/false questions.

I do not ask another instructor reviewing questions. However, on many occasions, we share each other the type of questions we use.

This is really from the experience. We have a testing center to monitor our tests. In the summary report, the amount of time spent by each student was also recorded. This gives me data to decide if the exam is reasonable timewise.

We have a testing center to monitor our tests. The summary report sent to instructor also includes the percent of students who answered each choice of multiple choice questions. This summary gives me good ideas on how students chose their answers and knowing how I chose the choices gives me good ideas as to why they made mistakes. The formative summary of this assessment report tells me the degree of difficulty for each question and where the mistakes were. I usually look for a balance of some difficult and some easier questions. An average and median between 70% to 80% would be considered a good exam. In addition, I look for the level of difficulty and where/why they made mistakes as feedback for revising my instructional approach in the future semesters.

TONY ONWUEGBUZIE

TONY ONWUEGBUZIE

Context: The 3-hour statistics classes that I teach involve graduate students (i.e., master's and doctoral students). My comments below, in boldface type, reflect statistics courses taught at the introductory, intermediate, and advanced levels. I have taught graduate-level statistics at the University of Central Arkansas (previously a master's-granting institution), Valdosta State University (a doctoral-granting institution), Howard University (a Research I, Research Intensive institution), and the University of South Florida (a Research Intensive Institution). My courses almost always are required. My average class size is 20.

When writing an examination from scratch, I typically use a Table of Specifications to decide which domains to sample, the level of mastery to assess using Bloom's (1956) taxonomy (i.e., knowledge, comprehension, application, analysis, synthesis, and evaluation), and the number of items to write for each combination of domain and level. Where possible, I ask another instructor to review my exams.

As mentioned in my response to the "Use of Exams" section, I never administer timed in-class statistics examinations. Rather, my in-class examinations are administered under untimed conditions, utilizing open-book and open-notes format. I record the time taken students as they turning their examination forms. Subsequently, I note the correlation between the length of time taken and each student's score. My assumption is that a significant positive relationship would indicate that some students are not spending sufficient time on each question (i.e., poor test-taking skills as a possible primary cause and poor study skills as a possible secondary cause), to their detriment, whereas a significant negative relationship would indicate that students who are taking the most time are most likely to be under-prepared for the examination (i.e., poor study habits skills as a possible primary cause and poor test-taking skills as a possible secondary cause). However, in 15 years of teaching at the college level, I have never observed a significant relationship.

As noted in my section entitled, "Post-Exam Feedback," I spend at least one-half class period providing post-examination feedback. (This also me to model item analyses.) I provide each student with a scoring key that provides the solutions or model solutions to all items. In particular, I undertake an item analysis (classical test theory due to the relatively small sample sizes [i.e., .30]). I compute item indices such as item difficulty, item discrimination, and point-biserial correlation. These indices are compared to the cutpoints provided in the literature (cf. Crocker & Algina, 1986) for deeming good items. For example, I consider point biserial correlations (obtained by correlating scores for open-ended response and overall test scores) that are two or more standard deviations above 0 as being indicative of a good item (Crocker & Algina, 1986, p. 324). For example, a class size of 20 would yield an approximate cutpoint of .23 for a point-biserial correlation. Also, I use Ebel's (1965) criteria for interpreting discrimination indices, D. (I compute discrimination indices for open-ended responses via Kelley's [1939] recommendations.) Specifically,

(a) If D > .40, the item is functioning satisfactorily

(b) If .30 < D < .39, little or no revision is required

(d) If D < .19, item should be revised or eliminated

ROXY PECK

Context: The comments I gave are based on Stat 130 and stat 217. These are courses primarily for students majoring in liberal arts fields (Stat 130) or social sciences (Stat 217). Class size is usually 45 - 48. Stat 130 is a general education course in statistical literacy, whereas Stat 217 is more of a methods course for students who will continue on to a research methods course in their own discipline.

I ususally start by making a list of content areas that are covered by the exam, and then for each list the concepts that I would like to explore. For example, I might list linear regression as the content topic and then the meaning of least squares and how to identify potentially infouential observations as being among the concepts that I want to test. Then for each topic, I try to come up with a reasonable simple, but hopefully interesting context for the problem. As I mentioned in earlier responses, I am an advocate of using real data and strive to do that in in-class examples, homework problems, and project assignements. Even so, I often use realistic rather than real data on exams in order to be able to keep the contexts realively simple--given the time constraints of an exam I don't want students to have to spend an inordinate amount of time trying to understand the context. But, context is essential; if there is no context, you can't evaluate a students ability to interpret the results of statistical analyses.

Once I have a set of candidate questions, I start to worry about the length of the exam, and pare down as seems warranted. I work through the exam, and if I can complete it in 10 minutes or less, then I think it is OK as a 50 minute exam for students. I don't usually have another instructor review the exam, mainly because I am not always working far enough ahead of time, but I probably ought to do this more often.

As far as deciding if, after all is said and done, it was a good exam or not, I guess I don't really do any kind of formal analysis like some of the others have suggested they do. Mainly, if most students are able to finish the exam in the allotted time, I get a reasonable distribution of scores, and I wasn't toooo depressed by the collection of responses to any of the questions, I'm happy.

ALLAN ROSSMAN

Context: My comments apply to a "Stat 101" algebra-based service course for students in humanities and social science majors. I have in mind the Math 121 course at Dickinson and courses such as Stat 130 and Stat 217 at Cal Poly.

At this point in my career I usually start by looking at exams from previous years. I never re-use exams, but I do lift and revise selected questions. On the other hand, when I'm starting from scratch, I try to create a list of potential questions or issues while I'm teaching the course. Another strategy is that when I start to write the exam, I make a list of topics/issues that I want to ask about, and I try to put on that list several issues that students have struggled with in class or on homework. Then I start writing the questions themselves, and I try to find real data from other textbooks or from the Web in order to provide the contexts for the questions. Sometimes I'll even steal entire questions from the exercises of another text, but most often I do substantial tweaking of them. I usually write more questions than I'll be able to use, and then I look back over the exam to check for two things: a) whether I've "covered" the most important ideas, and b) whether I've included a reasonable variety of question types. I always aim to have at least half, preferably 2/3 or so, of the points on an exam pertain to students' conceptual understanding and interpretive ability, with no more than half on computational and mechanical skills. At this stage I often have to remind myself to include some fairly routine questions and not just challenging ones. Then when I think I have a reasonable exam draft written, I always ask a colleague to review it, commenting on whether it seems reasonable and suggesting ways to improve it. Then at the final stage I assign tentative point allocations to the questions. After the exam is given and graded, I evaluate it primarily by seeing how well students did on the questions that I consider the more challenging and conceptual ones. I'm most pleased if the best students give very good answers, the mediocre students give competent answers, and the less-prepared students struggle. I also look at the grade distribution and hope that the exam has succeeded in differentiating the students' performances. In a typical introductory course I hope for an average score in the low 80s, with several scores in the 90s, lots in the 80s, many in the 70s, and a few below that.

CANDACE SCHAU

Context: I taught introductory statistics to graduate students in a College of Education for over 20 years. My classes contained 15 to 30 students and met twice a week for 1.25 hours per session. Most of the students in my classes were working toward Masters degrees in Education (although some were working toward Ph.D. degrees), most were in the course because it was required, and almost all did not want to be in the course. Most of the students held jobs, had significant others, and considered receiving their degrees (with a grade point of 4.0) as essential but usually third in importance (with loved ones first and jobs second). Because so many already were established in jobs (and in their lives) and felt as though they performed them well without ever reading research or understanding the statistics found in every day life, they viewed my introductory statistics course as the most difficult educational hurdle they faced and not very relevant to them. When I first started teaching, I really had hoped that my one introductory course would convince them that statistics is important in their lives. Given my students and their situations, however, that was a dream born from my ignorance (and, I'm sorry to say, my arrogance). For the last several years that I taught, I really hoped to accomplish a few more realistic (for my students) goals. By the end of my course, I wanted my students to: Develop a foundational understanding of important introductory statistical concepts and their uses.

Learn that they could understand a discipline that involved numbers, if they worked hard.

I also hoped that a few students would really like statistics and recognize its value to them and so decide to take additional statistics courses that were not required. My course assessments, however, were designed to assess the first goal.

I think the best way to write a test is to first construct a table of specifications. This table crosses the content with the kinds of learning outcomes you want to assess. As an example, I’ll use language from the ARTIST website. On a first test, you might want to cover the five content areas of data types, univariate data representation, data production, measures of center, and measures of spread. You might want to assess the two learning outcomes of literacy and reasoning. Your table of specifications would include the five content areas as rows and the two learning outcomes as columns. Based on the importance that you attribute to each content area crossed with each outcome, you then fill in the table with the percentage of points on your test that you wish to assess that content-outcome combination. You choose the content areas and learning outcomes based on your course objectives. You then write your items. As I worked from tables like this one, I would alter the content, outcome, and percentages, as needed. This process also helped me identify content-outcome combinations that I thought were important enough to assess but that I hadn’t covered well in class. In other words, the process of creating this table and writing a test from it provided me with information that I fed back into my course planning and delivery.

I would ask other statistics instructors if I could look at their tests, and I also examined test banks that accompanied various introductory statistics texts. I rarely would use an item exactly as I found it, but I often wrote test items based on an idea found in these other sources.

I always try to begin my tests with a very easy item that assesses low-level understanding of an important but simple concept. My goal is to help students gain some confidence at the beginning of the test and to help them relax. It is amazing how difficult it can be to write an item that everyone answers correctly.