When writing an examination from scratch, I typically use a Table of Specifications to decide which domains to sample, the level of mastery to assess using Bloom's (1956) taxonomy (i.e., knowledge, comprehension, application, analysis, synthesis, and evaluation), and the number of items to write for each combination of domain and level. Where possible, I ask another instructor to review my exams.

As mentioned in my response to the "Use of Exams" section, I never administer timed in-class statistics examinations. Rather, my in-class examinations are administered under untimed conditions, utilizing open-book and open-notes format. I record the time taken students as they turning their examination forms. Subsequently, I note the correlation between the length of time taken and each student's score. My assumption is that a significant positive relationship would indicate that some students are not spending sufficient time on each question (i.e., poor test-taking skills as a possible primary cause and poor study skills as a possible secondary cause), to their detriment, whereas a significant negative relationship would indicate that students who are taking the most time are most likely to be under-prepared for the examination (i.e., poor study habits skills as a possible primary cause and poor test-taking skills as a possible secondary cause). However, in 15 years of teaching at the college level, I have never observed a significant relationship.

As noted in my section entitled, "Post-Exam Feedback," I spend at least one-half class period providing post-examination feedback. (This also me to model item analyses.) I provide each student with a scoring key that provides the solutions or model solutions to all items. In particular, I undertake an item analysis (classical test theory due to the relatively small sample sizes [i.e., .30]). I compute item indices such as item difficulty, item discrimination, and point-biserial correlation. These indices are compared to the cutpoints provided in the literature (cf. Crocker & Algina, 1986) for deeming good items. For example, I consider point biserial correlations (obtained by correlating scores for open-ended response and overall test scores) that are two or more standard deviations above 0 as being indicative of a good item (Crocker & Algina, 1986, p. 324). For example, a class size of 20 would yield an approximate cutpoint of .23 for a point-biserial correlation. Also, I use Ebel's (1965) criteria for interpreting discrimination indices, D. (I compute discrimination indices for open-ended responses via Kelley's [1939] recommendations.) Specifically,

    (a) If D > .40, the item is functioning satisfactorily
    (b) If .30 < D < .39, little or no revision is required
    (c) If .20 < D < .29, the item is marginal and needs revision
    (d) If D < .19, item should be revised or eliminated

See all answers to this question

See all of Tony's answers