Stat 301 – Review 2

Due Monday, midnight: Submit Exam 2 Question and Exam 2 Example Questions in Canvas and then review responses before exam

Optional Review Session: Tuesday, 7-8pm, Zoom office hours link in Canvas

Exam Format: The exam will cover topics from Chapter 2 – one quantitative variable and Chapter 3 – comparing two groups on a binary response variable (Weeks 5-7, Quizzes 4-6, HW 4-6). The exam format will be similar to Exam 1 with a mix of multiple choice, short answer, and longer answer questions. You will not be using software or applets but you will be expected to interpret provided output (e.g., from R, JMP, applets, other), discuss how you would use technology (e.g., R or applet), and interpret “code.” You may use two pages of your own notes (8.5 x 11, front and back, i.e., four sides). You should also have your calculator (not a cell phone). The questions will not be heavily computational, but you are expected to know how to set up the calculations by hand. You are also expected to explain your reasoning, show your steps, and interpret your results. The exam will contain approximately 50 points (so a 2-point problem should only take you about 2 minutes).

Study Hints: You should study from the text (including chapter summaries), class meetings, investigations, powerpoint slides, labs, homeworks, homework solutions, grader comments on your assignments, Chapter Examples, practice questions, and quizzes. See also the Choice of Procedures pages and the ISCAM Glossary. In studying, I recommend re-doing investigations and homeworks, without looking at the online solutions, then checking your answers. Review comments on your graded assignments and general comments in Canvas. You also have

What Went Wrong examples, Practice quizzes, and Self-check videos.

Review Problems: Click here (Solutions)

Principles to keep in mind:

· Standardized/test statistic = (statistic – hypothesized)/(SE of statistic)

· Many confidence intervals = observed statistic + (critical value) × (SE of statistic)

· SE of statistic estimates sample to sample variation of statistic (vs. variability in data)

From Chapter 2

The big ideas in Chapter 2 are essentially the same as Chapter 1, just focused on a quantitative variable and likely using the mean as the parameter of interest. In this scenario, “descriptive statistics” (and “exploratory data analysis”) are more interesting, and you should be able to discuss which graphs/values are more informative in different situations and which comparisons are most useful (e.g., a comparison of centers wasn’t helpful in answering the cancer pamphlet question, looking at the mean water usage probably wasn’t that helpful). The only change we make when we use the sample mean as the statistic is to use the t-distribution as our ‘reference distribution’ rather than the normal distribution. One new idea is a “prediction interval” = a confidence interval for an individual (e.g., future) observation. With proportions, we assumed every individual observation had the same probability, that is, no person-to-person variation. The person-to-person variation with a quantitative variable is summarized by (in the population) and s (in the sample). The validity conditions for the Central Limit Theorem for the normality of the sampling distribution of the sample means also changed form slightly (large n > 30 or normal population).

From Section 2.1 (Investigations 2.1, 2.2, 2.3) you should be able to:

· Anticipate the behavior of a variable based on context

· Create and interpret different types of graphs: histogram, dotplot, stemplot, boxplot

o Critique effectiveness of display (e.g., labels, bin width, hiding patterns and important features)

· Describe the behavior of a variable’s distribution from the graph: shape, center, variability, unusual observations

· Describe the shape of a distribution of a quantitative variable as symmetric, skewed to the left, or skewed to the right (This refers to the longer tail: Where is there space in the graph to right out the description? To the left or to the right?)

· Assess the normality of a distribution (e.g., overlay curve on histogram, is normal probability plot linear, does it follow 68-95-99.7 rule)

· Identify possible outlier(s) in the dataset (visually, 1.5IQR criteria) and suggest explanations based on context

· “Outlier” are individual values far from the rest of the data. Observations can behave strangely other ways too (e.g., being the only non-integer)

· Critique justifications for removing outlier(s) from dataset

· Interpret the five-number summary and the inter-quartile range (IQR)

· Understand how skewness and/or outliers impact the relative positions of the mean and median and the values of the standard deviation and IQR

· Explore data transformations (e.g., log) to normalize a distribution

· Transform data and use a normal model to estimate a probability

From Section 2.2 (Investigations 2.4, 2.5, 2.6) you should be able to:

· Continue to consider whether or not you are likely to have a representative sample

· Explain the reasoning behind simulating random samples from a finite population

o Critique assumptions made about the population

· Predict the behavior of the sampling distribution of the sample mean (mean, standard deviation, shape) and compare to the population distribution

o How/when does the shape of the population matter?

o Distinguish between (i) the population, (ii) a sample, and (iii) the sampling distribution

§ Larger sample sizes impact the shape and variation of the sampling distribution but NOT the population or sample

· Apply the Central Limit Theorem of the sample mean

o Know when it does/does not apply

o “Use technology” (R or Normal Probability Calculator) to approximate probabilities for sample means with the normal distribution

§ What values to use for mean, SD, observation; direction

§ Sketch and label the axis of the corresponding distribution and shade the probability of interest

§ Interpret provided output

· Define a population mean in context

· State appropriate null and alternative hypotheses about a population mean for a given research question

· Calculate and interpret the standard error of the sample mean, s/.

· Determine and interpret the “standardized” distance between and (the mean of sample means)

· Roughly approximate a 95% confidence interval for by

· Use the t-distribution to model the behavior of the standardized statistic

o Determine the degrees of freedom (sample size minus one) and the impact of degrees of freedom on the t distribution

§ As sample size increases, df increases, t distribution approaches standard normal distribution

o Explain the difference between the normal distribution and the t distribution

§ Heavier tails

§ Why that’s helpful to use the t-distribution for inference about the population mean (when have estimated both and )

§ Consequences of using the t-distribution on p-values, confidence intervals, coverage rate (long-run proportion of intervals capturing )

o Assess the validity of the t procedures (same as CLT conditions)

· Interpret a confidence interval for in context (include measurement units)

o If asked, interpret the confidence level (long-run coverage rate)

· Determine and interpret a prediction interval (PI) for a future observation

o With raw data, by hand (e.g., or roughly roughly )

o Explain the reasoning behind the SE formula for a PI

o Compare a confidence interval to a prediction interval

o Assess the validity of the prediction interval procedure

From Chapter 3

In Ch. 3, the big picture to keep in mind is comparing a categorical response variable between two groups and that it matters a lot how those groups were formed: an observational study with independent random samples or a randomized experiment. This distinction must be considered when drawing your final conclusions (can I generalize to the larger populations, can I draw cause and effect?), and should probably also be considered when you analyze the data (are you modeling random samples from populations or random assignment). This can impact the standard errors that you use, but with large sample sizes the results won’t differ too much, and analysts tend to apply the same normal approximation in both cases.

From Section 3.1 (Investigations 3.1 and 3.2) you should be able to:

· Construct a two-way table of counts (explanatory variable as columns)

· Calculate (appropriate) conditional proportions and compare them

o Proportion of 6 ft tall men in the NBA vs. Proportion of NBA players over 6 ft vs. proportion of men that are 6 foot tall NBA players. (Hint: Follow the ‘of’)

· Create a segmented bar graph from a two-way table ~~(may use technology)~~ and describe what it reveals (e.g., do the distributions differ across the groups)

· Define the parameter in terms of the difference in population proportions

· State hypotheses in terms of the difference in population proportions

· Simulate random sampling (independent binomials) from two (large) populations under the null hypothesis

o Create a null distribution of differences in sample proportions

o Interpret graphical and numerical summaries of this distribution

o Estimate or obtain a p-value from simulation results

o Explain the simulation process (e.g., independent random samples with same probability of success)

o Interpret the p-value in context (e.g., X% of random samples…)

· Determine whether a normal approximation to the null distribution of difference in sample proportions should be valid

o Remember the simple way of checking this is all cell counts in table are at least 5, list the values you are looking at

o Should also consider sizes of (finite) populations sampling from and whether they are more than 20 times the sizes of the samples

o Reasoning behind the standard error formula (adding variances)

· Pooled vs. unpooled variance estimates (test statistic vs. CI)

· Calculate a z test statistic and p-value using the normal distribution

o Interpret the standard error, test statistic, and p-value in context

· Calculate and interpret a z-confidence interval for the difference in two population proportions

o Make sure the direction is clear. Go beyond saying is in the interval, but in terms of how much higher/lower is than (in context)

· Discuss factors that will/not affect standard error, test statistic, p-value, confidence interval, and how

o e.g., sample size, order of subtraction, size of difference in sample proportions

· Distinguish between the explanatory variable and the response variable from a study description

· Identify and explain a potential confounding variable in observational studies

o Be sure to explain on how there could be a differential effect by the confounding variable on the response variable between the explanatory variable groups (Make sure it’s an alternative explanation for the observed difference between groups separate from the explanatory variable, not just another variable or a feature that applies equally to both groups).

From Section 3.2 (Investigations 3.3, 3.4) you should be able to:

· Distinguish between an observational study and an experimental study

o Be able to justify which type of study you have

o Be able to critique advantages and disadvantages of different study designs in different contexts

· Discuss the advantages of using a placebo treatment

· Discuss the advantages to blinding and double-blinding in a study

· Discuss the purposes/goals/merits of “randomization” (aka random assignment to treatment groups)

· Identify when we are allowed to draw cause-and-effect conclusions (perhaps just about the experimental units in the study)

· Interpret and critique a description of a research study (e.g., Inv 3.4)

· Discuss some of the limitations in the type of conclusions that can be drawn from different designs

o Do not draw cause-and-effect conclusions from an observational study no matter how small the p-value

· Can still decide whether there is evidence of an association, measure how strong the association is

· Identify and justify the appropriate “scope of conclusions” (generalizability, causation) from the study design

o Best Table in the book!

From Section 3.3 (Investigations 3.5, 3.6, 3.7) you should be able to:

· Define the parameter in terms of the difference in (long-run) treatment probabilities

· Simulate random assignment under the null hypothesis, create a null (or randomization) distribution of the difference in two sample proportions

o Explain the reasoning behind randomization test (fixing number of successes and failures models “no effect” from treatment group assignment)

o Carry out and interpret the results from a randomization simulation for a two-way table (e.g., shuffling index cards, including how many cards and how many of each color, how many deal out to each group)

o Use the Analyzing Two-way Tables applet

o Understand the equivalence of using the number of successes in group A, difference in group proportions, relative risk, and odds ratio as the statistic in this simulation

o Including how to approximate the (one or two-sided) p-value based on the simulation results

§ Can double one-sided p-value if distribution is symmetric, use method of small p-values otherwise

o Interpret the p-value in context (e.g., X% of random shuffles…)

· Calculate the exact (one or two-sided) p-value using the hypergeometric distribution (aka Fisher’s Exact Test)

o Including showing set up by hand and with technology

o Including writing out the probability statement P(X > …. ) and the input values of the hypergeometric (N, M, n)

o “Using technology” to carry out the full FET p-value calculation

· Approximate (and interpret) the p-value and confidence interval for using the normal distribution (two-sample z-procedures)

o Decide whether the z-procedure is valid (just worry about cell counts, not population size)

· Consider continuity correction for p-value (half-way to next possible statistic outcome)

· Consider Wilson adjustment/Plus Four adjustment (adding 1 to each cell in the table) as an improvement for the confidence interval

From Section 3.4 (Investigation 3.8, 3.9, 3.10) you should be able to:

· Calculate and interpret relative risk as an alternative measure of association between two binary variables

o Remember that the difference in proportions does not take into account the magnitude of the baseline risk

§ Small differences in proportions “seem” much larger when the baseline risk is small

· Simulate a null distribution (using random sampling and/or random assignment) under the null hypothesis and interpret the results using the relative risk as the statistic

o Create a null distribution of relative risk

o Including how to approximate the p-value based on the simulation results

· Determine (by hand and with applet) and interpret a confidence interval for the ratio of treatment probabilities using the normal distribution

o Including how and why we “transformed” the statistic to log relative risk

o Calculate and interpret the standard error of the transformed statistic

· Back transform and interpret the confidence interval in context

o Make sure direction is clear

· Distinguish between a cohort, case-control, and cross-classified designs of an observational study and how the design affects which numerical summaries you can reasonably interpret

o Cohort: sample based on EV; Case-control: sample based on RV; Cross-classified: Sample and ask two questions

o Don’t use relative risk or difference in proportions with case-control studies

o It is always ok to calculate odds ratio

· Calculate and interpret odds ratio as an alternative measure of the association between two binary variables

o How to decide which calculation is being asked for in the context of the problem (how define success, group A)

o How to interpret the results of the calculations

· Interpret a confidence interval for the population odds ratio in context

· Make sure direction is clear

Coding principles you should be able to interpret/explain/write pseudo-code

· Subsetting data

· Recoding a categorical variable

· Splitting the graph by an explanatory variable

· Create simulations to replicate random sampling and/or random assignment

· You should also be able to interpret generic computer output for the procedures we have learned (one-sample t-procedures, two-sample z-procedures)

What you should be able to do with the calculator

· Calculate conditional proportions, relative risk, odds ratio

· Approximate 95% confidence intervals for population mean, next observation

· Calculate confidence intervals for relative risk, odds ratio

Things you need to remember from Exam 1

· Defining observational units, variables, and parameters in context

· How to interpret probability as a long-run relative frequency

· Showing your work/explaining how would use the computer

· Explaining your simulation process

· One-sided vs. two-sided alternatives

· The reasoning of statistical significance and what a p-value measures

· Making conclusions based on the size of the p-value (remember to provide “linkage” between the size of your p-value and the decision to reject/fail to reject H₀)

· Interpreting confidence intervals and confidence levels

· “Duality” between confidence intervals and tests of significance (two-sided p-values)

· ~~The concept of power and factors that affect power~~

· Comparing “theoretical,” “exact,” and “simulation” results

· Margin-of-error measures sample to sample variation (due to random sampling) but does not account for any “nonsampling errors” (e.g., poorly worded questions)

· Confidence level refers to the reliability of the method – how often, in the long run, random samples (or random shuffles) will produce an interval that succeeds in capturing the population parameter

Keep in mind

· When to talk in terms of population means, μ, and when to talk in terms of probabilities,

· When comparing distributions, remember to cite your evidence if you think there is a difference in the groups. In particular, tell me what you see in the summary statistics (e.g., a higher proportion) that leads to your conclusion (e.g., abstainers more likely to develop peanut allergy than consumers)

· Remember to think about the direction of subtraction used by the technology

· We can use a one-sample t-procedure even when the sample sizes are small if we have reason to believe the population distribution is normally distributed. You can try to judge this, especially if you don’t have past experience with the variable, based on graphs of the sample data. If the sample data looks reasonably normally distributed (normal probability plots are a useful tool for helping this judgment), you can cite this as evidence that the population distribution is normally distributed. If you aren’t sure, then use an alternative analysis instead (e.g., data transformation, simulation like bootstrapping).

· Keep in mind the one-sample t-procedure only tells you about the population mean (vs. other aspects of the distributions)

· Always put your conclusions in the context of the research study

· Including considering “practical significance” (statistical significance = could the difference have happened by random chance alone, practice significance = is the difference considered meaningful in the context of the variable, e.g., is a .5 ⁰F difference likely to matter)

· Try to avoid the word “accurate” without explaining exactly what you mean by it.

· Always try to say the distribution of what

· Try to avoid use of the word “group” but clarify if you mean the sample or the population or the long-run treatment

· Avoid use of the word “it”

Also keep in mind:

· Part of your grade will be based on communication. Be precise in your statements and use of terminology. Avoid unclear statements, and especially don’t use the word “it”! Always relate your comments to the study context.

· Show the details of any of your calculations.

· Organize your notes ahead of time, and don’t plan to rely on your notes too much.

· Be able to both make conclusions from a p-value and provide a detailed interpretation of what the p-value measures in context

o Improving interpretations of p-values:

§ Random chance = random sampling or random assignment

§ Alone = null hypothesis is true

§ Observed result = cite value of statistic from study

§ Or more extreme = give direction(s) based on H_a

· Keep in mind that “statistical significance” is an adjective of the sample data or the statistic, NOT the population parameter

· You should continue to focus on the overall statistical process from collecting the data, to looking at the data, to analyzing the data, to interpreting results

· When stating final conclusions, cite the specific evidence (e.g., it is/is not statistically significant because my p-value of XXX is small/not small; when interpreting the p-value insert the specific observed statistic value, direction etc.)

· Simulation-based vs. Exact vs. Theory-based (normal) procedures

· Think big picture and be able to apply your knowledge to new situations

Some additional Lessons from HW

HW 4

· Subsetting data and possible consequences on scope of conclusions

· Utility of a “data dictionary”/making sure results make sense in context/common explanations for unusual results (e.g., coding of missing values)

· Relationship and slightly different interpretations between mean and median

· Applying a data transformation to produce a more normal looking distribution

· Sample vs. Population vs. Sampling Distribution and what impacts the behavior of each

HW 5

· Be specific to the study context vs. generic statements (e.g., to guard against carry-over effects vs. to avoid bias)

· (Lack of) role of population shape when sample size is large

· Be able to state hypotheses both in using symbols and words. Make sure you define the symbols. Make sure you clarify what number is being tested.

o I find the phrase “true value” unclear, and instead would talk in terms of “population mean” or “long-run mean”– keeping in mind that all a test of significance can make conclusions about is the mean, not individual observations.

· If asked to “estimate a parameter” – use a confidence interval, not just the sample statistic

o If you are interpreting an interval for a difference be very clear what you think is larger than what

· Remember the validity conditions for using t-procedures are “either or”

o And a prediction interval requires normality, large sample size doesn’t solve that requirement

· Be able to distinguish (in interpretation, in identification) differences between a confidence interval for a mean and a prediction interval

· Bootstrapping is an alternative approach for estimating sample to sample variation in any statistic

o Goal of our simulation/theory-based methods is to estimate the “chance variation” to help us determine how far our statistic could plausibly be from the parameter of interest

· ~~‘Effect sizes’ are often used as a measure of “practical” rather than statistical significance~~

HW 6

· Justify “experiment” by considering “active imposition of explanatory variable”

· Using the research context to formulate a one or two-sided alternative hypothesis

· Matching the simulation model to how randomness was used in the study design (and why it might matter)

o Interpreting the p-value accordingly (by this point in the course, your interpretation should now be beyond “by chance alone” – you need to explain the source of randomness and the assumptions of the null hypothesis)

· Justifying conclusions

o Significance: p-value

o Estimation: confidence interval

o Causation: randomized experiment (and significant)?

o Generalizability: random sample?

Applets

I will assume you are familiar with the output/functionality of these applets for Ch. 2 & 3.

· Descriptive Statistics (mean, median, SD, IQR for quantitative data, possibly across groups, boxplots, histograms, dotplots, normal probability plot, time plot)

· Sampling from a Finite Population (can input a large population of values to random sample from, view individual samples, generate a sampling distribution for mean, median, t-statistic)

· Normal Probability Calculator (can specify variable, mean, SD, and region of interest to find probability or probability to find region or z-score to find probability and region)

· Simulating Confidence Intervals (can explore behavior of different interval procedures to see long-run coverage rate)

· t Probability Calculator (can specify df and region of interest to find probability or probability to find t* (critical value))

· Theory-Based Inference (can conduct one and two-sample tests and confidence intervals for proportion and means)

· Comparing Two Population Proportions (can simulate independent random samples from binomial processes, examine individual sampling distributions and sampling distribution of difference in proportions)

· Randomizing Subjects (can explore “balance” created between groups with random assignment) – see Investigation 3.3

· Analyzing Two-way Tables (given 2x2 two-way table, can find simulation-based p-value, Fisher’s Exact Test, normal approximation, 95% confidence intervals for difference in probabilities, relative risk, and odds ratio).