Stat 301 – Review 2
Problems Solutions
1) Weights of 30 (fun-size) Mounds candy bars and 20 (fun-size) PayDay candy bars, in grams, are
shown in the dotplots below.
(a) Which
distribution would you consider skewed to the right?
The
Mounds distribution is a bit skewed to the right and the PayDay distribution is
strongly skewed to the left.
(b) Which
distribution do you expect has a larger mean?
The
PayDay distribution is clearly centered around larger values than the Mounds
distribution.
(c) Which
distribution do you expect has a larger standard deviation?
The
PayDay values are more spread out/less consistent than the Mounds distribution.
In
other words, the Mounds weights are more consistent, but occasionally a few weigh more. The
PayDay distribution is less predictable and often has weights that are much
lower than typical, perhaps the difference of one or two peanuts?
(d) Which
distribution would you suspect will have its mean larger than its median?
Mounds
because it is skewed to the right
2) The highway miles per gallon rating of the 1999 Volkswagen
Passat was 31 mpg (Consumer Reports, 1999). The fuel efficiency that a
driver obtains on an individual tank of gasoline naturally varies from tankful
to tankful. Suppose the mpg calculations per tank of gas have a mean of = 31 mpg and a standard deviation of = 3 mpg.
(a) Would it be
surprising to obtain 30.4 mpg on one tank of gas? Explain.
Not
really, 30.4 is well within one standard deviation of the “population” mean of
31.
z =
(30.4 – 31)/3 = -0.20
(b) Would it be surprising for a sample of 30
tanks of gas to produce a sample mean of 30.4 mpg or less? Explain, referring
to the CLT and to a sketch that you draw of the sampling distribution.
First,
does the CLT apply here? We don’t know
much about the shape of the population distribution, though it’s reasonable to
assume the mileage from different tanks will by symmetric and roughly normal. But we also don’t care too much because our
sample size of 30 is considered large.
We are also assuming these observations are taken under identical
conditions.
So we
will model the distribution of the averages of 30 tanks for be normally
distributed with mean equal to 31 mpg and standard deviation equal to 3/sqrt(30) = 0.5477 mpg.
So a sample mean of 30.4 mpg would be (30.4 – 31)/.5477 = -1.095
standard deviations below the mean. This
is still not larger than 2.
Using
the normal distribution, P(< 30.4)
About
13.7% of random samples of 30 tanks will have an average mileage of 30.4 mpg by
random chance alone. We would probably
not consider this a surprising outcome (happens more than 10% of the time).
(c) Assess the
validity of your calculations in (a) and (b)
It’s
always reasonable to calculate a “standard score” as I did in (a). If I wanted to convert this z-value to a
probability, then I would need to know that the tank MPGs follow a normal
distribution. We aren’t told that here though it seems a reasonable assumption.
As stated in (b), we can use the CLT if
we continue to have this belief in the normality of the MPG values in general
or if it’s not too crazy behaving because then the sample of size 30 tells us
that the distribution of sample means should still be approximately
normal.
If
you go to the Sampling from a Finite Population applet and
check the box for Population Model, you can simulate drawing random samples
from a probability distribution rather than a finite population. When the
probability distribution is a normal distribution, everything works very well:
If
the theoretical probability distribution is not normal but symmetric, things
still work pretty well.
If
the theoretical probability distribution is not normal to begin with, things
still work pretty well due to the “large” sample size
3)
The file AgeGuesses.txt
contains students’ guesses of my age on the first day of class a few years ago.
(a) Estimate and interpret a 95%
confidence interval for the population mean.
Confidence interval for the
population mean:
+ t*
(s /) = 48.43 + t* (10.89/sqrt(30))
For 95% confidence, we
estimate t* to be around 2.
48.42 + 2(1.99) =
(44.4, 52.4) years
I’m 95% confident that the
average guess of my age in the population of all Cal Poly students on such an
activity would be between 44.4 and 52.4 years
(More precisely, t* = 2.045)
+ t*
(s /) = 48.43 + 2.045 (10.89/sqrt(30))
= (44.36, 52.50). A little bit wider.
On an exam without the computer, for 95% confidence you can use 2
for either z* or t*. That’s why I said “estimate”
(b) Estimate and interpret a 95%
confidence interval for the next student’s guess of my age.
+ t*
(s ) = 48.43 + 2 (10.89 × sqrt(1+1/30))
= (26.29, 70.6)
I’m 95% confident that any one
Cal Poly student would guess my age between 26.3 and 70.6 years.
(More precisely)
+ t*
(s ) = 48.43 + 2.045 (10.89 × sqrt(1+1/30))
= (25.79, 71.07)
(c) Which interval do you feel is more
meaningful in this context?
Opinions will vary, the
prediction interval is quite wide due to the huge amount of variation in the
responses given to this question. Typically a prediction interval is more meaningful (what
will happen next, vs. what is the long-run mean), but because it’s so wide this
one is not very informative, basically saying I went to graduate school but I’m
still alive!
(d) What information would you need to
know to decide whether students’ are “biased” in how
they guess my age? If you did a test of
significance, would this be a one-sided or a two-sided test?
You would need to know my
actual age, then we could see whether the sample mean fell above that
(overestimating my age on average) or below that (underestimating my age on
average).
(e) Evaluate the validity of your
calculations in (a) and (b).
The distribution is pretty symmetric and the sample size is 30 so the confidence
interval in (a) is probably ok (achieves the stated 95% confidence in the long
run), but with the outliers on both sides, the sample distribution of age
guesses has heavier tails than we might expect for a normally distributed
population. If we believe these long tails exist in the population, then this
would cast some doubt as to the validity of the prediction interval (though
again, at least the distribution is symmetric, but there may be less than 95%
of the population distribution falling within 2 standard deviations of the
mean, or more if the population standard deviation is inflated by such
outliers). The nonlinear nature of the
normal probably plot suggests these data are not coming from a normally
distributed population.
(f) Column 2 indicates whether the data
were collected in Section 1 or Section 2.
I changed something about my appearance between the two sections.
Suppose I find a statistically significant difference in the average guess of
my age between the two classes, flipping a coin in advance to decide which
appearance I would use in each section. Would you be willing to attribute the
change in the ages to the change I made in my appearance? Explain why or why
not.
While I did randomly assign
the two treatments in a sense, I did so at the class level rather than at the
individual student level. So there could still be a confounding variable between the
two sections (e.g., I looked more tired later in the day) and we should not
draw any cause-and-effect conclusion here. (Actually
the average guess was 10 years larger in section 1!)
4)
In a recent study (Klein, Thomas, and Sutter, 2007), researchers found that
current smokers were more likely to have used candy cigarettes as children than
current nonsmokers were.
(a) Identify and classify the
explanatory and response variables.
EV = whether used candy cigarettes as child
RV = whether or not current smoker
(b) When first hearing of this study,
someone responded by saying, “Isn’t the smoking status of the parents a
confounding variable here?”
Explain
what “confounding variable” means in this context, and
describe how parents’ smoking status could be confounding (i.e., describe what
would need to be true).
It would be a confounding variable if it provides an alternative
explanation for the observed association. To do this, it must differ between
the explanatory variable groups and potentially impact the response
variable. So if
those with smoking parents are more likely to be allowed to play with candy
cigarettes as children but also more likely to smoke due to the environment
they were raised and/or genetics, then the smoking habits of the parents might
better predict who is a later smoker, but would also explain why current
smokers are more likely to have played with candy cigarettes.
5) Newspaper headlines proclaimed
that chocolate lovers live longer, following the publication of a study titled
“Life is Sweet: Candy Consumption and Longevity” in the British Medical
Journal (Lee and Paffenbarger, 1998). In 1988, researchers sent a health
questionnaire to men who entered Harvard University as undergraduates between
1916 and 1950. The study included 7841 men, free of cardiovascular disease and
cancer. From the questionnaire they determined whether the respondents consumed
candy “almost never” (3312 men) or “sometimes or often” (4529 men), and then
they tracked the participants to determine whether or not they had died by
1993.
(a) Identify
the observational units.
men
(b) Identify the
response variable.
Whether
or not the person had died by 1993.
(c) Identify
the explanatory variable.
Whether
the person was classified a candy consumer (sometimes or often) or not a candy
consumer (almost never)
(d) Was this an
experiment or an observational study? If an experiment, was it a randomized,
comparative experiment? If observational, was if a cohort, cross-classified, or
case-control study? This was an observational study
because the candy-consumption levels were not imposed on the men in the study, the men in the study chose for
themselves. This is probably best
classified as a cohort study because they were identified, their candy
consumption determined, and then followed for 5 years to determine the outcome
for the response variable. This means its legitimate for us to use this data to
estimate the probability of still being alive.
(e) Researchers
found that of respondents who admitted to consuming candy regularly, 267 had
died by the end of 1993, compared to 247 of the non-consumers of candy. Set up
the calculation for Fisher’s Exact Test for deciding whether candy consumers
are significantly less likely to have died than non-consumers by completing the
following:
Note:
The conditional proportions of death are 267/4529 = .05895 and 247/3312 =
.07458
Best
bet is to set up the two-way table:
|
candy
consumer |
non-consumer |
Total |
still
alive |
4262 |
3065 |
7327 |
Died |
267 |
247 |
514 |
Total |
4529 |
3312 |
7841 |
If we
let X represent the number still alive in the candy consumer group, then we
want to find above X (even more survivors in candy consumer group)
p-value = P(X
> 4262
) where X follows a hypergeometric distribution
with parameters
N
= 7841 M = 7327 n = 4529
We
can also look at the number deaths in the candy consumer group, which we expect
(in the long run) to be less than the number of deaths in the non-consumer
group. In this case, p-value = P(X < 267)
where X follows a hypergeometric distribution with parameters N = 7841, M = 514, and n = 4529.
(There
are other correct set ups as well.)
(f) Suppose you
wanted to carry out a simulation to determine how surprising it is for two
random samples from the same population to give a difference in sample proportions
at least this large. Describe the
simulation process (if describing an applet, name the applet and the input
information you would use).
We
would randomly sample 4529 men and 3312 men, each from a population with a
probability of success of 7327/7841 = 0.934
Then
we would count how many samples have a difference as extreme (one-sided
alternative hypothesis) as 0.05895 – 0.07458 = -0.0156.
For
fun:
With
such large sample sizes, this is extremely statistically significant.
Also notice
that even though random sampling is probably a better model here, the FET
p-value is quite similar as well.
(g) The study
reported: Between 1988 and 1993, 514 men died: 7.5% of
non-consumers, but only 5.9% of consumers (age adjusted relative risk 0.83; 95%
confidence interval 0.70 to 0.98). Interpret
this statement as if to someone who has never taken a statistics class. In particular, what
do you think is meant by “age adjusted relative risk”?
This
interval provides an assessment for how much less likely a candy consumer is to
die in this time frame than a non-consumer. The values in the interval are all
less than one, so if we knew the death rate of non-consumers, we would multiply
by .70 to .98 to find the death rate for those who eat candy.
“Age
adjusted relative risk” essentially looks at the relative risks in different
ages groups (so only comparing men of similar ages) and then roughly averages
across those values to get an age-adjusted relative risk. This helps ensure we
have “controlled” for age since we couldn’t do random assignment.
(h) Based on
this interval, I would consider the comparison statistically significant. Why?
Yes,
because 1 is not inside this 95% confidence interval, we know the two-sided
p-value is less than .05.
(i) This does
not appear to be a large difference (7.5% vs. 5.9%), are you surprised that
this result is statistically significant? Explain.
1. No
because the relative risk takes the magnitudes of the values into account: 1.6
percentage points may not be a lot but it’s a decent fraction of 5.9%.
2.
The sample sizes are pretty large so even a weak
association will probably end up being “statistically significant.”
(j) The study
also reports: We then examined different levels of candy
intake. Compared with non-consumers, the relative risks of mortality among men
who consumed candy 1-3 times a month (1704 men), 1-2 times a week (1589 men),
and 3 or more times a week (1236 men) were 0.64 (0.48 to 0.86), 0.73 (0.55 to
0.96), and 0.84 (0.64 to 1.11),
Does this
result provide evidence of a “dose-response”? Explain.
Yes,
the relative “risk” of surviving that long is increasing with increasing
amounts of candy!
(k) And then: Finally, using life table analysis
truncated at age 95, we estimated that (after adjustment for age and cigarette
smoking) candy consumers enjoyed, on average, 0.92 (0.04 to 1.80) added years
of life, up to age 95, compared with non-consumers.
Based on these
results, are you willing to conclude that eat candy leads to a longer life?
No,
this was not a randomized comparative experiment, so we can’t draw any
cause-and-effect conclusions.
A
possible confounding variable is “happiness” – those who are happy and relaxed
and not worried about what they eat are more likely to consume candy than those
who are stressed and worried and watching their diet closely. But that happier lifestyle may also be
responsible for longer lives.
(l) What population
are you willing to generalize these results to? Explain.
At
most well-off males (graduates from Harvard), but even that is risky as this
study did not involve random sampling. It’s possible the access to medical care
and long-life span for such individuals is not representative of all adults
(certainly not women).
6) A study of whether AZT
helps to reduce transmission of AIDS from mother to baby (Connor et al., 1994):
Of the 180 babies whose mothers had been randomly assigned to receive AZT, 13 babies
were HIV-infected, compared to 40 of the 183 babies in the placebo group.
(a) Create a segmented bar graph to display
these results. Comment on what the graph reveals.
This
bar graph (and the conditional proportions of 13/180 vs. 40/183) indicates that
mothers given the placebo were about 3 times as more likely to have babies that
were HIV positive than were the mothers given AZT.
(b) Check the validity conditions for whether a
two-sample z-test can be applied to these data. Be sure to mention
whether the study involves random sampling from populations or random
assignment to treatment groups.
The number of
successes and failures in each group should be at least 5. The four
values are 13, 180-13 = 167, 40, 183-40=143. This condition is met.
(c) If you were to carry out a simulation to
obtain a p-value, would you simulate random sampling or random assignment? Explain.
The data are
from randomly assigning subjects to two treatment groups. So our p-value will want to reflect the random variation
from random assignment (e.g., shuffling the 363 cards (53 successes and 310
failures) to groups of 180 and 183).
(d) Conduct an appropriate test of significance
to determine whether the data provide convincing evidence that AZT is more
effective than a placebo for reducing mother-to-infant transmission of AIDS.
Report the hypotheses, test statistic, and p-value. Also indicate the test
decision using .01 as the level of significance.
The null
hypothesis is that AZT and a placebo are equally effective in reducing
mother-to-infant transmission of AIDS. Specifically, the probability of
HIV-positive babies born to mothers who could potentially take AZT is the same
as the probability of HIV-positive babies born to mothers who could potentially
take a placebo. In symbols, the null hypothesis is H0: πAZT
- πplacebo = 0.
The
alternative hypothesis is that AZT is more effective than a placebo for
reducing mother-to-infant transmission of AIDS, or that the probability of
HIV-positive babies born to mothers who could potentially take AZT is smaller
than the probability of HIV-positive babies born to mothers who could
potentially take a placebo. In symbols, the alternative hypothesis is Ha:
πAZT - πplacebo < 0.
Because this
is a randomized experiment and the counts are on the small size, we could carry
out Fisher’s Exact Test.
Or we could
carry out the random assignment simulation
And find the
p-value by counting how many re-random assignments have a difference in
proportion with HIV positive babies (AZT – placebo) of -.146 or less
Or, because
we said in (b) that the theory-based approach should be valid, we could go
straight to the Theory-Based applet to carry out a ‘two-sample z-test’
With such a
small p-value, reject H0 at the .01 level of significance.
We have very
strong statistical evidence that AZT is more effective than a placebo for
reducing mother-to-infant transmission of AIDS. We can say ‘more effective”
because this was a randomized, comparative experiment.
(e) Estimate the relative risk of transmission
with the placebo compared to AZT with a 95% confidence interval. Also be sure
to interpret this interval in context.
Sample
relative risk (with placebo in numerator): (40/183)/ (13/180) = 3.03
For 95%
confidence, z* = 1.96
For relative
risk, we first take the ln rel risk: ln(3.03) = 1.106
SE(ln rel risk)
= sqrt(1/40 – 1/183 + 1/13 – 1/180) = 0.3015
Confidence
interval: 1.106 + 1.96(.3015) = (0.515, 1.697)
Back-transforming:
exp(0.515, 1.697) = (1.67, 5.46)
We are 95%
confident that the probability of transmission is 1.67 to 5.46 times higher
with the placebo than with AZT.
(f) Summarize the conclusion that you could draw
from this study (significance, estimation, causation, and generalizability).
Also explain the reasoning behind each component.
Because
this was a well-designed experiment with a small p-value, we can
conclude that AZT caused the observed difference in HIV transmission
rates. If AZT and a placebo were equally effective in reducing
mother-to-infant transmission of AIDS, we virtually never see sample results as
or more extreme as those we saw in this experiment by random assignment alone
(p-value < .0001). We are 95% confident that using the placebo increases the
probability of transmission by 67% to 546%. We might have some caution in
generalizing these results to a larger population as we don’t know how the
HIV-positive mothers willing to participate in this study were recruited.
7) Consider the question of whether exposure to second-hand smoke is
harmful to the health of children. EV = whether or not
exposed to second hand smoke, RV = health of child
(a) Describe a prospective
cohort observational study that could address this question.
Find children who will be
exposed to second hand smoke and children who
won’t. In a few years, compare the health of the two groups.
(b) Describe a retrospective
case-control observational study that could address this question.
Find healthy children and unhealthy children and then see how much
second-hand smoke they were exposed to growing up.
(c) Describe a cross-classified
observational study that could address this question.
Find older children and
then determine their health status and whether they were exposed to different
amount of second-hand smoke.
(d) Describe how you could (in
principle) design an experiment to address this question.
Randomly assign some
children to be exposed to second-hand smoke and some children not to be exposed
to second-hand smoke.
(e) Would it be ethical to
conduct an experiment to address this question? Explain.
Because second-hand smoke
is so potentially hazardous, it would not be ethical to willing impose this
treatment on children.
8)
Investigation 3.10
(a) The observational units are drivers;
Explanatory variable is whether or not got a full
night’s sleep in previous week; Response variable is whether or not were
involved in a car crash; This is a case-control observational study.
(b)
|
No full night’s sleep |
Full night’s sleep |
Total |
Crash |
61 |
535-61 = 474 |
535 |
No Crash |
44 |
588 – 44 = 544 |
588 |
Total |
105 |
1018 |
1123 |
(c) Because this was a case-control study
we should use the odds ratio. If we look at the odds of being in a crash if
they did not get a full night’s sleep compared to the odds of being in a crash
if they did get a full night’s sleep: (61/44)/(474/544)
= 1.5911
Note,
this is the same as the odds of not getting a full night’s sleep for the crash
victims vs. the odds of not getting a full night’s sleep for the non-crash
subjects.
The
odds of being in a crash if didn’t get a full night’s sleep were 1.59 times
higher than the odds of being in a crash if did get at least one full night’s
sleep.
(d) We could replicate the sampling
design by sampling independently from two binomial processes with the same
probability of success (this will model H0: = 1). One
process will represent the sampling of the crash victims (n = 535) and the other will represent the sampling of the no-crash
population (n = 588). We can use 105/1123 » 0.581 as the common probability of
success. Once we get the two samples, we
will calculate the simulated odds ratio.
Then we will see how often we get a sample odds ratio of 1.5911 or
larger (Ha: > 1,
note, it’s not clear here whether they had a one or two sided alternative in
mind, but it’s reasonable to think that they suspected the lack of sleep would
be associated with an increase in odds of a car crash.).
(e) Let Xcrash be the number
of successes (no full night’s sleep) in the crash group, so we are modeling Xcrash
as binomial with n = 535 and =
0.0935.
Let
Xno crash be the number of successes in the “no crash” group, and
we are modeling Xno crash as binomial with n = 588 and p = 0.0935.
Then
odds ratio = (Xcrash/Xno crash) / [(535 - Xcrash)/(588
- Xno crash)]
The
distribution should appear skewed to the right with mean close to 1 and
standard deviation near 0.2.
Example
results:
mean
= 1.108, standard deviation = 0.2129
The
observed ratio of 1.59 is a fair bit out in the tail of the distribution and
appears to have a smallish p-value. If we
count how many of these observations are 1.59 are larger (rounding down from
1.5911 so that 1.5911 is included), we find about 1% (10 out of 1000) of the
simulated sample odds ratios are at least this extreme. Using Fisher's Exact test, the p-value is 0.0158 = P(X
≥ 61, for X hypergeometric with N
= 1123, M = 535, n = 105. This
gives us strong evidence, of a relationship in the population.
(f) Theoretical
standard error for the log odds ratio:
»
0.2075
ln(1.59) + 1.96(0.2075) = 0.4637 + 0.4067 Þ (0.057, 0.870)
exp(0.057, 0.070) Þ (1.06, 2.39)
We
are 95% confident that the odds of being in a car crash are 1.06 to 2.39 times
larger for those without a full night’s sleep in the previous week compared to
those with at least on full night’s sleep.
This
interval does not capture one, so we have statistically significant evidence of
an increase in odds for those without a full night’s sleep.
R output (R inverts a test (doubling the
one-sided p-value) rather than using the z-formula)
JMP output
These
are very close to the confidence interval we calculated.
(g) We have
strong evidence (p-value = 0.0158) that sleep deprived drivers have higher odds
(4% to 140% higher) of being in a car crash, at least for drivers like these
New Zealand drivers. We cannot draw a cause-and-effect conclusion however as
this was an observational study. We
should probably apply these results only to drivers in New Zealand at that
time.