Workshop Statistics: Discovery with Data and Fathom

Topic 11: Least Squares Regression II

Activity 11-1: Gestation and Longevity

(a) gestation = 22 + 13.1 * longevity

(b) Fitted value: 22 + 13.1(20) = 284; Residual: -97
(d) For each addition year of the animal's longevity, its gestation period is longer by 13.1 days.
(e) .439
(f)

        It seems as though the predictions are generally closer when the longevity is very small.

(g) This should produce the same graph you created in (f) (you may have to enlarge it to see it better).
(h)

        The elephant (residual = 98.23) is an outlier in both longeviy and gestation.  There are 6 other animals with larger positive value residuals, and 6 other animals with larger negative value residuals.  So no, the elephant, while being extreme in longevity and gestation, does not have the largest residual.

(i) The giraffe has the largest residual.  Its gestation period is much longer than would be expected for an animal of its longevity.
(j) regression line: gestation = 9 + 13.6 * longevity;  r2 = .501
(k) This regression line with the giraffe omitted is not substantially different from the original one.
(l) regression line: gestation = 45 + 11.1 * longevity;  r2 = .269
(m) The removal of the elephant affected the regression line much more than the removal of the giraffe.
(n) Changing the gestation of the elephant (an animal with an extreme longevity) has a very large effect on the regression line, as compared to changing the gestation period of any other animal with a more typical longevity, which doesn't have that much of an effect on the regression line.  Thus, the regression line is especially not resistant to outliers that are extreme in the horizontal direction.

Activity 11-2: Residual Plots

(a) (b) (c) Plots 1 and 4 summarize the relationship in the data about as well as possible.  The points fall roughly evenly about the least squares regression line.  Plots 2 and 3 would best be described by some type of curve.
(d) The scatterplots where the lines summarize the data about as well as possible do not correspond to the highest values of r2.  More points fall closer to the line in the other two plots, although they aren't best modeled by a linear fit.
 

Activity 11-3: Televisions and Life Expectancy (cont.)

(a)

        correlation: -.804:  This relationship does not appear to be linear, but rather curved.
(b)

        The pattern here is slightly u-shaped, with all of the extreme positive residuals being at very few people per TV, or a lot of people per TV, while most of the negative residuals tend to be more in the middle.

(d)

        new regression equation: life expectancy = 80.6 - 13.3 * logTV;  The transformation has made the scatterplot much more linear, and has greatly improved the fit of the regression line.

(e) .850
(f) 67.3
(g) 54;  difference = 13.3, the slope coefficient.  Since the horizontal axis is logarithmic, a change of one on the horizontal scale is actually a change by a factor of 10 in the number of people per TV.  So, since the number of people per TV changes by a factor of 10 from (f) to (g), it is changing by one unit on the horizontal scale, and thus the life expectancy simply increases by the slope coefficient.

(h)

        This scatterplot reveals no clear pattern.

(i) The linear regression model is a better fit with the transformed data.