Math 37 - Lecture 25

Regression

Example Anscombe data: correlation coefficient

If graph shows a linear relationship, can we model it?

Goal: Model overall linear pattern found in the scatterplot.

A regression line is a straight line that describes how the response variable varies with the explanatory variable. Also, allows us to predict the response from the explanatory variable value.

Example Manatees: Can we summarize the linear relationship between the number of boat registrations and number killed?

Explanatory variable=

number of boat registrations

Response variable=

number of manatees killed

Scatterplot showed a strong, positive, linear association

Equations for a Line: y=a+bx

y=height on vertical axis; a=intercept (height when x=0); b=slope (rate of change in y as x changes); x=position on horizontal axis

Example: =a+bx= (killed)= -41.4 + .125(registrations)

To add regression line to plot, pick two values of x (far apart), find the predicted values, , and draw a line through the two points.

Residual=observed y - predicted y = y -

The Least Squares technique minimizes the squared residuals

Fitting a line:

One Technique - Least Squares Regression

Def: Minimizes sum of the squared vertical distances from the line

Calculation: b= r sy/sx a= - b

Example Manatees, r= , = , sx= , = , sy=

If x=585, predict killed. Observed

Example Predicting NBA attendance

If there a highly associated variable?

Scatterplot:

Correlation Coefficient =

=18.57,sx=3.87,

=15710, sy=3570

Regression Line:

Outliers?

Are any teams unusual, e.g. have a large residual?

Can you give an explanation?

What if you refit the model without these observations?

Unusual Observations

Need an explanation before can remove the point from the data

- Often have small residuals

- Often occur for observations with extreme x values.

- Should point be included?

Are the influential? How decide?

R2 value = % of variation in y explained by regression on x

i.e. Does x do a good job of telling us why y varies?

Example For Manatees, R2=88.6% of the variation in #killed is explained

- Boat registrations does a lot to explain increase in manatee deaths

- Prediction of manatees killed from number of boat registrations is reasonably accurate

Cautions

- Only linear dependence

- Extrapolation: Very cautious when predicting for values of explanatory variable outside the range for which have data