Stat 414 – HW 1

Due by 2pm, Friday, Sept. 27

 

This first assignment does assume you remember a few topics from earlier courses, like interpreting the regression equation, interpreting R2, etc. It also assumes you can use help menus to figure out how to do something in a particular software package. You should ask me and/or use the discussion board in PolyLearn to ask questions and share lessons learned especially for different technologies. I will be most helpful with R, JMP, and Minitab questions, but you are welcome to use other packages as well.

 

Submitting your file

Include your name inside the file. In the PolyLearn page, under Assignments, click the link for Upload HW 1. (Either Word or PDF but I have a slight preference for Word (.doc or .docx).)

·       If this ever does not work for you, email it to me as an attachment and make sure your name is clearly indicated in your email, in the file, and in the file name.  You should use the subject line: Stat 414 HW submission. 

Please submit a separate file for each problem

 

1) All of the studies below have at least one violation of the LINE assumptions for inference. Begin by identifying the response and explanatory variables. Write the assumptions in the context of each study as if there were no violations. Then identify which assumptions are likely to be invalid. Explain your answer in the context of the study.

a) A randomized clinical trial investigated postnatal depression and the use of an estrogen patch (Gregoire et al. 1996). Patients were randomly assigned to either use the patch or not. Depression scores were recorded on six different visits.

b) Minnesota Pollution Control Agency is interested in using traffic volume data to generate predictions of particulate distributions as measured in counts per cubic feet.

c)  As part of a study investigating possible gender discrimination in beginning salaries at a particular company, researchers study the relationship between years of education and beginning salary among company employees.

d) Do college basketball referees tend to even out the foul calls on the two teams over the course of a game? For example, if several more fouls have been called on the visitors at a certain point in the game, does it become more likely that the next foul will be called on the home team? And do these chances depend on the score of the game, the size of the crowd, or the referees working the game?

 

Bring questions to class on Tuesday!

2) The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May .The race is known as the “Most Exciting Two Minutes in Sports,” and is the first leg of racing’s Triple Crown. The dataset KYDerby18.txt contains information on each running of the Kentucky Derby since 1875.  Use either JMP or R or another package for the analyses in the following questions. Cite any help documentation you use to determine the appropriate commands.

(a) Graph winning time vs. year. Describe any patterns you see.  How would you suggest modeling these data?

(b) Replace time on the vertical axis with speed and describe any patterns you see.  How would you suggest modeling these data?

(c) Let yi represent the speed of the winning horse in year i. Consider

Model 1: Yi = β0 + β1Yeari + ϵi where ϵi N(0, σ2).  

Based on the scatterplot in (b), does this model seem valid?  How are you deciding?  How will you interpret the intercept and slope values?

(d) Fit the model suggested in (c) using least squares. Is the model useful? Is the relationship statistically significant? How are you deciding? 

(e) How do you determine and interpret “Root Mean Square” aka “Residual standard error”?

(f) What do you learn from the residuals vs. predicted graph?

 

One way to create a more meaningful intercept is by centering the year variable,

Centered year = Year – mean(Year)

(g) Fit the regression model Model 1: Yi = β0 + β1CenteredYeari + ϵi where ϵi N(0, σ2).  

How have the model p-value, R2, and MSE values changed?  What about the residual plot? How has the intercept changed? How do you interpret its value?

 

One approach to modeling curvature in the data is with a polynomial model, especially when there are “bends” or changes in direction that power transformations can’t address.

(h) Fit a quadratic model using centered year and centered year2. Create a graph that shows the scatterplot of the data with the fitted values overlaid. In your opinion, how well does the model capture the curvature in the data? How do you interpret the quadratic behavior? Have the R2 and MSE values changed, why?  What about the residual plots? How do you interpret the residual standard error? 

 

Another approach to modeling curvature is with a linearizing transformation.  These assume a particular form of the relationship. In this case, you will want to shift the explanatory variable first by creating:   Year – 1874

(i) Produce a graph of speed vs. (year-1874). Fit a model to predict speed from log(year-1874) and overlay this model on your graph.  Does the “shape” of this function appear to match the shape of the data? Explain.

Hint: I would like a graph of speed vs. (year-1874) and I would like the model you find overlaid on that graph (and the one from (h)?) for comparison.

(j) How do you interpret the intercept of this model?

(k) Is the slope of the model in (i) statistically significant?  Define the parameter of interest, state null and alternative hypotheses in terms of this parameter, produce a test statistic and p-value (including appropriate output), and draw your conclusion.

Extra credit: Which model, the quadratic or the transformed model would you recommend and why?

 

(l) Created a coded scatterplot of speed vs. year, coded by condition. Describe any patterns you see.  How would you suggest modeling these data?  In particular, how would you suggest including condition in a general linear model?

(m) Consider the model from (i). Color code the residuals vs. fits graph by condition. Does there appear to be a pattern to the residuals?

(n) Recode the condition variable into three categories (“slow,” “good,” and “fast”) and include this new variable in the model. How do you interpret the condition coefficients?  Is this new condition variable significantly related to speed? How are you deciding?  (State the null and alternative hypotheses. Bonus: Define the parameter in words). Is it worth including in the model?

(o) Determine and interpret in context the R2 value for this model.

Extra Credit:  Explain how you would interpret an interaction between condition and year in this context.

 

3) A student collected data from a restaurant where she was a waitress. The student was interested in learning under what conditions a waitress can expect the largest tips—for example: At dinner time or late at night? From younger or older patrons? From patrons receiving free meals? From patrons drinking alcohol? From patrons tipping with cash or credit? And should tip amount be measured as total dollar amount or as a percentage? The tips1.txt data file has data on tip percentage and age.

a) Why would you analyze tip percentage rather than tip amount?

b) Carry out an Analysis of Variance to decide whether the tip percentage varies among younger, middle, and older patrons.

c) Does the equal variance condition appear to be met for these data?  Produce a graph to help support your argument. If not, what would you suggest?

d) Carry out a more formal test of the equality of the variances of tip percentages across the age groups.