If we think the variability is increasing with the response variable, then this suggests using the log or square root transformation on the response. ```{r} sqrtdemand = sqrt(utility$demand) scatter.smooth(sqrtdemand ~ utility$usage) model2=lm(sqrtdemand ~ usage, data = utility) par(mfrow=c(2,2)) plot(model2) ``` (b) Fit the log(response) model. The log is considered the more "severe" transformation. ```{r} lndemand = log(utility$demand) #note, this is the natural log by default scatter.smooth(lndemand ~ utility$usage) model2=lm(lndemand ~ usage, data = utility) plot(model2) ``` Does this "fix" the issue?

One concern with transforming the response is this can impact other model assumptions like linearity and normality. You also have to be careful in comparing $R^2$ values across such models. There are alternatives, such as weighted least squares and "modelling" the pattern in the variability. **Example 3: Modeling Heterogeneity/Weighted Least Squares** Smith et al. (2005) examined reproductive and somatic tissues in the squid *Loligo forbesi*. The data in Squid.txt include the dorsal mantel length (in mm) and testis weight from 768 male squid, over different months. ''The idea behind the original analysis was to investigate the role of endogenous and exogenous factors affecting sexual maturation, more specifically to determine the extent to which maturation is size-related and seasonal'' (Zuur et al., 2009). ```{r} Squid<-read.table("http://www.rossmanchance.com/stat414/data/Squid.txt",header=T) plot(Testisweight~DML, data = Squid) ``` (a) How does Testis weight appear to change with DML? For which DML values do we have less variable measurements of Testis weight?

Fit a linear model predicting testis weight from dorsal mantel length. ```{r, echo=FALSE, message=FALSE} plot(Testisweight ~ DML, data = Squid) model1 <- lm(Testisweight ~ DML , data = Squid) summary(model1) #Bear with me, just sorted the data first Squid2 <- Squid[with(Squid, order(DML)),] model1 <- lm(Testisweight ~ DML, data = Squid2) predictions <- predict(model1, newdata = data.frame(DML = Squid2$DML), interval = "prediction") alldata <- cbind(Squid2, predictions) ggplot(alldata, aes(x = DML, y = Testisweight)) + #define x and y axis variables geom_point() + #add scatterplot points stat_smooth(method = lm) + #confidence bands geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval geom_line(aes(y = upr), col = "coral2", linetype = "dashed") + #upr pred interval theme_bw() ``` (b) What do you conclude from the residual plots? (Also pay attention to the third one now.)

```{r} plot(model1) #Note, this pattern remains even in a more complicated model we will fit later, though including an interaction between DML and month does account for some of the "curvature." ```

(c) Why might a transformation not be helpful here?

Another approach is to model the relationship between the variance and the explanatory variable. Because the variability in the residuals is increasing with DML values, we could try modelling the error variance as proportional to DML: $\sigma_i \times I \sim N(0,\sigma^2 \times DML_i)$. We can think of this as a special case of weighted least squares which assumes $V(Y_i ┤| x_i) = \sigma^2 w_i$ and if we know the $w_i$ then we minimize $\Sigma w_i (Y_i- \beta_0 - \beta_1 X_i )^2$. For this dataset, we can take $w_i = 1/DML_i$ which will give more "weight" in the least squares estimation to squid with smaller DML values. (d)

To see whether this has sufficiently addressed the heterogeneity we saw in the residuals, we want to look again at the residual plots. However, with weighted least squares, we need to look at the *standardized* residuals rather than the non-standardized residuals. You can think of these like z-scores, though there are different versions that divide by slightly different SD(residual). ```{r} head(residuals(model2)) stdresids = rstandard(model2) head(stdresids) plot(stdresids ~ fitted.values(model2)) #plot(rstandard(model2)~fitted.values(model2)) ``` (h) Have things improved? Be clear how you are deciding.

**Notes:** - Can explore other "forms" (e.g., powers) of the variance covariates. (Watch for zero and negative values) - The estimates of the coefficients will usually be nearly the same as the unweighted estimates, but the weights will impact the widths of prediction intervals. - But there is a different problem with how we have done things here. We don't know the true $\sigma_j$ values, we have only estimated them from the sample data. Instead we should use *generalized least squares* (Aiken, 1934) which is an iterative approach for simultaneously estimating the regression coefficients and estimating the variance terms....