5 ::: Linear Models

This will only cover linear models for the purposes of Stat M12.

### basic linear models

If there is a variable x that is believed to hold a linear relationship with another variable y, then a linear model may be useful. The model will take the form

`yi = axi + b + ei`

It is easier to think of the model without the ei, which is just the residual of data point i (some may call this the 'error'). So, the goal is the find the best fitting line to the data.

As discussed in lab, this best linear model (by many standards) and the most commonly used method is called the 'least squares regression line' and it has some special properties:
- it minimizes the sum of the squared residuals,
- the sum of the residuals is zero, and
- the point (mean(x), mean(y)) falls on the line.

top

### fitting a model in R

Given a vector x that may be an explanatory variable of y (through a linear relationship), a model may be fit in R using the function lm():

```> x   # the values of x
[1]  1  2  3  4  5  6  7  8  9 10 11 22 13 14 15 16 17 18 19 20 21 12 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
> y   # the values of y
[1]  17.904874  16.869444   8.577667  11.514831  37.501027  36.153710
[7]  31.095892  -3.168944  21.165378  43.404508  24.303967  43.302560
[13]  46.234474  59.766088  27.568532  23.638366  53.436572  65.486651
[19]  90.248055  48.700811  43.273224  38.537765  64.810771  45.214556
[25]  36.170230  96.215944  50.387662  54.798343 134.467379  75.629519
[31]  33.460263  56.375976  57.751208  52.897848  77.047355  62.359089
[37]  38.304368  57.394667  82.131770  62.388716  92.809843 108.512499
[43]  84.240950  82.989272 100.224232  82.375033 121.487718  97.679322
[49]  90.685929 111.203797
> plot(x, y)         # look at the scatter plot
> fit <- lm(y ~ x)   # y 'as a linear function of' x
> abline(fit)        # add the least squares line onto the scatter plot```

The output is saved in fit and may be viewed by typing in fit and hitting enter or by looking at summary(fit):

```> fit

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
15.610        1.659  ```

The number corresponding to the y-intercept is the number below (Intercept). The second value that falls below x corresponds to the coefficient of x (the slope in the equation). To put this into the form of an equation, use the form 'y=mx+b':
y = 1.659*x + 15.610

```> summary(fit)   # the blue highlighting was done by me, not R

Call:
lm(formula = y ~ x)

Residuals:
Min      1Q  Median      3Q     Max
-38.683 -11.689  -2.379  10.182  70.751

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  15.6101     5.6999   2.739  0.00863 **
x             1.6588     0.1945   8.527 3.56e-11 ***
---
Signif. codes:  0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1

Residual standard error: 19.85 on 48 degrees of freedom
Multiple R-Squared: 0.6024,	Adjusted R-squared: 0.5941
F-statistic: 72.71 on 1 and 48 DF,  p-value: 3.558e-11 ```

Notice that the two values from the output of fit fall under the column Estimate. The r2 of the fit is found by looking at Multiple R-Squared (in this case, it is 0.6024). For Stat M12, we will not need the rest of the information in the summary.

top

### finding predictors and residuals

Continuing on with the example above, it's time to put the model fit to use. Given the model, if we want to look at what the predicted value of one of our x's would be, say at 12, what do we do? To find this, we just use our least squares fit and plug 12 into the equation:
y-predicted = 1.659*12 + 15.610 = 35.518
So when x=12, we expect y to be about 35.518. This could be computed in R rather than using a calculator. The way I constructed x in R, the position in x corresponding to the value 12 is the 22nd position, so I could the following in R:

```> 1.659*x[22] + 15.610
[1] 35.518```

To check my answer in R, I could also use fit, which stores extra information including predicted values:

```> fit\$fitted[22]
12
35.51623 ```

The reason why the answers are slightly different is due to round off error in the intercept and slope of the values in the R output (R rounded to 4 decimal places in the output of fit and summary(fit)). The output from fit\$fitted[12] is more precise.

If we are interested in the predicted value, odds are we also have interest in the residual at the given point. To compute the residual of a point, we use
residual = observed - predicted
So, in the case for when x=12 with a predicted value of 35.52 (and an observed value of 38.54, which is shown by the red dot in the plot), the residual would be
38.54 - 35.52 = 3.02
As with the pedicted, R will save the residuals in fit, and the residuals may be extracted using

```> fit\$resid[22]
12
3.021535 ```

As before, the manually computed and R-output may differ slightly, which is again due to round-off error in the values of the model fit.

One last remark. If a residual is positive, then we know the observed value was larger than the predicted. If a residual is negative, then we know the observed value was smaller than the predicted. This follows directly from the equation
residual = observed - predicted

[[ Stat M12 students: be certain to understand how to do the manual computations of the predicted value and residuals from the output of fit and summary(fit). Use fit\$fitted and fit\$resid only as a method to check your answers. ]]

top

### regression diagnostics

The plots below are the first two diagnostic plots available in R by plotting fit:

`> plot(fit)`

To view the plots, hit 'return'. The first plot gives an idea of whether there is any curvature in the data. If the red line is strongly curved, a quadratic or other model may be better. In this case, the curvature is not strong (so a quadratic component in the model is not necessary). The second plot is to check whether the residuals are normally distributed. (Stat M12 students, don't worry about this.)

The third plot is used to check if the variance is contant (ie, if the standard deviation among the residuals appears to be about constant). If the red line is strongly tilted up/down, that is a red flag. There are no issues with that in this example - the variance appears constant. (The red line will always move up/down a little because of inherent randomness.) The last plot is used to check to see if there were any overly influential points (Stat M12 students, don't worry about this plot at all).

top

### confidence intervals

Given a linear model, a confidence interval may be easily obtained by using the function confint():

```> confint(fit)
2.5 %    97.5 %
(Intercept) 4.149683 27.070573
x           1.267703  2.049981```

To change the level of confidence, add in the level argument:

```> confint(fit, level=0.9)
5 %      95 %
(Intercept) 6.050094 25.170162
x           1.332563  1.985121
> confint(fit, level=0.99)
0.5 %    99.5 %
(Intercept) 0.3217958 30.898460
x           1.1370588  2.180625```

top