Linear Regression, Part 4

Linear Correlation and Regression

Part 4: Regression

In Part 3, we noticed that when two variables have a correlation coefficient near +1 or -1, a scatter plot shows the data points tightly clustered near a line. When the correlation coefficient is near 0, the data points form a less dense cloud. In this section, our goal is to find the equation of a line which provides the best predictive ability for one variable in terms of the other. The predictor variable is generally denoted by x, and the predicted variable by y.

Our first task is to determine a measure for how well a line fits the data. Certainly we want a line which is as close as possible to all of the points, so the question becomes `How do we measure the distance between a point and the line?' A natural choice might be to measure the perpendicular distance between each data point and a proposed line. But remember that we want to use the line to predict y-values, given x-values, suggesting that the vertical distance between each data point and a proposed line might be a better measure of distance.

If we agree that the vertical distance between a data point and a proposed line will be a reasonable measure for the distance between a single point and the line, we are still faced with the question of how to measure distance for all of the points at once. There are many reasonable alternatives, including

the sum of the vertical distances, allowing negatives for points below the line

the sum of the absolute value of the distances

the largest absolute distance

the sum of the squares of the distances

Statisticians favor the last alternative in the list, known as the sum of the squared errors.

Use your helper application to superimpose the graph of the line y = 10 + 0.8x on the scatter plot of Test 2 scores versus Test 1 scores (so x denotes Test 1 scores). Does the line appear to fit the data reasonably well? Why or why not?

In your helper application worksheet, there is short procedure for computing the sum of the squared errors. Which of the following expressions is being implemented by the procedure?

A.
B.
C.
D.
E.

[ Hint ]

Use the procedure in your worksheet to compute the sum of the squared errors for the line y = 10 + 0.8x and the scores from Test 1 and Test 2.

Find the equation of another line which appears to fit the data reasonably well. Record the equation of your line, graph the line with the data, and compute the sum of the squared errors. Is your line better than the one from the previous part? How can you tell?

Try another line and compute the sum of the squared errors (sometimes abbreviated by SSE). Given these data points, how small do you think the SSE could be for any line? Could the SSE ever be 0? Explain.

Using calculus (this is, after all, a minimization problem), or linear algebra (see the linear algebra module Least Squares), it is not too difficult to find a formula for the slope and intercept of the line with the smallest possible sum of the squared errors. This line is known as the least squares line. The slope turns out to be

We would like our least squares line to pass through the point with coordinates ((ave of x),(ave of y)). Why is this a reasonable thing to want? Assuming the least squares line has the slope given above, and contains the point ((ave of x),(ave of y)), find an equation for the y-intercept of the line.

Use your computer algebra system to find the slope and intercept of the least squares line for Test 2 scores versus Test 1 scores.

Compute the SSE for this line and compare it to the SSE's you computed for the lines above. How close to the least squares line were your guesses?

Use your computer algebra system to find the least squares line for Test 3 scores versus Test 2 scores. Why is the slope of this line so much smaller than the slope of the least squares line for Test 2 scores versus Test 1 scores?

Compute SSE for least squares line for Test 3 scores versus Test 2 scores. How does it compare with the SSE for the least squares line from the Test 2 versus Test 1 scores? Explain why this makes sense given the two scatterplots?