E-Example 7.4: Understanding of the Least Squares Regression Line

Understanding the Least-Squares Regression Line with a Visual Model: Measuring Error in a Linear Model

This example allows students to explore three methods for measuring how well a linear model fits a set of data points. The Data Analysis and Probability Standard calls for students to explore how residuals (the difference between a predicted and observed value) may be used to measure the "goodness of fit" of a linear model. In this example, two of the methods use residuals and the third uses the shortest distance between a data point and the line given by the model. To introduce the idea of a measure of fit, in the first three tasks a line is given and the students explore the effects that six data points have on three measures of error. However, it rarely happens that the model is known and the data are not. Generally, we know the data and need to find a linear model. The additional tasks provide an opportunity to suggest and evaluate a variety of linear models and methods for a particular set of data.

Tasks

In this task a linear equation is used to model a set of data. By modifying the data points, explore how each of three methods—distance squared, absolute value, and shortest distance—measures how well the model approximates the data. How do individual data points contribute to the error? How do these contributions differ among the three methods of measuring the "goodness of fit"?

[How to Use the Interactive Figure]

[Stand-alone applet]

How do the three methods compare when one of the points is far from the line and the rest of the points are quite close?
For at least four different sets of data points, record the error measured by the absolute-value and shortest-distance methods. Be sure to use data sets that are quite different from one another in the number of points that are close to and far from the line. What relationships do you notice among the errors? (Hint: For each data set, try doing some arithmetic with the errors measured by the two methods.)

Additional Tasks

For the given data set and for each of the measures of error, find a line (a linear model) for which the error is as small as possible. Try various slopes and various y-intercepts before you settle on your line of "best fit." For each method, record the equation of the line.

Change the data set so that all the points except one lie in a line. Again find a line of best fit for each of the methods.
Change the data set so that the points follow a curve, and find the line of best fit for each of the methods. Is there only one line of best fit for each case?
Change the data set so that the points appear to follow no particular pattern, and find the line of best fit. Is there only one line of best fit for each case?

Discussion

Students should have experience graphing data generated by linear situations and writing equations for the lines that pass through such data points. Finding equations is relatively straightforward when the data all lie on a line. When the data are only approximately linear, however, no line will fit the data exactly and students must decide from among many possible linear models. This situation often arises when data come from real contexts and a model is desired from which predictions can be made. Before students engage with these interactive examples, they should be given a set of data that is somewhat, but not exactly, linear and asked to plot a line that they think fits the data well. They should be asked to defend their choice of linear model. Some might argue that their line is a good fit because it "passes through" many of the points. Others might argue that fitting well means that it is "closest to the most points" or that it is "in the middle of the points." Students could be asked to define statements such as "closest to the most points" numerically and to quantify their reasoning in other ways so that the effectiveness of two proposed models can be compared.

Given a set of bivariate data, graphing calculators or spreadsheets may be used to find the least-squares regression line for the data set. This investigation may be used to help students develop an understanding that there is more than one way to define the "line of best fit" and to help them develop meaning for the approach they are most likely to encounter: the method of "least squares." The least-squares regression line minimizes the sum of the squares of the residuals, a criterion not often suggested by students. The interactive figure above provides a visual model for the sum of the squares and prepares students to approximate a least-squares regression line in subsequent examples.

Take Time to Reflect

What do you learn about how well the linear model shown represents the data as points move farther from the line vertically? As they move parallel to the line? As they move horizontally across the line?
How do outliers, points that are significantly distant from the other data points, affect the selection and evaluation of a linear model? How does this effect vary among the methods for determining the line of best fit?

For a fixed data set, determine the ratio of the minimum sum of squares to the sum of squares when the line is horizontal and passes through the average of the y-values. What could this ratio tell you about the set of data?
What other criteria might be used to assess the goodness of fit of a linear model?

The absolute value and shortest distance methods should give the same line of best fit. Why? How would you characterize the difference between the two methods?

Home | Table of Contents | Purchase | Resources

NCTM Home | Illuminations Web site