Understanding the Least-Squares Regression Line with a Visual Model:
Measuring Error in a
Linear Model
This example allows
students to explore three methods for measuring how well a linear model
fits a set of data points. The Data
Analysis and Probability Standard calls for students to explore how
residuals (the difference between a predicted and observed value) may
be used to measure the "goodness of fit" of a linear model. In this example,
two of the methods use residuals and the third uses the shortest distance
between a data point and the line given by the model. To introduce the
idea of a measure of fit, in the first three tasks a line is given and
the students explore the effects that six data points have on three measures
of error. However, it rarely happens that the model is known and the data
are not. Generally, we know the data and need to find a linear model.
The additional tasks provide an opportunity to suggest and evaluate a
variety of linear models and methods for a particular set of data.
Tasks
- In this task a linear
equation is used to model a set of data. By modifying the data points,
explore how each of three methodsdistance squared, absolute value,
and shortest distancemeasures how well the model approximates
the data. How do individual data points contribute to the error? How
do these contributions differ among the three methods of measuring the
"goodness of fit"?
[How
to Use the Interactive Figure]
[Stand-alone
applet]
- How
do the three methods compare when one of the points is far from the
line and the rest of the points are quite close?
- For at least four
different sets of data points, record the error measured by the absolute-value
and shortest-distance methods. Be sure to use data sets that are quite
different from one another in the number of points that are close to
and far from the line. What relationships do you notice among the errors?
(Hint: For each data set, try doing some arithmetic with the
errors measured by the two methods.)
Additional Tasks
For the given data set
and for each of the measures of error, find a line (a linear model) for
which the error is as small as possible. Try various slopes and various
y-intercepts before you settle on your line of "best fit." For
each method, record the equation of the line.
- Change the data set
so that all the points except one lie in a line. Again find a line of
best fit for each of the methods.
- Change the data set
so that the points follow a curve, and find the line of best fit for
each of the methods. Is there only one line of best fit for each case?
- Change the data set
so that the points appear to follow no particular pattern, and find
the line of best fit. Is there only one line of best fit for each case?
Discussion
Students should have
experience graphing data generated by linear situations and writing equations
for the lines that pass through such data points. Finding equations is
relatively straightforward when the data all lie on a line. When the data
are only approximately linear, however, no line will fit the data exactly
and students must decide from among many possible linear models. This
situation often arises when data come from real contexts and a model is
desired from which predictions can be made. Before students engage with
these interactive examples, they should be given a set of data that is
somewhat, but not exactly, linear and asked to plot a line that they think
fits the data well. They should be asked to defend their choice of linear
model. Some might argue that their line is a good fit because it "passes
through" many of the points. Others might argue that fitting well means
that it is "closest to the most points" or that it is "in the middle of
the points." Students could be asked to define statements such as
"closest to the most points" numerically and to quantify their reasoning
in other ways so that the effectiveness of two proposed models can be
compared.
Given
a set of bivariate data, graphing calculators or spreadsheets may be used
to find the least-squares regression line for the data set. This investigation
may be used to help students develop an understanding that there is more
than one way to define the "line of best fit" and to help them develop
meaning for the approach they are most likely to encounter: the method
of "least squares." The least-squares regression line minimizes the
sum of the squares of the residuals, a criterion not often suggested by
students. The interactive figure above provides a visual model for the
sum of the squares and prepares students to approximate a least-squares
regression line in subsequent examples.
Take Time
to Reflect
|
- What do you
learn about how well the linear model shown represents the data
as points move farther from the line vertically? As they move
parallel to the line? As they move horizontally across the line?
- How do outliers,
points that are significantly distant from the other data points,
affect the selection and evaluation of a linear model? How does
this effect vary among the methods for determining the line
of best fit?
- For a fixed
data set, determine the ratio of the minimum sum of squares
to the sum of squares when the line is horizontal and passes
through the average of the y-values. What could this
ratio tell you about the set of data?
- What other
criteria might be used to assess the goodness of fit of a linear
model?
- The absolute
value and shortest distance methods should give the same line
of best fit. Why? How would you characterize the difference
between the two methods?
|
|