Generating Data
Researchers employ two ways of generating data: observational study and randomized experiment. In either, the researcher is studying one or more populations; a population is a collection of experimental units or subjects about which he wishes to infer a conclusion. Since examining each subject of a population is usually not feasible, a subset of the population is chosen for examination; such a subset of a population chosen for study is called a sample.
The sample or samples are assigned to groups, each of which receives a treatment; a treatment is a procedure, active or passive, which is applied to each member of an experimental group and which produces data..
· An observational study is characterized by group assignments’ being beyond the control of the researcher.
· A randomized experiment is characterized by assigning experimental units from the sample(s) to the groups by chance.
Since learning is best accomplished by means of examples, we offer as an illustration of observational study an investigation into the relationship between island area and number of species.
For randomized experiment the example is a study of the sexual preferences of female swordtail fish.
For an observational study to permit valid inference about the population, the sample should be chosen by chance. The simplest way of doing this is the simple random sample. A simple random sample from a population is one in which each member of the population has the same probability of being chosen for membership in the sample; equivalently, a sample is random if and only if it has the same probability of being chosen as any other subset of the population of the same size. For large finite populations one way of choosing a random sample is to label each member of the population with a number containing the same number of digits. For example, if the population contained between 100 and 999 members, we label its members 001, 002, 003, . . . , 998, 999. Then use a random number generating device such as a hand calculator, statistics software, or a table of random numbers (found in most statistics textbooks) to choose a sample of the required size. Again we stress: the sample must be random before a valid statistical inference can be deduced about the population.
For randomized experiments chance plays its role in assigning the subject(s) from the random sample(s) to the treatment groups. If there are three groups, for example, then assign the digits 1,2, and 3 randomly to determine the group assignments. Once again, no valid inference is possible unless the group assignments are random.
The main practical distinction between these two methods of generating data is that valid cause/effect relationships can be deduced from randomized experiments, but not from observational studies. Observational studies yield valid inferences about associations in populations and can suggest randomized experiments for further research.
In simple linear regression one explanatory variable X is related to the mean of one response variable Y by a linear equation:
μ[Y½X] = β0+β1X
Here the left side of the equation denotes the population mean of Y as a function of X.
To begin a regression analysis, first plot a scatterplot of the bivariate data from the sample. Inspect as to whether or not the data seem to be roughly linear in arrangement. As example we inspect first the data from the island area/number of species study To form the scatterplot: GraphØPlot. Then appears a dialog box, which should be completed as shown. en click OK.
The data do not seem promising with respect to clustering about a straight line. Later we shall learn how to transform the data values so that it acquires pronounced linearity.
Our task now is to fit a line to this data, even though this would not be useful in this case because of the data’s non-linearity. The criterion for the line of best fit is that the line should minimize the sum of the squares of the deviations. (The deviation of a datum from a line is the difference between y-coordinates of the datum and the point on the line directly above/below the datum.) By calculus or by linear algebra the slope and the y-intercept of this line are deduced, and the line is shown to be unique. For any set of bivariate data there exists a line of best fit (in the least squares sense.)
Minitab can compute the coefficients of this line and plot the line on the bivariate scatterplot. Let us do this for the previous example. StatØRegressionØFitted Line Plot. Complete the dialog box as shown:
Read the equation of the line of regression directly under the title Regression Plot. R-squared gives the proportion of the variability of Y that is due to the linear regression model.
EXERCISE: Learn what S means by using Minitab’s Help: HelpØSearch the StatGuide. Enter S, then scroll in the lowest window (3) in the dialog box until you reach S, then click Display.
You might want to dispense with the preliminary scatterplot if your intent is to eventually plot the line of regression.
EXERCISE: For the data in Preferences of Swordtail Fish prepare a scatterplot of the data without the line of regression, then a scatterplot with the line of regression. Answer.
Eexercise you noticed that the data fit the line much better than did the data from the island/species study. One way of measuring the tendency towards linearity is the correlation coefficient. Minitab computes this for us too. StatØBasic StatisticsØCorrelation Complete the dialog box thus, then click OK:
Because the correlation coefficient is close to zero, the fit of the regression line is inadequate, even though this is the line of best fit; the data are simply not linear. For any set of bivariate data the correlation coefficient can be computed.
EXERCISE: Find the correlation coefficient for the data in Preferences of Swordtail Fish. Answer
Notice how the larger magnitude of the correlation coefficient reflects the linear disposition of the data.
Often data that are not of linear configuration can be made so by suitably transforming one or both variables. For example the data of island area/number of species is not linear, so let us attempt to transform its variables