Correlation and Regression

Andrew F. Siegel , in Practical Business Statistics (Seventh Edition), 2016

The Standard Error of Estimate: How Large are the Prediction Errors?

The standard error of estimate , denoted S e here (but often denoted S in computer printouts), tells you approximately how large the prediction errors (residuals) are for your data set, in the same units as Y. How well can you predict Y? The answer is, to within about S e above or below. 16 Since you usually want your forecasts and predictions to be as accurate as possible, you would be glad to find a small value for S e. You can interpret S e as a standard deviation in the sense that, if you have a normal distribution for the prediction errors, then you will expect about two-thirds of the data points to fall within a distance S e either above or below the regression line. Also, about 95% of the data values should fall within 2S e, and so forth. This is illustrated in Fig. 11.2.10 for the production cost example.

Fig. 11.2.10. The standard error of estimate, S e indicates approximately how much error you make when you use the predicted value for Y (on the least-squares line) instead of the actual value of Y. You may expect about two-thirds of the data points to be within S e above or below the least-squares line for a data set with a normal linear relationship, such as this one.

The standard error of estimate may be found using the following formulas:

Standard Error of Estimate

S e = S Y 1 r 2 n 1 n 2 f o r computation = 1 n 2 i = 1 n Y i a + b X i 2 f o r interpretation

The first formula shows how S e is computed by reducing SY according to the correlation and sample size. Indeed, S e will usually be smaller than SY because the line a  + bX summarizes the relationship and therefore comes closer to the Y values than does the simpler summary, Y ¯ . The second formula shows how S e can be interpreted as the estimated standard deviation of the residuals: The squared prediction errors are averaged by dividing by n    2 (the appropriate number of degrees of freedom when two numbers, a and b, have been estimated), and the square root undoes the earlier squaring, giving you an answer in the same measurement units as Y.

For the production cost data, the correlation was found to be r  =   0.869193, the variability in the individual cost numbers is SY   =   $389.6131, and the sample size is n  =   18. The standard error of estimate is therefore

S e = S Y 1 r 2 n 1 n 2 = 389.6131 1 0.869193 2 18 1 18 2 = 389.6131 0.0244503 17 16 = 389.6131 0.259785 = $ 198.58

This tells you that, for a typical week, the actual cost was different from the predicted cost (on the least-squares line) by about $198.58. Although the least-squares prediction line takes full advantage of the relationship between cost and number produced, the predictions are far from perfect.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128042502000110

Detection of a Trend in Population Estimates

William L. Thompson , ... Charles Gowan , in Monitoring Vertebrate Populations, 1998

5.2 VARIANCE COMPONENTS

In this section, we discuss sources of variation that must be considered to make inferences from data when trying to detect trends. Three sources of variation must be considered: sampling variation, temporal variation in the population dynamics process, and spatial variation in the dynamics of the population across space. The latter two sources often are referred to as process variation, i.e., variation in the population dynamics process associated with environmental variation (such as rainfall, temperature, community succession, fires, or elevation). Methods to separate process variation from sampling variation will be presented.

Detection of a trend in a population's size requires at least two abundance estimates. For example, if the population size of Mexican spotted owls in Mesa Verde National Park is determined as 50 pairs in 1990, and as only 10 pairs in 1995, we would be concerned that a significant negative trend in the population exists during this time period, and that action must be taken to alleviate the trend. However, if the 1995 estimate was 40 pairs, we might still be concerned, but would be less confident that immediate action is required. Two sources of variation must be assessed before we are confident of our inference from these estimates.

The first source of variation is the uncertainty we have in our population estimates. We want to be sure that the two estimates are different, i.e., the difference between the two estimates is greater than would be expected from chance alone because of the sampling errors associated with each estimate. Typically, we present our uncertainty in our estimate as its variance, and use this variance to generate a confidence interval for our estimate. Suppose that the 1990 estimate of N ˆ 90 = 50 pairs has a sampling variance of V a ˆ r ( N ˆ 90 ) = 25 . Then, under the assumption of the estimate being normally distributed with a large sample size (i.e., large degrees of freedom), we would compute a 95% confidence interval as 50 ± 1.96 25 , or 40.2–59.8. If the 1995 estimate was N ˆ 95 = 40 with a sampling variance of Vâr( N ˆ 95 ) = 20, then the 95% confidence interval for this estimate is 40 ± 1.96 20 , or 31.2–48.8. Based on the overlap of the two confidence intervals (Fig. 5.2), we would conclude that by chance alone, these two estimates are probably not different. We also could compute a simple test as

Figure 5.2. The 95% confidence intervals plotted with the 1990 and 1995 population estimates.

(5.3) z = N ˆ 90 N ˆ 95 V a ˆ r ( N ˆ 90 ) + V a ˆ r ( N ˆ 95 ) ,

which for this example results in z = 1.491, with a probability of observing a z statistic this large or larger of P = 0.136. Although we might be alarmed, the chances are that 13.6 times out of 100 we would observe this large of a change just by random chance.

A variation of the previous test is commonly conducted for several reasons: (1) we often are interested in the ratio of two population estimates (rather than the difference) because a ratio represents the rate of change of the population, (2) the variance of N ˆ is usually linked to its estimate by V a ˆ r ( N ˆ ) = N ˆ C (e.g., Skalski and Robson, 1992, pp. 28–29), and (3) ln ( N ˆ ) is more likely to be normally distributed than N ˆ . Fortuitously, a log transformation provides some correction to all three of the above reasons and results in a more efficient statistical procedure. Because

(5.4) Var [ ln ( N ˆ ) ] = Var ( N ˆ ) N ˆ 2 ,

we construct the z test as

(5.5) z = ln ( N ˆ 90 ) ln ( N ˆ 95 ) V a ˆ r [ ln ( N ˆ 90 ) ] + V a ˆ r [ ln ( N ˆ 95 ) ]

to provide a more efficient (i.e., more powerful) test.

Suppose we had made a much more intensive effort in sampling the owl population, so that the sampling variances were one-half of the values observed (which would generally take about 4 times the effort). Thus, Vâr( N ˆ 90 ) = 12.5 and Vâr( N ˆ 95 ) = 10, giving a z statistic of 2.108 with probability value of P = 0.035. Now, we would conclude that the owl population was lower in 1995 than in 1990, and that this difference is unlikely due to variation in our samples, i.e., that an actual reduction in population size has taken place.

This leads us to the second variance component associated with determining whether a trend in the population is important. We would expect the size of the owl population (and any other population, for that matter) to fluctuate through time. How can we determine if this reduction is important? The answer lies in determining what the variation in the owl population has been for some period of time in the past, and then if the observed reduction is outside the range expected from this past fluctuation. Consider the example in Fig. 5.3, where the true population size (no sampling variation) is plotted. The population fluctuates around a mean of 50, but values more extreme than the range 40 to 60 are common. Note that a decline from 76 to 29 pairs occurred from 1984 to 1985, and that declines from over 50 pairs to under 40 pairs are fairly common occurrences. Thus, based on our previous example, a decline from 50 to 40 is not at all unreasonable given the past population dynamics of this hypothetical population.

Figure 5.3. Actual number of pairs of owls that exist each year. In reality, we never know these values, and can only estimate them.

To determine the level of change in population size that should receive our attention and suggest management action, we need to know something about the temporal variation in the population. The only way to estimate this variance component is to observe the population across a number of years. The exact number of years will depend on the magnitude of the temporal variation. Thus, if the population does not change much from year to year, a few observations will show this consistency. On the other hand, if the population fluctuates a lot, as in Fig. 5.3, many years of observations are needed to estimate the temporal variance. For the example in Fig. 5.3, we could compute the temporal variance as the variance of the 15 years. We find a variance of 265.7, or a standard deviation of 16.3 (Example 5.1). With a SD of 16.3, we would expect roughly 95% of the population values to be in the range of ±2 SD of the mean population size. This inference is based on the population being stable, i.e., not having an upward or downward trend, and being roughly normally distributed. For a normal distribution, 95% of the values lie in the interval ±2 SD of the mean. Therefore, a change of 2 SD, or 32.6, is not a particularly big change given the temporal variation observed over the 15-year period. Such a change should occur with probability greater than 1/20, or 0.05.

A complicating problem with estimating the temporal variance of a population's size is that we are seldom allowed to observe the true value of the population size. Rather, we are required to sample the population, and hence only obtain an estimate of the population size each year, with its associated sampling variance. Thus, we would need to include the 95% confidence bars on the annual estimates. As a result of this uncertainty from our sampling procedure, we would conclude that many of the year-to-year changes were not really changes because the estimates were not different. This complication leads to a further problem. If we compute the variance with the usual formula when estimates of population size replace the actual population size shown in Fig. 5.3, we obtain a variance estimate larger than the true temporal variance because our sampling uncertainty is included in the variance. For low levels of sampling effort each year, we would have a high sampling variance associated with each estimate, and as a result, we would have a high variance across years. The noise associated with our low sampling intensity would suggest that the population is fluctuating widely, when in fact the population could be constant (i.e., temporal variance is zero), and the estimated changes in the population are just due to sampling variance.

This mixture of sampling and temporal variation becomes particularly important in population viability analysis (PVA). The objective of a PVA is to estimate the probability of extinction for a population, given current size, and some idea of the variation in the population dynamics (i.e., temporal variation). If our estimate of temporal variation includes sampling variation, and the level of effort to obtain the estimates is relatively low, the high sampling variation causes our naive estimate of temporal variation to be much too large. When we apply our PVA analysis with this inflated estimate of temporal variance, we conclude that the population is much more likely to go extinct than it really is, and hence the importance of separating sampling variation from process variation.

Typically, we estimate variance components with analysis of variance (ANOVA) procedures. For the example considered here, we would have to have at least two estimates of population size for a series of years to obtain valid estimates of sampling and temporal variation. Further, typical ANOVA techniques assume that the sampling variation is constant, and so do not account for differences in levels of effort, or the fact that sampling variance is usually a function of population size. For our example, we have an estimate of sampling variance for each of our estimates, obtained from the population estimation methods considered in this manual. That is, capture–recapture, mark–resight, line transects, removal methods, and quadrat counts all produce estimates of sampling variation. Thus, we do not want to estimate sampling variation by obtaining replicate estimates, but want to use the available estimate. Therefore, we present a method of moments estimator developed in Burnham et al. (1987, Part 5). Skalski and Robson (1992, Chapter 2) also present a similar procedure, but do not develop the weighted estimator presented here.

Example 5.1 Population Size, Estimates, Standard Error of the Estimates, and Confidence Intervals for Owl Pairs in Fig. 5.3

Standard
Year Population Estimate error Lower 95% CI Upper 95% CI
1980 44 40.04 5.926 28.42 51.66
1981 48 50.51 11.004 28.94 72.08
1982 61 61.36 15.278 31.42 91.31
1983 48 47.6 11.062 25.92 69.28
1984 76 95.51 18.988 58.3 132.72
1985 29 33.81 8.803 16.56 51.06
1986 60 34.39 5.804 23.01 45.76
1987 59 38.52 11.168 16.63 60.41
1988 76 84.57 21.312 42.8 126.34
1989 42 30.04 6.918 16.48 43.6
1990 29 20.29 7.529 5.54 35.05
1991 68 68.42 17.969 33.2 103.64
1992 42 45.51 13.225 19.6 71.44
1993 27 27.01 6.137 14.98 39.04
1994 72 71.12 14.511 42.67 99.56
1995 54 51.45 8.054 35.66 67.24

The variance of the n = 16 populations is 265.628, whereas the variance of the 16 estimates is 450.376. Sampling variation causes the estimates to have a larger variance than the actual population. The difference of these two variances is an estimate of the sampling variation, i.e., 450.376 – 265.628 = 184.748. The square root of 184.748 is 13.592, and is the approximate mean of the 16 reported standard errors.

To obtain an unbiased estimate of the temporal variance, we must remove the sampling variation from the estimate of the total variance. Define σ total 2 as the total variance, estimated for n = 16 estimates of owl pairs ( N ˆ i , i = 1980, …, 1995) as

(5.6) σ ˆ total 2 = Σ i = 1980 1995 ( N ˆ i N ¯ ) 2 ( n 1 ) = Σ i = 1980 1995 N ˆ i 2 ( Σ i 1980 1985 N ˆ i ) 2 n ( n 1 ) ,

where the symbol indicates the estimate of the parameter. Thus, N ˆ i are the estimates of the actual populations, Ni , and σ ˆ total 2 is an estimate of the total variance σ ˆ i 2 For each estimate, N ˆ i , we also have an associated sampling variance, σ ˆ i 2 . Then, a simple estimator of the temporal variance, σ2 time, is given by

(5.7) σ ˆ time 2 = σ ˆ total 2 Σ i = 1980 1995 σ ˆ i 2 n ,

when we can assume that all of the sampling variances, σ ˆ i 2 , are equal. The above equation corresponds to Eq. (2.6) of Skalski and Robson (1992). When the σ ˆ i 2 cannot all be assumed to be equal, a more complex calculation is required (Burnham et al., 1987, Section 4.3) because each estimate must be weighted by its sampling variance. We take as the weight of each estimate the reciprocal of the sum of temporal variance plus the sampling variance, 1 / ( σ ˆ time 2 + σ ˆ i 2 ) . That is, Var ( N ˆ i ) = σ ˆ time 2 + σ ˆ i 2 , so w i = 1 / Var ( N ˆ i ) = 1 / ( σ ˆ time 2 + σ ˆ i 2 ) . Then, the weighted total variance is computed as

(5.8) σ ˆ total 2 = Σ i = 1980 1995 w i ( N ˆ i N ¯ ) 2 ( n 1 ) Σ i = 1980 1995 w i

with the mean of the estimates now computed as a weighted mean,

(5.9) N ¯ = Σ i = 1980 1995 w i N ˆ i Σ i = 1980 1995 w i .

We now know that the theoretical variance N ¯ is

(5.10) Var ( N ¯ ) = Var ( Σ i = 1980 1995 w i N ˆ i Σ i = 1980 1995 w i ) = 1 Σ i = 1980 1995 w i

and the empirical variance estimator is Eq. (5.8). Setting these two equations equal,

(5.11) 1 Σ i = 1980 1995 w i = Σ i = 1980 1995 w i ( N ˆ i N ¯ ) 2 ( n 1 ) Σ i = 1980 1995 w i

or

(5.12) 1 = Σ i = 1980 1995 w i ( N ˆ i N ¯ ) 2 ( n 1 ) .

Because we cannot solve for σ ˆ time 2 directly, we have to use an iterative numerical approach to estimate σ ˆ time 2 This procedure involves substituting values of σ ˆ time 2 into Eq. (5.12) via the wi until the two sides are equal. When both sides are the same, we have our estimate of σ ˆ time 2 . Using this estimate of σ ˆ time 2 , we can now decide what level of change in N ˆ i to N ˆ i + 1 is important and deserves attention. If the change from a series of estimates is greater than 2 σ ˆ time 2 , we may want to take action.

Typically, we do not have the luxury of enough background data to estimate σ ˆ time 2 , so we end up trying to evaluate whether a series of estimated population sizes is in fact signaling a decline in the population when both sampling and process variance are present. Note that just because we see a decline of the estimates for 3–4 consecutive years, we cannot be sure that the population is actually in a serious decline without knowledge of the mean population size and the temporal variation prior to the decline. Usually, however, we do not have good knowledge of the population size prior to some observed decline, and make a decision to act based on biological perceptions. Keep in mind the kinds of trends displayed in Fig. 5.1. Is the suggested trend part of a cycle, or are we observing a real change in population size? In this discussion, we have only considered temporal variation. A similar procedure can be used to separate spatial variation from sampling variation.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780126889604500058

Multiple Regression

Andrew F. Siegel , in Practical Business Statistics (Seventh Edition), 2016

Typical Prediction Error: Standard Error of Estimate

Just as for simple regression, with only one X , the standard error of estimate indicates the approximate size of the prediction errors. For the magazine ads example, S e   =   $53,812. This tells you that actual page costs for these magazines are typically within about $53,812 from the predicted page costs, in the sense of a standard deviation. That is, if the error distribution is normal, then you would expect about 2/3 of the actual page costs to be within S e of the predicted page costs, about 95% to be within 2S e , and so forth.

The standard error of estimate, S e   =   $53,812, indicates the remaining variation in page costs after you have used the X variables (audience, percent male, and median income) in the regression equation to predict page costs for each magazine. Compare this to the ordinary univariate standard deviation, SY   =   $105,639 for the page costs, computed by ignoring all the other variables. This standard deviation, SY , indicates the remaining variation in page costs after you have used only Y ¯ to predict the page costs for each magazine. Note that S e   =   $53,812 is smaller than SY   =   $105,639; your errors are typically smaller if you use the regression equation instead of just Y ¯ to predict page costs. This suggests that the X variables are helpful in explaining page costs.

Think of the situation this way. If you knew nothing of the X variables, you would use the average page costs ( Y ¯ = 160 , 397 ) as your best guess, and you would be wrong by about SY   =   $105,639. But if you knew the audience, percent male readership, and median reader income, you could use the regression equation to find a prediction for page costs that would be wrong by only S e   =   $53,812. This reduction in prediction error (from $105,639 to $53,812) is one of the helpful payoffs from running a regression analysis.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128042502000122

Multiple Regression

Gary Smith , in Essential Statistics, Regression, and Econometrics, 2012

Confidence Intervals for the Coefficients

If the error term is normally distributed and satisfies the four assumptions detailed in the simple regression chapter, the estimators are normally distributed with expected values equal to the parameters they estimate:

a N [ α , standard deviation of a ] b i N [ β i , standard deviation of b i ]

To compute the standard errors (the estimated standard deviations) of these estimators, we need to use the standard error of estimate (SEE) to estimate the standard deviation of the error term:

(10.3) SEE = ( Y Y ^ ) 2 n ( k + 1 )

Because n observations are used to estimate k + 1 parameters, we have n − (k + 1) degrees of freedom. After choosing a confidence level, such as 95 percent, we use the t distribution with n − (k + 1) degrees of freedom to determine the value t* that corresponds to this probability. The confidence interval for each coefficient is equal to the estimate plus or minus the requisite number of standard errors:

(10.4) a ± t * ( standard error of a ) b i ± t * ( standard error of b i )

For our consumption function, statistical software calculates SEE = 59.193 and these standard errors:

standard error of a = 27.327 standard error of b 1 = 0.019 standard error of b 2 = 0.003

With 49 observations and 2 explanatory variables, we have 49 − (2 + 1) = 46 degrees of freedom. Table A.2 gives t* = 2.013 for a 95 percent confidence interval, so that 95 percent confidence intervals are

α : a ± t * ( standard error of a ) = 110.126 ± 2.013 ( 27.327 ) = 110.126 ± 55.010 β 1 : b 1 ± t * ( standard error of b 1 ) = 0.798 ± 2.013 ( 0.019 ) = 0.798 ± 0.039 β 2 : b 2 ± t * ( standard error of b 2 ) = 0.026 ± 2.013 ( 0.003 ) = 0.026 ± 0.006

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123822215000106

Multiple Regression

Gary Smith , in Essential Statistics, Regression, and Econometrics (Second Edition), 2015

Confidence Intervals for the Coefficients

If the error term is normally distributed and satisfies the four assumptions detailed in the simple regression chapter, the estimators are normally distributed with expected values equal to the parameters they estimate:

a N [ α , standard deviation of a ] b i N [ β i standard deviation of b i ]

To compute the standard errors (the estimated standard deviations) of these estimators, we need to use the standard error of estimate ( SEE) to estimate the standard deviation of the error term:

(10.5) S E E = ( y y ˆ ) 2 n ( k + 1 )

Because n observations are used to estimate k  +   1 parameters, we have n    (k  +   1) degrees of freedom. After choosing a confidence level, such as 95   percent, we use the t distribution with n    (k  +   1) degrees of freedom to determine the value t that corresponds to this probability. The confidence interval for each coefficient is equal to the estimate plus or minus the requisite number of standard errors:

(10.6) a ± t ( standard error of a ) b i ± t ( standard error of b i )

For our consumption function, statistical software calculates SEE  =   59.193 and these standard errors:

standard error of a = 27.327 standard error of b 1 = 0.019 standard error of b 2 = 0.003

With 49 observations and two explanatory variables, we have 49     (2   +   1)   =   46 degrees of freedom. Table A.2 gives t   =   2.013 for a 95   percent confidence interval, so that 95   percent confidence intervals are:

α : a ± t ( standard error of a ) = 110.126 ± 2.013 ( 27.327 ) = 110.126 ± 55.010 β 1 : b 1 ± t ( standard error of b 1 ) = 0.798 ± 2.013 ( 0.019 ) = 0.798 ± 0.039 β 2 : b 2 ± t ( standard error of b 2 ) = 0.026 ± 2.013 ( 0.003 ) = 0.026 ± 0.006

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128034590000108

Simple Regression

Gary Smith , in Essential Statistics, Regression, and Econometrics (Second Edition), 2015

Abstract

The simple regression model assumes a linear relationship, Y  =   α   +   βX  +   ε, between a dependent variable Y and an explanatory variable X, with the error term ε encompassing omitted factors. The least squares estimates a and b minimize the sum of squared errors when the fitted line is used to predict the observed values of Y . The standard error of estimate (SEE) is our estimate of the standard deviation of the error term. The standard errors of the estimates a and b can be used to construct confidence intervals for α and β and test null hypotheses, most often that the value of β is zero ( Y and X are not linearly related). The coefficient of determination R 2 compares the model's sum of the squared prediction errors to the sum of the squared deviations of Y about its mean, and can be interpreted as the fraction of the variation in the dependent variable that is explained by the regression model. The correlation coefficient is equal to the square root of R 2.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012803459000008X

Bootstrap Method

K. Singh , M. Xie , in International Encyclopedia of Education (Third Edition), 2010

Approximating Standard Error of a Sample Estimate

Let us suppose, information is sought about a population parameter θ. Suppose θ ˆ is a sample estimator of θ based on a random sample of size n, that is, θ ˆ is a function of the data (X 1, X 2, …,X n ). In order to estimate standard error of θ ˆ , as the sample varies over the class of all possible samples, one has the following simple bootstrap approach.

Compute θ 1 * , θ 2 * , , θ N * , using the same computing formula as the one used for θ ˆ , but now base it on N different bootstrap samples (each of size n). A crude recommendation for the size N could be N = n 2 (in our judgment), unless n 2 is too large. In that case, it could be reduced to an acceptable size, say n log e n . One defines

SE B ( θ ˆ ) = [ ( 1 / N ) i = 1 N ( θ i * θ ˆ ) 2 ] 1 / 2

following the philosophy of bootstrap: replace the population by the empirical population.

An older resampling technique used for this purpose is Jackknife, though bootstrap is more widely applicable. The famous example where Jackknife fails while bootstrap is still useful is that of θ ˆ = the sample median.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780080448947013099

Pearson, Karl

M. Eileen Magnello , in Encyclopedia of Social Measurement, 2005

The Biometric School

Although Pearson's success in attracting such large audiences in his Gresham lectures may have played a role in encouraging him to further develop his work in biometry, he resigned from the Gresham Lectureship due to his doctor's recommendation. Following the success of his Gresham lectures, Pearson began to teach statistics to students at UCL in October 1894. Not only did Galton's work on his law of ancestral heredity enable Pearson to devise the mathematical properties of the product– moment correlation coefficient (which measures the relationship between two continuous variables) and simple regression (used for the linear prediction between two continuous variables) but also Galton's ideas led to Pearson's introduction of multiple correlation and part correlation coefficients, multiple regression and the standard error of estimate (for regression), and the coefficient of variation. By then, Galton had determined graphically the idea of correlation and regression for the normal distribution only. Because Galton's procedure for measuring correlation involved measuring the slope of the regression line (which was a measure of regression instead), Pearson kept Galton's "r" to symbolize correlation. Pearson later used the letter b (from the equation for a straight line) to symbolize regression. After Weldon had seen a copy of Pearson's 1896 paper on correlation, he suggested to Pearson that he should extend the range for correlation from 0 to +1 (as used by Galton) so that it would include all values from −1 to +1.

Pearson achieved a mathematical resolution of multiple correlation and multiple regression, adumbrated in Galton's law of ancestral heredity in 1885, in his seminal paper Regression, Heredity, and Panmixia in 1896, when he introduced matrix algebra into statistical theory. (Arthur Cayley, who taught at Cambridge when Pearson was a student, created matrix algebra by his discovery of the theory of invariants during the mid-19th century.) Pearson's theory of multiple regression became important to his work on Mendel in 1904 when he advocated a synthesis of Mendelism and biometry. In the same paper, Pearson also introduced the following statistical methods: eta (η) as a measure for a curvilinear relationship, the standard error of estimate, multiple regression, and multiple and part correlation. He also devised the coefficient of variation as a measure of the ratio of a standard deviation to the corresponding mean expressed as a percentage.

By the end of the 19th century, he began to consider the relationship between two discrete variables, and from 1896 to 1911 Pearson devised more than 18 methods of correlation. In 1900, he devised the tetrachoric correlation and the phi coefficient for dichotomous variables. The tetrachoric correlation requires that both X and Y represent continuous, normally distributed, and linearly related variables, whereas the phi coefficient was designed for so-called point distributions, which implies that the two classes have two point values or merely represent some qualitative attribute. Nine years later, he devised the biserial correlation, where one variable is continuous and the other is discontinuous. With his son Egon, he devised the polychoric correlation in 1922 (which is very similar to canonical correlation today). Although not all of Pearson's correlational methods have survived him, a number of these methods are still the principal tools used by psychometricians for test construction. Following the publication of his first three statistical papers in Philosophical Transactions of the Royal Society, Pearson was elected a fellow of the Royal Society in 1896. He was awarded the Darwin Medal from the Royal Society in 1898.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0123693985002280

Advances in Analysis of Mean and Covariance Structure when Data are Incomplete*

Mortaza Jamshidian , Matthew Mata , in Handbook of Latent Variable and Related Models, 2007

3.2.5 Generalized least squares and minimum chi-square

Lee (1986) has proposed a generalized least squares method for estimating the parameters θ in SEM in an effort to do without the normality assumption. Suppose that there are m missing data patterns and for each pattern j, there exist nj cases, sufficiently large, based on which a positive definite sample covariance Si is obtained. Lee (1986) proposed estimating θ by minimizing

(10) G ( θ ) = j = 1 m n j n trace { ( S j Σ j ( θ ) ) W j } ,

where σ j (θ) is the subset of σ(θ) related to the observed components in the pattern j, and Wj is a positive definite weight matrix which converges in probability to the true σ−1(θ ). He gave an iterative algorithm to accomplish this and gave formulas for the standard error of estimates.

With the same intention of moving away from assumption of normality, recently Yuan and Bentler (2000) gave an estimation method that utilizes the minimum chi-squared method of Ferguson (1996, Chapter 23). Let vech(.) be an operator that transforms a symmetric matrix into a vector by stacking the column of the matrix, leaving out the elements above the diagonal. Let β(θ) = (vech(σ(θ))T, μ(θ)T)T. Let β ˆ = ( vech ( Σ ( b ¯ ) ) T , μ ( b ¯ ) T ) T be the estimate of β obtained from what we called the EM estimates of μ and σ from the saturated model. Furthermore, let Ω ˆ denote the sandwich type estimate of the asymptotic covariance of β ˆ . Then the minimum chi-square estimate of θ is obtained by minimizing

(11) Q ( θ ) = ( β ˆ β ( θ ) ) T Ω 1 ( β ˆ β ( θ ) )

with respect to θ. Yuan and Bentler (2000) gave the asymptotic standard error formulas for this estimator and stated that it is asymptotically normal. They state that when data are not normal, the minimum chi-square estimator is asymptotically at least as efficient as the FIML and the ML estimate θ ˜ that uses the EM mean and covariance.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444520449500057

Exploratory and Confirmatory Clinical Trials

Joseph Tal , in Strategy and Statistics in Clinical Trials, 2011

The all too human model

There are times when I sit impatiently facing the computer monitor awaiting the Word in its contemporary form, called "output." Depending on the results I may praise or rail, but this does not trouble the machine. In moments of clarity, it does not trouble me either. On occasion even I realize that processors running statistical procedures are neither for nor against me; they simply go about their business, paying no personal heed to the hands that key.

The programs I use are set up to analyze data in fixed steps, and even the "decisions" they make—their choice of one routine or another within an analysis—are burned into their programs. They are, in this sense, unfeeling. But this does not necessarily mean that they are without bias. In fact, being extensions of our own reasoning, they are very much subject to the kind of human foibles I have been describing.

Figure 9.1 represents measurements of Engine Size (displacement in inches3) and Acceleration (time in seconds from 0 to 60 miles) for 404 car models. As you might imagine, and as can be seen in the figure, the larger the engine, the greater the acceleration. Knowing this before having analyzed the data, and assuming the relationship will be adequately described by a straight line, I planned to fit a simple linear regression model to the data. This I did, and I present the line computed in the scatterplot as well.

Figure 9.1. Relationship between Acceleration and Engine Size with a fitted first-order linear model.

The line in Figure 9.1 seems to fit the data, and I conclude that this simple linear model reasonably describes the relationship between Acceleration and Engine Size. Moreover, having prespecified the model before analyzing the data would seem to make this a confirmatory study and especially credible. Well, almost. Actually, I did prespecify that I would fit a simple linear model to the data, but I did not prespecify its specific parameters.

Reach back to your high school algebra and recall that lines in two-dimensional space have the following general form:

(9.1) y = c + b × x

In this particularly example:

y = Acceleration

x = Engine Size

c = Constant (termed intercept, which is the point on the y-axis that the line intersects)

b = Regression coefficient, which is the line's slope

On the technical side, the procedure requires that I provide the computer program with the model's (linear) form and data. It will then use the programmed procedure to compute the parameters "c" and "b" of the equation. Once done, I have a formula with which I can predict Acceleration from Engine Size. In this particular case it is the following equation:

(9.2) Acceleration = 18.434 0.0150 × ( Engine Size )

Plug in an Engine Size—one of the 404 models of which the data were already collected or another of your choice—and you will obtain the model's prediction for Acceleration. For example, estimating Acceleration for an engine of size 280 inch3, I plug the number into the formula as follows:

Acceleration = 18.434 0.0150 × 280 = 14.2

Thus, given the model, I expect an engine of size 280 inch3 to provide a car's Acceleration from 0 to 60 miles of 14.2 seconds. Not too impressive.

Having posited a linear model before building it but not having prespecified the model's parameters, this "trial" is not purely confirmatory but rather:

Confirmatory with respect to testing whether a linear model fits the data.

Exploratory in that it uses an optimization procedure to determine the model's actual parameters after having "looked" at the data.

In a wholly confirmatory study, I would do the following:

1.

Prespecify the model, complete with specific values for "c" and "b," which would have likely been developed on pilot data. 2

2.

Collect new data—obtain another random sample of cars from the population—and use Eq. (9.2) to estimate Acceleration given Engine Size.

3.

Compare the results predicted by the model to the cars' actual acceleration values.

If the differences between observed and predicted Accelerations are sufficiently small, I have confirmed the model. If the differences are unacceptable, the model is disconfirmed. Thus, a confirmatory trial would test an existing model "as is" on data; it would not use the data to build or modify any part of Eq. (9.2).

The "partial-confirmatory" approach just described is common in research, and there is nothing wrong with it. Regardless, when building models we should distinguish between the confirmatory and exploratory elements in them. This will provide us information on the degree to which we can trust our results. In this particular example, I can be confident that a straight line is a reasonable way to model the data. After all, I specified the simple model in advance, and it fits the data. However, I am less trusting of the model's specific parameters; in other words, I am less certain that the constant (18.434) and the slope (–0.0150) are reasonable estimates of the true values in the population because they were not prespecified.

Testing and Trusting

Models are useful to the degree that they apply to the population in general. In this sense, their performance on the data used for building them provides limited information. So it is with a drug in an exploratory study that is shown to be effective; its utility must then be confirmed on a new sample of subjects—on another exemplar of the population. But confirming takes time and resources, and it would be nice to have some indication of the degree to which a model will perform in the future. Thus, the regression procedure provides standard error for c and b (constant and slope) that are quantitative indicators of how these parameters are likely to change when they are computed on a new sample. Other measures of model fit, such as R-square and standard error of estimate , relate to the distance of the points on the graph from the line. If the points, which represent actually observed cars, are far from the line, we say the linear model does not fit very well (R-square is low and standard error of estimate is high). If the points are near the line we say that our model fits well (R-square is high and standard error of estimate is low). Additional statistics abound, each of which provides some information on the degree to which you can trust a models' future performance before actually testing it (in the future).

I said before that a simple linear model fits the data reasonably well and the figure presented bears this out. Still, life is not mathematics—and neither are cars—and most of the points are off the line; in other words, the model does not fit perfectly, and there is no one-to-one correspondence between Acceleration and Engine Size. This too was expected. Examining the figure, I wonder if I can do better. I see, for example, that there are several relatively small engines (about 100 inch3) of which the Acceleration is in the vicinity of 25 seconds, yet, according to the estimated model, they "should be" about 17 seconds. At the other end of the scale—when looking at the largest engines—all of the points are below the line. In other words, for all these engines the model predicts slower Acceleration than is actually the case. This is not optimal. Ideally a line will "pass through" the data throughout the range rather than be uniformly above or below in certain subranges.

Thus, the error in prediction of Acceleration from Engine Size produced by Eq. (9.2) is, at times, more than I would like. Hoping for a better model, I decide to fit a more complex equation to the data. Specifically I choose a model of which the form is a second-order polynomial—one that posits an element of nonlinearity in the relationship between Acceleration and Engine Size. The new equation (which you might also vaguely recall from high school algebra) has the following form:

(9.3) y = c + b 1 × x + b 2 × x 2

where:

x 2 = Engine Size squared (parabola, remember?)

b 2 = An additional coefficient to be estimated—the weight assigned to the nonlinear term (x 2) in predicting acceleration

Estimating the model using the optimization procedure mentioned yields the following:

(9.4) Acceleration = 15.863 + 0.0139 × ( Engine Size ) 0.0001 × ( Engine Size ) 2

If you are familiar with this procedure, fine. Otherwise, do not fret. It is enough to know that instead of using the simple model (Eq. (9.1)), I have fit a more complex model (Eq. (9.3)) to the data.

Having run the procedure, I obtain Eq. (9.4) and present the results of my efforts in Figure 9.2.

Figure 9.2. Relationship between Acceleration and Engine Size with fitted second-order linear model.

Looking at Figure 9.2, I see that the line is curved and appears to fit the data better than the line in Figure 9.1. There is a clear improvement at the high end of Engine Size, where the first-order model passed above the data points, while the second-order model passed through them. But there appears to be little or no improvement at the lower end of the Engine Size scale. As it turns out, the model in Figure 9.2 is statistically superior to the first in that its R2 (R squared) is 0.36 as opposed to the first one's 0.32. 3 This is no dramatic improvement but is a step in the right direction. Moreover, the difference between the models is statistically significant. In other words, I can be pretty certain that the second model will fit in the population better than the first. Or can I?

I now have a better fit to the current data and, arguably, a better prediction of Acceleration from Engine Size. I write "arguably" because while the model fits the data at hand better, it is far from clear that its predictive prowess in the population will exceed that of the simpler model. Keep in mind that my more recent effort has two post hoc elements to it: the model's form and the parameters estimated. While in my first effort the model's form (linear) was prespecified, in my second effort I added a nonlinear term after looking at the numbers.

So I am once again in a situation where post hoc testing has provided me with apparently useful information. But how useful is it? This is no idle question, since in the future I will want to use some model to predict Acceleration from Engine Size on new data. This means that I must now choose between the two models computed. Based on fit alone, I should choose my second effort, while based on "planned versus post hoc analyses," I should choose the first. All the while I should remember there is a third option—namely, fitting an even more complex model to the data, since neither of my first two efforts yielded particularly impressive results. I will consider this third possibility soon.

The example described presents the statistician with a quandary: choosing between models to predict future events based on sample data. In one form or another, this issue arises continuously in clinical research.

Suppose you conduct a dose-response study testing five doses ranging from 0   mg to 60   mg in an indication where the stronger the response, the better. You start at 0   mg to make sure the kind of response you are trying to elicit does not occur naturally in the human body. And you choose 60   mg for your highest dose, expecting that your drug produces its maximal response at about 40   mg and weakens thereafter. This is what your early data have suggested and, judging by the limited research reported, what others have found as well. In fact, you could have probably forgone the 60   mg dose, but you had the resources and wanted to substantiate your earlier results and those reported in the literature. Figure 9.3 shows the results obtained in this trial.

Figure 9.3. Relationship between Medication Dose and Subjects' Response.

Surprisingly, the 60   mg dose elicits a stronger response than 40   mg, and you must now to decide which should be used in your upcoming feasibility trial. Specifically, should you believe your expectation that 40   mg is optimal, or should you go with the apparently more effective dose of 60   mg observed?

As before, the preferred solution is to repeat the study and test whether these results replicate. But this is not an option. While your company has the monetary resources for another study, it cannot spare the time; its plan requires that the molecule move forward to a feasibility trial, and you have no say in the matter. Discussing the issue with your superiors, they agree that another study is a good idea but that neither you nor they can make it happen. Given the results obtained, they say, any dose between 40   mg and 60   mg should do for now. They propose that you bring up the issue again after the drug is approved and starts generating income. Perhaps then, they say, the company will be open to considering tests of alternative doses.

So you have been given the authority to determine the dose, and decide on 40   mg. It is the safe option and probably not a bad one. But you are not completely comfortable with this route, and you review your results repeatedly to see whether more can be gleaned from them. In reassessing your results, you come up with Figure 9.4, in which two dose-response curves (models) have been fit to the data.

Figure 9.4. Relationship between Medication Dose and Subjects' Response with fitted models.

Looking at the simpler, dotted-line curve, 60   mg is your best option; it produces the strongest response, which is preferred in this indication. In fact, you do not really need a model for this. The strongest response was observed at 60   mg, and that is that. At the same time, this result is at variance with your expectations. Specifically, the maximal response was expected at 40   mg, and pushing it up to 60   mg is sufficiently inconsistent to be questionable. Keeping this in mind, you fit the more complex, solid-line model. And although more complex, its form is often encountered in dose-response studies and thus may more accurately reflect this relationship than the simpler model. Examining the solid curve, the strongest response is slightly over 50   mg rather than 60   mg, which is situated on the curve's downward trend.

So it seems that repeated looks at the data have only complicated matters. You are now faced with three choices, each with advantages and disadvantages:

The safest choice is 40   mg. It is consistent with expectation, yields an acceptable response, and no one will blame you for selecting it. Yet, you would not have conducted the study in the first place if you were completely sure of the 40   mg dose.

Using a simple dose-response model, you ought to choose 60   mg. Of the doses tested, this provides the best response. Indeed, even if you had not used a model at all, you would have reached the same conclusion, which is also in 60   mg's favor.

Given a more complex model of the type often fitted to dose-response data, the dose of choice for the feasibility trial should be in the vicinity of 50 mg.

Having dug into the data a bit deeper, you now have three options and no clear criterion to guide your choice. Now as a statistician I would really like to help you, but there is not much I can do. While the first argument might be the most powerful statistically, all three make sense. In this particular circumstance I would likely push (mildly) for the first option but would ask to discuss the matter further with clinicians.

Allow me another example—one that in one form or another I have encountered in the development of diagnostic tests. Suppose you have designed a device that produces a number for detecting some disorder. Let us call this number Score-X. Now Score-X is a product of theory, early testing, and intensive work by your company's algorithm experts. Applying it in a pilot yields reasonable, but not outstanding, accuracy: sensitivity and specificity of about 0.8 each. You are now planning your pivotal trial and writing a protocol in preparation for submission to the regulator. Once approved, you will go ahead with the trial, and if the results expected materialize, your device will likely be approved. 4 Meanwhile your company has added another algorithm expert to the staff, and she comes up with a modified diagnostic indicator that she calls Score-Y. Reanalyzing your pilot data with Score-Y, both sensitivity and specificity increase appreciably (to about 0.9 each). What should you do?

If this new expert came up with Score-Y after looking at the data, the solution is probably "post hoc enough" to be rejected. Indeed, chances are that your other experts would also have achieved better results if they were allowed to modify the algorithm based on available data. But this is not how it happened. The new person came up with the algorithm using the same information your other experts had before the pilot trial. She only used the more recently acquired data to test the model rather than to develop it. So here is what you have:

1.

A post hoc study in the sense that Score-Y was computed after the pilot.

2.

A planned study in that you assessed Score-Y on data that were not used in its development, which is certainly not post hoc.

The important moral of this story is that the difference between "planned" and "after-the-fact" is not necessarily a matter of chronology. To evaluate trustworthiness of outcomes, you must examine how these outcomes were obtained. When your model is fit to some data, these same data do not provide an optimal test of your model. However, if the model was built independently of a specific data set, this latter set provides an acceptable test of the model even if its numbers were collected before the model was constructed.

To address this issue, when developing algorithms, researchers often take existing data and divide them into two or three subsets. They use the first subset—usually the largest—for what is called "learning." Once the learning phase is completed, one has an algorithm that can be tested on "virgin" data—numbers that were not used in algorithm development (though collected at the same time). Whether you choose one or two data sets in addition to the first depends both on the amount of data at your disposal and your researcher's preferences. This is a technical issue that need not concern us. My point is this: Where possible, do not "waste" all your data on building a model. Select a random subset of them to be excluded from the model-building phase. Once your equation has been developed, use the data that have been set aside to test your accuracy. This way you have, for all practical purposes, conducted two trials, the first being exploratory and the second confirmatory.

The examples described thus far involve "shades of gray" in decision making. Based on statistical "fit parameters" only, there are better results and worse. But statistical fit is not the only concern in choosing between models, and this makes your choice less than straightforward. Now all this might be an amusing intellectual exercise if the reality of biomedical development did not necessitate your deciding one way or the other. But you must decide, and at times the decision-making process feels a bit like gambling, which it is. You can collect information to increase your odds for success, but you will not know whether your choice has been a good one until the results are in.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123869098000106