2 @cindex least squares fit
3 @cindex regression, least squares
4 @cindex weighted linear fits
5 @cindex unweighted linear fits
6 This chapter describes routines for performing least squares fits to
7 experimental data using linear combinations of functions. The data
8 may be weighted or unweighted, i.e. with known or unknown errors. For
9 weighted data the functions compute the best fit parameters and their
10 associated covariance matrix. For unweighted data the covariance
11 matrix is estimated from the scatter of the points, giving a
12 variance-covariance matrix.
14 The functions are divided into separate versions for simple one- or
15 two-parameter regression and multiple-parameter fits. The functions
16 are declared in the header file @file{gsl_fit.h}.
21 * Linear fitting without a constant term::
22 * Multi-parameter fitting::
24 * Fitting References and Further Reading::
27 @node Fitting Overview
30 Least-squares fits are found by minimizing @math{\chi^2}
31 (chi-squared), the weighted sum of squared residuals over @math{n}
32 experimental datapoints @math{(x_i, y_i)} for the model @math{Y(c,x)},
36 \chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2
43 \chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2
48 The @math{p} parameters of the model are @c{$c = \{c_0, c_1, \dots\}$}
49 @math{c = @{c_0, c_1, @dots{}@}}. The
50 weight factors @math{w_i} are given by @math{w_i = 1/\sigma_i^2},
51 where @math{\sigma_i} is the experimental error on the data-point
52 @math{y_i}. The errors are assumed to be
53 gaussian and uncorrelated.
54 For unweighted data the chi-squared sum is computed without any weight factors.
56 The fitting routines return the best-fit parameters @math{c} and their
57 @math{p \times p} covariance matrix. The covariance matrix measures the
58 statistical errors on the best-fit parameters resulting from the
59 errors on the data, @math{\sigma_i}, and is defined
60 @cindex covariance matrix, linear fits
61 as @c{$C_{ab} = \langle \delta c_a \delta c_b \rangle$}
62 @math{C_@{ab@} = <\delta c_a \delta c_b>} where @c{$\langle \, \rangle$}
63 @math{< >} denotes an average over the gaussian error distributions of the underlying datapoints.
65 The covariance matrix is calculated by error propagation from the data
66 errors @math{\sigma_i}. The change in a fitted parameter @math{\delta
67 c_a} caused by a small change in the data @math{\delta y_i} is given
72 \delta c_a = \sum_i {\partial c_a \over \partial y_i} \delta y_i
79 \delta c_a = \sum_i (dc_a/dy_i) \delta y_i
84 allowing the covariance matrix to be written in terms of the errors on the data,
88 C_{ab} = \sum_{i,j} {\partial c_a \over \partial y_i}
89 {\partial c_b \over \partial y_j}
90 \langle \delta y_i \delta y_j \rangle
97 C_@{ab@} = \sum_@{i,j@} (dc_a/dy_i) (dc_b/dy_j) <\delta y_i \delta y_j>
102 For uncorrelated data the fluctuations of the underlying datapoints satisfy
103 @c{$\langle \delta y_i \delta y_j \rangle = \sigma_i^2 \delta_{ij}$}
104 @math{<\delta y_i \delta y_j> = \sigma_i^2 \delta_@{ij@}}, giving a
105 corresponding parameter covariance matrix of
109 C_{ab} = \sum_{i} {1 \over w_i} {\partial c_a \over \partial y_i} {\partial c_b \over \partial y_i}
116 C_@{ab@} = \sum_i (1/w_i) (dc_a/dy_i) (dc_b/dy_i)
121 When computing the covariance matrix for unweighted data, i.e. data with unknown errors,
122 the weight factors @math{w_i} in this sum are replaced by the single estimate @math{w =
123 1/\sigma^2}, where @math{\sigma^2} is the computed variance of the
124 residuals about the best-fit model, @math{\sigma^2 = \sum (y_i - Y(c,x_i))^2 / (n-p)}.
125 This is referred to as the @dfn{variance-covariance matrix}.
126 @cindex variance-covariance matrix, linear fits
128 The standard deviations of the best-fit parameters are given by the
129 square root of the corresponding diagonal elements of
130 the covariance matrix, @c{$\sigma_{c_a} = \sqrt{C_{aa}}$}
131 @math{\sigma_@{c_a@} = \sqrt@{C_@{aa@}@}}.
132 The correlation coefficient of the fit parameters @math{c_a} and @math{c_b}
133 is given by @c{$\rho_{ab} = C_{ab} / \sqrt{C_{aa} C_{bb}}$}
134 @math{\rho_@{ab@} = C_@{ab@} / \sqrt@{C_@{aa@} C_@{bb@}@}}.
136 @node Linear regression
137 @section Linear regression
138 @cindex linear regression
140 The functions described in this section can be used to perform
141 least-squares fits to a straight line model, @math{Y(c,x) = c_0 + c_1 x}.
143 @cindex covariance matrix, from linear regression
144 @deftypefun int gsl_fit_linear (const double * @var{x}, const size_t @var{xstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c0}, double * @var{c1}, double * @var{cov00}, double * @var{cov01}, double * @var{cov11}, double * @var{sumsq})
145 This function computes the best-fit linear regression coefficients
146 (@var{c0},@var{c1}) of the model @math{Y = c_0 + c_1 X} for the dataset
147 (@var{x}, @var{y}), two vectors of length @var{n} with strides
148 @var{xstride} and @var{ystride}. The errors on @var{y} are assumed unknown so
149 the variance-covariance matrix for the
150 parameters (@var{c0}, @var{c1}) is estimated from the scatter of the
151 points around the best-fit line and returned via the parameters
152 (@var{cov00}, @var{cov01}, @var{cov11}).
153 The sum of squares of the residuals from the best-fit line is returned
154 in @var{sumsq}. Note: the correlation coefficient of the data can be computed using @code{gsl_stats_correlation} (@pxref{Correlation}), it does not depend on the fit.
157 @deftypefun int gsl_fit_wlinear (const double * @var{x}, const size_t @var{xstride}, const double * @var{w}, const size_t @var{wstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c0}, double * @var{c1}, double * @var{cov00}, double * @var{cov01}, double * @var{cov11}, double * @var{chisq})
158 This function computes the best-fit linear regression coefficients
159 (@var{c0},@var{c1}) of the model @math{Y = c_0 + c_1 X} for the weighted
160 dataset (@var{x}, @var{y}), two vectors of length @var{n} with strides
161 @var{xstride} and @var{ystride}. The vector @var{w}, of length @var{n}
162 and stride @var{wstride}, specifies the weight of each datapoint. The
163 weight is the reciprocal of the variance for each datapoint in @var{y}.
165 The covariance matrix for the parameters (@var{c0}, @var{c1}) is
166 computed using the weights and returned via the parameters
167 (@var{cov00}, @var{cov01}, @var{cov11}). The weighted sum of squares
168 of the residuals from the best-fit line, @math{\chi^2}, is returned in
172 @deftypefun int gsl_fit_linear_est (double @var{x}, double @var{c0}, double @var{c1}, double @var{cov00}, double @var{cov01}, double @var{cov11}, double * @var{y}, double * @var{y_err})
173 This function uses the best-fit linear regression coefficients
174 @var{c0}, @var{c1} and their covariance
175 @var{cov00}, @var{cov01}, @var{cov11} to compute the fitted function
176 @var{y} and its standard deviation @var{y_err} for the model @math{Y =
177 c_0 + c_1 X} at the point @var{x}.
180 @node Linear fitting without a constant term
181 @section Linear fitting without a constant term
183 The functions described in this section can be used to perform
184 least-squares fits to a straight line model without a constant term,
187 @deftypefun int gsl_fit_mul (const double * @var{x}, const size_t @var{xstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c1}, double * @var{cov11}, double * @var{sumsq})
188 This function computes the best-fit linear regression coefficient
189 @var{c1} of the model @math{Y = c_1 X} for the datasets (@var{x},
190 @var{y}), two vectors of length @var{n} with strides @var{xstride} and
191 @var{ystride}. The errors on @var{y} are assumed unknown so the
192 variance of the parameter @var{c1} is estimated from
193 the scatter of the points around the best-fit line and returned via the
194 parameter @var{cov11}. The sum of squares of the residuals from the
195 best-fit line is returned in @var{sumsq}.
198 @deftypefun int gsl_fit_wmul (const double * @var{x}, const size_t @var{xstride}, const double * @var{w}, const size_t @var{wstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c1}, double * @var{cov11}, double * @var{sumsq})
199 This function computes the best-fit linear regression coefficient
200 @var{c1} of the model @math{Y = c_1 X} for the weighted datasets
201 (@var{x}, @var{y}), two vectors of length @var{n} with strides
202 @var{xstride} and @var{ystride}. The vector @var{w}, of length @var{n}
203 and stride @var{wstride}, specifies the weight of each datapoint. The
204 weight is the reciprocal of the variance for each datapoint in @var{y}.
206 The variance of the parameter @var{c1} is computed using the weights
207 and returned via the parameter @var{cov11}. The weighted sum of
208 squares of the residuals from the best-fit line, @math{\chi^2}, is
209 returned in @var{chisq}.
212 @deftypefun int gsl_fit_mul_est (double @var{x}, double @var{c1}, double @var{cov11}, double * @var{y}, double * @var{y_err})
213 This function uses the best-fit linear regression coefficient @var{c1}
214 and its covariance @var{cov11} to compute the fitted function
215 @var{y} and its standard deviation @var{y_err} for the model @math{Y =
216 c_1 X} at the point @var{x}.
219 @node Multi-parameter fitting
220 @section Multi-parameter fitting
221 @cindex multi-parameter regression
222 @cindex fits, multi-parameter linear
223 The functions described in this section perform least-squares fits to a
224 general linear model, @math{y = X c} where @math{y} is a vector of
225 @math{n} observations, @math{X} is an @math{n} by @math{p} matrix of
226 predictor variables, and the elements of the vector @math{c} are the @math{p} unknown best-fit parameters which are to be estimated. The chi-squared value is given by @c{$\chi^2 = \sum_i w_i (y_i - \sum_j X_{ij} c_j)^2$}
227 @math{\chi^2 = \sum_i w_i (y_i - \sum_j X_@{ij@} c_j)^2}.
229 This formulation can be used for fits to any number of functions and/or
230 variables by preparing the @math{n}-by-@math{p} matrix @math{X}
231 appropriately. For example, to fit to a @math{p}-th order polynomial in
232 @var{x}, use the following matrix,
248 where the index @math{i} runs over the observations and the index
249 @math{j} runs from 0 to @math{p-1}.
251 To fit to a set of @math{p} sinusoidal functions with fixed frequencies
252 @math{\omega_1}, @math{\omega_2}, @dots{}, @math{\omega_p}, use,
256 X_{ij} = \sin(\omega_j x_i)
263 X_@{ij@} = sin(\omega_j x_i)
268 To fit to @math{p} independent variables @math{x_1}, @math{x_2}, @dots{},
285 where @math{x_j(i)} is the @math{i}-th value of the predictor variable
288 The functions described in this section are declared in the header file
289 @file{gsl_multifit.h}.
291 The solution of the general linear least-squares system requires an
292 additional working space for intermediate results, such as the singular
293 value decomposition of the matrix @math{X}.
295 @deftypefun {gsl_multifit_linear_workspace *} gsl_multifit_linear_alloc (size_t @var{n}, size_t @var{p})
296 This function allocates a workspace for fitting a model to @var{n}
297 observations using @var{p} parameters.
300 @deftypefun void gsl_multifit_linear_free (gsl_multifit_linear_workspace * @var{work})
301 This function frees the memory associated with the workspace @var{w}.
304 @deftypefun int gsl_multifit_linear (const gsl_matrix * @var{X}, const gsl_vector * @var{y}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work})
305 @deftypefunx int gsl_multifit_linear_svd (const gsl_matrix * @var{X}, const gsl_vector * @var{y}, double @var{tol}, size_t * @var{rank}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work})
306 These functions compute the best-fit parameters @var{c} of the model
307 @math{y = X c} for the observations @var{y} and the matrix of predictor
308 variables @var{X}. The variance-covariance matrix of the model
309 parameters @var{cov} is estimated from the scatter of the observations
310 about the best-fit. The sum of squares of the residuals from the
311 best-fit, @math{\chi^2}, is returned in @var{chisq}. If the coefficient
312 of determination is desired, it can be computed from the expression
313 @math{R^2 = 1 - \chi^2 / TSS}, where the total sum of squares (TSS) of
314 the observations @var{y} may be computed from @code{gsl_stats_tss}.
316 The best-fit is found by singular value decomposition of the matrix
317 @var{X} using the preallocated workspace provided in @var{work}. The
318 modified Golub-Reinsch SVD algorithm is used, with column scaling to
319 improve the accuracy of the singular values. Any components which have
320 zero singular value (to machine precision) are discarded from the fit.
321 In the second form of the function the components are discarded if the
322 ratio of singular values @math{s_i/s_0} falls below the user-specified
323 tolerance @var{tol}, and the effective rank is returned in @var{rank}.
326 @deftypefun int gsl_multifit_wlinear (const gsl_matrix * @var{X}, const gsl_vector * @var{w}, const gsl_vector * @var{y}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work})
327 @deftypefunx int gsl_multifit_wlinear_svd (const gsl_matrix * @var{X}, const gsl_vector * @var{w}, const gsl_vector * @var{y}, double @var{tol}, size_t * @var{rank}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work})
329 This function computes the best-fit parameters @var{c} of the weighted
330 model @math{y = X c} for the observations @var{y} with weights @var{w}
331 and the matrix of predictor variables @var{X}. The covariance matrix of
332 the model parameters @var{cov} is computed with the given weights. The
333 weighted sum of squares of the residuals from the best-fit,
334 @math{\chi^2}, is returned in @var{chisq}. If the coefficient
335 of determination is desired, it can be computed from the expression
336 @math{R^2 = 1 - \chi^2 / WTSS}, where the weighted total sum of squares
337 (WTSS) of the observations @var{y} may be computed from
338 @code{gsl_stats_wtss}.
340 The best-fit is found by singular value decomposition of the matrix
341 @var{X} using the preallocated workspace provided in @var{work}. Any
342 components which have zero singular value (to machine precision) are
343 discarded from the fit. In the second form of the function the
344 components are discarded if the ratio of singular values @math{s_i/s_0}
345 falls below the user-specified tolerance @var{tol}, and the effective
346 rank is returned in @var{rank}.
349 @deftypefun int gsl_multifit_linear_est (const gsl_vector * @var{x}, const gsl_vector * @var{c}, const gsl_matrix * @var{cov}, double * @var{y}, double * @var{y_err})
350 This function uses the best-fit multilinear regression coefficients
351 @var{c} and their covariance matrix
352 @var{cov} to compute the fitted function value
353 @var{y} and its standard deviation @var{y_err} for the model @math{y = x.c}
354 at the point @var{x}.
357 @deftypefun int gsl_multifit_linear_residuals (const gsl_matrix * @var{X}, const gsl_vector * @var{y}, const gsl_vector * @var{c}, gsl_vector * @var{r})
358 This function computes the vector of residuals @math{r = y - X c} for
359 the observations @var{y}, coefficients @var{c} and matrix of predictor
363 @node Fitting Examples
366 The following program computes a least squares straight-line fit to a
367 simple dataset, and outputs the best-fit line and its
368 associated one standard-deviation error bars.
371 @verbatiminclude examples/fitting.c
375 The following commands extract the data from the output of the program
376 and display it using the @sc{gnu} plotutils @code{graph} utility,
381 # best fit: Y = -106.6 + 0.06 X
387 $ for n in data fit hi lo ;
389 grep "^$n" tmp | cut -d: -f2 > $n ;
391 $ graph -T X -X x -Y y -y 0 20 -m 0 -S 2 -Ie data
392 -S 0 -I a -m 1 fit -m 2 hi -m 2 lo
397 @center @image{fit-wlinear,3.0in}
400 The next program performs a quadratic fit @math{y = c_0 + c_1 x + c_2
401 x^2} to a weighted dataset using the generalised linear fitting function
402 @code{gsl_multifit_wlinear}. The model matrix @math{X} for a quadratic
407 X=\pmatrix{1&x_0&x_0^2\cr
410 \dots&\dots&\dots\cr}
417 X = [ 1 , x_0 , x_0^2 ;
425 where the column of ones corresponds to the constant term @math{c_0}.
426 The two remaining columns corresponds to the terms @math{c_1 x} and
429 The program reads @var{n} lines of data in the format (@var{x}, @var{y},
430 @var{err}) where @var{err} is the error (standard deviation) in the
434 @verbatiminclude examples/fitting2.c
438 A suitable set of data for fitting can be generated using the following
439 program. It outputs a set of points with gaussian errors from the curve
440 @math{y = e^x} in the region @math{0 < x < 2}.
443 @verbatiminclude examples/fitting3.c
447 The data can be prepared by running the resulting executable program,
450 $ ./generate > exp.dat
462 To fit the data use the previous program, with the number of data points
463 given as the first argument. In this case there are 19 data points.
467 0.1 0.97935 +/- 0.110517
468 0.2 1.3359 +/- 0.12214
470 # best fit: Y = 1.02318 + 0.956201 X + 0.876796 X^2
472 [ +1.25612e-02, -3.64387e-02, +1.94389e-02
473 -3.64387e-02, +1.42339e-01, -8.48761e-02
474 +1.94389e-02, -8.48761e-02, +5.60243e-02 ]
479 The parameters of the quadratic fit match the coefficients of the
480 expansion of @math{e^x}, taking into account the errors on the
481 parameters and the @math{O(x^3)} difference between the exponential and
482 quadratic functions for the larger values of @math{x}. The errors on
483 the parameters are given by the square-root of the corresponding
484 diagonal elements of the covariance matrix. The chi-squared per degree
485 of freedom is 1.4, indicating a reasonable fit to the data.
489 @center @image{fit-wlinear2,3.0in}
492 @node Fitting References and Further Reading
493 @section References and Further Reading
495 A summary of formulas and techniques for least squares fitting can be
496 found in the ``Statistics'' chapter of the Annual Review of Particle
497 Physics prepared by the Particle Data Group,
501 @cite{Review of Particle Properties},
502 R.M. Barnett et al., Physical Review D54, 1 (1996)
503 @uref{http://pdg.lbl.gov/}
507 The Review of Particle Physics is available online at the website given
510 @cindex NIST Statistical Reference Datasets
511 @cindex Statistical Reference Datasets (StRD)
512 The tests used to prepare these routines are based on the NIST
513 Statistical Reference Datasets. The datasets and their documentation are
514 available from NIST at the following website,
516 @center @uref{http://www.nist.gov/itl/div898/strd/index.html}.