7
Enabling Science IDBS Unit 2 Occam Court Surrey Research Park Guildford Surrey GU2 7QB UK t: +44 1483 595000 e: [email protected] w: http://www.idbs.com Curve Fitting Best Practice Part 5: Robust Fitting and Complex Models Most researchers are familiar with standard kinetics, Michaelis-Menten and dose response curves, but there are many more available modern techniques of analysis that allow you to get greater value from data. This article discusses the methods used in curve fitting today, including Iteratively Re-weighted Least Squares (IRLS) which is also known as robust fitting. The constraints of this technique are also explored, including the reasons why robust fitting is now more widely accepted and used today after its introduction some 20 years ago. The principles behind complex models, and how they can be applied, are also discussed. Quick introduction to weights By default, equal weight is given to every data point in a curve fit. In order to determine the best fit, standard weighting is calculated by the Levenberg-Marquardt algorithm (LVM), which minimizes the sum-of-squares of the vertical distance between the observed data and the curve or fitted data (residuals). By default, LVM minimizes: Σ (Ydata – Yfit) 2 Unequal weighting can be assigned according to any scheme. If weight values are assigned, LVM minimizes: Σ [(Ydata – Yfit)/Weight] 2 The lower the weight (closer to 0), the higher the values bearing on the fit. Unequal weight can be assigned to data points within a certain tolerance, so that all points are included in the analysis but those with less weight are given less bearing and meaning to the ultimate result. For example, an instrument may have a certain data range within which it guarantees a high level of accuracy but when the limits of that range are exceeded, the tolerances in the accuracy of that instrument decreases. In this scenario, more bearing or weight can be given to the data points within the instrument’s tolerances, and outside of that range, data points will still be included in the analysis but they have less bearing on the ultimate result. A set of weighting values can be applied that will reflect that assumption and reduce the impact of any outliers outside the tolerance ranges in the fitting process. IRLS (Robust Fitting) Standard regression analysis is very prone to outliers and even a single outlier will affect results considerably, as shown in Fig 1 below. Knocking out an individual outlier improves the curve fit considerably, as shown in Fig 2.

Curve Fitting Best Practice Part 5

  • Upload
    lamminh

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Curve Fitting Best Practice Part 5

Enabling Science

IDBS • Unit 2 • Occam Court • Surrey Research Park • Guildford • Surrey GU2 7QB • UK

t: +44 1483 595000 • e: [email protected] • w: http://www.idbs.com

Curve Fitting Best Practice Part 5: Robust Fitting and Complex Models Most researchers are familiar with standard kinetics, Michaelis-Menten and dose response curves, but there are

many more available modern techniques of analysis that allow you to get greater value from data. This article

discusses the methods used in curve fitting today, including Iteratively Re-weighted Least Squares (IRLS) which is

also known as robust fitting. The constraints of this technique are also explored, including the reasons why robust

fitting is now more widely accepted and used today after its introduction some 20 years ago. The principles behind

complex models, and how they can be applied, are also discussed.

Quick introduction to weights By default, equal weight is given to every data point in a curve fit. In order to determine the best fit, standard

weighting is calculated by the Levenberg-Marquardt algorithm (LVM), which minimizes the sum-of-squares of the

vertical distance between the observed data and the curve or fitted data (residuals).

By default, LVM minimizes:

Σ (Ydata – Yfit)2

Unequal weighting can be assigned according to any scheme.

If weight values are assigned, LVM minimizes:

Σ [(Ydata – Yfit)/Weight]2

The lower the weight (closer to 0), the higher the values bearing on the fit.

Unequal weight can be assigned to data points within a certain tolerance, so that all points are included in the

analysis but those with less weight are given less bearing and meaning to the ultimate result.

For example, an instrument may have a certain data range within which it guarantees a high level of accuracy but

when the limits of that range are exceeded, the tolerances in the accuracy of that instrument decreases. In this

scenario, more bearing or weight can be given to the data points within the instrument’s tolerances, and outside of

that range, data points will still be included in the analysis but they have less bearing on the ultimate result.

A set of weighting values can be applied that will reflect that assumption and reduce the impact of any outliers

outside the tolerance ranges in the fitting process.

IRLS (Robust Fitting) Standard regression analysis is very prone to outliers and even a single outlier will affect results considerably, as

shown in Fig 1 below. Knocking out an individual outlier improves the curve fit considerably, as shown in Fig 2.

Page 2: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 2 of 7

Robust fitting is an extension of standard regression (standard non-linear Least Squares Fitting (LSF)) that can

even out individual outliers in a data set and neutralize their effect on the ultimate result.

Robust fitting was introduced about 20 years ago but was not initially widely accepted because of the many

competing techniques available at the time and a lack of understanding of the most appropriate way to use it.

Another reason for the general reluctance to widely adopt IRLS was its computationally-intensive nature. Standard

non-linear LSF processes could be calculated by writing on paper using standard math techniques, but the more

robust technique of IRLS was much harder to perform in the same way. Early curve-fitting software packages were

not able to employ robust fitting, making the technique and its algorithms mostly unavailable to the mainstream.

Fig 1: Even one outlying data point can significantly affect the quality of a fit

Fig 2: Knocking out the outlier considerable improves results

Page 3: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 3 of 7

IRLS (Robust Fitting) A fitting process is iterative and, on each iteration, the fitting algorithm changes parameter values based on the data

set provided in order to converge on best results.

Robust fitting introduces another variable to the fitting process, by varying individual weights for individual data

points as well as parameter values. Thus on each cycle of the iteration, the weighting values for each data point are

changed to enable the fit to converge at the best fit for the data. So if there is an outlier in the data set, it will be

significantly down weighted to achieve a more robust and better fit for the rest of the data set.

There are many IRLS techniques available, but the six major most commonly used are:

• Tukey’s Biweight*

• Andrew’s Sine*

• German-McClure

• Huber

• Welsch

• Cauchy

*Undefined over complete error space resulting in outliers being ‘removed’

Tukey’s Biweight and Andrew’s Sine are the most commonly used, and because they are not defined over a whole

error space, these two techniques differ slightly compared to the other four. For example, when employing Tukey’s

Biweight and Andrew’s Sine, if a data point is given a weighting value that might be significantly low, it is construed

as an outlier and removed from the data set. This occurs in curve-fitting applications such as XLfit when a user

chooses to automatically remove outlying points from a data set.

Note: The other four techniques down-weigh outlying points so they have no bearing on the fit at all, which is

equivalent to knocking them out. It is possible to combine IRLS with manual outlier knock-out where appropriate.

In the IRLS fitting scenarios below in Fig 3, Tukey’s Biweight is performed on three different sets of data, which are

all well defined but contain easily identifiable outliers. IRLS has removed these data points from the set, making

manual interaction unnecessary because the fit is of significant quality to be confident in the results produced.

Note: These data sets are well defined from a data perspective and are complete, with a reasonably high number of

data points. Applying IRLS to well formed data sets enables the analysis to be of significant quality and the process

to produce accurate results. Much like standard non-linear LSF, robust fitting does not work if there are any errors in

the X value.

Page 4: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 4 of 7

For a data set with a large amount of scatter, the process involves re-weighting and changing the weight of each

point. It is very difficult for the fitting process to converge on a positive result and a single best fit for such data. IRLS

does require that the data fitted is of a significant quality, otherwise it is prone to failure. It is recommended that

IRLS is always used in conjunction with other data quality checks to ensure good results.

IRLS (Robust Fitting) The graph below illustrates how the IRLS process works. The blue line proceeds to infinity and if we assume that

the vertical axis is showing some level of impact on the curve fit for an individual outlier, the residual value - the

distance from the fitted curve – increases. So the further the point is away from the fitted line, the higher the point

outlier status is, and the more the impact on the curve fit. The red line is the IRLS fit. For one given individual outlier

in the data set, as its outlier status increases, its impact on the fit decreases and eventually reaches 0.

Fig 3: IRLS fitting improves fit results accuracy when applied to three different data sets

Page 5: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 5 of 7

Complex models Data fitting and analysis is not just confined to basic Michaelis-Menten and dose response models. Complex

modelling can be used to analyze different types of data using standard non-linear LSF. The example below in Fig 5

shows time-controlled drug delivery with a number of different parameters being measured, while a drug is

administered at different time points in a pulsed nature. The graph is analyzing absorption of the drug into the blood

stream over time, indicated by the wave-like fit, allowing the researcher to determine the cycle and rate at which the

drug is distributed.

Fig 4: The impact of IRLS on an outlier compared to standard least-squares regression

Fig 5: Analyzing the cycle and

rate at which drug absorption

Page 6: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 6 of 7

Composite models such as the those shown in Fig 6 allow us to analyze a data set using two different models. For

example, the researcher fits the first model up to a point in time until the data points start to go back down when the

model is changed to analyze a different phase within the data. Although this is a complex model, it allows the

researcher to fit results to a high degree of confidence.

Fig 7 below shows a common scenario where data is fitted to a standard dose response curve but the data points

start decreasing at the end of the measurements. The researcher can set up a technique to remove those final data

points, such as applying an IRLS fitting technique to eliminate those points that start to drop off.

Alternatively the researcher can use a model that has been constructed to tackle this kind of scenario. A bell-

shaped dose response model allows the extraction of data points at the bottom and top, so that parameters C1 and

C2 can be extracted as the EC50 values for these two linked dose response curves, with measureable slope factors

for both curves. Bell-shaped models provide an effective means of analyzing and interpreting a whole set of data,

as opposed to having to reject data points.

A scenario such as this comes up frequently in standard dose response analysis. If the last six points of this

example were knocked out and a standard dose response curve performed on the data set, the results for the first

curve in the bell-shaped model would display similar or exactly the same results as the standard dose response

curve.

Fig 6: Fitting composite data occurs

Page 7: Curve Fitting Best Practice Part 5

Curve Fitting Best Practice Part 5

IDBS 2008 Page 7 of 7

Summary IRLS provides an advanced technique for reducing and neutralizing the effects of outliers in a fit. By weighting

individual data points, IRLS can increase the accuracy of fit results compared to those achieved using standard

regression (standard non-linear LSF). Both techniques, however, must be applied to a well defined and complete

data set in order to produce quality results.

Curve fitting is a flexible process offering a range of data analysis types, and researchers do not have to be

constrained by standard analysis techniques. Providing a variety of innovative ways of applying data analysis to

extract required results in varying scenarios, complex models extend data fitting and analysis beyond basic

Michaelis-Menten and dose response models and can be used in a wide range of applications.

Fig 7: A bell-shaped dose response model producing two fit results