Upload
vladimir-turner
View
19
Download
0
Embed Size (px)
DESCRIPTION
System Identification and Curve Fitting with a Genetic Algorithm Hierarchy. Alice E. Smith and Mehmet Gulsen Department of Industrial Engineering University of Pittsburgh INFORMS Fall 1997. Curve Fitting. - PowerPoint PPT Presentation
Citation preview
System Identification and Curve Fitting with a
Genetic Algorithm Hierarchy
Alice E. Smith and Mehmet Gulsen
Department of Industrial Engineering
University of Pittsburgh
INFORMS Fall 1997
Curve Fitting
Process of approximating a closed form function to a given data set of independent variables and dependent variable (variable selection, closed form function selection, coefficient estimation). Used for:– System identification– Judging the strength of relationship– Identifying main variables and interaction between
variables– Interpolate/extrapolate to new data
Conventional Approaches
Various regression techniques Time series analysis Spline fitting Neural networks
Genetic Algorithm Hierarchy
LowerModule
UpperModule
Function andVariable Selection
Coefficient Estimation
y c x c Cos c c x 1 1 2 3 4 2
2( )
y x Cos x
SSE
9 234 2 123 0 093 4 823
0 346271 2
2. . ( . . )
.
candidatefunctions
optimizedcoefficientsfor functions
Search Structure
Lower GASearch
Data
n1 n2 n
111
Upper GASearch
Upper GAPopulation
Lower GAPopulation
Genetic Search Process
InitialPopulation
Mutants
Offspring
InitialPopulation
Offspring
Mutants
FinalPopulation
( )n
( )n1
( )n2
best (n)
( )n
TopHalfSelection
UniformSelection
Upper GA - Function Selection
Explore the possible functional forms that could represent the underlying relationship between independent and dependent variables of a data set
Objective Function: Minimize “adjusted” total error corresponding to the functional form. Adjustment is performed by penalizing more complex representations (more variables, higher order terms)
Stopping Criteria: Search is terminated when no improvement is observed for a specific number of generations
Upper GAFunction Selection - Encoding
Tree Structure y C C x C C x C x x 1 2 13
3 4 2 5 1 2cos( )
C5
x2
+
+
*
*
1
x1 x1
C1
x1
C2
+
*
x1
cos
x2
C3
C4
Upper GAFunction Selection - Penalty Function
C5
x2
+
+
*
*
1
x1 x1
C1
x1
C2
+
*
x1
cos
x2
C3
C4
[( )]number of nodes
constantm
( ) ..14
51 05280 05
Penalty Factor = 0.05
Upper GAFunction Selection - Crossover
y CC C x
C xC C x C x x 1
2 3 1
4 2
5 6 2 7 1 2
ln( )cos( )
C5
y C C x C C x C x x 1 2 1
3
3 4 2 5 1 2cos( ) y CC C x
C xC 1
2 3 1
4 2
5sin(ln( )
)
C5
x2
+
+
*
*
1
x1 x1
C1
x1
C2
+
*
x1
cos
x2
C3
C4
C3
+
/
x2
x1
1
sinC1
C2 C4
ln
crossover
y C C x C 1 2 1
3
3sin( )
Before:
After:
Parent 1 Parent 2
Offspring 1 Offspring 2
Upper GAFunction Selection - Mutation
y C C x C C x C x x 1 2 1
3
3 4 2 5 1 2cos( )
C5
x2
+
+
*
*
1
x1 x1
C1
x1
C2
+
*
x1
cos
x2
C3
C4
mutation
y C C x C C x C x C x C x 1 2 1
3
3 4 2 5 1 6 1 7 1
2cos( ) exp( )
Before:
After:
C3
x1
+
x1
C1
C2
exp
x2
Parent 1
Mutant
randomly generated tree
Lower GA - Coefficient Estimation
Estimate the coefficients of a given closed form function which minimize the total error over the set of data points
Objective Function: Minimize total squared error
Minimize
K: number of data points
Stopping Criteria: Search is terminated when no improvement is observed for specific number of generations
Detailed results are published in “International Journal of Production Research”, Vol. 33, No. 7, 1995
( )y yi
K
actual model
1
2
Lower GACoefficient Estimation - Encoding
y C C x C C x C x x 1 2 13
3 4 2 5 1 2cos( )
C1 C2 C3 C4 C5
Lower GA - Selection/Breeding Parents are selected for breeding uniformly from the
superior half of the population The values of the offspring’s coefficients are determined
by calculating the arithmetic mean of the corresponding coefficients of two parents
Parent A: 45.876 32.958 12.098 -3.892 0.2356
Parent B: 12.988 35.832 0.234 -12.984 2.4576
Offspring: 29.432 34.395 6.166 -8.438 1.3466
Lower GA - Mutation Perturbing existing solutions to explore new regions of
search space Perturbation value is obtained by multiplying the current
population range with a random factor
C1 C2 C3 C4 C5
k C1 1 1 k C4 4 4 k C2 2 2 k C3 3 3 k C5 5 5
Test Problem
C Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 MeanSd.Dv.
1 9.986 9.998 10.002 10.000 9.996 10.001 9.9970.005
2 9.999 10.000 10.000 10.000 10.000 10.000 10.0000.000
3 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
4 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
5 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
6 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
7 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
8 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
9 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
10 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
SE. 0.0017 0.000 0.0000 0.000 0.000 0.000 0.000 -
C Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 MeanSd.Dv.
1 9.986 9.998 10.002 10.000 9.996 10.001 9.9970.005
2 9.999 10.000 10.000 10.000 10.000 10.000 10.0000.000
3 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
4 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
5 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
6 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
7 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
8 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
9 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
10 10.000 10.000 10.000 10.000 10.000 10.000 10.0000.000
SE. 0.0017 0.000 0.0000 0.000 0.000 0.000 0.000 -
y C C x C x C x C x C x C x C x x C x x C x x 1 2 1 3 2 4 3 5 1
26 2
27 3
28 1 2 9 1 3 10 2 3
Test ProblemDifferent Error Metrics
0
1
2
3
4
5
6
7
8
0 500 1000
Number of Generations
Log
10
of
Sq
uare
d E
rror
1500
Squared ErrorAbsolute Error
Maximum Error
Test Problem Different Numbers of Data Points
-8
-6
-4
-2
0
2
4
6
8
0 500 1000 1500 2000 2500 3000 3500
Number of Generations
Log10 o
f Square
d E
rror
4000
25 Points
100 Points
Empirical Data Sets
Five benchmark problems from the literature
1. onion growth
2. children growth
3. sunspots
4. chemical plant
5. slip casting Single variable/50 observations to 13 variables/1000
observations Nonlinear regression, time series analysis, model
identification
Sunspot data from 1700 to 1995 Highly cyclic with peak and bottom values approximately
in every 11.1 years Cycle is not symmetric. The number of counts reaches to
maximum value faster than it drops to a minimum Training range: 1700-1979 Validation range: 1980-1995
Test Problem 3, Sunspot Data
Functions Identified
M o d e l E q u a t i o n S S E
A 9)-0.2471(+2)-0.4585(-1)-1.1965( ttt 6 1 9 6 4
B
2))-0.6271(-1))-p(-0.3263((-2.7260ex15.7476exp+
9))--0.3512(1.1989exp(-1)-0.8337(
tt
tt 4 5 5 3 3
C9)-0.1148(+4)-0.1316(-1)-0.8064(+1)-2))(-0.8446(-
1))-0.4282(+4)-(1.4097(-0.6099cos1.2410exp(
ttttt
tt 4 0 3 4 1
D
9)-0.1046(+4)-0.1413(-1)-0.8253(+
1)-2)(-0.9362(-2))-0.7485(-2))-3.1756(-2))-2.8807(+
4))-(0.2561(-3.3442cos0.6979exp(+
4)-(-1.4893(-0.5564cos1.6258exp(
ttt
ttttt
t
t 3 8 7 1 5
Model D
0
1
2
3
4
5
6
7
8
9
10
1700 1750 1800 1850 1900 1950 2000
Year
20 x
An
nu
al N
um
ber
of
Su
nsp
ots
Extrapolation of Model D
0
1
2
3
4
5
6
7
8
9
1980 1985 1990 1995
Year
Data
Fitted Function
Conclusions
A unique approach for curve fitting problemsProvides closed form function for the given data set
Can handle non-linear, discontinuous functions
Flexible in terms of error metric
Can be used separately for function selection and coefficient optimization
Computationally intensive and needs a priori setting of search parameters and penalty function components
Forthcoming paper : “A hierarchical genetic algorithm for system identification and curve fitting with a supercomputer implementation,” Mehmet Gulsen and Alice E. Smith, Institute for Mathematics and its Applications, Volumes in Mathematics and its Applications, Volume on Evolutionary Computing.