Nature-inspired metaheuristic algorithms for optimization and computional intelligence

Intro Classic Algorithms Metaheuristic Markov Analysis All and NFL Constraints Applications Thanks

Nature-Inspired Metaheristics Algorithmsfor Optimization and Computational Intelligence

Xin-She Yang

National Physical Laboratory, UK

@ FedCSIS2011

Xin-She Yang FedCSIS2011

Metaheuristics and Computational Intelligence

Computational science is now the third paradigm of science,complementing theory and experiment.

- Ken Wilson (Cornell University), Nobel Laureate.

All models are wrong, but some are useful.

- George Box, Statistician

All models are inaccurate, but some are useful.

All algorithms perform equally well on average over all possiblefunctions.

- No-free-lunch theorems (Wolpert & Macready)

All algorithms perform equally well on average over all possiblefunctions. How so?

All algorithms perform equally well on average over all possiblefunctions. Not quite! (more later)

Overview

Part I

Introduction

Metaheuristic Algorithms

Monte Carlo and Markov Chains

Algorithm Analysis

Overview

Part I

Introduction

Metaheuristic Algorithms

Monte Carlo and Markov Chains

Algorithm Analysis

Part II

Exploration & Exploitation

Dealing with Constraints

Applications

Discussions & Bibliography

A Perfect Algorithm

What is the best relationship among E , m and c?

A Perfect Algorithm

Initial state: m,E ,c ,

A Perfect Algorithm

Initial state: m,E ,c , =⇒

A Perfect Algorithm

Initial state: m,E ,c , =⇒

A Perfect Algorithm

Initial state: m,E ,c , =⇒ =⇒E =mc2

A Perfect Algorithm

Steepest Descent

A Perfect Algorithm

Steepest Descent

A Perfect Algorithm

Steepest Descent

min t =

1 + y′2

2g [h − y(x)]dx

A Perfect Algorithm

Steepest Descent

min t =

1 + y′2

2g [h − y(x)]dx

A Perfect Algorithm

Steepest Descent

min t =

1 + y′2

2g [h − y(x)]dx

=⇒ =⇒

A Perfect Algorithm

Steepest Descent

min t =

1 + y′2

2g [h − y(x)]dx

=⇒ =⇒

x = A2 (θ − sin θ)

y = h − A2 (1− cos θ)

Computing in Reality

A Problem & Problem Solvers⇓

Mathematical/Numerical Models

⇓Computer & Algorithms & Programming

⇓Validation⇓

Results

What is an Algorithm?

Essence of an Optimization Algorithm

To move to a new, better point xi+1 from an existing knownlocation xi .

Population-based algorithms use multiple, interacting paths.

Different algorithms

Different strategies/approaches in generating these moves!

Optimization is Like Treasure Hunting

How to find a treasure, a hidden 1 million dollars?What is your best strategy?

Optimization Algorithms

Deterministic

Newton’s method (1669, published in 1711), Newton-Raphson(1690), hill-climbing/steepest descent (Cauchy 1847),least-squares (Gauss 1795),

linear programming (Dantzig 1947), conjugate gradient(Lanczos et al. 1952), interior-point method (Karmarkar1984), etc.

Stochastic/Metaheuristic

Genetic algorithms (1960s/1970s), evolutionary strategy(Rechenberg & Swefel 1960s), evolutionary programming(Fogel et al. 1960s).

Simulated annealing (Kirkpatrick et al. 1983), Tabu search(Glover 1980s), ant colony optimization (Dorigo 1992),genetic programming (Koza 1992), particle swarmoptimization (Kennedy & Eberhart 1995), differentialevolution (Storn & Price 1996/1997),

harmony search (Geem et al. 2001), honeybee algorithm(Nakrani & Tovey 2004), ..., firefly algorithm (Yang 2008),cuckoo search (Yang & Deb 2009), ...

Steepest Descent/Hill Climbing

Gradient-Based Methods

Use gradient/derivative information – very efficient for local search.

Newton’s Method

xn+1 = xn −H−1∇f , H =

∂2f∂x1

2 · · · ∂2f∂x1∂xn

.... . .

...∂2f

∂xn∂x1· · · ∂2f

∂xn2

Generation of new moves by gradient.

Newton’s Method

xn+1 = xn −H−1∇f , H =

∂2f∂x1

2 · · · ∂2f∂x1∂xn

.... . .

...∂2f

∂xn∂x1· · · ∂2f

∂xn2

Quasi-Newton

If H is replaced by I, we have

xn+1 = xn − αI∇f (xn).

Here α controls the step length.

Generation of new moves by gradient.

Steepest Descent Method (Cauchy 1847, Riemann 1863)

From the Taylor expansion of f (x) about x(n), we have

f (x(n+1)) = f (x(n) + ∆s) ≈ f (x(n) + (∇f (x(n)))T ∆s,

where ∆s = x(n+1) − x(n) is the increment vector.So

f (x(n) + ∆s)− f (x(n)) = (∇f )T∆s < 0.

Therefore, we have∆s = −α∇f (x(n)),

where α > 0 is the step size.In the case of finding maxima, this method is often referred to ashill-climbing.

Conjugate Gradient (CG) Method

Belong to Krylov subspace iteration methods. The conjugategradient method was pioneered by Magnus Hestenes, EduardStiefel and Cornelius Lanczos in the 1950s. It was named as one ofthe top 10 algorithms of the 20th century.

Conjugate Gradient (CG) Method

Belong to Krylov subspace iteration methods. The conjugategradient method was pioneered by Magnus Hestenes, EduardStiefel and Cornelius Lanczos in the 1950s. It was named as one ofthe top 10 algorithms of the 20th century.

A linear system with a symmetric positive definite matrix A

Au = b,

is equivalent to minimizing the following function f (u)

f (u) =1

2uTAu− bTu + v,

where v is a vector constant and can be taken to be zero. We caneasily see that ∇f (u) = 0 leads to Au = b.

The theory behind these iterative methods is closely related to theKrylov subspace Kn spanned by A and b as defined by

Kn(A,b) = {Ib,Ab,A2b, ...,An−1b},

where A0 = I.If we use an iterative procedure to obtain the approximate solutionun to Au = b at nth iteration, the residual is given by

rn = b− Aun,

which is essentially the negative gradient ∇f (un).

The search direction vector in the conjugate gradient method issubsequently determined by

dn+1 = rn −dT

dTn Adn

The solution often starts with an initial guess u0 at n = 0, andproceeds iteratively. The above steps can compactly be written as

un+1 = un + αndn, rn+1 = rn − αnAdn,

anddn+1 = rn+1 + βndn,

αn =rTn rn

dTn Adn

, βn =rTn+1rn+1

rTn rn.

Iterations stop when a prescribed accuracy is reached.Xin-She Yang FedCSIS2011

Gradient-free Methods

Gradient-base methods

Requires the information of derivatives. Not suitable for problemswith discontinuities.

Gradient-free or derivative-free methods

BFGS, Downhill simplex, Trust-region, SQP ...

Nelder-Mead Downhill Simplex Method

The Nelder-Mead method is a downhill simplex algorithm, firstdeveloped by J. A. Nelder and R. Mead in 1965.

A Simplex

In the n-dimensional space, a simplex, which is a generalization ofa triangle on a plane, is a convex hull with n + 1 distinct points.For simplicity, a simplex in the n-dimension space is referred to asn-simplex.

(a) (b) (c)Xin-She Yang FedCSIS2011

Downhill Simplex Method

xcxn+1

The first step is to rank and re-order the vertex values

f (x1) ≤ f (x2) ≤ ... ≤ f (xn+1),

at x1, x2, ..., xn+1, respectively. Wikipedia Animation

Metaheuristic

Most are nature-inspired, mimicking certain successful features innature.

Simulated annealing

Genetic algorithms

Ant and bee algorithms

Particle Swarm Optimization

Firefly algorithm and cuckoo search

Harmony search ...

Simulated Annealling

Metal annealing to increase strength =⇒ simulated annealing.

Probabilistic Move: p ∝ exp[−E/kBT ].

kB=Boltzmann constant (e.g., kB = 1), T=temperature, E=energy.

E ∝ f (x),T = T0αt (cooling schedule) , (0 < α < 1).

T → 0, =⇒p → 0, =⇒ hill climbing.

Simulated Annealling

Metal annealing to increase strength =⇒ simulated annealing.

Probabilistic Move: p ∝ exp[−E/kBT ].

kB=Boltzmann constant (e.g., kB = 1), T=temperature, E=energy.

E ∝ f (x),T = T0αt (cooling schedule) , (0 < α < 1).

T → 0, =⇒p → 0, =⇒ hill climbing.

This is essentially a Markov chain.Generation of new moves by Markov chain.

An Example

Genetic Algorithms

crossover mutation

Genetic Algorithms

crossover mutation

Genetic Algorithms

crossover mutation

Generation of new solutions by crossover, mutation and elistism.

Swarm Intelligence

Ants, bees, birds, fish ...

Simple rules lead to complex behaviour.

Go to Metaheuristic Slides

Cuckoo Search

Local random walk:

xt+1i = xt

i + s ⊗ H(pa − ǫ)⊗ (xtj − xt

[xi , xj , xk are 3 different solutions, H(u) is a Heaviside function, ǫis a random number drawn from a uniform distribution, and s isthe step size.

Cuckoo Search

Local random walk:

xt+1i = xt

i + s ⊗ H(pa − ǫ)⊗ (xtj − xt

Global random walk via Levy flights:

xt+1i = xt

i + αL(s, λ), L(s, λ) =λΓ(λ) sin(πλ/2)

s1+λ, (s ≫ s0).

Cuckoo Search

Local random walk:

xt+1i = xt

i + s ⊗ H(pa − ǫ)⊗ (xtj − xt

xt+1i = xt

s1+λ, (s ≫ s0).

Cuckoo Search

Local random walk:

xt+1i = xt

i + s ⊗ H(pa − ǫ)⊗ (xtj − xt

xt+1i = xt

s1+λ, (s ≫ s0).

Generation of new moves by Levy flights, random walk and elitism.

Monte Carlo Methods

Almost everyone has used Monte Carlo methods in some way ...

Measure temperatures, choose a product, ...Taste soup, wine ...

Markov Chains

Random walk – A drunkard’s walk:

ut+1 = µ+ ut + wt ,

where wt is a random variable, and µ is the drift.

For example, wt ∼ N(0, σ2) (Gaussian).

Markov Chains

ut+1 = µ+ ut + wt ,

0 100 200 300 400 500

Markov Chains

ut+1 = µ+ ut + wt ,

0 100 200 300 400 500-20

-15 -10 -5 0 5 10 15 20

Markov Chains

Markov chain: the next state only depends on the current stateand the transition probability.

P(i , j) ≡ P(Vt+1 = Sj

∣V0 = Sp, ...,Vt = Si)

= P(Vt+1 = Sj

∣Vt = Sj),

=⇒Pijπ∗i = Pjiπ

∗j , π∗ = stionary probability distribution.

Examples: Brownian motion

ui+1 = µ+ ui + ǫi , ǫi ∼ N(0, σ2).

Markov Chains

Monopoly (board games)

Monopoly Animation

Markov Chain Monte Carlo

Landmarks: Monte Carlo method (1930s, 1945, from 1950s) e.g.,Metropolis Algorithm (1953), Metropolis-Hastings (1970).

Markov Chain Monte Carlo (MCMC) methods – A class ofmethods.

Really took off in 1990s, now applied to a wide range of areas:physics, Bayesian statistics, climate changes, machine learning,finance, economy, medicine, biology, materials and engineering ...

Convergence Behaviour

As the MCMC runs, convergence may be reached

When does a chain converge? When to stop the chain ... ?

Are multiple chains better than a single chain?

0 100 200 300 400 500 600 700 800 900

t=−2U

−∞← t

t=−n

converged

Multiple, interacting chains

Multiple agents trace multiple, interacting Markov chains duringthe Monte Carlo process.

Analysis

Classifications of Algorithms

Trajectory-based: hill-climbing, simulated annealing, patternsearch ...

Population-based: genetic algorithms, ant & bee algorithms,artificial immune systems, differential evolutions, PSO, HS,FA, CS, ...

Analysis

Ways of Generating New Moves/Solutions

Markov chains with different transition probability.

Trajectory-based =⇒ a single Markov chain;Population-based =⇒ multiple, interacting chains.

Tabu search (with memory) =⇒ self-avoiding Markov chains.Xin-She Yang FedCSIS2011

Ergodicity

Markov Chains & Markov Processes

Most theoretical studies uses Markov chains/process as aframework for convergence analysis.

A Markov chain is said be to regular if some positive power k

of the transition matrix P has only positive elements.

A chain is call time-homogeneous if the change of itstransition matrix P is the same after each step, thus thetransition probability after k steps become Pk .

A chain is ergodic or irreducible if it is aperiodic and positiverecurrent – it is possible to reach every state from any state.

As k →∞, we have the stationary probability distribution π

π = πP, =⇒ thus the first eigenvalue is always 1.

Asymptotic convergence to optimality:

limk→∞

θk → θ∗, (with probability one).

As k →∞, we have the stationary probability distribution π

π = πP, =⇒ thus the first eigenvalue is always 1.

Asymptotic convergence to optimality:

limk→∞

θk → θ∗, (with probability one).

The rate of convergence is usually determined by the secondeigenvalue 0 < λ2 < 1.

An algorithm can converge, but may not be necessarily efficient,as the rate of convergence is typically low.

Convergence of GA

Important studies by Aytug et al. (1996)1, Aytug and Koehler(2000)2, Greenhalgh and Marschall (2000)3, Gutjahr (2010),4 etc.5

The number of iterations t(ζ) in GA with a convergenceprobability of ζ can be estimated by

t(ζ) ≤

ln(1− ζ)

1−min[(1− µ)Ln, µLn]

where µ=mutation rate, L=string length, and n=population size.

1H. Aytug, S. Bhattacharrya and G. J. Koehler, A Markov chain analysis of genetic algorithms with power of

2 cardinality alphabets, Euro. J. Operational Research, 96, 195-201 (1996).2H. Aytug and G. J. Koehler, New stopping criterion for genetic algorithms, Euro. J. Operational research,

126, 662-674 (2000).3D. Greenhalgh & S. Marshal, Convergence criteria for genetic algorithms, SIAM J. Computing, 30, 269-282

(2000).4W. J. Gutjahr, Convergence Analysis of Metaheuristics Annals of Information Systems, 10, 159-187 (2010).

Multiobjective Metaheuristics

Asymptotic convergence of metaheuristic for multiobjectiveoptimization (Villalobos-Arias et al. 2005)6

The transition matrix P of a metaheuristic algorithm has astationary distribution π such that

|Pkij − πj | ≤ (1− ζ)k−1, ∀i , j , (k = 1, 2, ...),

where ζ is a function of mutation probability µ, string length L

and population size. For example, ζ = 2nLµnL, so µ < 0.5.

6M. Villalobos-Arias, C. A. Coello Coello and O. Hernandez-Lerma, Asymptotic convergence of metaheuristics

for multiobjective optimization problems, Soft Computing, 10, 1001-1005 (2005).

|Pkij − πj | ≤ (1− ζ)k−1, ∀i , j , (k = 1, 2, ...),

Note: An algorithm satisfying this condition may not converge (formultiobjective optimization)However, an algorithm with elitism, obeying the above condition,does converge!.

Other results

Limited results on convergence analysis exist, concerning (finitestates/domains)

ant colony optimization

generalized hill-climbers and simulated annealing,

best-so-far convergence of cross-entropy optimization,

nested partition method, Tabu search, and

of course, combinatorial optimization.

Other results

Limited results on convergence analysis exist, concerning (finitestates/domains)

ant colony optimization

generalized hill-climbers and simulated annealing,

best-so-far convergence of cross-entropy optimization,

nested partition method, Tabu search, and

of course, combinatorial optimization.

However, more challenging tasks for infinite states/domains andcontinuous problems.

Many, many open problems needs satisfactory answers.

Converged?

Converged, often the ‘best-so-far’ convergence, not necessarily atthe global optimality

In theory, a Markov chain can converge, but the number ofiterations tends to be large.

In practice, a finite (hopefully, small) number of generations, if thealgorithm converges, it may not reach the global optimum.

Converged?

How to avoid premature convergence

Equip an algorithm with the ability to escape a local optimum

Increase diversity of the solutions

Enough randomization at the right stage

....(unknown, new) ....

Coffee Break (15 Minutes)

All and NFL

So many algorithms – what are the common characteristics?

What are the key components?

How to use and balance different components?

What controls the overall behaviour of an algorithm?

Exploration and Exploitation

Characteristics of Metaheuristics

Exploration and Exploitation, or Diversification and Intensification.

Exploitation/Intensification

Intensive local search, exploiting local information.E.g., hill-climbing.

Exploitation/Intensification

Intensive local search, exploiting local information.E.g., hill-climbing.

Exploration/Diversification

Exploratory global search, using randomization/stochasticcomponents. E.g., hill-climbing with random restart.

Summary

Exploitation

Summary

Exploitation

uniformsearch

Summary

Exploitation

uniformsearch

steepestdescent

Summary

Exploitation

uniformsearch

steepestdescent

Tabu Nelder-Mead

PSO/FAEP/ESSA Ant/Bee

Genetic algorithms

Newton-Raphson

Summary

Exploitation

uniformsearch

steepestdescent

Tabu Nelder-Mead

PSO/FAEP/ESSA Ant/Bee

Genetic algorithms

Newton-Raphson

Free lunch?

No-Free-Lunch (NFL) Theorems

Algorithm Performance

Any algorithm is as good/bad as random search, when averagedover all possible problems/functions.

Finite domains

No universally efficient algorithm!

Finite domains

No universally efficient algorithm!

Any free taster or dessert?

Yes and no. (more later)

NFL Theorems (Wolpert and Macready 1997)

Search space is finite (though quite large), thus the space ofpossible “cost” values is also finite. Objective functionf : X 7→ Y, with F = YX (space of all possible problems).Assumptions: finite domain, closed under permutation (c.u.p).

For m iterations, m distinct visited points form a time-ordered

set dm ={(

dxm(1), dy

, ...,(

dxm(m), dy

m(m))}

The performance of an algorithm a iterated m times on a costfunction f is denoted by P(dy

m|f ,m, a).

For any pair of algorithms a and b, the NFL theorem states∑

P(dym|f ,m, a) =

P(dym|f ,m, b).

set dm ={(

dxm(1), dy

, ...,(

dxm(m), dy

m(m))}

m|f ,m, a).

P(dym|f ,m, a) =

P(dym|f ,m, b).

set dm ={(

dxm(1), dy

, ...,(

dxm(m), dy

m(m))}

m|f ,m, a).

P(dym|f ,m, a) =

P(dym|f ,m, b).

Any algorithm is as good (bad) as a random search!Xin-She Yang FedCSIS2011

Open Problems

Framework: Need to develop a unified framework foralgorithmic analysis (e.g.,convergence).

Exploration and exploitation: What is the optimal balancebetween these two components? (50-50 or what?)

Performance measure: What are the best performancemeasures ? Statistically? Why ?

Convergence: Convergence analysis of algorithms for infinite,continuous domains require systematic approaches?

Open Problems

Nature-inspired metaheuristic algorithms for optimization and computional intelligence

Education

An application of Data-Oriented Computional Linguistics

Harmony Search as a Metaheuristic Algorithm

CFD (Computional Fluid Dynamics) Quiz

Heterogeneouspigeon-inspiredoptimizationscis.scichina.com/en/2019/070205.pdfthe metaheuristic optimization methods whose design is inspired from biological behaviors, like genetic

Nature-Inspired Metaheuristic Algorithms Second Edition

with I P EXPERIMENTAL AND COMPUTIONAL BIOMEDICINEvph-institute.org/upload/first-announcement-yekaterinburg_564f0fe2... · EXPERIMENTAL AND COMPUTIONAL BIOMEDICINE in memory of Professor

A New Metaheuristic for Optimization: Optics Inspired ...drkashan.ir/files/pdf/Optics Inspired Optimization.pdf · There are many ways to classify metaheuristic algorithms. One way

Nature-Inspired Metaheuristic Algorithms

Computional Physics

The ACO Metaheuristic

Metaheuristic Applied to Civil Engineering

Fireﬂy Algorithm, Levy Flights and Global´ Optimizationread.pudn.com/downloads616/sourcecode/math/2510202... · Nature-inspired metaheuristic algorithms are becoming powerful in

addis ababa universty college of natural and computional science department of zoological scince

Design optimization of distribution transformers with ...of the application of ve modern nature-inspired metaheuristic optimization algorithms for the solution of the TDO problem,

Metaheuristic Algorithms: A Critical Analysis

What is a Metaheuristic

Towards computional specificity screening of DNA-binding ...Towards computional specificity screening of DNA-binding proteins Daniel Seeliger1, Floris P. Buelens2, Maik Goette2, Bert

Hardware Acceleration of Computional Fluid Dynamics Simulations in an Oxygenator - Part 2

Metaheuristic – Threshold Acceptance (TA)

Xerus Optimization Algorithm (XOA): a novel nature ...journals.ut.ac.ir/article_75188_bbceddde3a611eed1a3cee0faf27a323.pdfnature-inspired metaheuristic algorithm for solving ... nature-inspired