Stochastic Optimization · Stochastic Optimization Hypotheses Search Space with some topological structure Objective function F assume some weak regularity Hill-Climbing Randomly

Stochastic Optimization

Marc Schoenauer

DataScience @ LHC Workshop 2015

Content

1 Setting the scene

2 Continuous Optimization

3 Discrete/Combinatorial Optimization

4 Non-Parametric Representations

5 Conclusion

Content

1 Setting the scene




5 Conclusion


Hypotheses

Search Space Ω with some topological structure

Objective function F assume some weak regularity

Hill-Climbing

Randomly draw x0 ∈ Ω and compute F(x0) Initialisation

Until(happy)

y = Best neighbor(xt) neighbor structure on Ω

Compute F(y)If F(y) F (xt) then xt+1 = y accept if improvement

else xt+1 = xt

Comments

Find closest local optimum defined by neighborhood structure

Iterate from different x0’s Until(very happy)


Hypotheses



Stochastic (Local) Search


Until(happy)

y = Random neighbor(xt) neighbor structure on Ω


else xt+1 = xt

Comments

Find one close local optimum defined by neighborhood structure

Iterate, leaving current optimum Iterated Local Search


Hypotheses



Stochastic (Local) Search – alternative viewpoint


Until(happy)

y = Move(xt) stochastic variation on Ω


else xt+1 = xt

Comments

Find one close local optimum defined by the Move operator

Iterate, leaving current optimum Iterated Local Search


Hypotheses



Stochastic Search aka Metaheuristics


Until(happy)

y = Move(xt) stochastic variation on Ω

Compute F(y)xt+1 = Select(xt,y) stochastic selection, biased toward better points

Comments

Can escape local optima e.g., Select = Boltzman selection

Iterate ?


Hypotheses



Population-based Metaheuristics

Randomly draw xi0 ∈ Ω and compute F(xi0), i = 1, . . . , µ Initialisation

Until(happy)

yi = Move(x1t , . . . ,x

µt ), i = 1, . . . , λ stochastic variations on Ω

Compute F(yi), i = 1, . . . , λ

(x1t+1, . . . ,x

µt+1) = Select(x1

t , . . . ,xµt ,y

1, . . . ,yλ) (µ+ λ) selection

= Select(y1, . . . ,yλ) (µ, λ) selection

Evolutionary Metaphor: ’natural’ selection and ’blind’ variations

yi = Movem(xit) for some i: mutation

yi = Movec(xit,x

jt ) for some (i, j): crossover

Metaheuristics: an Alternative Viewpoint

Population and variation operators define a distribution on Ω

−→ directly evolve a parameterized distribution P (θ)

Distribution-based Metaheuristics

Initialize distribution parameters θ0Until(happy)

Sample distribution P (θt)→ y1, . . . ,yλ ∈ ΩEvaluate y1, . . . ,yλ on FUpdate parameters θt+1 ← Fθ(θt,y1, . . . ,yλ,F(y1), . . . ,F(yλ))

includes selection

Covers

Estimation of Distribution Algorithms

Population-based Metaheuristics Evolutionary Algorithms

Particle Swarm Optimization, Differential Evolution, . . .

Deterministic algorithms

This is (mostly) Experimental Science

Theory

lagging behind practice

though rapidly catching up in recent years

not discussed here

Results need to be experimentally validated

Experiments

Validation relies on

grounded statistical validation CPU costly

informed (meta)parameter setting/tuning CPU costly ++

reproducibility Open Science

From Solution Space to Search Space

Choices

Objective function

Search space representation

and move operators / parameterized distribution

Approximation bias vs optimization error

This talk

Non-convex, non-smooth objective parametric continuous optimization

→ Rank-based ML seeded identification of influencers

Search spaces and crossover operators the TSP

Non-parametric representations exploration vs optimization

Content

1 Setting the scene

2 Continuous OptimizationThe Evolution StrategyThe CMA-ES (Covariance Matrix Adaptation Evolution Strategy)The Rank-based DataScience



5 Conclusion

Content

1 Setting the scene




5 Conclusion

Stochastic Optimization Templates



Until(happy)

yi = Move(x1t , . . . ,x


Compute F(yi), i = 1, . . . , λ(x1t+1, . . . ,x

µt+1) = Select(x1

t , . . . ,xµt ,y






Stochastic Optimization Templates




The (µ, λ)−Evolution Strategy

Ω = Rn P (θ): normal distributions

Initialize distribution parameters m, σ,C, set population size λ ∈ N

While not terminate

1 Sample distribution N(m, σ2C

)→ y1, . . . ,yλ ∈ Rn

yi ∼m+ σNi(0,C) for i = 1, . . . , λ

2 Evaluate y1, . . . ,yλ on F

Compute F(y1), . . . ,F(yλ)

3 Select µ best samplesy1, . . . ,yµ

4 Update distribution parameters

m, σ,C← F (m, σ,C,y1, . . . ,yµ,F(y1), . . . ,F(yµ))

The (µ, λ)−Evolution Strategy (2)

Gaussian Mutations

mean vector m ∈ Rn is the current solution

the so-called step-size σ ∈ R+ controls the step length

the covariance matrix C ∈ Rn×n determines the shape of thedistribution ellipsoid

How to update m, σ, and C?

m =1

µ

i=µ∑i=1

xi:λ “Crossover” of best µ individuals

Adaptive σ and C

Need for adaptation

Sphere function f(x) =∑x2i , n = 10

Content

1 Setting the scene




5 Conclusion

Cumulative Step-Size Adaptation (CSA)

Measure the length of the evolution path

path of the mean vector m in the generation sequence

↓decrease σ

↓increase σ

Compare to random walk without selection

Covariance Matrix AdaptationRank-One Update

m ← m+ σyw, yw =∑µi=1 wi yi:λ, yi ∼ Ni(0,C)

initial distribution, C = I

new distribution: C← 0.8×C + 0.2× ywyTw

ruling principle: the adaptation increases the probability of successfulsteps, yw, to appear again



yw, movement of the population mean m (disregarding σ)





mixture of distribution C and step yw, C← 0.8×C + 0.2× ywyTw





new distribution (disregarding σ)





movement of the population mean m





mixture of distribution C and step yw,C← 0.8×C + 0.2× ywyT

w







CMA-ES in one slide

Input: m ∈ Rn, σ ∈ R+, λInitialize: C = I, and pc = 0, pσ = 0,Set: cc ≈ 4/n, cσ ≈ 4/n, c1 ≈ 2/n2, cµ ≈ µw/n2, c1 + cµ ≤ 1,dσ ≈ 1 +

√µwn

, and wi=1...λ such that µw = 1∑µi=1 wi

2 ≈ 0.3λ

While not terminate

xi = m+ σ yi, yi ∼ Ni(0,C) , for i = 1, . . . , λ sampling

m←∑µi=1 wi xi:λ = m+ σyw where yw =

∑µi=1 wi yi:λ update mean

pc ← (1− cc)pc + 1I‖pσ‖<1.5√n√

1− (1− cc)2√µw yw cumulation for C

pσ ← (1− cσ) pσ +√

1− (1− cσ)2√µwC−

12 yw cumulation for σ

C← (1− c1 − cµ)C + c1 pcpcT + cµ

∑µi=1 wi yi:λy

Ti:λ update C

σ ← σ × exp(cσdσ

(‖pσ‖

E‖N(0,I)‖ − 1))

update of σ

Not covered on this slide: termination, restarts, active or mirrored sampling,

outputs, and boundaries

Invariances: Guarantee for Generalization

Invariance properties of CMA-ES

Invariance to order preserving transformationsin function space

like all comparison-based algorithms

Translation and rotation invarianceto rigid transformations of the search space

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

→ Natural Gradient in distribution space NES, Schmidhuber et al., 08

IGO, Olliver et al., 11

CMA-ES is almost parameterless

Tuning of a small set of functions Hansen & Ostermeier 2001

Default values generalize to whole classes

Exception: population size for multi-modal functionsbut see IPOP-CMA-ES Auger & Hansen, 05

and BIPOP-CMA-ES Hansen, 09

BBOB – Black-Box Optimization Benchmarking

ACM-GECCO workshops, 2009-2015 http://coco.gforge.inria.fr/

Set of 25 benchmark functions, dimensions 2 to 40

With known difficulties (ill-conditioning, non-separability, . . . )

Noisy and non-noisy versions

Competitors include

BFGS (Matlab version),

Fletcher-Powell,

DFO (Derivative-FreeOptimization, Powell 04)

Differential Evolution

Particle Swarm Optimization

and many more0 1 2 3 4 5 6 7 8 9

log10 of (ERT / dimension)

0.0

0.5

1.0

Proportion of functions

RANDOMSEARCHSPSABAYEDADIRECTDE-PSOGALSfminbndLSstepRCGARosenbrockMCSABCPSOPOEMSEDA-PSONELDERDOERRNELDERoPOEMSFULLNEWUOAALPSGLOBALPSO_BoundsBFGSONEFIFTHCauchy-EDANBC-CMACMA-ESPLUSSELNEWUOAAVGNEWUOAG3PCX1komma4mirser1plus1CMAEGSDEuniformDE-F-AUCMA-LS-CHAINVNSiAMALGAMIPOP-CMA-ESAMALGAMIPOP-ACTCMA-ESIPOP-saACM-ESMOSBIPOP-CMA-ESBIPOP-saACM-ESbest 2009f1-24

BBOB – Black-Box Optimization Benchmarking

ACM-GECCO workshops, 2009-2015 http://coco.gforge.inria.fr/

Set of 25 benchmark functions, dimensions 2 to 40

With known difficulties (ill-conditioning, non-separability, . . . )

Noisy and non-noisy versions

Competitors include

BFGS (Matlab version),

Fletcher-Powell,

DFO (Derivative-FreeOptimization, Powell 04)

Differential Evolution

Particle Swarm Optimization

and many more0 1 2 3 4 5 6 7 8 9

log10 of (ERT / dimension)

0.0

0.5

1.0

Proportion of functions

RANDOMSEARCHSPSABAYEDADIRECTDE-PSOGALSfminbndLSstepRCGARosenbrockMCSABCPSOPOEMSEDA-PSONELDERDOERRNELDERoPOEMSFULLNEWUOAALPSGLOBALPSO_BoundsBFGSONEFIFTHCauchy-EDANBC-CMACMA-ESPLUSSELNEWUOAAVGNEWUOAG3PCX1komma4mirser1plus1CMAEGSDEuniformDE-F-AUCMA-LS-CHAINVNSiAMALGAMIPOP-CMA-ESAMALGAMIPOP-ACTCMA-ESIPOP-saACM-ESMOSBIPOP-CMA-ESBIPOP-saACM-ESbest 2009f1-24

(should be) available in ROOT6 Benazera, Hansen, Kegl, 15

Content

1 Setting the scene




5 Conclusion

Seeded Influencer IdentificationPhilippe Caillou & Michele Sebag, coll SME Augure

Goal

Find the influencers in a given social network

from public data sources tweets, blogs, articles, . . .

Commercial goal: so as to bribe them :-)

Scientific goal: with little user input

What is an Influencer?

No clear definition

# followers?

Highly retweeted? but # retweets disagrees with # followers

Sources of information cascades?

# invitations to join?

Topic-specific PageRanking?

Presence on Wikipedia?

Need for

Topic-specific features

and graph features

and user input no generic ground truth

Seeded Influencer Ranking

Input

A social network

(Big) Data of user interactions Tweets, blogs, messages, . . .

Some identified influencers

The global picture

Derive features representing the users using content and traces

Optimize scoring function giving highest scores to known influencers

return best-k scoring users the most (known + new) influencers

Search Space and Learning Criterion

The scoring function

Each individual is described by d features xi ∈ IRd

Non-linear score h defined by pair (v, a) in IRd × IRd

hv,a(x) =

d∑i=1

vi|xi − ai|

A rank-based criterion

Each score hv,a induces a ranking Rv,a on the dataset

Goal: Known influencers (KIs) ranked highest Best has rank n

Non-convex optimization

ArgMax(v,a) ∑xi∈KI

Rv,a(xi)

−→ Stochastic Optimizer e.g., sep-CMA-ES

Data

Domains

Fashion: 18/23 influencers with twitter account

SmartPhones: 69/206 influencers with twitter account

Social Influence: 39/39 influencers with twitter account

Wine: 28/235 influencers with twitter account, coming from publicsources (e.g., Time Magazine)

Human Resources: 99 influencers from the 100 most influencialpeople in HR and recruiting on twitter

Sources

random 10% of all retweets of November 2014

100M tweets, 45M unique tweets,

with origin-destination pairs → 43M nodes graph

only words in at least 0.001% and at most 10% tweets considered→ 10.5M candidates

Features

100 Content-based Features

foreach medium eventually

Foreach user x, identify W(x), the N words with max tf-idfterm frequency - inverse document frequency

50 words that appear most often in all W(influencer)

50 words with max sum of tf-idf over⋃W(influencer)

each selected word is a feature for x: 0 if not present in W(x), ti-idfotherwise Many are null for most candidates

6 Network Features

from the weighted graph of retweets centrality, PageRank, . . .

5 Social Features

from tweeter user profiles # tweets, # followers, . . .

Seeded Influencer Identification: Discussion

Results

Better than using only social features

Found unexpected influencers

Confirmed by experts

Forthcoming application

Unveil some influencial xxx-sceptics Global warming, genocides, . . .

Content

1 Setting the scene


3 Discrete/Combinatorial OptimizationThe Crossover ControversyThe TSP Illustration


5 Conclusion

Content

1 Setting the scene




5 Conclusion

Stochastic Optimization Templates (reminder)



Until(happy)

yi = Move(x1t , . . . ,x



µt+1) = Select(x1

t , . . . ,xµt ,y






Stochastic Optimization Templates (reminder)



Until(happy)

yi = Move(x1t , . . . ,x



µt+1) = Select(x1

t , . . . ,xµt ,y



Crossover or not Crossover

Issues with crossover

Crossover ubiquitous in naturethough not in all bacteria

Showed to mainly work on dynamical problemsCompared to ILS, SA, Ch. Papadimitriou et al., 15





Yes, but

Most artificial crossover operators exchangechunks of solutions

Nature does not exchange “organs”

but chunks of programs that builds thesolution

6=





Yes, but

Most artificial crossover operators exchangechunks of solutions

Nature does not exchange “organs”

but chunks of programs that builds thesolution

6=

Better no crossover than a poor crossover

Content

1 Setting the scene




5 Conclusion

The Traveling Salesman Problem

Find shortest Hamiltoniancircuit on n cities

dk11455.tsp

Permutation Representation

Ordered list of cities:no semantic link with the objective

Blind exchange of chunks meaningless,and leads to unfeasible circuits

Edge Representation

List of all edges of the tour: unordered

clearly related to objective

but blind exchange of edges doesn’twork either

Challenge or opportunity?

A strong exact solver Concorde

A very strong heuristic solver LKH-2

Lin-Kernighan-Helsgaun heuristic Helsgaun, 06

LKH-2

Local Search

Deep k-opt moves efficient implementation

Many ad hoc heuristics

Multi-trial LKH-2

Iterated Local Search

Using . . . a crossover operator between best-so-far and new localoptima

Iterative Partial Transcription, Mobius et al., 99-08

a particular case of GPX next slide

Most best known results on non-exactly solved instances, up to 10M cities

Generalized Partition Crossover (GPX) Whitley, Hains, & Howe, 09-12

Respectful and Transmitting crossover

Find k-partitions of cost 2 if exist O(n)

Chooses best of 2k − 2 solutions O(k)

All common edges are keptNo new edge is created

Two possible hybridizations

Evolutionary Algorithm using GPX anditerated 2-opt as mutation operator

Some best results on small instance < 10k cities

Replacing IPT in LKH-2 heuristicImproved some state-of-the-art results

Edge Assembly Crossover Nagata & Kobayashi, 06-13

The EAX crossover

1 Merge A and B

2 Alternate A/B edges

3 Local, then globalchoice strategy

4 Use sub-tours to removeparent edges

5 Local search for minimalnew edges

The EAX algorithm

Foreach randomly chosen parent pair typically 300 iterations

apply EAX N times typically 30

keep best of parents + N offspring diversity-preserving selection

Best known results on several instances . . .≤ 200k cities

Evolutionary TSP: Discussion

Meta or not Meta?

GPX can be applied to other graph problems

EAX is definitely a specialized TSP operator

Take Home Message

Efficient crossovers might exist

for semantically relevant representations

but require domain knowledge . . . and fine-tuning

Content

1 Setting the scene



4 Non-Parametric RepresentationsThe Velocity IdentificationThe Planning Decomposition

5 Conclusion

Content

1 Setting the scene




5 Conclusion

Identification of Spatial Geological Models

Source**

measure points

** * * * * * * ***

The direct problem

Function φ

applied to model MFind result R = φ(M)

→ numerical simulation

The inverse problem

Given experimental results R∗

find model M∗

such that φ(M∗) = R∗

→ Arg MinM||φ(M)−R∗|2

MS, Ehinger, Braunschweig, 98

(simulated) 3D seismogram

New first seismograms: shot 11

10 20 30 40 50 60 70 80 90 100 110 120

1

10 20 30 40 50 60 70 80 90 100 110 120

0.00

1.00

2.04

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1.90

0.00

1.00

2.04

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1.90

Plane

Trace

Plane

Trace

120× 2D seismogram

Voronoi Representation

The solution space

Velocities in some 3D domain

Blocky model piecewise constant

Values on a grid/mesh

Thousands of variables

Uncorrelated

Voronoi representation

List of labelled Voronoi sites

Velocity = site label inside each cell

Variation Operators

Geometric exchange crossover

Add, remove, move sites mutations

Variable unordered list ofVoronoi sites, and corresponding

’colored’ Voronoi diagram

Offspring 2Offspring 1

Parent 2Parent 1

The crossover operator

Other Voronoi successes

Evolutionary chairs

Minimize weight

Constraint on stiffness

Permanent collection ofBeaubourg Modern Art Museum

The Seroussi architectural competition

A building for privateexhibitions

Optimize natural lightning ofpaintings all year round

Room seeds for adaptive roomdesing

1st price (out of 8 submissions)

Content

1 Setting the scene




5 Conclusion

The TGV paradigm

Improve a dummy algorithm

I G I G

Cannot directly go from I to GCan go sequentially from I to the in-termediate stations, and to G

→ evolve the sequence of intermediate ’stations’

Representation

Variable-length ordered lists of , ,1 2 3( )

Variation operators

Chromosome-like swap variable-length crossover

Move, add or remove intermediate stations mutations

Evolutionary AI Planning MS, P. Saveant, V. Vidal, 06

AI Planning problem (S,A, I, G)

Given a state space S and a set of actions A conditions, effects

An initial state I and a goal state G possibly partially defined

Find optimal sequence of actions from I to G.

Many complete / satisficing planners biennal IPC since 1998

Evolving intermediate states

Search space: Variable-length ordered lists of states (S1, . . . , Sl)

Solve (S,A, Si, Si+1) using a given planner “dummy”

All sub-problems solved → solution of initial problem

Highlights J. Bibaı et al., 09-11

Solves instances where ’dummy’ planner fails

Silver Medal, Humies Awards, GECCO 2010

Winner, satisficing track, IPC 2011

First multi-objective planner, IJCAI 2013 M. Khouadjia et al.

Non-Parametric Representations: Discussion

Further examples

Spatial geological models: horizontal layers + geological parameterspatented

Neural Networks HyerNeat, Stanley 07; Compressed NNs, Schmidhuber 10

Non-parametric representations

Explore larger search spaces, but with some constraints→ explore a sub-variety of the whole space

Modifies both the approximation biashow far from the true optimum is the best solution under the streetlight?

and the optimization error our algorithm is not perfect

Optimization as an exploratory tool

Content

1 Setting the scene




5 Conclusion

The Black-Box Hypothesis

Meta or not Meta?

A continuum of embedded classes of problemsWhere does Black Box start? e.g., GPX and EAX

Continuous optimization: at least the dimension is known

The more information the better e.g. multimodality for restarts

Learn about F during the search→ online tuning beyond distribution parameters

Not Covered Today

General

Statistical tests Tutorials available

From parameter setting Programming by Optimization, Hoos, 2014

to hyper-heuristics Burke et al., 08-15

Multi-objective Full track with own conferences

Continuous

Surrogates ML helping Optimization

Constraints from Lagrangian to specific approaches

Noise handling e.g., using Hoffding or Bernstein inequalities

Other Representations/Paradigms

Genetic Programming Search a space of programs

Generative Developmental Systems . . . . . . that build the solution

Conclusion

Flexibility

Documents

Stochastic Optimization · Stochastic Optimization Hypotheses Search Space with some topological structure Objective function F assume some weak regularity Hill-Climbing Randomly