The Mesh Adaptive Direct Search Algorithm for Discrete Blackbox

BBO Example MADS Computational experiments Discussion References

The Mesh Adaptive Direct SearchAlgorithm for Discrete Blackbox

Optimization

Sebastien Le DigabelCharles Audet

Christophe Tribes

GERAD and Ecole Polytechnique de Montreal

ICCOPT, Tokyo

2016–08–08

BBO: Algorithm and Applications 1/32

https://www.gerad.ca/Sebastien.Le.Digabel/

https://www.gerad.ca/Charles.Audet/

https://www.gerad.ca/en/people/christophe-tribes

https://www.gerad.ca

http://www.polymtl.ca

http://www.iccopt2016.tokyo


Presentation outline

Blackbox optimization

Motivating example

The MADS algorithm

Computational experiments

Discussion




Motivating example

The MADS algorithm


Discussion



Blackbox optimization (BBO) problems

I Optimization problem:

minx∈Ω

f(x)

I Evaluations of f (the objective function) and of the functionsdefining Ω are usually the result of a computer code (ablackbox).

I Variables are typically continuous, but in this work, some ofthem are discrete – integers or granular variables.



Blackboxes as illustrated by J. Simonis [ISMP 2009]




Motivating example

The MADS algorithm


Discussion



Motivating example: Parameters tuning (1/2)

I [Audet and Orban, 2006].

I The classical Trust-Region algorithm depends on fourparameters x = (η1, η2, α1, α2) ∈ R4

+.

I Consider a collection of 55 test problems from CUTEr.

I Let f(x) be the CPU time required to solve the collection ofproblems by a Trust-Region algorithm with parameters x.

I f(x0) ' 3h45 with the textbook valuesx0 = (1/4, 3/4, 1/2, 2).




MADS Algorithmx ∈ R4

+

BlackboxRun TR on collection

Return CPU time

f(x)-

I This optimization produced x with f(x) ' 2h50.

Victory ?No becausex = (0.22125, 0.94457031, 0.37933594, 2.3042969).





+


Return CPU time

f(x)-

I This optimization produced x with f(x) ' 2h50.Victory ?

No becausex = (0.22125, 0.94457031, 0.37933594, 2.3042969).





+


Return CPU time

f(x)-

I This optimization produced x with f(x) ' 2h50.Victory ?No becausex = (0.22125, 0.94457031, 0.37933594, 2.3042969).



Granular variablesThe initial point x0 = (0.25, 0.75, 0.50, 2.00) is frequently usedbecause each entry is a multiple of 0.25.

Its granularity is G = 0.25

I How can we devise a direct search algorithm so that it stopson a prescribed granularity?With a granularity of G = 0.05, the code might produce

x = (0.20, 0.95, 0.40, 2.30).Which is much nicer (for a human) than

x = (0.22125, 0.94457031, 0.37933594, 2.3042969).

I This may be achieved using integer variables, together with arelative scaling, but there is a simpler way for mesh-basedmethods.







x = (0.22125, 0.94457031, 0.37933594, 2.3042969).








x = (0.22125, 0.94457031, 0.37933594, 2.3042969).





Motivating example

The MADS algorithm


Discussion



Mesh Adaptive Direct Search (MADS) in Rn

I [Audet and Dennis, Jr., 2006].

I Iterative algorithm that evaluates the blackbox at some trialpoints on a spatial discretization called the mesh.

I One iteration = search and poll.

I The search allows trial points generated anywhere on themesh.

I The poll consists in generating a list of trial pointsconstructed from poll directions. These directions grow dense.

I At the end of the iteration, the mesh size is reduced if no newsuccess point is found.






























[0] Initializations (x0, ∆0: initial poll size )[1] Iteration k

let δk ≤ ∆k be the mesh size parameterSearch

test a finite number of mesh pointsPoll (if the Search failed)

construct set of directions Dk

test poll set Pk = xk + δkd : d ∈ Dkwith ‖δkd‖ ' ∆k

[2] Updatesif success

xk+1 ← success pointincrease ∆k

elsexk+1 ← xk

decrease ∆k

k ← k + 1, stop if ∆k ≤ ∆min or go to [1]



Poll illustration (successive fails and mesh shrinks)

δk = 1∆k = 1

rxk

rp1

rp2

r

p3

trial points=p1, p2, p3




δk = 1∆k = 1

rxk

rp1

rp2

r

p3


δk+1 = 1/4∆k+1 = 1/2

rxkr

p4 rp5

r

p6

= p4, p5, p6




δk = 1∆k = 1

rxk

rp1

rp2

r

p3


δk+1 = 1/4∆k+1 = 1/2

rxkr

p4 rp5

r

p6

= p4, p5, p6

δk+2 = 1/16∆k+2 = 1/4

rxk

rp7

r p8rSSp9

= p7, p8, p9



Discrete variables in MADS – so far

I MADS has been designed for continuous variables.

I Some theory exists for categorical variables [Abramson, 2004].

I So far: Only a patch allows to handle integer variables:Rounding + minimal mesh size of one.

In this work, we start from scratch and present direct searchmethods with a natural way of handling discrete variables.



Mesh refinement on min(x− 1/3)2∆k xk

1 00.5 0.50.25 0.250.125 0.3750.0625 0.31250.03125 0.343750.015625 0.3281250.0078125 0.33593750.00390625 0.332031250.001953125 0.333984375

alternately

∆k xk

1 00.5 0.50.2 0.40.1 0.30.05 0.350.02 0.340.01 0.330.005 0.3350.002 0.3320.001 0.333

Idea:Instead of dividing ∆k by 2, change it so that

10× 10b refines to 5× 10b



To get three decimals, one simply sets the granularity to 0.001.Integer variables are treated by setting the granularity to G = 1.




1 00.5 0.50.25 0.250.125 0.3750.0625 0.31250.03125 0.343750.015625 0.3281250.0078125 0.33593750.00390625 0.332031250.001953125 0.333984375

alternately

∆k xk

1 00.5 0.50.2 0.40.1 0.30.05 0.350.02 0.340.01 0.330.005 0.3350.002 0.3320.001 0.333









1 00.5 0.50.25 0.250.125 0.3750.0625 0.31250.03125 0.343750.015625 0.3281250.0078125 0.33593750.00390625 0.332031250.001953125 0.333984375

alternately

∆k xk

1 00.5 0.50.2 0.40.1 0.30.05 0.350.02 0.340.01 0.330.005 0.3350.002 0.3320.001 0.333








Poll and mesh size parameter updateI The poll size parameter ∆k is updated as

10× 10b ←→ 5× 10b ←→ 2× 10b ←→ 1× 10b

I The fine underlying mesh is defined with the mesh sizeparameter

δk =

1 if ∆k ≥ 1,max102b,G otherwise, i.e. ∆k ∈ 1, 2, 5 × 10b.

I Example: Granularity of G = 0.005 : δk ∆k

1 51 21 10.01 0.50.01 0.20.01 0.10.005 0.050.005 0.020.005 0.010.005 0.005← stop



[0] Initializations (x0, ∆0 ∈ 1, 2, 5 × 10b, G granularity )[1] Iteration k

let δk ≤ ∆k be the mesh size parameterSearch

test a finite number of mesh pointsPoll (if the Search failed)

construct set of directions Dk

test poll set Pk = xk + δkd : d ∈ Dkwith ‖δkd‖ ' ∆k

[2] Updatesif success

xk+1 ← success pointincrease ∆k

elsexk+1 ← xk

decrease ∆k

k ← k + 1, stop if δk = G or go to [1]



TheoryI MADS analysis relies on “the sequence of trial points are

located on some discretization of the space of variables calledthe mesh”.

I By multiplying or dividing ∆k by a rational number τ ,[Torczon, 1997] showed that all trial points from iteration 0 to` were located on a fine underlying mesh. The proof is nottrivial and uses the fact that τ ∈ Q, and does not work forτ ∈ R (paper won SIAM Outstanding Paper Prize).

I With the new mesh, that technical part of the proof becomes:

Consider any trial point t considered from iteration k = 0 to `.If a granularity of Gi is requested on variable i,

ti lies on the mesh of granularity Giif no granularity is requested on variable i,

ti lies on the mesh of granularity 10bi

with bi = minbki : k = 0, 1, . . . , `.



TheoryI MADS analysis relies on “the sequence of trial points are

located on some discretization of the space of variables calledthe mesh”.

I By multiplying or dividing ∆k by a rational number τ ,[Torczon, 1997] showed that all trial points from iteration 0 to` were located on a fine underlying mesh.

I With the new mesh, that technical part of the proof becomes:

Consider any trial point t considered from iteration k = 0 to `.If a granularity of Gi is requested on variable i,

ti lies on the mesh of granularity Giif no granularity is requested on variable i,

ti lies on the mesh of granularity 10bi

with bi = minbki : k = 0, 1, . . . , `.




Motivating example

The MADS algorithm


Discussion



Trust-Region parameters tuning

Find the values of the four parameters x = (η1, η2, α1, α2) ∈ R4+

that minimize the overall CPU time to solve 55 CUTEr problems.

I A surrogate function s is defined as the time to solve acollection of small-sized problems.

I In 2005, f(x) ' 3h-4h and s(x) ' 1m. The surrogate was200 times faster (now ' 140 times).

Theorem (Moore’s Law is valid)

Year CPU f(x) CPU s(x)2005 13800 s 69 s2016 330 s 2.3 s

Ratio: 42 30

Moore’s Law: Speed doublesevery 2 years:2016-2005 = 11 ⇒ 5.5 cycles,speedup of 25.5 ' 45.



Trust-Region parameters tuning

Find the values of the four parameters x = (η1, η2, α1, α2) ∈ R4+

that minimize the overall CPU time to solve 55 CUTEr problems.

I A surrogate function s is defined as the time to solve acollection of small-sized problems.

I In 2005, f(x) ' 3h-4h and s(x) ' 1m. The surrogate was200 times faster (now ' 140 times).

Theorem (Moore’s Law is valid)

Year CPU f(x) CPU s(x)2005 13800 s 69 s2016 330 s 2.3 s

Ratio: 42 30

Moore’s Law: Speed doublesevery 2 years:2016-2005 = 11 ⇒ 5.5 cycles,speedup of 25.5 ' 45.



Standard ×2÷ 2 versus New 1, 2, 5 × 10b



Trust-Region parameter tuning: Results

Standard New Granularity×2÷ 2 1, 2, 5 × 10b 0.01 0.05

CPU: 228 s 207 s 205 s 220 s

η1 0.2508936274 0.5325498121 0.35 0.35η2 0.9008242693 0.9938378210 0.99 0.95α1 0.5464425868 0.4817164472 0.50 0.45α2 1.5505441560 3.5206901980 2.82 3.20

Observations:I On this example, the new strategies seem preferable.I Trust-region recommendation for humans:

(η1, η2, α1, α2) = (0.35, 0.99, 0.5, 2.82).



Results on a collection of mixed-integer problems

I 43 problems taken from 3 papers by J. Muller et al.:

I SO-MI [Muller et al., 2013]: 19 problems, mixed, constrained.I SO-I [Muller et al., 2014]: 16 problems, discrete, constrained.I MISO [Muller, 2016]: 8 problems, mixed, unconstrained.

I 10 LHS starting points are considered for each problem, for atotal of 430 instances.

I NOMAD 3.7.3 (current release, classic mesh)vsNOMAD 3.8.Dev (prototype, new mesh).

I Performance and data profiles [More and Wild, 2009].



Performance profiles for the 430 instances

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9precision

75

80

85

90

95

100

num

ber o

f pro

blem

s (%

)

Performance profiles (43x10 instances)

NOMAD.3.7.3NOMAD.3.8.Dev



Data profiles for the 430 instances (precision 10−1)

0 100 200 300 400 500 600 700 800 900 1000number of simplex gradients

0

10

20

30

40

50

60

70

80

90

100

num

ber o

f pro

blem

s (%

)

Data profiles (43x10 instances, precision=1e-1)






0

10

20

30

40

50

60

70

80

90

100

num

ber o

f pro

blem

s (%

)







0

10

20

30

40

50

60

70

80

90

100

num

ber o

f pro

blem

s (%

)






Motivating example

The MADS algorithm


Discussion



Discussion (1/2)

I New mesh parameter update rules to control the number ofdecimals 1, 2, 5 × 10b:

I A native way to handle granularity of variables G.I Integer variables are handled by setting G = 1.

I Computational experiments on trust region parameters:

I New parameters reduce CPU time by 35% (versus textbook).I New parameters have granularity 0.01 (readable by humans).I Experiments on other test problems in progress.

I Computational experiments on analytical problems:' 3% performance improvement over the previous NOMADversion.



Discussion (1/2)



I Computational experiments on trust region parameters:I New parameters reduce CPU time by 35% (versus textbook).I New parameters have granularity 0.01 (readable by humans).I Experiments on other test problems in progress.




Discussion (1/2)



I Computational experiments on trust region parameters:I New parameters reduce CPU time by 35% (versus textbook).I New parameters have granularity 0.01 (readable by humans).I Experiments on other test problems in progress.




Discussion (2/2)

I This will be part of our NOMAD 3.8 software:

I The only additional input from the user is G.

I ... and it is optional.

I www.gerad.ca/nomad.


https://www.gerad.ca/nomad


References I

Abramson, M. (2004).Mixed variable optimization of a Load-Bearing thermal insulation system using afilter pattern search algorithm.Optimization and Engineering, 5(2):157–177.

Audet, C. and Dennis, Jr., J. (2006).Mesh adaptive direct search algorithms for constrained optimization.SIAM Journal on Optimization, 17(1):188–217.

Audet, C. and Orban, D. (2006).Finding optimal algorithmic parameters using derivative-free optimization.SIAM Journal on Optimization, 17(3):642–664.

Le Digabel, S. (2011).Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm.ACM Transactions on Mathematical Software, 37(4):44:1–44:15.

More, J. and Wild, S. (2009).Benchmarking derivative-free optimization algorithms.SIAM Journal on Optimization, 20(1):172–191.



References II

Muller, J. (2016).MISO: mixed-integer surrogate optimization framework.Optimization and Engineering, 17(1):177–203.

Muller, J., Shoemaker, C., and Piche, R. (2013).SO-MI: A surrogate model algorithm for computationally expensive nonlinearmixed-integer black-box global optimization problems.Computers and Operations Research, 40(5):1383–1400.

Muller, J., Shoemaker, C., and Piche, R. (2014).SO-I: a surrogate model algorithm for expensive nonlinear integer programmingproblems including global optimization applications.Journal of Global Optimization, 59(4):865–889.

Torczon, V. (1997).On the convergence of pattern search algorithms.SIAM Journal on Optimization, 7(1):1–25.


Documents

The Mesh Adaptive Direct Search Algorithm for Discrete Blackbox