Evolutionary (deep) neural network

Evolutionary Deep Neural Network(or NeuroEvolution)

신수용2017. 8. 29

@SNU TensorFlow Study

2

https://www.youtube.com/watch?v=aeWmdojEJf0https://github.com/ssusnic/Machine-Learning-Flappy-Bird

https://www.youtube.com/watch?v=aeWmdojEJf0

https://www.youtube.com/watch?v=aeWmdojEJf0

https://github.com/ssusnic/Machine-Learning-Flappy-Bird

3

Evolutionary DNN

• Usually, used to decide DNN structure

– Number of layers, number of nodes..

• Can be used to decide weight values

– Flappy bird example

Evolutionary Computation

5

Biological Basis

• Biological systems adapt themselves to a new environment by evolution.

• Biological evolution

– Production of descendants changed from their parents

– Selective survival of some of these descendants to produce more descendants

Survival of the Fittest

6

Evolutionary Computation

• Stochastic search (or problem solving) techniques that mimic the metaphor of natural biological evolution.

77

General Framework

초기해집합 생성

적합도 평가

종료?

부모 개체 선택

자손 생성

적합도 함수

최적해Yes

No

교차 연산돌연변이 연산

선택 연산

8

Paradigms in EC

• Genetic Algorithm (GA)– [J. Holland, 1975]

– Bitstrings, mainly crossover, proportionate selection

• Genetic Programming (GP)– [J. Koza, 1992]

– Trees, mainly crossover, proportionate selection

• Evolutionary Programming (EP)– [L. Fogel et al., 1966]

– FSMs, mutation only, tournament selection

• Evolution Strategy (ES)– [I. Rechenberg, 1973]

– Real values, mainly mutation, ranking selection

Genetic Algorithms

10

GA(Genetic Algorithms)

• 자연계의 유전 현상을 모방하여 적합한 가설을 얻어내는 방법

• 특성– 진화는 자연계에서 성공적이고 정교한 적응 방법– 모델링하기 힘든 복잡한 문제에도 적용 가능– 병렬화가 가능, H/W 성능의 도움을 받을 수 있음

• 대규모 탐색공간에서 최선의 fitness의 해를 찾는일반적인 최적화 과정

• 최적의 해를 찾는다고 보장할 수는 없지만 높은fitness의 해를 얻을 수 있음

11

GA의 기본용어 (1/2)

• 염색체 (Chromosome) 개체 (Individual)

– 주어진 문제에 대한 가능한 해 또는 가설

– 대부분 string으로 표현됨

– string의 원소는 정수, 실수 등 필요에 의해 결정됨

• 개체군 (population)

– 개체(가설)들의 집합

1 1 0 1 0 0 1 1

12

GA의 기본용어 (2/2)

• 적합도 (fitness)– 산술적인 단위로 가설의 적합도를 표시한다.

– 유전자의 각 개체의 환경에 대한 적합의 비율을 평가하는 값

– 평가치로 최적화 문제를 대상으로 하는 경우 목적함수 값이나 제약조건을 고려하여 페널티 함수 값

• 적합도 함수 (fitness function)– 적합도를 구하기 위해서 사용되는 기준방법

13

GA의 연산자 (1/5)

• 선택 연산자 (Selection Operator)

– 개체를 선택하여 부모들로 선정

– 우수한 자손들이 많이 생성되도록 하기 위해서(해답을 발견하기 위해서) 좀더 우수한 적합도를 가진 개체들이 선택될 확률이 비교적 높도록 함.

– Proportional (Roulette wheel) selection

– Tournament selection

– Ranking-based selection

14


• 교차 연산자 (Crossover Operator)

– 생물들이 생식을 하는 것처럼 부모들의 염색체를 서로 교차시켜서 자손을 만드는 연산자.

– Crossover rate라고 불리는 임의의 확률에 의해서교차연산의 수행여부가 결정된다.

15


– One-point crossover

1 1 0 1 0 0 1 1

0 1 1 1 0 1 1 0

Crossover point

1 1 0 1 0 1 1 0

0 1 1 1 0 0 1 1

16


• 돌연변이 연산자 (Mutation Operator)

– 한 bit를 mutation rate라는 임의의 확률로 변화(flip)시키는 연산자

– 아주 작은 확률로 적용된다. (ex) 0.001

1 1 0 1 0 0 1 1

1 1 0 1 1 0 1 1

17

Example of Genetic Algorithm

18

가설 공간 탐색

• 다른 탐색 방법과의 비교

– local minima에 빠질 확률이 적다(급격한 움직임 가능)

• Crowding

– 유사한 개체들이 개체군의 다수를 점유하는 현상

– 다양성을 감소시킨다.

19

Crowding

• Crowding의 해결법– 선택방법을 바꾼다.

• Tournament selection, ranking selection

– “fitness sharing”• 유사한 개체가 많으면 fitness를 감소시킨다.

– 결합하는 개체들을 제한• 가장 비슷한 개체끼리 결합하게 함으로써 cluster or

multiple subspecies 형성

• 개체들을 공간적으로 분포시키고 근처의 것끼리만 결합 가능하게 함

20

Typical behavior of an EA

• Phases in optimizing on a 1-dimensional fitness landscape

Early phase:

quasi-random population distribution

Mid-phase:

population arranged around/on hills

Late phase:

population concentrated on high hills

21

Geometric Analogy - Mathematical Landscape

22

Typical run: progression of fitness

Typical run of an EA shows so-called “anytime behavior”

Best

fitness

in p

opula

tion

Time (number of generations)

23

Best

fitness

in p

opula

tion

Time (number of generations)

Progress in 1st half

Progress in 2nd half

Are long runs beneficial?

• Answer: - it depends how much you want the last bit of progress- it may be better to do more shorter runs

24

Scale of “all” problems

Perform

ance

of m

eth

ods

on p

roble

ms

Random search

Special, problem tailored method

Evolutionary algorithm

ECs as problem solvers: Goldberg’s 1989 view

25

Advantages of EC

• No presumptions w.r.t. problem space

• Widely applicable

• Low development & application costs

• Easy to incorporate other methods

• Solutions are interpretable (unlike NN)

• Can be run interactively, accommodate user proposed solutions

• Provide many alternative solutions

26

Disadvantages of EC

• No guarantee for optimal solution within finite time

• Weak theoretical basis

• May need parameter tuning

• Often computationally expensive, i.e. slow

Genetic Programming

28

Genetic Programming

• Genetic programming uses variable-size tree-representations rather than fixed-length strings of binary values.

• Program tree

= S-expression

= LISP parse tree

• Tree = Functions (Nonterminals) + Terminals

29

GP Tree: An Example

• Function set: internal nodes

– Functions, predicates, or actions which take one or more arguments

• Terminal set: leaf nodes

– Program constants, actions, or functions which take no arguments

S-expression: (+ 3 (/ ( 5 4) 7))Terminals = {3, 4, 5, 7}Functions = {+, , /}

30

Tree based representation

• Trees are a universal form, e.g. consider

• Arithmetic formula

• Logical formula

• Program

15)3(2

yx

(x true) (( x y ) (z (x y)))

i =1;

while (i < 20)

{

i = i +1

}

31

Tree based representation

• In GA, ES, EP chromosomes are linear structures (bit strings, integer string, real-valued vectors, permutations)

• Tree shaped chromosomes are non-linear structures.

• In GA, ES, EP the size of the chromosomes is fixed.

• Trees in GP may vary in depth and width.

32

Crossover: Subtree Exchange

+

b

a b

+

b

+

a a b

+

a b

a b

+

b

+

a

b

33

Mutation

a b

+

b

/

a

+

b

+

b

/

a

-

b a

Evolution strategies

35

ES quick overview

• Developed: Germany in the 1970’s

• Early names: I. Rechenberg, H.-P. Schwefel

• Typically applied to:– numerical optimisation

• Attributed features:– fast

– good optimizer for real-valued optimisation

– relatively much theory

• Special:– self-adaptation of (mutation) parameters standard

36

ES technical summary

Representation Real-valued vectors

Recombination Discrete or intermediary

Mutation Gaussian perturbation

Parent selection Uniform random

Survivor selection (,) or (+)

Specialty Self-adaptation of mutation

step sizes

37

Introductory example

• Task: minimimise f : Rn R

• Algorithm: “two-membered ES” using

– Vectors from Rn

directly as chromosomes

– Population size 1

– Only mutation creating one child

– Greedy selection

38

Parent selection

• Parents are selected by uniform random distribution whenever an operator needs one/some

• Thus: ES parent selection is unbiased -every individual has the same probability to be selected

• Note that in ES “parent” means a population member (in GA’s: a population member selected to undergo variation)

39

Survivor selection

• Applied after creating children from the parents by mutation and recombination

• Deterministically chops off the “bad stuff”

• Basis of selection is either:

– The set of children only: (,)-selection

– The set of parents and children: (+)-selection

40

Survivor selection cont’d

• (+)-selection is an elitist strategy

• (,)-selection can “forget”

• Often (,)-selection is preferred for:– Better in leaving local optima

– Better in following moving optima

– Using the + strategy bad values can survive in x, too long if their host x is very fit

• Selective pressure in ES is very high ( 7 • is the common setting)

Evolutionary Programming

42

EP quick overview

• Developed: USA in the 1960’s

• Early names: D. Fogel

• Typically applied to:– traditional EP: machine learning tasks by finite state machines

– contemporary EP: (numerical) optimization

• Attributed features:– very open framework: any representation and mutation op’s OK

– crossbred with ES (contemporary EP)

– consequently: hard to say what “standard” EP is

• Special:– no recombination

– self-adaptation of parameters standard (contemporary EP)

43

EP technical summary tableau

Representation Real-valued vectors

Recombination None

Mutation Gaussian perturbation

Parent selection Deterministic

Survivor selection Probabilistic (+)

Specialty Self-adaptation of mutation

step sizes (in meta-EP)

Evolutionary Neural Networks(or Neuro-evolution)

45

ENN

• The back-propagation learning algorithm cannot guarantee an optimal solution.

• In real-world applications, the back-propagation algorithm might converge to a set of sub-optimal weights from which it cannot escape.

• As a result, the neural network is often unable to find a desirable solution to a problem at hand.

46

ENN

• Another difficulty is related to selecting an optimal topology for the neural network.

– The “right” network architecture for a particular problem is often chosen by means of heuristics, and designing a neural network topology is still more art than engineering.

• Genetic algorithms are an effective optimization technique that can guide both weight optimization and topology selection.

47

Encoding a set of weights in a chromosome

y

0.91

3

4x1

x3

x22

-0.8

0.4

0.8

-0.7

0.2

-0.2

0.6

-0.3 0.1

-0.2

0.9

-0.60.1

0.3

0.5

From neuron:

To neuron:

12 34 5678

1

2

3

4

5

6

7

8

00 00 0000

00 00 0000

00 00 0000

0.9 -0.3 -0.7 0 0000

-0.8 0.6 0.3 0 0000

0.1 -0.2 0.2 0 0000

0.4 0.5 0.8 0 0000

00 0 -0.6 0.1 -0.2 0.9 0

Chromosome: 0.9 -0.3 -0.7 -0.8 0.6 0.3 0.1 -0.2 0.2 0.4 0.5 0.8 -0.6 0.1 -0.2 0.9

48

Fitness function

• The second step is to define a fitness function for evaluating the chromosome’s performance.

– This function must estimate the performance of a given neural network.

– Simple function defined by the sum of squared errors.

49

4

5

y

x22

-0.3

0.9

-0.7

0.5

-0.8

-0.6

Parent 1 Parent 2

x11

-0.2

0.1

0.44

5

y

x22

-0.1

-0.5

0.2

-0.9

0.6

0.3x11

0.9

0.3

-0.8

0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9 0.4 -0.3 0.3 0.2 0.3 -0.9 0.60.9 -0.5 -0.8 -0.1

0.1 -0.7 -0.6 0.5 -0.80.9 -0.5 -0.8 0.1

4y

x22

-0.1

-0.5

-0.7

0.5

-0.8

-0.6

Child

x11

0.9

0.1

-0.8

Crossover

50

Mutation

Original network3

4

5

y6

x22

-0.3

0.9

-0.7

0.5

-0.8

-0.6x11

-0.2

0.1

0.4

0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9

3

4

5

y6

x22

0.2

0.9

-0.7

0.5

-0.8

-0.6x11

-0.2

0.1

-0.1

0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9

Mutated network

0.4 -0.3 -0.1 0.2

51

Architecture Selection

• The architecture of the network (i.e. the number of neurons and their interconnections) often determines the success or failure of the application.

• Usually the network architecture is decided by trial and error; there is a great need for a method of automatically designing the architecture for a particular application. – Genetic algorithms may well be suited for this

task.

52

Encoding

Fromneuron:

To neuron:

1 2

0

5

0

3

0

4

0

6

1

2

3

4

5

6

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1 1

1

1 1 1 1

0

1 0

0 0

0 0

3

4

5

y6

x22

x11

Chromosome:

0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0

53

ProcessNeural Network j

Fitness = 117

Neural Network j

Fitness = 117Generation i

Training Data Set0 0 1.0000

0.1000 0.0998 0.8869

0.2000 0.1987 0.7551

0.3000 0.2955 0.61420.4000 0.3894 0.4720

0.5000 0.4794 0.3345

0.6000 0.5646 0.20600.7000 0.6442 0.0892

0.8000 0.7174 -0.01430.9000 0.7833 -0.1038

1.0000 0.8415 -0.1794Child 2

Child 1

CrossoverParent 1

Parent 2

Mutation

Generation (i + 1)

Evolutionary DNN

55

Good reference blog

• https://medium.com/@stathis/design-by-evolution-393e41863f98

https://medium.com/@stathis/design-by-evolution-393e41863f98

56

Evolving Deep Neural Networks

• https://arxiv.org/pdf/1703.00548.pdf

• CoDeepNEAT

– for optimizing deep learning architectures through evolution

– Evolving DNNS for CIFAR-10

– Evolving LSTM architecture

– Not so clear experimental comparison..

https://arxiv.org/pdf/1703.00548.pdf

57

Large-Scale Evolution of Image Classifiers

• https://arxiv.org/abs/1703.01041

• Individual

– a trained architecture

• Fitness

– Individual’s accuracy on a validation set

• Selection (tournament selection)

– Randomly choose two individuals

– Select better one (parent)

https://arxiv.org/abs/1703.01041

58


• Mutation

– Pick a mutation from a predetermined set

• Train child

• Repeat.

59


60

Convolution by Evolution

• https://arxiv.org/pdf/1606.02580.pdf

• GECCO16 paper

• Differential version of the Compositional Pattern Producing Network (DPPN)– Topology is evolved but the weights are

learned

– Compressed the weights of a denoising autoencoder from 157684 to roughly 200 parameters with comparable image reconstruction accuracy

https://arxiv.org/pdf/1606.02580.pdf

61

62

[email protected]

@likesky3

Engineering

Evolutionary (deep) neural network