Upload
soo-yong-shin
View
307
Download
12
Embed Size (px)
Citation preview
Evolutionary Deep Neural Network(or NeuroEvolution)
신수용2017. 8. 29
@SNU TensorFlow Study
2
https://www.youtube.com/watch?v=aeWmdojEJf0https://github.com/ssusnic/Machine-Learning-Flappy-Bird
3
Evolutionary DNN
• Usually, used to decide DNN structure
– Number of layers, number of nodes..
• Can be used to decide weight values
– Flappy bird example
Evolutionary Computation
5
Biological Basis
• Biological systems adapt themselves to a new environment by evolution.
• Biological evolution
– Production of descendants changed from their parents
– Selective survival of some of these descendants to produce more descendants
Survival of the Fittest
6
Evolutionary Computation
• Stochastic search (or problem solving) techniques that mimic the metaphor of natural biological evolution.
77
General Framework
초기해집합 생성
적합도 평가
종료?
부모 개체 선택
자손 생성
적합도 함수
최적해Yes
No
교차 연산돌연변이 연산
선택 연산
8
Paradigms in EC
• Genetic Algorithm (GA)– [J. Holland, 1975]
– Bitstrings, mainly crossover, proportionate selection
• Genetic Programming (GP)– [J. Koza, 1992]
– Trees, mainly crossover, proportionate selection
• Evolutionary Programming (EP)– [L. Fogel et al., 1966]
– FSMs, mutation only, tournament selection
• Evolution Strategy (ES)– [I. Rechenberg, 1973]
– Real values, mainly mutation, ranking selection
Genetic Algorithms
10
GA(Genetic Algorithms)
• 자연계의 유전 현상을 모방하여 적합한 가설을 얻어내는 방법
• 특성– 진화는 자연계에서 성공적이고 정교한 적응 방법– 모델링하기 힘든 복잡한 문제에도 적용 가능– 병렬화가 가능, H/W 성능의 도움을 받을 수 있음
• 대규모 탐색공간에서 최선의 fitness의 해를 찾는일반적인 최적화 과정
• 최적의 해를 찾는다고 보장할 수는 없지만 높은fitness의 해를 얻을 수 있음
11
GA의 기본용어 (1/2)
• 염색체 (Chromosome) 개체 (Individual)
– 주어진 문제에 대한 가능한 해 또는 가설
– 대부분 string으로 표현됨
– string의 원소는 정수, 실수 등 필요에 의해 결정됨
• 개체군 (population)
– 개체(가설)들의 집합
1 1 0 1 0 0 1 1
12
GA의 기본용어 (2/2)
• 적합도 (fitness)– 산술적인 단위로 가설의 적합도를 표시한다.
– 유전자의 각 개체의 환경에 대한 적합의 비율을 평가하는 값
– 평가치로 최적화 문제를 대상으로 하는 경우 목적함수 값이나 제약조건을 고려하여 페널티 함수 값
• 적합도 함수 (fitness function)– 적합도를 구하기 위해서 사용되는 기준방법
13
GA의 연산자 (1/5)
• 선택 연산자 (Selection Operator)
– 개체를 선택하여 부모들로 선정
– 우수한 자손들이 많이 생성되도록 하기 위해서(해답을 발견하기 위해서) 좀더 우수한 적합도를 가진 개체들이 선택될 확률이 비교적 높도록 함.
– Proportional (Roulette wheel) selection
– Tournament selection
– Ranking-based selection
14
GA의 연산자 (2/5)
• 교차 연산자 (Crossover Operator)
– 생물들이 생식을 하는 것처럼 부모들의 염색체를 서로 교차시켜서 자손을 만드는 연산자.
– Crossover rate라고 불리는 임의의 확률에 의해서교차연산의 수행여부가 결정된다.
15
GA의 연산자 (3/5)
– One-point crossover
1 1 0 1 0 0 1 1
0 1 1 1 0 1 1 0
Crossover point
1 1 0 1 0 1 1 0
0 1 1 1 0 0 1 1
16
GA의 연산자 (4/4)
• 돌연변이 연산자 (Mutation Operator)
– 한 bit를 mutation rate라는 임의의 확률로 변화(flip)시키는 연산자
– 아주 작은 확률로 적용된다. (ex) 0.001
1 1 0 1 0 0 1 1
1 1 0 1 1 0 1 1
17
Example of Genetic Algorithm
18
가설 공간 탐색
• 다른 탐색 방법과의 비교
– local minima에 빠질 확률이 적다(급격한 움직임 가능)
• Crowding
– 유사한 개체들이 개체군의 다수를 점유하는 현상
– 다양성을 감소시킨다.
19
Crowding
• Crowding의 해결법– 선택방법을 바꾼다.
• Tournament selection, ranking selection
– “fitness sharing”• 유사한 개체가 많으면 fitness를 감소시킨다.
– 결합하는 개체들을 제한• 가장 비슷한 개체끼리 결합하게 함으로써 cluster or
multiple subspecies 형성
• 개체들을 공간적으로 분포시키고 근처의 것끼리만 결합 가능하게 함
20
Typical behavior of an EA
• Phases in optimizing on a 1-dimensional fitness landscape
Early phase:
quasi-random population distribution
Mid-phase:
population arranged around/on hills
Late phase:
population concentrated on high hills
21
Geometric Analogy - Mathematical Landscape
22
Typical run: progression of fitness
Typical run of an EA shows so-called “anytime behavior”
Best
fitness
in p
opula
tion
Time (number of generations)
23
Best
fitness
in p
opula
tion
Time (number of generations)
Progress in 1st half
Progress in 2nd half
Are long runs beneficial?
• Answer: - it depends how much you want the last bit of progress- it may be better to do more shorter runs
24
Scale of “all” problems
Perform
ance
of m
eth
ods
on p
roble
ms
Random search
Special, problem tailored method
Evolutionary algorithm
ECs as problem solvers: Goldberg’s 1989 view
25
Advantages of EC
• No presumptions w.r.t. problem space
• Widely applicable
• Low development & application costs
• Easy to incorporate other methods
• Solutions are interpretable (unlike NN)
• Can be run interactively, accommodate user proposed solutions
• Provide many alternative solutions
26
Disadvantages of EC
• No guarantee for optimal solution within finite time
• Weak theoretical basis
• May need parameter tuning
• Often computationally expensive, i.e. slow
Genetic Programming
28
Genetic Programming
• Genetic programming uses variable-size tree-representations rather than fixed-length strings of binary values.
• Program tree
= S-expression
= LISP parse tree
• Tree = Functions (Nonterminals) + Terminals
29
GP Tree: An Example
• Function set: internal nodes
– Functions, predicates, or actions which take one or more arguments
• Terminal set: leaf nodes
– Program constants, actions, or functions which take no arguments
S-expression: (+ 3 (/ ( 5 4) 7))Terminals = {3, 4, 5, 7}Functions = {+, , /}
30
Tree based representation
• Trees are a universal form, e.g. consider
• Arithmetic formula
• Logical formula
• Program
15)3(2
yx
(x true) (( x y ) (z (x y)))
i =1;
while (i < 20)
{
i = i +1
}
31
Tree based representation
• In GA, ES, EP chromosomes are linear structures (bit strings, integer string, real-valued vectors, permutations)
• Tree shaped chromosomes are non-linear structures.
• In GA, ES, EP the size of the chromosomes is fixed.
• Trees in GP may vary in depth and width.
32
Crossover: Subtree Exchange
+
b
a b
+
b
+
a a b
+
a b
a b
+
b
+
a
b
33
Mutation
a b
+
b
/
a
+
b
+
b
/
a
-
b a
Evolution strategies
35
ES quick overview
• Developed: Germany in the 1970’s
• Early names: I. Rechenberg, H.-P. Schwefel
• Typically applied to:– numerical optimisation
• Attributed features:– fast
– good optimizer for real-valued optimisation
– relatively much theory
• Special:– self-adaptation of (mutation) parameters standard
36
ES technical summary
Representation Real-valued vectors
Recombination Discrete or intermediary
Mutation Gaussian perturbation
Parent selection Uniform random
Survivor selection (,) or (+)
Specialty Self-adaptation of mutation
step sizes
37
Introductory example
• Task: minimimise f : Rn R
• Algorithm: “two-membered ES” using
– Vectors from Rn
directly as chromosomes
– Population size 1
– Only mutation creating one child
– Greedy selection
38
Parent selection
• Parents are selected by uniform random distribution whenever an operator needs one/some
• Thus: ES parent selection is unbiased -every individual has the same probability to be selected
• Note that in ES “parent” means a population member (in GA’s: a population member selected to undergo variation)
39
Survivor selection
• Applied after creating children from the parents by mutation and recombination
• Deterministically chops off the “bad stuff”
• Basis of selection is either:
– The set of children only: (,)-selection
– The set of parents and children: (+)-selection
40
Survivor selection cont’d
• (+)-selection is an elitist strategy
• (,)-selection can “forget”
• Often (,)-selection is preferred for:– Better in leaving local optima
– Better in following moving optima
– Using the + strategy bad values can survive in x, too long if their host x is very fit
• Selective pressure in ES is very high ( 7 • is the common setting)
Evolutionary Programming
42
EP quick overview
• Developed: USA in the 1960’s
• Early names: D. Fogel
• Typically applied to:– traditional EP: machine learning tasks by finite state machines
– contemporary EP: (numerical) optimization
• Attributed features:– very open framework: any representation and mutation op’s OK
– crossbred with ES (contemporary EP)
– consequently: hard to say what “standard” EP is
• Special:– no recombination
– self-adaptation of parameters standard (contemporary EP)
43
EP technical summary tableau
Representation Real-valued vectors
Recombination None
Mutation Gaussian perturbation
Parent selection Deterministic
Survivor selection Probabilistic (+)
Specialty Self-adaptation of mutation
step sizes (in meta-EP)
Evolutionary Neural Networks(or Neuro-evolution)
45
ENN
• The back-propagation learning algorithm cannot guarantee an optimal solution.
• In real-world applications, the back-propagation algorithm might converge to a set of sub-optimal weights from which it cannot escape.
• As a result, the neural network is often unable to find a desirable solution to a problem at hand.
46
ENN
• Another difficulty is related to selecting an optimal topology for the neural network.
– The “right” network architecture for a particular problem is often chosen by means of heuristics, and designing a neural network topology is still more art than engineering.
• Genetic algorithms are an effective optimization technique that can guide both weight optimization and topology selection.
47
Encoding a set of weights in a chromosome
y
0.91
3
4x1
x3
x22
-0.8
0.4
0.8
-0.7
0.2
-0.2
0.6
-0.3 0.1
-0.2
0.9
-0.60.1
0.3
0.5
From neuron:
To neuron:
12 34 5678
1
2
3
4
5
6
7
8
00 00 0000
00 00 0000
00 00 0000
0.9 -0.3 -0.7 0 0000
-0.8 0.6 0.3 0 0000
0.1 -0.2 0.2 0 0000
0.4 0.5 0.8 0 0000
00 0 -0.6 0.1 -0.2 0.9 0
Chromosome: 0.9 -0.3 -0.7 -0.8 0.6 0.3 0.1 -0.2 0.2 0.4 0.5 0.8 -0.6 0.1 -0.2 0.9
48
Fitness function
• The second step is to define a fitness function for evaluating the chromosome’s performance.
– This function must estimate the performance of a given neural network.
– Simple function defined by the sum of squared errors.
49
4
5
y
x22
-0.3
0.9
-0.7
0.5
-0.8
-0.6
Parent 1 Parent 2
x11
-0.2
0.1
0.44
5
y
x22
-0.1
-0.5
0.2
-0.9
0.6
0.3x11
0.9
0.3
-0.8
0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9 0.4 -0.3 0.3 0.2 0.3 -0.9 0.60.9 -0.5 -0.8 -0.1
0.1 -0.7 -0.6 0.5 -0.80.9 -0.5 -0.8 0.1
4y
x22
-0.1
-0.5
-0.7
0.5
-0.8
-0.6
Child
x11
0.9
0.1
-0.8
Crossover
50
Mutation
Original network3
4
5
y6
x22
-0.3
0.9
-0.7
0.5
-0.8
-0.6x11
-0.2
0.1
0.4
0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9
3
4
5
y6
x22
0.2
0.9
-0.7
0.5
-0.8
-0.6x11
-0.2
0.1
-0.1
0.1 -0.7 -0.6 0.5 -0.8-0.2 0.9
Mutated network
0.4 -0.3 -0.1 0.2
51
Architecture Selection
• The architecture of the network (i.e. the number of neurons and their interconnections) often determines the success or failure of the application.
• Usually the network architecture is decided by trial and error; there is a great need for a method of automatically designing the architecture for a particular application. – Genetic algorithms may well be suited for this
task.
52
Encoding
Fromneuron:
To neuron:
1 2
0
5
0
3
0
4
0
6
1
2
3
4
5
6
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1 1
1
1 1 1 1
0
1 0
0 0
0 0
3
4
5
y6
x22
x11
Chromosome:
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0
53
ProcessNeural Network j
Fitness = 117
Neural Network j
Fitness = 117Generation i
Training Data Set0 0 1.0000
0.1000 0.0998 0.8869
0.2000 0.1987 0.7551
0.3000 0.2955 0.61420.4000 0.3894 0.4720
0.5000 0.4794 0.3345
0.6000 0.5646 0.20600.7000 0.6442 0.0892
0.8000 0.7174 -0.01430.9000 0.7833 -0.1038
1.0000 0.8415 -0.1794Child 2
Child 1
CrossoverParent 1
Parent 2
Mutation
Generation (i + 1)
Evolutionary DNN
55
Good reference blog
• https://medium.com/@stathis/design-by-evolution-393e41863f98
56
Evolving Deep Neural Networks
• https://arxiv.org/pdf/1703.00548.pdf
• CoDeepNEAT
– for optimizing deep learning architectures through evolution
– Evolving DNNS for CIFAR-10
– Evolving LSTM architecture
– Not so clear experimental comparison..
57
Large-Scale Evolution of Image Classifiers
• https://arxiv.org/abs/1703.01041
• Individual
– a trained architecture
• Fitness
– Individual’s accuracy on a validation set
• Selection (tournament selection)
– Randomly choose two individuals
– Select better one (parent)
58
Large-Scale Evolution of Image Classifiers
• Mutation
– Pick a mutation from a predetermined set
• Train child
• Repeat.
59
Large-Scale Evolution of Image Classifiers
60
Convolution by Evolution
• https://arxiv.org/pdf/1606.02580.pdf
• GECCO16 paper
• Differential version of the Compositional Pattern Producing Network (DPPN)– Topology is evolved but the weights are
learned
– Compressed the weights of a denoising autoencoder from 157684 to roughly 200 parameters with comparable image reconstruction accuracy
61
62
@likesky3