Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Designing architectures by hand is hard
Change architecture
Run experiments on architecture
Analyze results (and bugs, training
details, …) McCulloch-Pitts Neuron: 1943
LSTM: 1997
Search architectures automatically
• speed up architecture search enormously
• remove the human prior• perhaps reveal what makes a
good architecture
Change architecture
Run experiments on architecture
Analyze results (and bugs, training
details, …)
Controller
PerformanceReward
Boot up GPUs
Baker et al. 2016, Zoph and Le 2017
Recurrent Neural Networks (RNN)
RNN
𝑥𝑥𝑡𝑡
ℎ𝑡𝑡
Recurrent Neural Networks (RNN)
Commonly used: Long Short-Term Memory (LSTM)
𝑐𝑐𝑡𝑡
𝑥𝑥𝑡𝑡
ℎ𝑡𝑡
𝑥𝑥𝑡𝑡−1
ℎ𝑡𝑡−1
Outline
1. Flexible language (DSL) to define architectures
2. Components: Ranking Function & Reinforcement Learning Generator
3. Experiments: Language Modeling & Machine Translation
Domain Specific Language (DSL)or how to define an architecture
Zoph and Le 2017
Domain Specific Language (DSL)or how to define an architecture
𝑇𝑇𝑇𝑇𝑇𝑇ℎ(𝐴𝐴𝐴𝐴𝐴𝐴(𝑀𝑀𝑀𝑀 𝑥𝑥𝑡𝑡 ,𝑀𝑀𝑀𝑀 ℎ𝑡𝑡−1 )
Core• Variables 𝑥𝑥𝑡𝑡 , 𝑥𝑥𝑡𝑡−1,ℎ𝑡𝑡−1• MM• Sigmoid, Tanh, ReLU• Add, Mult• Gate3 𝑥𝑥,𝑦𝑦, 𝑓𝑓
= 𝜎𝜎(𝑓𝑓) � 𝑥𝑥 + (1 − 𝜎𝜎 𝑓𝑓) � 𝑦𝑦• Memory cell 𝑐𝑐𝑡𝑡
Expanded• Sub, Div• Sin, Cos, PosEnc• LayerNorm• SeLU
Domain Specific Language (DSL)or how to define an architecture
Instantiable Framework
Architecture Generator
given the current architecture,output the next operator
1. Random
2. REINFORCE
Reinforcement Learning Generator
ReLU
Performance: 42
Agent Environment
action
observation, reward
Ranking Function
Goal: predict performance of an architecture
Train with architecture-performance pairs
Language Modeling
𝑃𝑃 𝑤𝑤𝑖𝑖 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑖𝑖−1)“Why did the chicken cross the ___”Performance measurement: perplexity
Language Modeling (LM) with Random Search + Ranking Function
LM with Ranking Function:selected architectures improve
The BC3 cell
Weight matrices 𝑊𝑊,𝑈𝑈,𝑉𝑉,𝑋𝑋 ∈ ℝ𝐻𝐻×𝐻𝐻
LM with Ranking Function:Improvement over many human architectures
Machine Translation
Test evaluation: BLEU score
Decoder
Softmax
Encoder
Embed
He loved to eat .
+
Er liebte
ErNULL
Machine Translationwith Reinforcement Learning Generator
Machine Translation (MT)with Reinforcement Learning Generator (RL)
• Generator = 3-layer NN (linear-LSTM-linear) outputting action scores
• Choose action with multinomial and epsilon-greedy strategy (𝜖𝜖 = 0.05)
• Train generator on soft priors first (use activations, …)
• Small dataset to evaluate an architecture in ~2 hours
MT with RL:re-scale loss to reward great architectures more
∞ Loss 0
0Re
war
d
MT with RL:switch between exploration and exploitation
Epochs
log(
perf
orm
ance
)
MT with RL:good architectures found
MT with RL:many good architectures found
Perplexity
Num
ber o
f arc
hite
ctur
es
MT with RL:rediscovery of human architectures
• 𝐴𝐴𝐴𝐴𝐴𝐴(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑓𝑓𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑥𝑥𝑡𝑡 , 𝑥𝑥𝑡𝑡)
variant of residual networks (He et al., 2016)
• 𝐺𝐺𝑇𝑇𝑇𝑇𝐺𝐺𝐺 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑓𝑓𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑥𝑥𝑡𝑡 , 𝑥𝑥𝑡𝑡 , 𝑆𝑆𝑇𝑇𝑆𝑆𝑇𝑇𝑇𝑇𝑇𝑇𝐴𝐴 …
highway networks (Srivastava et al., 2015)
• Motifs found in multiple cells
MT with RL:novel operators only used after “it clicked”
Epochs
MT with RL:novel operators contribute to successful architectures
Related work
• Hyper-parameter search: Bergstra et al. 2011, Snoek et al. 2012
• Neuroevolution: Stanley et al. 2009, Bayer et al. 2009, Fernando et al. 2016,
Liu et al. 2017 (← also random search)
• RL search: Baker et al. 2016, Zoph and Le 2017
• Subgraph selection: Pham, Guan et al. 2018
• Weight prediction: Ha et al. 2016, Brock et al. 2018
• Optimizer search: Bello et al. 2017
Discussion
• Remove need for expert knowledge to a degree• Cost of running these experiments
• us: 5 days on 28 GPUs (best architecture after 40 hours)• Zoph and Le 2017: 4 days using 450 GPUs
• Hard to analyze the diversity of architectures (much more quantitative than qualitative)
• Definition of search space difficult• We’re using a highly complex system
to find other highly complex systemsin a highly complex space
Contributions
1. Flexible language (DSL) to define
architectures
2. Ranking Function
(Language Modeling)
Reinforcement Learning Generator
(Machine Translation)
3. Explore uncommon operators
• Search architectures that correspond to
biology
• Allow for more flexible search space
• Find architectures that do well on
multiple tasks
Future Work
Backup
Compilation: DSL Model
• DSL is basically executable• Traverse tree from source nodes towards final node ℎ𝑡𝑡• Produce code: initialization and forward call• Collect all matrix multiplications on single source node and batch
them
Restrictions on generated architectures
• Gate3(…, …, Sigmoid(…))• Have to use 𝑥𝑥𝑡𝑡 ,ℎ𝑡𝑡−1• Maximum 21 nodes, depth 8• Prevent stacking two identical operations
• MM(MM(x)) is mathematically identical to MM(x)• Sigmoid(Sigmoid(x)) is unlikely to be useful• ReLU(ReLU(x)) is redundant
How to define proper search space?
• Too small will find nothing radically novel• Too big need Google computing ressources
• Baseline experiment parameters restrict successful architectures
MT with RL:Learned encoding very different
MT with RL:Parent-Child operator preference