AWS re:Invent 2016: Deep Learning in Alexa (MAC202)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Nikko Strom, Sr. Principal Scientist

Arpit Gupta, Scientist

November 30, 2016

Deep Learning in AlexaMAC202

Outline

• History of Deep Learning

• Deep Learning in Alexa

• The Alexa Skills Kit

Intense academic

activity“Neural winter” The “GPU era"

History of Deep Learning

1986 1998 2007 20162014

Amazon

Echo

launches!

Hinton, Rumelhart

and Williams invent

backpropagation

training

Multilayer perceptron

input, x

output, y

“input layer”

“hidden layer”

“hidden layer”

“output layer”

h1 = sigmoid(A1x+b1)

h2 = sigmoid(A2h1+b2)

y = sigmoid(Aoh2+bo)

x

Mohamed, Dahl and Hinton beat a

well-known speech recognition

benchmark (TIMIT)

Neural winter

Deep Learning milestones

1986 1998 2009 2010 2016

Krizhevsky, Sutskever, and

Hinton win the ImageNet

object recognition challenge.

AlphaGo beats a Go

World Champion

Microsoft and Google

demonstrate breakthrough

results on large vocabulary

speech recognition.

Hinton, Rumelhart

and Williams Salakhutdinov and

Hinton discover a

method to train very

deep neural

networks.

2002 2011

LeCun, Bottou,

Bengio and Haffner

publish CNN for

Computer Vision

1997

Hochreiter and

Schmidthuber invent LSTM

for recurrent networks with

long memory.

Neural winter

Deep Learning in Speech Recognition

1986 1998 2009 2010 20162002 2011

Mohamed, Dahl and Hinton beat a

well-known speech recognition

benchmark (TIMIT)

Microsoft and Google

demonstrate breakthrough

results on large vocabulary

speech recognition.

‘96‘91 ‘92‘89

Waibel, Hanazawa,

Hinton, Shikano, and

Lang publish time-

delay neural network

(TDNN).

Strom combines

time-delay NN and

RNN (RTDNN)

Strom introduces

speaker vectors for

speaker adaptation

Robinson demonstrates

RNN for ASR and get the

best result on TIMIT so far.

Bourlard, Morgan, Wooters and

Renals introduce context

dependent MLP models.

Impact of data corpus size

= 140,160 hours16 years

≈14,016 hours of speech

Neural winter

Impact of data corpus size

8800 GTX

350 GFLOPS

1986 1998 2007 2016

Neural winter

Impact of compute capacity

Cray X-MP/48

1986

1 GFLOPS

8800 GTX

350 GFLOPS

p2.16xlarge

23 TFLOPS

(70 TFLOPS single)

cg1.4xlarge

1 TFLOPS

ASCI Red

1 TFLOPS

1986 1998 2007 2016

Sun Ultra 60

1 GFLOPS

Taihu

100 PFLOPSRoadrunner

1 PFLOPS

Neural winter

Impact of compute infrastructure

1986 1998 2007 2012 2016

Reign of EM

• During the “neural winter,” EM became a dominant distributed

computing paradigm for machine learning (ML)

• ML algorithms that use the EM algorithms benefited greatly

• Distributed SGD broke out Deep Learning from the single box

Distributed SGD

Strom Dean et al.

2015

Conclusion – how we got here

• Theory and algorithm design in the 80s and 90s

• Orders of magnitude more data available

• Orders of magnitude more computational capacity

• A few algorithmic inventions enabled deep networks

• The rise of distributed SGD training

We are in a period of massive Deep Learning adoption because:

Deep Learning in Alexa

Large-scale distributed training

Up to 80 EC2 g2.2xlarge GPU

instances working in sync to train

a model

Thousands of

hours of speech

training data stored

in Amazon S3

Large-scale distributed training

All nodes must communicate

updates to the model to all

other nodes.

GPUs compute model

updates fast – Think updates

per second

A model update is hundreds

of MB

0

100,000

200,000

300,000

400,000

500,000

600,000

0 20 40 60 80

Fra

mes p

er

second

Number of GPU workers

DNN training speed

Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.

Speech Recognition

Signal

processingAcoustic model

Decoder

(inference)

Post

processing

Feature

vectors

[4.7, 2.3, -1.4, …]

Phonetic

probabilities

[0.1, 0.1, 0.4, …]

Words

increase to 70 degrees

Text

Increase to 70⁰

Sound

Speech recognition

Transfer learning from English to German

Hidden layer 1

Hidden layer 2

Last hidden layer

æI ɑɜ ʊ … eæI ɑɜ u: … œ

Output layer

Natural Language

Understanding

Intent and entities

play two steps behind by def leppard

IntentPlayMusic

EntitiesSong Artist

Two problems:

1. Words are symbols – not vectors of numbers

2. Requests are of different lengths

PlayMusic

Recurrent Neural Networks

Recurrent

Network

play two steps behind by def leppard

Speech

synthesis

Speech synthesis

Text

Text normalization

Grapheme-to-phoneme conversion

Waveform generation

Speech

She has 20$ in her pocket.

she has twenty dollars in her pocket

ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Concatenative synthesis

Di-phone

segment

database

Di-phone unit selection

SpeechInput

ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Prosody for natural sounding reading

Bi-directional recurrent network

pitch duration

• Phonetic features

• Linguistic features

• Semantic word vectors

targets for segment

intensity

Long-form example

“Over a lunch of diet cokes and lobster salad one

balmy fall day in Boston, Joseph Martin, the

genial, white-haired, former dean of Harvard

medical school, told me how many hours of pain

education Harvard med students get during four

years of medical school.”

Before After

The Alexa Skills Kit

The Alexa Skills Kit

Alexa!

Customers DevelopersAlexa

Growth of Published Skills

0

1000

2000

3000

4000

March May July September

2016

Alexa Skills: Examples

Business: Uber, Dominos, Fidelity, Capital One, Home Advisor, 1-800

Flowers

Info: Washington Post, Campbell’s Kitchen, Boston Children’s Hospital,

Stocks, Bitcoin Price, History Buff, Savvy Consumer

Fitness: Fitbit, 7-Minute Workout

Automation: Nest, Garageio, Alarm.com, Scout Alarm

Misc: Quick Events, Phone Finder, Cat Facts, Famous Quotes

Games: Jeopardy!, Minesweeper, Word Master, Blackjack, Math Puzzles,

Guess Number, Spelling Bee

Customers

ASK for Developers

Alexa!

DevelopersAlexa

ASK for Developers

• Define a Voice User Interface

• Provide a finite number of sample utterances

• ASK automatically builds and deploys

machine learning models

Developer Input

Model Build Workflow

DEVELOPER

Developer

Portal

Website

creates/edits

skill

Skill Model Builder

builds/uploads

skill models

reads

skill.json

writes

skill defnData

Store

Runtime

Cloud

Store

Model Building

Finite-state transducers (FSTs)

(exact match)

ML Entity Recognizer

ML Intent Recognizer

Developer Input

We build two models: FSTs are for exact matches,

machine learning models for fuzzy matches.

ASK Machine Learning

ASK

Machine

Learning

Model

hey uhm i need a

car to starbucks

Training: Finite number

of sample utterances

MATCH TRAIN

Runtime: Infinite number of

possible utterances

DevelopersCustomers

get a car to <Destination>

get me a car

…

• Neural Networks (NNs)

• Transfer Learning:

• Use knowledge learned from

large related training data

• Example: We’ve seen slots

like <Destination> before, no

need to learn from scratch.

get a car to <Destination>

get me a car

…

ASK Machine Learning (contd.)

How to Write Great Skills

Slots• Catalogs: Provide as many values as possible.

Add representative values of different lengths where

appropriate

• Use built-in slots where possible

(e.g., cities, states, first names)

• Do not use too many slots in one utterance

(rather ask for missing slots in a dialog)

• Use context around each slot

How to Write Great Skills

Intents• Split heterogeneous intents

• Use built-in intents where possible

• Provide as many carrier phrases as possible

• Use Thesaurus or paraphrasing tools, ask your friends or

mechanical turk for utterances

Conclusions

• ASK connects developers to customers

• Developers constantly extend Alexa’s capabilities

• We constantly get more data and improve experience

via machine learning

• Making Alexa more intelligent and powerful, bridging

the gap between human and machine

Thank you!

Remember to complete

your evaluations!

Related Sessions

Images used

Glove vectors. Produced internally.

Images used

Macaw. Public domain. https://pixabay.com/en/macaw-bird-beak-parrot-650638/

VW. Free for editorial use. http://media.vw.com/images/category/11/

Images used

ASCI Red. Public domain. https://commons.wikimedia.org/wiki/File:Asci_red_-_tflop4m.jpeg

8800 GTX. Permission by email by Tri Hyunth at Nvidia.

Images used

https://commons.wikimedia.org/wiki/File:President_Ronald_Reagan_addresses_Congress_in_1981.jpg

https://commons.wikimedia.org/wiki/File:President_George_W._Bush_(8003096992).jpg

https://commons.wikimedia.org/wiki/File:President_Obama_interview_January_27,_2009.jpg

https://commons.wikimedia.org/wiki/File:US_Navy_020828-N-1058W-

025_Former_U.S._President_George_H._W._Bush_congratulates_Sailor_aboard_USS_Harry_S._Truman_(CVN_75).jpg

https://commons.wikimedia.org/wiki/File:President_Clinton_speaks_on_tax_cut_deal.jpg

Technology

AWS re:Invent 2016: Deep Learning in Alexa (MAC202)