AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Tom 'Elvis' Jones, Partner Solutions Architect, Amazon Web Services

Diego Oppenheimer, CEO, Algorithmia

December 1, 2016

Bringing Deep Learning to the Cloud

with Amazon EC2

CMP314

* As of 1 October 2016

2009

48

280

722

82

2011 2013 2015

AWS has been continually expanding its services to support virtually any cloud workload

and now has more than 70 services that range from compute, storage, networking,

database, analytics, application services, deployment, management and mobile. AWS

has launched a total of 706 new features and/or services year to date* - for a total of

2,601 new features and/or services since inception in 2006.

AWS Pace of Innovation

Microservices

Serverless

Lambda

Pay as you go

Scalability

Machine learning

Machine learning is the technology that

automatically finds patterns in your data and

uses them to make predictions for new data

points as they become available

Your data + machine learning = smart applications

New P2 GPU Instance Types

• New EC2 GPU instance type for accelerated computing

• Offers up to 16 NVIDIA K80 GPUs (8 K80 cards) in a single instance

• The 16xlarge size provides:

• A combined 192 GB of GPU memory, 40 thousand CUDA cores

• 70 teraflops of single precision floating point performance

• Over 23 teraflops of double precision floating point performance

• Example workloads include:

• Deep learning, computational fluid dynamics, computational finance,

seismic analysis, molecular modeling, genomics, VR content rendering,

accelerated databases

P2 Instance Types

Three P2 instance sizes:

Instance Size GPUs GPU

Peer to

Peer

vCPUs Memory

(GiB)

Network

Bandwidth*

p2.xlarge 1 - 4 61 1.25Gbps

p2.8xlarge 8 Y 32 488 10Gbps

p2.16xlarge 16 Y 64 732 20Gbps

*In a placement group

P2 Instance Summary

High-performance GPU instance types, with many innovations

• Based on NVIDIA K80 GPUs, with up to 16 GPUs in a single instance

• With dedicated peer-to-peer connections supporting GPUDirect

• Intel Broadwell processors with up to 64 vCPUs and up to 732GiB RAM

• 20Gbps network on EC2, using Elastic Network Adaptor (ENA)

• Supporting a wide variety of ISV applications and open-source

frameworks

Diego Oppenheimer – CEO, Algorithmia

Product developer, entrepreneur, extensive background

in all things data.

Microsoft: PowerPivot, PowerBI, Excel, and SQL Server

Founder of algorithmic trading startup

BS/MS Carnegie Mellon University

Make state-of-the-art

algorithms accessible and

discoverable by everyone.

A marketplace for algorithms...

We host algorithms

Anyone can turn their algorithms into scalable web services

Typical users: scientists, academics, domain experts

We make them discoverable

Anyone can use and integrate these algorithms into their solutions

Typical users: businesses, data scientists, app developers, IoT makers

We make them monetizable

Users of algorithms pay for algorithms they use

Typical scenarios: heavy-load use cases with large user base

Sample algorithms (2600+ and growing daily)

● Text analysis summarizer, sentence tagger, profanity detection

● Machine learning digit recognizer, recommendation engines

● Web crawler, scraper, pagerank, emailer, html to text

● Computer vision image similarity, face detection, smile detection

● Audio & video speech recognition, sound filters, file conversions

● Computation linear regression, spike detection, fourier filter

● Graph traveling salesman, maze generator, theta star

● Utilities parallel for-each, geographic distance, email validator

Machine Intelligence Stack

Applications

Scalable CPU/GPU compute

Algorithms as micro services

Data stores

Scale?

• Support the workloads of 32,000 developers

• Mixed use CPU/GPU

• Spikey traffic

• Heterogeneous hardware

Hosting Deep Learning

Cloud hosting of deep learning models can be especially challenging

due to complex hardware and software dependencies

At Algorithmia we had to:

• Learn how to host and scale the 5 most common deep learning

frameworks (more coming).

• Deal with scaling and spikey traffic on GPUs

• Deal with multitenancy on GPUs

• Build an extensible and dynamic architecture to support deep

learning

First: What is Deep Learning?

• Deep learning uses artificial neurons similar to the brain to

represent high-dimensional data• The mammal brain is organized in a deep architecture, e.g., the visual

system has 5 to 10 levels. [1]

• Deep learning excels in tasks where the basic unit (a single pixel,

frequency or word) has very little meaning in and of itself, but

contains high-level structure. Deep nets have been effective at

learning this structure without human intervention.

[1] Serre, Kreiman, Kouh, Cadieu, Knoblich, & Poggio,

What is Deep Learning Being Applied To?

Primarily: huge growth in unstructured data

• Pictures

• Videos

• Audio

• Speech

• Websites

• Emails

• Reviews

• Log files

• Social media

Today’s Use Cases

• Computer vision• Image classification• Object detection• Face recognition

• Natural language• Speech to test• Chatbots• Q&A systems (Siri, Alexa, Google Now)• Machine translation

• Optimization• Anomaly detection• Recommender systems

Why Now?…and why is deep learning suddenly everywhere?

Advances in research

• LeCun, Gradient-Based Learning Applied to

Document Recognition,1998

• Hinton, A Fast Learning Algorithm for Deep

Belief Nets, 2006

• Bengio, Learning Deep Architectures for AI,

2009

Advances in hardware

• GPUs: 10x performance, 5x energy efficiency

http://www.nvidia.com/content/events/geoInt2015/LBrown_DL_Image_ClassificationGEOINT.pdf

Deep Learning Hardware (2016)

GPUs: NVIDIA is dominating

One of the first GPU neural nets was on a NVIDIA

GTX 280 up to 9 layers neural network (2010

Ciresan and Schmidhuber)

• NVIDIA chips tend to outperform AMD

• More importantly, all the major frameworks use

CUDA as a first-class citizen. Poor support for

AMD’s OpenCL.

Deep Learning Hardware

GPU:

• Becoming more tailored for deep learning (e.g., Pascal chipset)

Custom hardware:

• FPGA (AWS F1, MSFT Project Catapult)

ASIC:

• Google TPU

• IBM TrueNorth

• Nervana Engine

• Graphcore IPUs

GPU Deep Learning Dependencies

Meta deep learning framework

Deep learning framework

cuDNN

CUDA

NVIDIA driver

GPU

Deep Learning Frameworks

Theano

Created by Université de Montréal

Theano pioneered the trend of using a symbolic graph for programming a network. Very mature framework, good support for many kinds of networks.

Pros:• Use Python + Numpy• Declarative computational graph• Good support for RNNs• Wrapper frameworks make it more accessible (Keras, Lasagne, Blocks)• BSB License

Cons:• Low level framework• Error messages can be unhelpful• Large models can have long compile times• Weak support for pre-trained models

Torch

Created by collaboration of various researchers. Used by DeepMind (prior to Google). Torch is a general scientific computing framework for Lua.

Torch is more flexible than TensorFlow and Theano in that it’s imperative while TF/Theano are declarative. That makes some operations (e.g, beam search) much easier to do.

Pros:• Very flexible multidimensional array engine• Multiple back ends (CUDA and OpenMP)• Lots of pre-trained models available

Cons:• Lua• Not good for recurrent networks• Lack of commercial support

Caffe

Created by Berkley Vision and Learning center and community contributors.

Probably the most used framework today, certainly for CV.

Pros:

• Optimized for feedforward networks, convolutional nets and image processing.

• Simple Python API

• BSD License

Cons:

• C++/CUDA for new GPU layers

• Limited support for recurrent networks (recently added)

• Cumbersome for big networks (GoogLeNet, ResNet)

TensorFlow

Created by Google.TensorFlow is written with a Python API over a C/C++ engine. TensorFlow generates a computational graph (e.g., series of matrix operations) and performs automatic differentiation.

Pros:• Uses Python + Numpy• Lots of interest from community• Highly parallel, and designed to use various back ends (software, gpu, asic)• Apache License

Cons:• Slower than other frameworks [1]

• More features, more abstractions than torch• Not many pre-trained models yet

[1] https://arxiv.org/pdf/1511.06435v3.pdf

Networks for Training

Where to get networks:

• If you’re just interested in using deep learning to classify images, you can

usually find off-the-shelf networks.

• VGG, GoogleNet, AlexNet, SqueezeNet

• Caffe Model Zoo

Training vs. Running

Deep learning generally consists of two phases: training and running.

Training deep learning models is challenging, with many solutions available

today.

Running deep learning models (at scale) is the next step, and has its own

challenges.

Hosting Deep Learning

Making deep learning models available as an API represents a

unique set of challenges that are rarely, if ever, addressed in

tutorials.

Why ML in the Cloud?

• Need to react to live user data

• Don’t want to manage own servers

• Need enough servers to sustain max load. You can save money

using cloud services

• Limited compute capacity on mobile

http://blog.algorithmia.com/2016/07/cloud-hosted-deep-learning-models

Use case: http://colorize-it.com

http://blog.algorithmia.com/2016/07/cloud-hosted-deep-learning-models

Capacity required at peak

Capacity required at peak

Potential for ghost/wasted

capacity

Without elastic

compute on GPUs

our cost would be

~75% more

Service Oriented Architecture

Going to want a dedicated infrastructure for handling

computationally intensive tasks like deep learning

LO

AD

BA

LA

NC

ER

S

CPU WORKER #1

CPU WORKER #N

Docker(algorithm#1)

Docker(algorithm#2)

..

Docker(algorithm#n)

CLIENTS

AP

I S

ER

VE

RS

AP

I S

ER

VE

RS

GPU WORKER #1

GPU WORKER #N

Docker(deep-algo#1)

Docker(deep-algo#2)

..

Docker(deep-algo#n)

m4

m4

m4

x1

Why P2s?

• More video memory

• 12 GB per GPU

• Modern CUDA support

• More CUDA cores to run in parallel

• New messages

• In particular, we had that problem with CUDA 3.0 not allowing

us to share memory as efficiently

• Price per flop

Customer Showcase:

• CSDisco offers cutting-edge eDiscovery technology for attorneys “DISCO ML”.

• DISCO ML is a deep learning based exploration tool that asynchronously re-learns as

attorneys move through their normal discovery workflows.

• “The proprietary, multi-layer artificial network uses deep learning and arrays of GPUs

to unpack and process learned information quickly. Combining Google’s advanced

Word2Vec embeddings with powerful convolutional and recurrent neural networks for

text...”

Customer Showcase:

Why they chose to host their ML on Algorithmia ?

• Scalability: Required scalable GPU based compute fabric for their neural net based

ML approach to on-board hundreds of new customers without taxing their engineering

department.

• Flexibility: Expected high peaks of usage during certain hours.

• Reduce ghost compute: excess capacity = unnecessary cost

• Ability to chain algorithms: Process is a series of operations from scoring, training ,

validation, querying – each is its own model hosted on Algorithmia. Algorithmia

provides an easy way to pipe algorithms into each other.

Challenges and how EC2 helps (with some)

Challenge #1: New Hardware, Latest CUDA

AWS: G2 (Grid K520- 2013) and P2 instances

Azure: N-Series

Google: Just announced preview

SoftLayer: various cards including Tesla K80 and M60.

Small providers: Nimbix, Cirrascale, Penguin

Challenge #2: Language Bindings

You probably already have an existing stack in some programming

language. How does it talk to your deep learning framework?

Hope you like Python ( or Lua)

Solution: Services!

Challenge #3: Large Models

Deep learning models are getting larger

• State of the art networks are easily multi-gigabyte

• Need to be loaded and scaled

Solutions:

• More hardware

• Smaller models

Memory Per Model

Size (MB) Error % (top-5)

SqueezeNet

Compressed

0.6 19.7%

SqueezeNet 4.8 19.7%

AlexNet 240 19.7%

Inception v3 84 5.6%

VGG-19 574 7.5%

ResNet-50 102 7.8%

ResNet-200 519 4.8%

SqueezeNet

Hypothesis: the networks we’re using today are much larger and

more complicated than they need to be.

Enter SqueezeNet: AlexNet-level accuracy with 50x fewer

parameters and < 0.5 MB model size.

Not quite state of the art, but close, and MUCH easier to host.

[1] Iandola, Han, Moskewicz, Ashraf, Dally, Keutz

Model Compression

Recent efforts at aimed at pruning the size of networks

“Reduce the storage requirement of neural networks by 35x to 49x

without affecting their accuracy.” (Han, Mao, Dally -2015)

Challenge #4: GPU Sharing

GPUs were not designed to be shared like CPUs

• Limited amount of video memory

• Even with multi-context management, memory overflows and

unrestricted pointer logic are very dangerous for other

applications

• Developers need a way to share GPU resources safely from

potentially malicious applications.

Challenge #5: GPU Sharing - Containers

Docker – new standard in deploying applications, but adds an

additional layer of challenges to GPU computing.

• NVIDIA drivers must match inside and outside containers.

• CUDA drivers must match inside and outside containers.

• Some algorithms require X windows, which must be started

outside the container and mounted inside

• Nvidia-docker container is helpful but not a complete solution.

• New AWS Deep Learning AMI -> Huge step in right direction.

Lessons Learned

• Deep learning in the cloud is still in its infancy

• Hosting deep learning models is the next logical step after the

training model, but the difficulty is underappreciated.

• Tooling and frameworks are making things easier, but there is a

lot of opportunity for improvement

Big picture: the challenges involved with creating DL models is only

half the problem. Deploying them is an entirely different skillset.

Demo

Thank you!

Try out Algorithmia for free:

Code: reinvent16

Algorithmia.com

Remember to complete

your evaluations!

Technology

AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)