46
A Terascale Learning Algorithm Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John Langford ... And the Vowpal Wabbit project.

Terascale Learning

  • Upload
    pauldix

  • View
    5.421

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Terascale Learning

A Terascale Learning Algorithm

Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and JohnLangford

... And the Vowpal Wabbit project.

Page 2: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 3: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 4: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.

I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 5: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?

J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 6: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!

I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 7: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!

The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 8: Terascale Learning

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point. At that time, smarter learningalgorithms always won. To win, we must master the bestsingle-machine learning algorithms, then clearly beat them with aparallel approach.

Page 9: Terascale Learning

Demonstration

Page 10: Terascale Learning

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

2.1T sparse features17B Examples16M parameters1K nodes

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.

(Actually, we can do even better now.)

Page 11: Terascale Learning

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

2.1T sparse features17B Examples16M parameters1K nodes

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.

(Actually, we can do even better now.)

Page 12: Terascale Learning

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

2.1T sparse features17B Examples16M parameters1K nodes

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.

(Actually, we can do even better now.)

Page 13: Terascale Learning

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

2.1T sparse features17B Examples16M parameters1K nodes

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.

(Actually, we can do even better now.)

Page 14: Terascale Learning

Compare: Other Supervised Algorithms in Parallel Learningbook

100 1000

10000 100000 1e+06 1e+07 1e+08 1e+09

RB

F-S

VM

MP

I?-5

00R

CV

1E

nsem

ble

Tre

e M

PI-

128

Syn

thet

icR

BF

-SV

MT

CP

-48

MN

IST

220

KD

ecis

ion

Tre

eM

apR

ed-2

00A

d-B

ounc

e #

Boo

sted

DT

MP

I-32

Ran

king

#Li

near

Thr

eads

-2R

CV

1Li

near

Had

oop+

TC

P-1

000

Ads

*&

Fea

ture

s/s

Speed per method

parallelsingle

Page 15: Terascale Learning

The tricks we use

First Vowpal Wabbit Newer Algorithmics Parallel Stuff

Feature Caching Adaptive Learning Parameter Averaging

Feature Hashing Importance Updates Smart Averaging

Online Learning Dimensional Correction Gradient Summing

L-BFGS Hadoop AllReduce

Hybrid LearningWe’ll discuss Hashing, AllReduce, then how to learn.

Page 16: Terascale Learning

RA

M

Conventional VW

Weights

Str

ing −

> I

ndex

dic

tionar

y

Weights

Most algorithms use a hashmap to change a word into an index fora weight.VW uses a hash function which takes almost no RAM, is x10faster, and is easily parallelized.Empirically, radical state compression tricks possible with this.

Page 17: Terascale Learning

MPI-style AllReduce

2 3 4

6

7

Allreduce initial state

5

1

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 18: Terascale Learning

MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 19: Terascale Learning

MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 20: Terascale Learning

MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 21: Terascale Learning

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 22: Terascale Learning

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 23: Terascale Learning

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

Page 24: Terascale Learning

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

Page 25: Terascale Learning

What is Hadoop-Compatible AllReduce?

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 26: Terascale Learning

What is Hadoop-Compatible AllReduce?

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 27: Terascale Learning

What is Hadoop-Compatible AllReduce?

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 28: Terascale Learning

What is Hadoop-Compatible AllReduce?

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 29: Terascale Learning

Algorithms: Preliminaries

Optimize so few data passes required ⇒ Smart algorithms.Basic problem with gradient descent = confused units.fw (x) =

∑i wixi

⇒ ∂(fw (x)−y)2

∂wi= 2(fw (x)− y)xi which has units of i .

But wi naturally has units of 1/i since doubling xi implies halvingwi to get the same prediction.Crude fixes:

1 Newton: Multiply inverse Hessian: ∂2

∂wi∂wj

−1by gradient to

get update direction.

..but computational complexity kills you.

2 Normalize update so total step size is controlled.

..but this justworks globally rather than per dimension.

Page 30: Terascale Learning

Algorithms: Preliminaries

Optimize so few data passes required ⇒ Smart algorithms.Basic problem with gradient descent = confused units.fw (x) =

∑i wixi

⇒ ∂(fw (x)−y)2

∂wi= 2(fw (x)− y)xi which has units of i .

But wi naturally has units of 1/i since doubling xi implies halvingwi to get the same prediction.Crude fixes:

1 Newton: Multiply inverse Hessian: ∂2

∂wi∂wj

−1by gradient to

get update direction...but computational complexity kills you.

2 Normalize update so total step size is controlled...but this justworks globally rather than per dimension.

Page 31: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .

2 Dimensionally correct, adaptive, online, gradient descent forsmall-multiple passes.

1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 32: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 33: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.

3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 34: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.

4 Always save input examples in a cachefile to speed laterpasses.

5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 35: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.

5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 36: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 37: Terascale Learning

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Page 38: Terascale Learning

Robustness & Speedup

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Spe

edup

Nodes

Speed per method

Average_10Min_10

Max_10linear

Page 39: Terascale Learning

Webspam optimization

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 5 10 15 20 25 30 35 40 45 50

test

log

loss

pass

Webspam test results

OnlineWarmstart L-BFGS 1

L-BFGS 1Single machine online

Page 40: Terascale Learning

How does it work?

The launch sequence (on Hadoop):

1 mapscript.sh hdfs output hdfs input maps asked

2 mapscript.sh starts spanning tree on the gateway. Each vwconnects to spanning tree to learn which other vw is adjacentin the binary tree.

3 mapscript.sh starts hadoop streaming map-only job: runvw.sh

4 runvw.sh calls vw twice.

5 First vw call does online learning while saving examples in acachefile. Weights are averaged before saving.

6 Second vw uses online solution to warmstart L-BFGS.Gradients are shared via allreduce.

Example: mapscript.sh outdir indir 100Everything also works in a nonHadoop cluster: you just need toscript more to do some of the things that Hadoop does for you.

Page 41: Terascale Learning

How does this compare to other cluster learning efforts?

Mahout was founded as the MapReduce ML project, but hasgrown to include VW-style online linear learning. For linearlearning, VW is superior:

1 Vastly faster.

2 Smarter Algorithms.

Other algorithms exist in Mahout, but they are apparently buggy.AllReduce seems a superior paradigm in the 10K node-hoursregime for Machine Learning.

Graphlab(@CMU) doesn’t overlap much. This is mostly aboutgraphical model evaluation and learning.

Ultra LDA(@Y!), overlaps partially. VW has noncluster-parallelLDA which is perhaps x3 more effecient.

Page 42: Terascale Learning

Can allreduce be used by others?

It might be incorporated directly into next generation Hadoop.

But for now, it’s quite easy to use the existing code.void all reduce(char* buffer, int n, std::string master location,size t unique id, size t total, size t node);buffer = pointer to some floatsn = number of bytes (4*floats)master location = IP address of gatewayunique id = nonce (unique for different jobs)total = total number of nodesnode = node id numberThe call is stateful: it initializes the topology if necessary.

Page 43: Terascale Learning

Further Pointers

search://Vowpal Wabbit

mailing list: vowpal [email protected]

VW tutorial (Preparallel): http://videolectures.net

Machine Learning (Theory) blog: http://hunch.net

Page 44: Terascale Learning

Bibliography: Original VW

Caching L. Bottou. Stochastic Gradient Descent Examples on ToyProblems, http://leon.bottou.org/projects/sgd, 2007.

Release Vowpal Wabbit open source project,http://github.com/JohnLangford/vowpal_wabbit/wiki,2007.

Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, andSVN Vishwanathan, Hash Kernels for Structured Data,AISTAT 2009.

Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J.Attenberg, Feature Hashing for Large Scale MultitaskLearning, ICML 2009.

Page 45: Terascale Learning

Bibliography: Algorithmics

L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with LimitedStorage, Mathematics of Computation 35:773–782, 1980.

Adaptive H. B. McMahan and M. Streeter, Adaptive BoundOptimization for Online Convex Optimization, COLT 2010.

Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive SubgradientMethods for Online Learning and Stochastic Optimization,COLT 2010.

Import. N. Karampatziakis, and J. Langford, Online ImportanceWeight Aware Updates, UAI 2011.

Page 46: Terascale Learning

Bibliography: Parallel

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A ScalableModular Convex Solver for Regularized Risk Minimization,KDD 2007.

averaging G. Mann, R. Mcdonald, M. Mohri, N. Silberman, and D.Walker. Efficient large-scale distributed training of conditionalmaximum entropy models, NIPS 2009.

averaging K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable forDistributed Optimization, LCCC 2010.

(More forthcoming, of course)