58
Challenges in Large Scale Machine Learning Sudarsun Santhiappan RISE Lab IIT Madras Disclaimer: The slide contents are retreived from the free Internet but cleverly edited, sequenced and stitched for academic lecturing purposes.

Challenges in Large Scale Machine Learning

Embed Size (px)

Citation preview

Page 1: Challenges in Large Scale  Machine Learning

Challenges in Large Scale Machine Learning

Sudarsun SanthiappanRISE LabIIT Madras

Disclaimer: The slide contents are retreived from the free Internet but cleverly edited, sequenced and stitched for academic lecturing purposes.

Page 2: Challenges in Large Scale  Machine Learning

2

Why all the hype?

Max Welling, ICML 2014

Big Models Big Data

Peter Norvig, Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009

Page 3: Challenges in Large Scale  Machine Learning

3

Scability

● Strong scaling: if you throw twice as many machines at the task, you solve it in half the time.Usually relevant when the task is CPU bound.

● Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.Memory bound tasks... usually.

Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).

Page 4: Challenges in Large Scale  Machine Learning

4

Why Distributed ML ?

● The usual answer is that data are too big to be stored in one computer !

● Some say that because “Hadoop”, “Spark” and “MapReduce” are buzzwords– No, we should never believe buzzwords

– For most jobs, multi-machine architectures are not probably required at all.

● Let's argue that things are more complicated than we thought...

Page 5: Challenges in Large Scale  Machine Learning

5

When Distributed ML ?

● When the data or method could not be fitted in one computer system, we imagine about using multiple computers to solve problems.

● Typically, when one or more of the following increase crazy, Distributed ML can be thought of– Number of data points (Sensor networks)

– Number of attributes (Image data)

– Number of model parameters (DeepNN)

Page 6: Challenges in Large Scale  Machine Learning

6

Some Assumptions..

Much of the current work in large scale learning makes the standard assumptions about the data:– That it is drawn IID from a stationary distribution– That linear time algorithms are cheap, while

super-linear time algorithms are expensive.– All the data are available and ready for use.

Page 7: Challenges in Large Scale  Machine Learning

7

Big Data! assumptions broken..

● Data is generated from real world, and hardly anything in real world follows a stationary distribution – IID broken.

● Data arrives at different speed and different times – Availability broken.

● Sometimes, offline processing is allowed; so super-linear algorithms are infact feasible.

● Sparsity – Is that an advantage or a pain point?– When the data dimensionality is higher, the data set mostly ends up

sparse; ex: Text data.– Applying transformations make data dense.– Continuous vs Categorical vs Cardinal attributes.

Page 8: Challenges in Large Scale  Machine Learning

8

Size of Training Data

● Say you have 1K labeled and 1M unlabeled– Labeled/unlabeled ratio: 0.1%

– Is 1K enough to train a supervised model?

● Now you have 1M labeled and 1B unlabeled– Labeled/unlabeled ratio: 0.1%

– Is 1M enough to train a supervised model?

Page 9: Challenges in Large Scale  Machine Learning

9

Class Imbalance

● Given a classification problem,– What is the guarantee that the class proportions are uniform?

● What is the problem if the there are imbalanced classes?● Can't the Machine Learning algorithms handle class

imbalance in classification problems?● What if the class imbalance is very very skewed?

– Ex: 1% vs 99%

● Why accuracy cannot be a good measure here?– Label everything with the majority label to get 99% accuracy!

Page 10: Challenges in Large Scale  Machine Learning

10

Curse of Dimensionality

● Exponential increase in volume associated with adding extra dimensions

● Joint problem of the data and the algorithm being applied● Expected edge length (e), to cover volume ratio (r), is given by

where the dimensions of the data is (p).– 1% volume implies 63% of the edge

– 10% volume imples 80% of the edge

● Hughes Effect: With a fixed number of training samples, the predictive power reduces as the dimensionality increases

● Solve this problem using a standard dimension reduction techniques such as Linear Discriminant Analysis, Principal Component Analysis, SVD, etc.

Page 11: Challenges in Large Scale  Machine Learning

11

Overfitting

● When lot of data is available, it inherently means, possibility for lot of noise!

● Overfit statistical models describe ”noise” instead of the underlying input-output relationship.

● Generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

● Model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

● Avoid overfitting by:– Regularization, Cross Validation, Early Stopping, Pruning.

Page 12: Challenges in Large Scale  Machine Learning

12

Method Complexities

Network latency and disk read times may dominate the cost of some learning algorithms– One pass over the data is expensive– Multiple passes may be out of the question

Because reading the data dominates costs, we can do intensive computation in a given locality without significantly impacting cost– Read the data once into memory, do several hundred

passes, read the next block, . . .– Super-linear algorithms aren't so bad?

Page 13: Challenges in Large Scale  Machine Learning

13

Slow Arrival Problem

A lot of big data doesn't arrive all at once. We only get a chunk of the data every so often– Transactional data, Sensor data

Streaming algorithm, incremental updates– Good, but limits our options somewhat– Typically have to make choices about how long it takes for data to “expire”

(e.g., learning rate) Lazy accumulators / Reservoir sampling

– Lazy algorithms limit options– Reservoir sampling isn't using all data– Implicit expiry of data is “never”

Window-based retraining– Completely forgets past data

– Window size is an explicit choice

Page 14: Challenges in Large Scale  Machine Learning

14

Let’s Start with An Example

● Using a linear classifier LIBLINEAR (Fan et al., 2008) to train the rcv1 (RCV1: A New Benchmark Collection for Text Categorization Research) document data sets (Lewis et al., 2004).

● # instances: 677,399, # features: 47,236● On a typical PC

– $time ./train rcv1_test.binary

● Total time: 50.88 seconds● Loading time: 43.51 seconds● For this example

– loading time >> running time

Page 15: Challenges in Large Scale  Machine Learning

15

Loading vs Running Time

● Let’s assume memory hierarchy contains only disk● Assume # instances is N; Loading time: N×(a big constant);

Running time: Nk×(some constant), where k ≥ 1.● Traditionality Machine Learning and Data Mining methods

consider only running time.– If running time dominates, then we should design algorithms to reduce

number of operations– If loading time dominates, then we should design algorithms to reduce

number of data accesses

● Distributed environment is another layer of memory hierarchy– So things become even more complicated

Page 16: Challenges in Large Scale  Machine Learning

16

Adv. of Distrib. Storage

● One apparent reason of using distributed clusters is that data is too large for one disk (ex: Petabytes of data)

● Parallel data loading: – Reading several TB data from disk a ⇒ few hours taken– Using 100 machines;

● each has 1/100 data in its local disk a ⇒ few minutes

● Fault tolerance: Some data replicated across machines: if one fails, others are still available– how to efficiently/effectively do this is a challenge

Page 17: Challenges in Large Scale  Machine Learning

17

Distributed Systems

● Distributed file systems– We need it because a file is now managed at different nodes

– A file is split in to chunks and each chunk is replicated ⇒ if some nodes fail, data is still available

– Example: GFS (Google file system), HDFS (Hadoop file system)

● Parallel programming frameworks– A framework is like a language or a specification. You can then

have different implementations

Page 18: Challenges in Large Scale  Machine Learning

18

Out of Core Computing

Very fast online learning learning. One thread to read, one to train.

Hashing trick, online error, etc.

Parallel matrix multiplication.

Bottleneck tends to be CPU-GPU memory transfer.

● Problem: Training Data does not fit in RAM● Solution: Lay out data efficiently on disk

and load it as needed in memory.

Page 19: Challenges in Large Scale  Machine Learning

19

Map-Reduce

● MapReduce (Dean and Ghemawat, 2008). A framework now commonly used for large-scale data processing

● In MapReduce, every element is a (key, value) pair

– Mapper: a list of data elements provided. Each element transformed to an output element

– Reducer: values with same key presented to a single reducer

Page 20: Challenges in Large Scale  Machine Learning

20

Map-Reduce:Statistical Query Model

The sum corresponds to the reduce operation

f, the map function, is sent to every machine

Page 21: Challenges in Large Scale  Machine Learning

21

Map Reduce (Salient Points)

● Resilient to failure. HDFS disk replication.

● Can run on huge clusters.● Makes use of data locality.

Program (query) is moved to the data and not the opposite.

● Map functions must be stateless

States are lost between map iterations.

● Computation graph is very constrained by independencies.

Not ideal for computation on arbitrary graphs.

Page 22: Challenges in Large Scale  Machine Learning

22

From Hadoop to Spark

Disk I/O coupling & Stateless mapping makes MapReduce ineffective to Iterative Algorithms

Spark (Zaharia et al, 2010) supports data caching between iterations

Page 23: Challenges in Large Scale  Machine Learning

23

MPI vs MapReduce

● MPI: communication explicitly specified● MPI is like an assembly language, but ● MPI: sends/receives data to/from a

node’s memory● MPI: no fault tolerance

● MapReduce: communication performed implicitly

● MapReduce is high-level● MapReduce: communication

involves expensive disk I/O● MapReduce: support fault tolerance

MPI (Snir and Otto, 1998) is a parallel programming framework;

MPICH is a portable implementation of MPI standard. CH stands for Chameleon a portable programming language by Willian Gropp.

MPICH2 is a popular and most adapted implementation of MPI, which was used as foundation for IBM MPI, Intel MPI, Cray MPI, MS MPI, etc.

Page 24: Challenges in Large Scale  Machine Learning

24

Some Distributed/Scalable Machine Learning frameworks

● Vowpal-Wabbit – Out of Core ML framework ● Mahout on Hadoop● MLlib on Spark● GraphLab – Vertex Parallel Programming● Pregel – Large Scale Graph Processing ● ParameterServer – Big Models

Page 25: Challenges in Large Scale  Machine Learning

25

Evaluation

● Traditionally a parallel program is evaluated by scalability

● We expect, when # machines are doubled, speedup is doubled (strong scaling).

● But, it does not linearly scale. Why?

Page 26: Challenges in Large Scale  Machine Learning

26

Data Locality – One Reason

● Transferring data across networks is slow. So, we should try to access data from local disk

● Hadoop tries to move computation to the data.

If data in node A, try to use node A for computation

● But most machine learning algorithms are not designed to achieve good data locality.

● Traditional parallel machine learning algorithms distribute computation to nodes– This works well in dedicated parallel machines with fast communication

among nodes– But in data-center environments this may not work communication cost is ⇒

very high

Page 27: Challenges in Large Scale  Machine Learning

27

Summary

● Going to distributed or not is sometimes a difficult decision

● There are many considerations– Data already in distributed file systems or not– The availability of distributed learning algorithms

for your problems– The efforts for writing a distributed code– The selection of parallel frameworks

Page 28: Challenges in Large Scale  Machine Learning

28

Classification Vs Clustering

● Which type of machine-learning problems, distributed computing is suitable?

Classification Clustering

You may not need to use all your training data. Many training data + a so-so method may not be better than some training data + an advanced method. [mostly iterative]

If you have N data instances, you need to cluster all of them [parallelizable]

Page 29: Challenges in Large Scale  Machine Learning

29

A Multi-class Problem

● Problem: 55M documents, 29 classes, 3-100M features depending on settings.

● Does this qualify for distributed computing?● Not necessarily !● With a 75G RAM machine, LIBLINEAR takes about 20 mins

to train the classifier with 55M docs and 3M features.● In single computer, room to try different features is available

and hence opportunity to increase accuracy.● You go to Distributed framework, only when you have

designed your experiement completely !!

Page 30: Challenges in Large Scale  Machine Learning

30

A bagging implementation

● Assume data is large, say 1TB. You have 10 machines with 100GB RAM each.● One way to train this large data is a bagging approach

– machine 1 trains 1/10 data– machine 2 trains 1/10 data– ..

– Machine 10 trains 1/10 data

● Then use 10 models for prediction and combine results.● Obvious Reason: parallel data loading and parallel computation● But it is not that simple if MapReduce/Hadoop is used. ● HDFS is not designed for easily copying a subset of data to a node!

– Assign 10 different keys to data, such that each key land in a reducer.

Page 31: Challenges in Large Scale  Machine Learning

31

Before going to Distributed Computing...● Remember that they (ex: Hadoop) are not

designed in particular for machine learning applications.

● We need to know when and where they are suitable to be used.

● Also whether your data are already in distributed systems or not, is important.

Page 32: Challenges in Large Scale  Machine Learning

32

Distributed Algorithms..

● Let's try to see how well known machine learning algorithms for classification and clustering are converted to distributed algo.– Distributed K-Means & Parallel Spectral Clustering

Chen et al, IEEE PAMI 2011

– DC-SVM: A Divide-and-Conquer Solver for Kernel Support Vector Machines

Hsieh, Si, Dhillion, ICML 2014.

– LDA: An Architecture for Parallel Topic ModelsSmola, Narayanamurthy, VLDB 2010.

Page 33: Challenges in Large Scale  Machine Learning

33

Clustering

K-Means, Spectral Clustering

Page 34: Challenges in Large Scale  Machine Learning

34

Distributed K-Means

● Let's discuss difference between MPI and MapReduce implementations of k-means

Page 35: Challenges in Large Scale  Machine Learning

35

K-Mean: MPI

● Broadcast initial centers to all machines● While not converged

– Each node assigns its data to k clusters and

– compute local sum of each cluster

– An MPI AllReduce operation obtains sum of all k clusters to find new centers

● Communication versus computation:● If x R∈ d , then

– transfer k×d elements after k × d × N/p operations,

● N: total number of data and p: number of nodes.

Page 36: Challenges in Large Scale  Machine Learning

36

K-Means: Map Reduce

● Thomas Jungblut http://codingwiththomas.blogspot.com/2011/05/k-means-clustering-with-mapreduce.html

● You don’t specifically assign data to nodes– That is, data has been stored somewhere at HDFS

● Each instance: a (key, value) pair– key: its associated cluster center– value: the instance

● Map: Each (key, value) pair find the closest center and update the key● Reduce: For instances with the same key (cluster), calculate the new

cluster center● Given you don’t control where data points are, it’s unclear how

expensive loading and communication is!!

Page 37: Challenges in Large Scale  Machine Learning

37

Spectral Clustering

● Input: Data points x1 , . . . , x

n; k: number of desired clusters.

● Construct similarity matrix S R∈ n×n .● Modify S to be a sparse matrix.● Construct D, the degree matrix.● Compute the Symmetric Laplacian matrix L by

L = I − D−1/2 SD−1/2 ,● Compute the first k eigenvectors of L; and construct● V R∈ n×k , whose columns are the k eigenvectors.● Use k-means algorithm to cluster n rows of V into k groups.

Page 38: Challenges in Large Scale  Machine Learning

38

Challenges

● Similarity matrix– Only done once: suitable for MapReduce

– But size grows in O(n2)

● First k Eigenvectors– Implicitly restarted Arnoldi is iterative– Iterative: not suitable for MapReduce

– MPI is used but no fault tolerance

● Parallel Similarity Matrix using MPI/MapReduce● Parallel ARPACK using MPI● Parallel K-Means using MPI/MapReduce

Page 39: Challenges in Large Scale  Machine Learning

39

Sample Result

● 2,121,863 points and 1,000 classes

Page 40: Challenges in Large Scale  Machine Learning

40

How to Scale up?

● We can see that scalability of eigen decomposition is not good!!● We can see two bottlenecks

– computation: O(n2) similarity matrix

– communication: finding eigenvectors

● To handle even larger sets we may need to modify the algorithm● For example, we can use only part of the similarity matrix (e.g., Nystrom

approximation); Slightly worse performance, but may scale up better● The decision relies on your number of data and other considerations

Page 41: Challenges in Large Scale  Machine Learning

41

Classification

Distributed SVM method

Page 42: Challenges in Large Scale  Machine Learning

42

Support Vector Machines

● Given:

– Training data points x1 , · · · , x

n .

– Each xi R∈ d is a feature vector:

– Consider a simple case with two classes: y

i {+1, −1}.∈

● Goal: Find a hyperplane to separate these two classes of data:

– if yi=1, wTx

i ≥ 1−ξ

i

– if yi=−1, wTx

i ≤ −1 + ξ

i

Page 43: Challenges in Large Scale  Machine Learning

43

Linearly non-separable?

● Basis Expansion:

Map data xi to higher

dimensional (maybe infinite) feature space φ(x

i), where they

are linearly separable.● Kernel trick

K(xi, x

j) = φ(x

i)Tφ(x

j).

● Various types of kernels– Gaussian: K(x, y) = e−γ||x−y||2

– Polynomial: K(x, y) = (γ xTy + c)d

Page 44: Challenges in Large Scale  Machine Learning

44

Support Vector Machines

● Training data {yi , x

i},

– xi R∈ d , i = 1, . . . , N; y

i = ±1

● SVM solves the following optimization problem, with a regularization term.

● Decision function, where φ(x) is the basis expansion function:

Page 45: Challenges in Large Scale  Machine Learning

45

Bottlenecks

● Assume gaussian kernel and a square matrix K(x,y) = e−γ||x−y||2, – space complexity is O(n2)

– compute time complexity is O(n2d)– Example: when N = 1M, space required for

kernel is 1M x 1M x 8B = 8TB

● Existing methods try not to use the whole kernel matrix at the same time.

Page 46: Challenges in Large Scale  Machine Learning

46

Dual Problem of SVM

● Challenge for solving kernel SVMs:● Space: O(n2);● Time: O(n3), assume O(n) support vectors.

● n = Number of variables = Number of samples.

Page 47: Challenges in Large Scale  Machine Learning

47

Scability

● LIBSVM takes more than 8 hours to train on a CoverType dataset with 0.5 million samples (with prediction accuracy 96%).

● Many inexact solvers have been developed:

AESVM (Nadan et al., 2014), Budgeted SVM (Wang et al., 2012), Fastfood (Le et al.,2013), Cascade SVM (Graf et al., 2005), . . .

1-3 hours, with prediction accuracy 85 − 90%.

● Divide the problem into smaller subproblems – DC-SVM 11 minutes, with prediction accuracy 96%.

Page 48: Challenges in Large Scale  Machine Learning

48

DC-SVM with Single Level Data Division

Page 49: Challenges in Large Scale  Machine Learning

49

DC-SVM: Conquer step

It is shown that the cluster objective function is dependent on D(π), the between-cluster error, given σ

n is the smallest eigenvalue of the kernel matrix.

Page 50: Challenges in Large Scale  Machine Learning

50

Quality of α (solution from subproblems)

Page 51: Challenges in Large Scale  Machine Learning

51

Kernel K-means clustering

● Want a partition which

– Minimizes D(π) = ∑i,j:π(xi)≠π(xj)

|K(xi, x

j)|.

– Have balanced cluster sizes (for efficient training).

● Use kernel kmeans (but slow); Use Two step kernel kmeans instead:– Run kernel kmeans on a subset samples of size M << N

to find cluster centers.

– Identify the clusters for the rest of data.Software available at: http://www.cs.utexas.edu/~cjhsieh/dcsvm

Page 52: Challenges in Large Scale  Machine Learning

53

Document Modeling

Latent Dirichlet Allocation

Page 53: Challenges in Large Scale  Machine Learning

54

Topic Models

● Topic Models for Text– Text Documents are composed of Topics– Each Topic can generate a set of words

– P(w,d) = ∑zP(w|z)P(z|d) = ∑

zP(w|z)P(d|z)P(z)

● Basic idea– Each word 'w' or a document 'd' can be represented as a topic vector– Each topic can be enumerated as a ranked list of words.

● For a query– “ice skating”

● LDA (Blei et al., 2003) can infer from “ice” that “skating” is closer to a topic “sports” rather than a topic “computer”

Page 54: Challenges in Large Scale  Machine Learning

55

Latent Dirichlet Allocation

● Wij: jth word from ith document

● p(wij|z

ij,Φ) and p(z

ij|Θ

i): multinomial distributions, that is w

ij is

drawn from zij, Φ and z

ij is drawn from Θ

i.

● p(Θi|α), p(Φ

j|β): Dirichlet

distributions.● α,β: priors of Θ, Φ

respectively.

Page 55: Challenges in Large Scale  Machine Learning

56

Gibbs Sampling

● Maximizing the likelihood is not easy, so Griffiths and Steyvers (2004) propose using Gibbs sampling to iteratively estimate the posterior p(z|w)

● While the model looks complicated, Θ and Φ can be integrated out to p(w, z|α, β)

● Then at each iteration, only a counting procedure is needed.

Page 56: Challenges in Large Scale  Machine Learning

57

LDA Algorithm

● For each iteration– For each document i

● For each word j in document i– Sampling and counting

● Distributed learning seems straightforward– Divide data to several nodes

– Each node counts local data

– Models are summed up!

Page 57: Challenges in Large Scale  Machine Learning

58

Smola et al, 2010

● A direct MapReduce implementation may not be efficient due to I/O at each iteration

● Smola and Narayanamurthy (2010) use quite sophisticated techniques to get high throughputs– They don’t partition documents to several machines. Otherwise

machines need to wait for synchronization

– Instead, they consider several samplers and synchronize between them

– They used memcached so data stored in memory rather than disk

– They used Hadoop streaming, so C++ rather than Java is used.

Page 58: Challenges in Large Scale  Machine Learning

59

Conclusion

● Distributed machine learning is still an active research topic● It is related to both machine learning and systems● While machine learning people can’t develop systems, they

need to know how to choose systems● An important fact is that existing distributed systems or

parallel frameworks are not particularly designed for machine learning algorithms

● Machine learning people can– help to affect how systems are designed– design new algorithms for existing systems