38
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Embed Size (px)

Citation preview

Page 1: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Learning with Hadoop – A case study on MapReduce

based Data MiningEvan Xiang, HKUST

1

Page 2: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

2

Page 3: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Introduction to Hadoop

Hadoop Map/Reduce is

a java based software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware

in a reliable, fault-tolerant manner.

3

Page 4: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Job submission node

Slave node

TaskTracker DataNode

HDFS master

JobTracker NameNode

Slave node

TaskTracker DataNode

Slave node

TaskTracker DataNode

Client

Hadoop Cluster Architecture

4From Jimmy Lin’s slides

Page 5: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Hadoop HDFS

5

Page 6: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Hadoop Cluster Rack Awareness

6

Page 7: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Hadoop Development Cycle

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

7From Jimmy Lin’s slides

Page 8: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

8From Jimmy Lin’s slides

Page 9: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

High-level MapReduce pipeline

9

Page 10: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Detailed Hadoop MapReduce data flow

10

Page 11: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

11

Page 12: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Word Count with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 4

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

12From Jimmy Lin’s slides

Page 13: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

13

Page 14: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

14

Calculating document pairwise similarity

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

From Jimmy Lin’s slides

Page 15: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

15

Better Solution

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

From Jimmy Lin’s slides

Page 16: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

16

reduce

Decomposition

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

map

From Jimmy Lin’s slides

Page 17: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

17

Standard Indexing

tokenize

tokenize

tokenize

tokenize

combine

combine

combine

doc

doc

doc

doc

posting list

posting list

posting list

Shuffling

group values by: terms

(a) Map (b) Shuffle (c) Reduce

From Jimmy Lin’s slides

Page 18: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Inverted Indexing with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 2 2 2

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

18From Jimmy Lin’s slides

Page 19: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

19

Indexing (3-doc toy collection)Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

From Jimmy Lin’s slides

Page 20: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

20

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

How to deal with the long list?

From Jimmy Lin’s slides

Page 21: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

21

Page 22: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

PageRank

PageRank – an information propagation model

Intensive access of neighborhood list

22

Page 23: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

PageRank with MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

How to maintain the graph structure?

From Jimmy Lin’s slides

Page 24: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

24

Page 25: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

K-Means Clustering

25

Page 26: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

K-Means Clustering with MapReduce

26

Mapper_iMapper_i-1 Mapper_i+1

Reducer_i Reducer_i+1Reducer_i-1

How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.

Each Mapper loads a set of data samples,

and assign each sample to a nearest centroid

Each Mapper needs to keep a copy of

centroids

1 3 42

1 32

1 42

1 3 42

4

3

23 4

3 2 4

[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

Page 27: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

27

Page 28: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Matrix Factorization for Link Prediction

In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

28

Page 29: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Given X and V, updating U:

Similarly, given X and U, we can alternatively update V

Solving Matrix Factorization via Alternative Least Squares

29

Xm

n

ui

Vk

n

A

k k

k

k

k

k

b

k

Page 30: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

MapReduce for ALS

30

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Group rating data in X using

for item j

Group features in V

using for item j

Align ratings and features for item j, and make a copy of Vj for

each observe xij

Stage 1 Stage 2

Rating for item j

Features for item j

i-1 Vj

i Vj

i+1 Vj

Group rating data in X using

for user i

i Vj

i Vj+2

i+1 Vj

Standard ALS:Calculate A and b,

and update Ui

Page 31: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

31

Page 32: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Cluster Coefficient

32

In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.

How to maintain the Tier-2 neighbors?

[D. J. Watts and Steven Strogatz (June 1998).

"Collective dynamics of 'small-world' networks".

Nature 393 (6684): 440–442]

Page 33: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Cluster Coefficient with MapReduce

33

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Stage 1 Stage 2

Calculate the cluster coefficient

BFS based method need three stages, but actually we only need two!

Page 34: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Resource Entries to ML labsMahout

Apache’s scalable machine learning librariesJimmy Lin’s Lab

iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections

IBM TJ Watson Research CenterEdward Chang & Yi Wang

Google Beijing

34

Page 35: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Advanced Topics in Machine Learning with MapReduce

35

Probabilistic Graphical models

Gradient based optimization methods

Graph Mining

Others…

Page 36: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Some Advanced TipsDesign your algorithm with a divide and

conquer manner

Make your functional units loosely dependent

Carefully manage your memory and disk storage

Discussions…

36

Page 37: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

37

Page 38: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Q&A

Why not MPI? Hadoop is Cheap in everything…D.P.T.H…

What’s the advantages of Hadoop? Scalability!

How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics

How can you beat “large memory” solution? Clever use of Sequential Disk Access

38