CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

Preview:

Citation preview

1

CS525: Big Data Analytics

Machine Learning on Hadoop

Fall 2013

Elke A. Rundensteiner

2

Analytics ?

• Machine learning, data mining & statistics tools• Analyze/mine/summarize large datasets• Extract knowledge from past or streaming data• Predict trends in future data

ML Today

• Internet search clustering

• Social network analysis

• Taxonomy transformations

• Market analytics

• Recommendation systems

• Log analysis & event filtering

• SPAM filtering

• Fraud detection

4

Tools & Algorithms

• Collaborative Filtering

• Clustering Techniques

• Classification Algorithms

• Association Rules

• Frequent Pattern Mining

• Statistical libraries (Regression, SVM, …)

• Others…

5

Common Use Cases

6

Make It Industry Strength: Big Data

--Efficient in analyzing/mining data--Do not scale

--Efficient in managing big data--Does not analyze or mine data

How to integrate these two worlds ?

8

Some Projects

• Apache Mahout• Open-source package on Hadoop for

data mining and machine learning

• Revolution R (R-Hadoop or Radoop )• Extensions to R package to run on

Hadoop

9

Apache Mahout

10

Apache Mahout

• Apache Software Foundation project

• Create scalable machine learning libraries

• Why ?

• Many Open Source ML libraries either:• Lack Community• Lack Documentation• Lack Scalability• Or are research-oriented only

Support Machine Learning

12

But Must Scale & Perform

• Be as fast as possible

• Scale to as much data as possible

13

But Must Scale & Perform

• Be as fast as possible given intrinsic algorithm !

• What is expressible as map-reduce jobs ?

• Work in progress . . .

14

C1: Collaborative Filtering

15

C2: Clustering

• Group similar objects together

• K-Means, Fuzzy K-Means, Density-Based,…

• Different distance measures• Manhattan, Euclidean, …

16

C3: Classification

17

FPM: Frequent Pattern Mining

• Find the frequent itemsets• <milk, bread, cheese> are sold

frequently together

• Very common in market analysis, access pattern analysis, etc…

18

Matrices and Statistics

• Math libraries• Vectors, matrices, etc.

• Noise reduction

• Similarity Functions

19

Apache Mahout

• http://mahout.apache.org/

Recommended