Predictive Analytics with Hadoop

Predictive Analytics with HadoopMichael HausenblasMapR Technologies

© 2014 MapR Technologies 2© 2014 MapR Technologies

Predictive Analytics with Hadoop

Michael Hausenblas, Chief Data Engineer EMEA

Hadoop Summit, Amsterdam, April 2014

© 2014 MapR Technologies 3

Agenda

• Examples• Data-driven solutions• Obtaining big training data• Recommendation with Mahout and Solr• Operational considerations


Recommendation is Everywhere

Media and Advertising

e-commerce Enterprise Sales

• Recommend sales opportunities to partners

• $40M revenue in year 1• 1.5B records per day• Using MapR


Classification is Everywhere

IP address blacklisting

Fraud Detection Customer 360Scoring &

Categorization

• Customer 360 application• Each customer is scored

and categorized based on all their activity

• Data from hundreds of streams and databases

• Using MapR

• 600+ variables considered for every IP address

• Billions of data points• Using MapR

• Identify anomalous patterns indicating fraud, theft and criminal activity

• Stop phishing attempts• Using MapR

Fortune 100 Telco


Data-Driven Solutions

• Physics is ‘simple’: f = ma; E=mc2

• Human behavior is much more complex– Which ad will they click?– Is a behavior fraudulent? Why?

• Don’t look for complex models that try to discover general rules– The size of the dataset is the most important factor– Simple models (n-gram, linear classifiers) with Big Data

• A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8-12, 2009.


The Algorithms Are Less Important


Focus on the Data

• Most algorithms come down to counting and simple math

• Invest your time where you can make a difference– Getting more data can improve results by 2x

• eg, add beacons everywhere to instrument user behavior

– Tweaking an ML algorithm will yield a fraction of 1%

• Data wrangling– Feature engineering– Moving data around– …


Obtaining Big Training Data

• Can’t really rely on experts to label the data– Doesn’t scale (not enough experts out there)– Too expensive

• So how do you get the training data?– Crowdsourcing– Implicit feedback

• “Obvious” features• User engagement


Using Crowdsourcing for Annotation

R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP, 2008.

Quantity: $2 for 7000 annotations

(leveraging Amazon Mechanical Turk and a “flat

world”)

Quality: 4 non-experts = 1 expert


Using “Obvious” Features for Annotation

:)

:(


Leveraging Implicit Feedback• Users behavior provides valuable training data

• Google adjusts search rankings based on engagement– Did the user click on the result?– Did the user come back to the search page within seconds?

• Most recommendation algorithms are based solely on user activity– What products did they view/buy?– What ads did they click on?

• T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1, 2007.


Increasing Exploration

We need to find ways to increase exploration to broaden our learning

Exploration of the second page

Can’t learn much about

results on page 2, 3 or 4!


Result Dithering

• Dithering is used to re-order recommendation results – Re-ordering is done randomly

• Dithering is guaranteed to make off-line performance worse

• Dithering also has a near perfect record of making actual performance much better– “Made more difference than any other change”


Simple Dithering Algorithm

• Generate synthetic score from log rank plus Gaussian

• Pick noise scale to provide desired level of mixing

• Typically:

• Oh… use floor(t/T) as seed so results don’t change too often


Example:ε = 0.5

• Each line represents a recommendation of 8 items

• The non-dithered recommendation would be 1, 2, …, 8


Example:ε = log 2 = 0.69

• Each line represents a recommendation of 8 items

• The non-dithered recommendation would be 1, 2, …, 8

© 2014 MapR Technologies 18© 2014 MapR Technologies

Recommendations with Mahout and Solr


What is Recommendation?

The behavior of a crowd helps us understand what individuals will do…


Batch and Real-Time

• We can learn about the relationship between items every X hours– These relationships don’t change often– People who buy Nikon D7100 cameras also buy a Nikon EN-EL15 battery

• What to recommend to Bob has to be determined in real-time– Bob may be a new user with no history– Bob is shopping for a camera right now, but he was buying a baby bottle an

hour ago

• How do we do that?– Mahout for the heavy number crunching– Solr/Elasticsearch for the real-time recommendations


Real-Time Recommender Architecture

Co-occurrence (Mahout)

Indexing(Solr)

Index shardsItem metadata

Web tierSearch(Solr)

Recent history for single user

Recommendation

Complete history (log files/tables)

Note: All data lives in the cluster


Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Alice

Charles


Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Bob got an apple

Alice

Bob

Charles


Recommendations

? What else would Bob like?

Alice

Bob

Charles


Log Files

Alice

Bob

Charles

Alice

Bob

Charles

Alice


History Matrix

Alice

Bob

Charles

✔ ✔ ✔

✔ ✔

✔ ✔


Co-Occurrence Matrix: Items by Items

Q: How do you tell which co-occurrences are useful?

A: Let Mahout do the math…

-

1 2

1 1

1

1

2 1

00

0 0


Indicator Matrix: Anomalous Co-Occurrences

✔✔

Result: The marked row will be added to the indicator field in the item document…


Indicator Matrix

✔

id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

indicators: (t1)

That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine.

Note: Data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.


Internals of the Recommender Engine

Q: What should we recommend if new user listened to 2122:Fats Domino & 303:Beatles?

A: Search the index with “indicator_artists:2122 OR indicator_artists:303”


Internals of the Recommender Engine

1710:Check Berry is the top recommendation


Toolbox for Predictive Analytics with Hadoop

• Mahout– Use it for Recommendations, Clustering, Math

• Vowpal Wabbit– Use it for Classification (but it’s harder to use)

• SkyTree– Commercial (not open source) implementation

of common algorithms


Operational Considerations

• Snapshot your raw data before training– Reproducible process– MapR snapshots

• Leverage NFS for real-time data ingestion– Train the model on today’s data– Learning schedule independent from ingestion schedule

• Look for HA, data protection, disaster recovery– Predictive analytics increases revenue or reduces cost– Quickly becomes a must-have, not a nice-to-have


Q & AEngage with us!

[email protected] +353 86 0215164

@MapR_EMEA maprtech

+MaprTechnologies

maprtech

MapR Technologies

mailto:[email protected]

https://twitter.com/MapR_EMEA

https://www.facebook.com/maprtech

https://plus.google.com/u/0/+MaprTechnologies

http://www.youtube.com/channel/UCKmI-fBjvKexRlBBeVTwrrg

http://www.linkedin.com/company/1478780

Analyzing Big Data with Open Source R and HadoopSteven SitIBM

Analyzing big data with open source R and

HadoopSteven Sit

IBM Silicon Valley Laboratory

39

Outline

• R on Hadoop• Rationale and requirements• Various approaches

• Using R and “Big R” packages• Data exploration, statistical analysis and Visualization• Scale out R with data partitioning• Distributed machine learning

• Demo

40

Rationale and requirementsProductivity • Use natural R syntax to access massive data in Hadoop• Leverage existing packages and libraries

Platform for analytics• Common input and output data formats across analytic

algorithms

Customizability & Extensibility• Easily customize existing algorithms• Support for both conventional data mining and large

scale ML needs

Scalability & Performance• Scale to massively parallel clusters• Analyze terabytes and petabytes of data

41

Streams

ETL Tools SQOOP

Flume

NFSRest API

Files NoSQL DB

Hadoop

Business Intelligence Tools

ODBCJDBC,

Utilities

Warehouses, Marts, MDM

Analytic Runtimes

Analytical Tools

SQL engines

DDL

Database Tools

PIG, Java, etc Exploration Tools Search

Indexes Models

Sources

Common Patterns for Hadoop and Analytics

42

Quick Tour of R

• R is an interpreted language• Open-source implementation of the S language (1976)• Best suited for statistical analysis and modeling

• Data exploration and manipulation• Descriptive statistics• Predictive analytics and machine learning• Visualization• +++

• Can produce “publication quality graphics”• Emerging as a competitor to proprietary platforms

43

R is hot!... but

• Quite lean as far as software goes

• Free!

• State of the art algorithms• Statistical researchers often

provide their methods as R packages

• New techniques available without delay

• Commercial packages usually behind the curve

• 4700+ packages as of today

• Active and vibrant user community

• Universities are teaching R

• IT shops are adopting R

• Companies are integrating R into their products

• R jobs and demand for R skills on the rise

Unfortunately R is not built for Big

Data

Working with large datasets is limited by RAM

44

R and Big Data: Various approaches• R + Hadoop streaming

• Use R as a scripting language• Write map/reduce jobs in R • Or Python for mappers, and R for the reducers

• R + open-source packages for Hadoop• RHadoop

• RMR2: Write low-level map/reduce with API in R• RHBase and RHDFS: Simple R API over Hadoop capabilities

• RHipe• R + SparkR + Hadoop

• R frontend to Spark in memory structure and data from Hadoop

• R + High Level Languages• JaQL<->R “bridge”

• R interfaces over non-R engines• ScalaR from Revolution R• Oracle’s ORE, Netezza’s NZR, Teradata’s teradataR, +++

45

RHadoop Example

Given a table describing flight information for every airline spanning several decades:“Carrier”, “Year”, “Month”, “Day”, “DepDelay”, …

What’s the mean departure delay (“DepDelay”)for each airline for each month?

46

RHadoop Example (contd.)

Map Reduce

key = c(Carrier, Year, Month)value = DepDelay

key = c(Carrier, Year, Month)value = vector(DepDelay)

key = NULLvalue = vector (columns in a line)

key = c(Carrier, Year, Month)value = mean(DepDelay)Shuffle

47

RHadoop Example (contd.)csvtextinputformat = function(line) keyval(NULL, unlist(strsplit(line, "\\,")))deptdelay = function (input, output) { mapreduce(input = input, output = output, textinputformat = csvtextinputformat, map = function(k, fields) { # Skip header lines and bad records: if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptDelay <- fields[[16]] # Skip records where departure dalay is "NA": if (!(identical(deptDelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month: keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptDelay)}}}, reduce = function(keySplit, vv) { keyval(keySplit[[2]], c(keySplit[[3]], length(vv), keySplit[[1]], mean(as.numeric(vv))))})}from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))

Source: http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html

http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html

48

What is “Big R”• Explore, visualize, transform,

and model big data using familiar R syntax and paradigm

• Scale out R with MR programming• Partitioning of large data• Parallel cluster execution of R

code

• Distributed Machine Learning• A scalable statistics engine that

provides canned algorithms, and an ability to author new ones, all via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data (summaries)

to R client

Or, push R functions

right on the data

1

2

3

Let’s mix it up a little ...

... with a demo interspersed with slides.

50

Big R Demo Data

• Publicly available “airline” data

• 22 years of actual arrival/departure information • Every scheduled flight in the US• 1987 onwards

• From U.S. Department of Transportation

• Cleansed version• http://stat-computing.org/dataexpo/2009/the-data.html

http://stat-computing.org/dataexpo/2009/the-data.html

51

Airline data descriptionYear 1987-2008

Month 1-12

DayofMonth 1-31

DayOfWeek 1 (Monday) - 7 (Sunday)

DepTime actual departure time (local, hhmm)

CRSDepTime scheduled departure time (local, hhmm)

ArrTime actual arrival time (local, hhmm)

CRSArrTime scheduled arrival time (local, hhmm)

UniqueCarrier unique carrier code

FlightNum flight number

TailNum plane tail number

ActualElapsedTime in minutes

CRSElapsedTime in minutes

AirTime in minutes

ArrDelay arrival delay, in minutes

http://stat-computing.org/dataexpo/2009/supplemental-data.html

Airline data description (contd.)

Origin origin IATA airport code

Dest destination IATA airport code

Distance in miles

TaxiIn taxi in time, in minutes

TaxiOut taxi out time in minutes

Cancelled was the flight cancelled?

CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

Diverted 1 = yes, 0 = no

CarrierDelay in minutes

WeatherDelay in minutes

NASDelay in minutes

SecurityDelay in minutes

LateAircraftDelay in minutes



53

Explore, visualize, transform, model Hadoop data with R

• Represent Big Data objects as R datatypes• R's programming syntax and paradigm• Data stays in HDFS• R classes (e.g. bigr.data.frame) as proxies• No low-level map/reduce constructs• No underlying scripting languages

1

# Construct a bigr.frame to access large data setair <- bigr.frame(dataPath="airline_demo.csv", …)

# How many flights were flown by United or Delta?length(UniqueCarrier[UniqueCarrier %in% c("UA", "DL")])

# Filter all flights that were delayed by 15+ minutes at departure or arrival.airSubset <- air[air$Cancelled == 0 & (air$DepDelay >= 15 | air$ArrDelay >= 15), c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]

# For these filtered flights, compute key statistics (# of flights, # average flying distance and flying time), grouped by airlinesummary(count(UniqueCarrier) + mean(Distance) + mean(CRSElapsedTime) ~

UniqueCarrier, dataset = airSubset)Bigr.boxplot(air$Distance ~ air$UniqueCarrier, …)

54

Scale out R in Hadoop• Support parallel / partitioned execution of R

• Work around R’s memory limitations• Execute R snippets on chunks of data

• Partitioned by key, by #rows, via sampling, …• Follows R’s “apply” model• Parallelized seamlessly Map/Reduce engine

2

# Filter the airline data on United and Hawaiianbf <- air[air$UniqueCarrier %in% c("HA", "UA"),]

# Build one decision-tree model per airlinermodels <- groupApply(data = bf, groupingColumns = list(bf$UniqueCarrier), rfunction = function(df) { library(rpart) predcols <- c('ArrDelay', 'DepDelay', 'DepTime'', 'Distance') return (rpart(ArrDelay ~ ., df[,predcols]))})

# Pull the model for HA to the clientmodelHA <- bigr.pull(models$HA)

# Visualize the modelprettyTree(modelHA)

Big R API: Core Classes & Methods

Connection handlingbigr.connect() and bigr.disconnect()is.bigr.connected()

bigr.frameModeled after R’s data.frameProxy for tabular data

bigr.vectorModeled after R’s vector datatypeProxy for a column

bigr.list(Loosely) modeled after R’s listProxy for collections of serialized R objects

• Basic exploration• head(), tail()

• dim(), length(), nrow(), ncol()

• str(), summary()

• Selection and Projections• [

• $

• Arithmetic and Logical operators

• +, -, /, -

• &, |, !

• ifelse()

• String and Math functions• Lots of these

• Other relational operators• table()

• unique()

• merge()

• summary()

• sort()

• na.omit(), na.exclude()

• Data movement• as.data.frame,

as.bigr.frame

• bigr.persist

• Sampling• bigr.sample()

• bigr.random()

• Visualizations• built into R packages (E.g.

ggplot2)

Big R API: Core Classes & Methods (contd.)

• Summarization and aggregation

• summary(), very powerful when used with “formula” notation

• max(), min(), range()

• Mean, variance and standard deviation

• mean()• var()• sd()

• Correlation• cov() and cor()

• Hadoop options• bigr.get.server.option()• bigr.set.server.option()

• bigr.debug(T)• Useful for servicing• Will print out internal

debugging output

• Catalog access• bigr.listfs()• bigr.listTables(),

bigr.listColumns()

• groupApply()• Primary function for

embedded execution• Can return tables or

objects• Run “help(groupApply)”

inside R for extensive documentation

• Examining R execution logs• bigr.logs()

• Other *Apply functions• rowApply(), for running R

on batches of rows• tableApply(), for running R

on entire dataset

57

Example: What is the average scheduled flight time, actual gate-to-gate time, and actual airtime for each city pair per year?

mapper.year.market.enroute_time = function(key, val) { # Skip header lines, cancellations, and diversions:

if ( !identical(as.character(val['Year']), 'Year') & identical(as.numeric(val['Cancelled']), 0) & identical(as.numeric(val['Diverted']), 0) ) { # We don't care about direction of travel, so construct 'market' # with airports ordered alphabetically # (e.g, LAX to JFK becomes 'JFK-LAX'

if (val['Origin'] < val['Dest']) market = paste(val['Origin'], val['Dest'], sep='-') else market = paste(val['Dest'], val['Origin'], sep='-') # key consists of year, market output.key = c(val['Year'], market) # output gate-to-gate elapsed times (CRS and actual) + time in air output.val = c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime']) return( keyval(output.key, output.val) ) }}

reducer.year.market.enroute_time = function(key, val.list) { # val.list is a list of row vectors # a data.frame is a list of column vectors # plyr's ldply() is the easiest way to convert IMHO if ( require(plyr) ) val.df = ldply(val.list, as.numeric) else { # this is as close as my deficient *apply skills can come w/o plyr val.list = lapply(val.list, as.numeric) val.df = data.frame( do.call(rbind, val.list) ) } colnames(val.df) = c('actual','crs','air') output.key = key output.val = c( nrow(val.df), mean(val.df$actual, na.rm=T), mean(val.df$crs, na.rm=T), mean(val.df$air, na.rm=T) ) return( keyval(output.key, output.val) )}mr.year.market.enroute_time = function (input, output) { mapreduce(input = input, output = output, input.format = asa.csvtextinputformat, map = mapper.year.market.enroute_time, reduce = reducer.year.market.enroute_time, backend.parameters = list( hadoop = list(D = "mapred.reduce.tasks=10") ), verbose=T)}hdfs.output.path = file.path(hdfs.output.root, 'enroute-time')results = mr.year.market.enroute_time(hdfs.input.path, hdfs.output.path)results.df = from.dfs(results, to.data.frame=T)colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'in.air')save(results.df, file="out/enroute.time.RData")

RHadoop Implementation(>35 lines of code)

Equivalent Big R Implementation (4 lines of code)

Equivalent Big R Implementation (4 lines of code)

air <- bigr.frame(dataPath = "airline.csv", dataSource = “DEL", na.string="NA")air$City1 <- ifelse(air$Origin < air$Dest, air$Origin, air$Dest)air$City2 <- ifelse(air$Origin >= air$Dest, air$Origin, air$Dest)summary(count(UniqueCarrier) + mean(ActualElapsedTime) + mean(CRSElapsedTime) + mean(AirTime) ~ Year + City1 + City2 , dataset = air[air$Cancelled == 0 & air$Diverted == 0,])

air <- bigr.frame(dataPath = "airline.csv", dataSource = “DEL", na.string="NA")air$City1 <- ifelse(air$Origin < air$Dest, air$Origin, air$Dest)air$City2 <- ifelse(air$Origin >= air$Dest, air$Origin, air$Dest)summary(count(UniqueCarrier) + mean(ActualElapsedTime) + mean(CRSElapsedTime) + mean(AirTime) ~ Year + City1 + City2 , dataset = air[air$Cancelled == 0 & air$Diverted == 0,])

58

Use Cases• Where Big R works well

• When data can be partitioned cleanly ...• ... and each partition fits in the memory of a server node

• In other words:• size of entire data can be bigger than cluster memory• size of individual partitions limited to node memory (due to R)

• Real-world customer scenarios we’ve seen:• Model data from individual tractors• Build time-series anomaly detection on IP address pairs• Build models on each customer’s behavior

• And where it doesn’t ...• When building one monolithic model on the entire data

• without sampling• Without use some form of blending such as ensembles

59

Large scale analytics in Hadoop • Some workloads are not logically partitionable, or partitions are still large

• SystemML - a scalable engine running natively over Hadoop:• Deliver pre-built ML algorithms

• Regression, Classification, Clustering, etc.• Ability to author new algorithms

3

# Build a model on the entire data setmodel <- bigr.lm(ArrDelay ~ ., df)

# Or, build several models, partitioned by airlinemodels <- groupApply(

input = air, groupBy = air$UniqueCarrier, function(df) {

# Linear regression model <- bigr.lm(ArrDelay ~ ., df) return (model) })

60

Association Rules

Association

• Apriori and pattern growth

for frequent itemsets

• sequence minerClustering

K-Means

Data Mining

Dimension Reduction

• Non-negative Matrix Factorizations

• Principal Component Analysis (large n, small p)

• Singular Value Decompositions

Time Series Analysis

Granger modeling

Predictive Analytics

Regression

Linear Regression for large, sparse datasets.

Generalized Linear Models.

Classification

Linear Logistic Regression Trust Region Method

Linear SVMs Modified Finite Newton Method

Random decision treesfor classification & regression

Ranking

PageRank of a directed graph

HITS Hubs and Authorities

Optimization

Conjugate Gradient for Sparse Linear Systems

Parallel Optimization for sparse linear models

Stochastic Gradient Descent

Outlier Detection

Recursive Binning and reprojection for distance based outlier detection

Univariate

Scale, nominal, ordinal variables

Scale: min, max, range, mean, variance, moments (kurtosis, skewness), order statistics (median, quantile, iqm), outliers

Categorical: mode, histograms

Bivariate

Scale/categorical

Eta, ftest, grouped mean/variance/weight

Categorical/categorical

cramer’s V, pearson ChiSquare, spearman

Data Exploration

Large scale analytics (systemML) modules

Recommender Systems

Matrix Completion algorithms

Meta Learning

Ensemble LearningCross Validation

61

Example - Recommendation Systems with Collaborative Filtering

ratings

peo

ple

W

HK f

acto

rs

movies K factors

peo

ple

1 1 0.101 2 0.30: : :

1 1 0.101 2 0.301 3 0.221 4 1.24 : : : : : :

movies

Analyzing both similarity of people and products

62

Example: Topic Evolution in Social Media with Clustering

tokens

docu

men

ts

1 1 0.101 2 0.301 3 0.221 4 1.24 : : : : : :

W

H

K to

pics

words K topics

docu

men

ts 1 1 0.101 2 0.30: : :

63

Example: Gaussian Non-negative Matrix Factorization

package gnmf;

import java.io.IOException;import java.net.URISyntaxException;

import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.JobConf;

public class MatrixGNMF{ public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done");

package gnmf;

import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep2{ static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";

JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class);

package gnmf;

import gnmf.io.MatrixCell;import gnmf.io.MatrixFormats;import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep1{ public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0)

Java Implementation(>1500 lines of code)

Equivalent Big R - SystemML Implementation

(12 lines of code)

Equivalent Big R - SystemML Implementation

(12 lines of code)# Perform matrix operations, say non-negative factorization# V ~~ WH

V <- bigr.matrix("V.mtx”); # initial matrix on HDFSW <- bigr.matrix(nrow=nrow(V), ncols=k); # initialize starting pointsH <- bigr.matrix(nrow=k, ncols=ncols(V));

for (i in 1:numiterations) { H <- H * (t(W) %*% V / t(W) %*% W %*% H); W <- W * (V %*% t(H) / W %*% H %*% t(H));}

# Perform matrix operations, say non-negative factorization# V ~~ WH

V <- bigr.matrix("V.mtx”); # initial matrix on HDFSW <- bigr.matrix(nrow=nrow(V), ncols=k); # initialize starting pointsH <- bigr.matrix(nrow=k, ncols=ncols(V));

for (i in 1:numiterations) { H <- H * (t(W) %*% V / t(W) %*% W %*% H); W <- W * (V %*% t(H) / W %*% H %*% t(H));}

64

Summary - Popular R Analytics on Hadoop

• R is the preferred language for Data Scientists

• Many approaches exist to enable R on Hadoop with pros and cons

• “Big R” approach:• Data exploration, statistical analysis and

Visualization with natural R syntax• Scale out R with data partitioning• Support for standard R tools and

existing packages and libraries• SystemML engine for Distributed

Machine Learning that provides canned algorithms, and an ability to author new ones, all via R-like syntax

Data Sources

Hive Tables HBase Tables Files

R Runtime

Hadoop

R Clients

Distributed ML

Runtime

65 65

www.ibm.com/software/data/infosphere/hadoop

Technology

Predictive Analytics with Hadoop