Upload
makoto-yui
View
1.303
Download
3
Tags:
Embed Size (px)
Citation preview
Copyright ©2015 Treasure Data. All Rights Reserved.
Treasure Data Inc.Research EngineerMakoto YUI @myui
2015/05/14TD tech talk #3 @Retty 1
http://myui.github.io/
20 min. Introduction to Hivemall
Copyright ©2015 Treasure Data. All Rights Reserved.
Ø2015/04 Joined Treasure Data, Inc.Ø1st Research Engineer in Treasure DataØMy mission in TD is developing ML-‐as-‐a-‐Service (MLaaS)
Ø2010/04-‐2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ØWorked on a large-‐scale Machine Learning project and Parallel Databases
Ø2009/03 Ph.D. in Computer Science from NAISTØMy research topic was about building XML native database and Parallel Database systems
ØSuper programmer award from the MITOU Foundation (a Government founded program for finding young and talented programmers)Ø Super creators in Treasure Data: Sada Furuhashi, Keisuke Nishida
2
Who am I ?
Copyright ©2015 Treasure Data. All Rights Reserved.3
0
2000
4000
6000
8000
10000
12000
Aug-‐12
Sep-‐12
Oct-‐12
Nov-‐12
Dec-‐12Jan-‐13
Feb-‐13
Mar-‐13
Apr-‐13
May-‐13
Jun-‐13Jul-‐13
Aug-‐13
Sep-‐13
Oct-‐13
Nov-‐13
Dec-‐13Jan-‐14
Feb-‐14
Mar-‐14
Apr-‐14
May-‐14
Jun-‐14Jul-‐14
Aug-‐14
Sep-‐14
Oct-‐14
Billio
n records (Unit)
Service in
Series A Funding
Reached 100 customers
Selected as “Cool Vendor in Big Data” by Gartner
10 trillionrecords
5 trillion records
Figures on Oct. 20144 hundred thousand (40万) records Imported for each SECOND!!10+ trillion (10兆) records Total number of imported records
12 billion (120億) records # records sent by an Ad-tech company
Figures of Imported Data in Treasure Data
Copyright ©2015 Treasure Data. All Rights Reserved.
The latest numbers in Treasure Data
100+CustomersIn Japan
15 trillion# of
stored records
4,000A single company sends data to usfrom 4,000 nodes
500,000# of records
stored per a second
4
Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
5
Copyright ©2015 Treasure Data. All Rights Reserved.
What is HivemallScalable machine learning library built on the top of Apache Hive, licensed under the Apache License v2
Hadoop HDFS
MapReduce(MRv1)
Hive / PIG
Hivemall
Apache YARN
Apache TezDAG processing MR v2
Machine Learning
Check http://github.com/myui/hivemall
6
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
Copyright ©2015 Treasure Data. All Rights Reserved.
R
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
M MM
M M
HDFS
R
MapReduce and DAG engine
MapReduce DAG engineTez/Spark
No intermediate DFS reads/writes!
7
Copyright ©2015 Treasure Data. All Rights Reserved.
Very easy to use; Machine Learning on SQL
The key characteristic of Hivemall
100+ lines
of code
Classification with Mahout
CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers
ü Machine Learning made easy for SQL developers (ML for the rest of us)
ü APIs are very stable because of SQL abstraction
This SQL query automatically runs in parallelon Hadoop
8
Copyright ©2015 Treasure Data. All Rights Reserved.
List of functions in Hivemall v0.3
9
• Classification (both binary-‐ and multi-‐class)
ü Perceptronü Passive Aggressive (PA)ü Confidence Weighted (CW)ü Adaptive Regularization of Weight Vectors (AROW)
ü Soft Confidence Weighted (SCW)ü AdaGrad+RDA
• Regressionü Logistic Regression (SGD)ü PA Regressionü AROW Regressionü AdaGradü AdaDELTA
• kNN and Recommendationü Minhash and b-‐Bit Minhash(LSH variant)
ü Similarity Search using K-‐NNü Matrix Factorization
• Feature engineeringü Feature hashingü Feature scaling(normalization, z-‐score)
ü TF-‐IDF vectorizer
Treasure Data will support Hivemallv0.3.1 in the next week!
bit.ly/hivemall-‐mf
Copyright ©2015 Treasure Data. All Rights Reserved.
• Contribution from Daniel Dai (Pig PMC) from Hortonworks• To be supported from Pig 0.15
10
Hivemall on Apache Pig
Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
11
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Data preparation
12
Copyright ©2015 Treasure Data. All Rights Reserved.
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall -‐ Data preparation
Define a Hive table for training/testing data
13
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Feature Engineering
14
Copyright ©2015 Treasure Data. All Rights Reserved.
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label})
as label,features
from e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall -‐ Feature Engineering
Transforming a label value to a value between 0.0 and 1.0
15
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Training
16
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE lr_model ASSELECT
feature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Training by logistic regression
map-‐only task to learn a prediction model
Shuffle map-‐outputs to reduces by feature
Reducers perform model averaging in parallel
17
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive weights for avg
+0.7, +0.3, +0.2, -‐0.1, +0.7
Training for the CW classifier
18
Copyright ©2015 Treasure Data. All Rights Reserved.
create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select
train_multiclass_cw(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_arow(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_scw(addBias(features),label)as (label,feature,weight)
from news20mc_train_x3
) t group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models by union all
19
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Prediction
20
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Prediction
CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
Prediction is done by LEFT OUTER JOINbetween test data and prediction model
No need to load the entire model into memory
21
Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
22
Copyright ©2015 Treasure Data. All Rights Reserved.
Type/Purpose Matrix of Machine Learning
23
OnlineLearning
OfflineLearning
OnlinePrediction
• AlgorithmTrade (HFT)• Twitter real-‐time
analysis
• Ad-‐tech (e.g., CTR/CVR prediction)
• Real-‐time recommendation
OfflinePrediction
no/fewneeds?
• Daily/weekly batch systems
• BusinessAnalytics/Reporting
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
Feature Vector
Feature Vector
Label
Export prediction models
24
Copyright ©2015 Treasure Data. All Rights Reserved.
Export Prediction Model to a RDBMS
25
hive> desc news20b_cw_model1;feature intweight double
Any RDBMS
TD exportPeriodical export is very easyin Treasure Data
103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855
Copyright ©2015 Treasure Data. All Rights Reserved.26
hive> desc testing_exploded; feature string value float
Real-‐time Prediction on MySQL
#2 Preparing a Test data table
SIGMOID(x) = 1.0 / (1.0 + exp(-‐x))
PredictionModel Label
Feature Vector
SELECT sigmoid(sum(t.value * m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)
#3 Online prediction on MySQL
You can alternatively use SQL viewdefining for testing target
Index lookups are veryefficient in RDBMSs
http://bit.ly/hivemall-‐rtp
Copyright ©2015 Treasure Data. All Rights Reserved.
Cost of Amazon Machine LearningAmazon-‐ML is suspected to be based on Vowpal Wabbit(single process)
27
Data Analysis and Model Building Fees$0.42/Instance per Hour
Batch Prediction$0.1/1000 requests
Real-‐time Prediction$0.0001 per a request
Pay-‐per-‐request is apparently not suitable for doing prediction for each web request (e.g. online CTR prediction)
Copyright ©2015 Treasure Data. All Rights Reserved.28
Real-‐time Prediction on Treasure Data
Run batch trainingjob periodically
Real-‐time predictionon a RDBMS
Periodicalexport
Copyright ©2015 Treasure Data. All Rights Reserved.29
Beyond Query-‐as-‐a-‐Service!
We ❤️ Open-‐source! We invented ..
We are Hiring!