Hivemall dbtechshowcase 20160713 #dbts2016

MachineLearningMadeEasybyusingHivemall

ResearchEngineerMakotoYUI@myui

<myui@treasure-data.com>

bit.ly/hivemall

12016/07/13 DB tech showcase

➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service

(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project

and Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST➢XML native database and Parallel Database systems

WhoamI?

ExternalIntegrations

Server

App log

Sensor

Apache log

HiveBatch

AdhocPresto

ODBCJDBC

Treasure Agent

BI tools

Data analysis

Data Collectors

Embedded

Embulk

Mobile SDK

JS SDK

Treasure Data Cloud Service

Machine Learning

900,000Records stored

per sec.

(単位

)10億レコード

サービス開始

SeriesAFunding

100社導入

Gartner社「CoolVendorinBigData」に選定される

10兆件

５兆レコード

数字でみるトレジャーデータ (2014年10月):40万レコード毎秒インポートされるデータの数10兆レコード以上インポートされたデータの数120億アドテク業界のお客様1社によって毎日送られてくるデータ

Data Imported to Treasure Data

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. How to use Hivemall

Agenda

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

SparkSQL

ApacheSpark

Hive Pig

WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)

bit.ly/hivemall-award7

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

List of supported Algorithms

List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

List of Algorithms for Recommendation

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Other Supported Algorithms

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Ø CTR prediction of Ad click logs• Freakout Inc., Fan communication, and more• Replaced Spark MLlib w/ Hivemall at company X

Industry use cases of Hivemall

http://www.slideshare.net/masakazusano75/sano-hmm-2015051212

ØGender prediction of Ad click logs• Scaleout Inc. and Fan commucations

http://eventdots.jp/eventreport/458208

Industry use cases of HivemallØ Value prediction of Real estates

• Livesense

http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14

Source: http://itnp.net/article/2016/02/18/2286.html

ØChurn Detection• OISIX

http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16

会員サービスの解約予測

•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた

•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減

•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に

•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現

•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成

•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現

Mobile

属性情報

行動ログ

クレーム情報

流入元

利用サービス情報

直接施策

間接施策

ポイント付与ケアコール

成功体験への誘導UI 変更

予測に使うデータ

ØRecommendation• Portal site

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

Agenda

WhyHivemall

1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.

2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.

That’swhyIbuildHivemall.20

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

MachineLearning

RawData

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

MachineLearning22

RawData

Do not scaleHave to learn R/Python APIs

HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS

RawData

Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I ❤ scalableSQL query

Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming

ScalaShell(REPL)H2O Rprogramming

GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)

C++APIprogrammingCommandLine

SurveyonexistingMLframeworks

ExistingdistributedmachinelearningframeworksareNOTeasytouse

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)

✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop 26

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

1. What is Hivemall

2. Why Hivemall

Agenda

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

Datapreparation 29

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

HowtouseHivemall

MachineLearning

Training

Prediction

FeatureVector

FeatureEngineering

create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

HowtouseHivemall

MachineLearning

Training

Prediction

FeatureVector

Training

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

HowtouseHivemall

MachineLearning

Training

Prediction

FeatureVector

Prediction

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

FeatureVector

Exportpredictionmodels

bit.ly/hivemall-rtp

Export Prediction Model to a RDBMS

Any RDBMS

TD exportPeriodical export is very easyin Treasure Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

PredictionModel

Real-timePredictiononMySQL

SIGMOID(x) = 1.0 / (1.0 + exp(-x))

Feature Vector

SELECT sigmoid(sum(t.value * m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)

Online prediction on MySQL

Index lookups are veryefficient in RDBMSs

RandomForest in Hivemall

Ensemble of Decision Trees

Training of RandomForest

Prediction of RandomForest

https://console.treasuredata.com/jobs/75633717

Conclusion

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

TreasureDataprovidesML-as-a-ServiceusingthelatestversionofHivemall

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?

Hivemall dbtechshowcase 20160713 #dbts2016

Engineering

Vista 55online.ir-news-instagram-reunion-piton fournaisevolc-eimagery-main001-20160713

Yahoo! JAPANのデータ基盤とHadoop #dbts2016

Podling Hivemall in the Apache Incubator

POLÍTICA INSPIRE DEU CERTOdeucerto.inspireenel.com.br/arquivos/20160713-140704-57867538d8f47.pdf · Entre seus pilares, foca especificamente em estimular a captação de práticas

20160713 DA

20160713 srws第六回@メタ・アナリシス前半

Hivemall talk@Hadoop summit 2014, San Jose

Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Mik 20160713

Hivemall tech talk at Redwood, CA

TygerBurger Brackenfell 20160713

TygerBurger Kraaifontein 20160713

Recommendation 101 using Hivemall

HivemallとSpark MLlibの比較

20160713 srws第六回後半、revmanでのメタ・アナリシス・ハンズオン

20160713キッズデザイン賞受賞...Title 20160713キッズデザイン賞受賞 Created Date 7/13/2016 11:10:02 AM

Hivemall LT @ Machine Learning Casual Talks #3

20160713-Programma-A5-it · Title: 20160713-Programma-A5-it.indd Created Date: 7/18/2016 5:07:04 PM

20160713 DM

DBTS2016 DBAのための最新テクノロジー