43
Recommendation 101 using Hivemall Research Engineer Makoto YUI @myui <[email protected]> 1

Recommendation 101 using Hivemall

Embed Size (px)

Citation preview

Page 1: Recommendation 101 using Hivemall

Recommendation101usingHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

Page 2: Recommendation 101 using Hivemall

Agenda

1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking

2

Page 3: Recommendation 101 using Hivemall

WhatisHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

3

https://github.com/myui/hivemall

Page 4: Recommendation 101 using Hivemall

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

4

Page 5: Recommendation 101 using Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation5

Page 6: Recommendation 101 using Hivemall

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

6

Page 7: Recommendation 101 using Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

7

Page 8: Recommendation 101 using Hivemall

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

8

Page 9: Recommendation 101 using Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

9

Page 10: Recommendation 101 using Hivemall

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

10

Page 11: Recommendation 101 using Hivemall

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

11

Page 12: Recommendation 101 using Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

12

Page 13: Recommendation 101 using Hivemall

HowtouseHivemall- Prediction

CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

13

Page 14: Recommendation 101 using Hivemall

14

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

ListofsupportedAlgorithms

Page 15: Recommendation 101 using Hivemall

ListofsupportedAlgorithms

15

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

Page 16: Recommendation 101 using Hivemall

ListofAlgorithmsforRecommendation

16

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearch onVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k function of Hivemall is useful for recommending top-k items

Page 17: Recommendation 101 using Hivemall

OtherSupportedAlgorithms

17

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Page 18: Recommendation 101 using Hivemall

Agenda

1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking

18

Page 19: Recommendation 101 using Hivemall

•ExplicitFeedback• ItemRating• ItemRanking

•ImplicitFeedback• Positive-onlyImplicitFeedback

• Bought(ornot)• Click(ornot)• Converged(ornot)

19

Recommendation101

Page 20: Recommendation 101 using Hivemall

•ExplicitFeedback• ItemRating• ItemRanking

•ImplicitFeedback• Positive-onlyImplicitFeedback

• Bought(ornot)• Click(ornot)• Converged(ornot)

20

Recommendation101

CaseforCoursehero?

Page 21: Recommendation 101 using Hivemall

U/I Item1 Item2 Item3 … ItemI

User1 5 3

User2 2 1

… 3 4

UserU 1 4 5

21

ExplicitFeedback

Page 22: Recommendation 101 using Hivemall

U/I Item1 Item2 Item3 … ItemI

User1 ? 5 ? ? 3

User2 2 ? 1 ? ?

… ? 3 ? 4 ?

UserU 1 ? 4 ? 5

22

ExplicitFeedback

Page 23: Recommendation 101 using Hivemall

23

ExplicitFeedback

U/I Item1 Item2 Item3 … ItemI

User1 ? 5 ? ? 3

User2 2 ? 1 ? ?

… ? 3 ? 4 ?

UserU 1 ? 4 ? 5

• VerySparseDataset• #offeedbackissmall• Unknowndata>>Trainingdata• Userpreferencetorateditemsisclear• Hasnegativefeedbacks• Evaluationiseasy(MAE/RMSE)

Page 24: Recommendation 101 using Hivemall

U/I Item1 Item2 Item3 … ItemI

User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

24

ImplicitFeedback

Page 25: Recommendation 101 using Hivemall

U/I Item1 Item2 Item3 … ItemI

User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

25

ImplicitFeedback• SparseDataset• NumberofFeedbacksarelarge• Userpreferenceisunclear• No negative feedback• Known feedback maybe negative• Unknownfeedbackmaybepositive• Evaluationisnotsoeasy(NDCG,Prec@K,Recall@K)

Page 26: Recommendation 101 using Hivemall

26

ProsandCons

ExplicitFeedback

ImplicitFeedback

Datasize L JUser preference J LDislike/Unknown J LImpact ofBias L J

Page 27: Recommendation 101 using Hivemall

Agenda

1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking

27

Page 28: Recommendation 101 using Hivemall

28

MatrixFactorization/Completion

Factorizeamatrixintoaproductofmatriceshavingk-latentfactor

Page 29: Recommendation 101 using Hivemall

29

MatrixCompletion How-to

• MeanRatingμ• RatingBiasforeachItem Bi• RatingBiasforeachUserBu

Page 30: Recommendation 101 using Hivemall

30

MeanRating

MatrixFactorization

Regularization

Biasforeachuser/item

CriteriaofBiasedMFFactorization

Diffinprediction

Page 31: Recommendation 101 using Hivemall

31

TrainingofMatrixFactorization

Support iterative training using local disk cache

Page 32: Recommendation 101 using Hivemall

32

PredictionofMatrixFactorization

Page 33: Recommendation 101 using Hivemall

Agenda

1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking

33

StillinBetabutwillofficiallybesupportedsoon

Page 34: Recommendation 101 using Hivemall

34

ImplicitFeedback

AnaïveL approachbyfillingunknowncellasnegative

Page 35: Recommendation 101 using Hivemall

35

SamplingschemeforImplicitFeedback

Samplepairs<u,i,j>ofPositiveItemi andNegativeItem jforeachUseru

• UniformusersamplingØ Sampleauser.Then,sampleapair.

• UniformpairsamplingØ Samplepairsdirectory(dist.alongw/originaldataset)

• With-replacementorwithout-replacementsampling

U/I Item1 Item2 Item3 … ItemI

User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

DefaultHivemallsamplingscheme:- Uniformusersampling- Withreplacement

Page 36: Recommendation 101 using Hivemall

•Rendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.

•Amostproven(?)algorithmforrecommendationforimplicitfeedback

36

BayesianProbabilisticRanking

Keyassumption:useru prefersitemi overnon-observeditem j

Page 37: Recommendation 101 using Hivemall

BayesianProbabilisticRanking

37

ImagetakenfromRendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf

BPRMF’staskcanbeconsideredfilling0/1theitem-itemmatrixandgettingprobabilityofI>uJ

Page 38: Recommendation 101 using Hivemall

TrainbyBPR-MatrixFactoriaztion

38

Page 39: Recommendation 101 using Hivemall

39

PredictbyBPR-MatrixFactorization

Page 40: Recommendation 101 using Hivemall

40

PredictbyBPR-MatrixFactorization

Page 41: Recommendation 101 using Hivemall

41

PredictbyBPR-MatrixFactorization

Page 42: Recommendation 101 using Hivemall

42

RecommendationforImplicitFeedbackDataset

1. EfficientTop-kcomputationisimportantforprediction O(U*I)

2. Memoryconsumptionisheavyforwhereitemsize|i|islarge

• MyMediaLite requireslotsofmemory• MaximumdatasizeofMovielens:33,000moviesby240,000users,20millionratings

3. Bettertoavoidcomputingpredictionsforeachtime

Page 43: Recommendation 101 using Hivemall

43

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?