Recommendation 101 using Hivemall

Recommendation101usingHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

Agenda

1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking

2

WhatisHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

3

https://github.com/myui/hivemall

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

4

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation5

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

6

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

FeatureEngineering

7

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

8

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Training

9

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

10

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

11

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Prediction

12

HowtouseHivemall- Prediction

CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

13

14

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

ListofsupportedAlgorithms

ListofsupportedAlgorithms

15

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

ListofAlgorithmsforRecommendation

16

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearch onVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k function of Hivemall is useful for recommending top-k items

OtherSupportedAlgorithms

17

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Agenda


18

•ExplicitFeedback• ItemRating• ItemRanking

•ImplicitFeedback• Positive-onlyImplicitFeedback

• Bought(ornot)• Click(ornot)• Converged(ornot)

19

Recommendation101

•ExplicitFeedback• ItemRating• ItemRanking

•ImplicitFeedback• Positive-onlyImplicitFeedback

• Bought(ornot)• Click(ornot)• Converged(ornot)

20

Recommendation101

CaseforCoursehero?

U/I Item1 Item2 Item3 … ItemI

User1 5 3

User2 2 1

… 3 4

UserU 1 4 5

21

ExplicitFeedback


User1 ? 5 ? ? 3

User2 2 ? 1 ? ?

… ? 3 ? 4 ?

UserU 1 ? 4 ? 5

22

ExplicitFeedback

23

ExplicitFeedback


User1 ? 5 ? ? 3

User2 2 ? 1 ? ?

… ? 3 ? 4 ?

UserU 1 ? 4 ? 5

• VerySparseDataset• #offeedbackissmall• Unknowndata>>Trainingdata• Userpreferencetorateditemsisclear• Hasnegativefeedbacks• Evaluationiseasy(MAE/RMSE)


User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

24

ImplicitFeedback


User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

25

ImplicitFeedback• SparseDataset• NumberofFeedbacksarelarge• Userpreferenceisunclear• No negative feedback• Known feedback maybe negative• Unknownfeedbackmaybepositive• Evaluationisnotsoeasy(NDCG,Prec@K,Recall@K)

26

ProsandCons

ExplicitFeedback

ImplicitFeedback

Datasize L JUser preference J LDislike/Unknown J LImpact ofBias L J

Agenda


27

28

MatrixFactorization/Completion

Factorizeamatrixintoaproductofmatriceshavingk-latentfactor

29

MatrixCompletion How-to

• MeanRatingμ• RatingBiasforeachItem Bi• RatingBiasforeachUserBu

30

MeanRating

MatrixFactorization

Regularization

Biasforeachuser/item

CriteriaofBiasedMFFactorization

Diffinprediction

31

TrainingofMatrixFactorization

Support iterative training using local disk cache

32

PredictionofMatrixFactorization

Agenda


33

StillinBetabutwillofficiallybesupportedsoon

34

ImplicitFeedback

AnaïveL approachbyfillingunknowncellasnegative

35

SamplingschemeforImplicitFeedback

Samplepairs<u,i,j>ofPositiveItemi andNegativeItem jforeachUseru

• UniformusersamplingØ Sampleauser.Then,sampleapair.

• UniformpairsamplingØ Samplepairsdirectory(dist.alongw/originaldataset)

• With-replacementorwithout-replacementsampling


User1 ⭕ ⭕

User2 ⭕ ⭕

… ⭕ ⭕

UserU ⭕ ⭕ ⭕

DefaultHivemallsamplingscheme:- Uniformusersampling- Withreplacement

•Rendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.

•Amostproven(?)algorithmforrecommendationforimplicitfeedback

36

BayesianProbabilisticRanking

Keyassumption:useru prefersitemi overnon-observeditem j

BayesianProbabilisticRanking

37

ImagetakenfromRendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf

BPRMF’staskcanbeconsideredfilling0/1theitem-itemmatrixandgettingprobabilityofI>uJ

TrainbyBPR-MatrixFactoriaztion

38

39

PredictbyBPR-MatrixFactorization

40


41


42

RecommendationforImplicitFeedbackDataset

1. EfficientTop-kcomputationisimportantforprediction O(U*I)

2. Memoryconsumptionisheavyforwhereitemsize|i|islarge

• MyMediaLite requireslotsofmemory• MaximumdatasizeofMovielens:33,000moviesby240,000users,20millionratings

3. Bettertoavoidcomputingpredictionsforeachtime

43

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?