Upload
makoto-yui
View
255
Download
2
Embed Size (px)
Citation preview
Agenda
1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking
2
WhatisHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
3
https://github.com/myui/hivemall
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
4
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation5
CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
6
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
7
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom
e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
8
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
9
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
10
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
11
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
12
HowtouseHivemall- Prediction
CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
13
14
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression
ListofsupportedAlgorithms
ListofsupportedAlgorithms
15
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
ListofAlgorithmsforRecommendation
16
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearch onVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k function of Hivemall is useful for recommending top-k items
OtherSupportedAlgorithms
17
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
Agenda
1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking
18
•ExplicitFeedback• ItemRating• ItemRanking
•ImplicitFeedback• Positive-onlyImplicitFeedback
• Bought(ornot)• Click(ornot)• Converged(ornot)
19
Recommendation101
•ExplicitFeedback• ItemRating• ItemRanking
•ImplicitFeedback• Positive-onlyImplicitFeedback
• Bought(ornot)• Click(ornot)• Converged(ornot)
20
Recommendation101
CaseforCoursehero?
U/I Item1 Item2 Item3 … ItemI
User1 5 3
User2 2 1
… 3 4
UserU 1 4 5
21
ExplicitFeedback
U/I Item1 Item2 Item3 … ItemI
User1 ? 5 ? ? 3
User2 2 ? 1 ? ?
… ? 3 ? 4 ?
UserU 1 ? 4 ? 5
22
ExplicitFeedback
23
ExplicitFeedback
U/I Item1 Item2 Item3 … ItemI
User1 ? 5 ? ? 3
User2 2 ? 1 ? ?
… ? 3 ? 4 ?
UserU 1 ? 4 ? 5
• VerySparseDataset• #offeedbackissmall• Unknowndata>>Trainingdata• Userpreferencetorateditemsisclear• Hasnegativefeedbacks• Evaluationiseasy(MAE/RMSE)
U/I Item1 Item2 Item3 … ItemI
User1 ⭕ ⭕
User2 ⭕ ⭕
… ⭕ ⭕
UserU ⭕ ⭕ ⭕
24
ImplicitFeedback
U/I Item1 Item2 Item3 … ItemI
User1 ⭕ ⭕
User2 ⭕ ⭕
… ⭕ ⭕
UserU ⭕ ⭕ ⭕
25
ImplicitFeedback• SparseDataset• NumberofFeedbacksarelarge• Userpreferenceisunclear• No negative feedback• Known feedback maybe negative• Unknownfeedbackmaybepositive• Evaluationisnotsoeasy(NDCG,Prec@K,Recall@K)
26
ProsandCons
ExplicitFeedback
ImplicitFeedback
Datasize L JUser preference J LDislike/Unknown J LImpact ofBias L J
Agenda
1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking
27
28
MatrixFactorization/Completion
Factorizeamatrixintoaproductofmatriceshavingk-latentfactor
29
MatrixCompletion How-to
• MeanRatingμ• RatingBiasforeachItem Bi• RatingBiasforeachUserBu
30
MeanRating
MatrixFactorization
Regularization
Biasforeachuser/item
CriteriaofBiasedMFFactorization
Diffinprediction
31
TrainingofMatrixFactorization
Support iterative training using local disk cache
32
PredictionofMatrixFactorization
Agenda
1. IntroductiontoHivemall2. Recommendation1013. MatrixFactorization4. BayesianProbabilisticRanking
33
StillinBetabutwillofficiallybesupportedsoon
34
ImplicitFeedback
AnaïveL approachbyfillingunknowncellasnegative
35
SamplingschemeforImplicitFeedback
Samplepairs<u,i,j>ofPositiveItemi andNegativeItem jforeachUseru
• UniformusersamplingØ Sampleauser.Then,sampleapair.
• UniformpairsamplingØ Samplepairsdirectory(dist.alongw/originaldataset)
• With-replacementorwithout-replacementsampling
U/I Item1 Item2 Item3 … ItemI
User1 ⭕ ⭕
User2 ⭕ ⭕
… ⭕ ⭕
UserU ⭕ ⭕ ⭕
DefaultHivemallsamplingscheme:- Uniformusersampling- Withreplacement
•Rendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.
•Amostproven(?)algorithmforrecommendationforimplicitfeedback
36
BayesianProbabilisticRanking
Keyassumption:useru prefersitemi overnon-observeditem j
BayesianProbabilisticRanking
37
ImagetakenfromRendle etal.,“BPR:BayesianPersonalizedRankingfromImplicitFeedback”,Proc.UAI,2009.http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf
BPRMF’staskcanbeconsideredfilling0/1theitem-itemmatrixandgettingprobabilityofI>uJ
TrainbyBPR-MatrixFactoriaztion
38
39
PredictbyBPR-MatrixFactorization
40
PredictbyBPR-MatrixFactorization
41
PredictbyBPR-MatrixFactorization
42
RecommendationforImplicitFeedbackDataset
1. EfficientTop-kcomputationisimportantforprediction O(U*I)
2. Memoryconsumptionisheavyforwhereitemsize|i|islarge
• MyMediaLite requireslotsofmemory• MaximumdatasizeofMovielens:33,000moviesby240,000users,20millionratings
3. Bettertoavoidcomputingpredictionsforeachtime
43
WesupportmachinelearninginCloud
Anyfeaturerequest?Or,questions?