µ HAWQ ¾ MADlib O N 5xunzhangthu.org/talk/Develop_RecSys_with_HAWQ.pdf · Meetup @Shanghai,...

Preview:

Citation preview

HAWQ MADlib

Hong Wu: hwu@pivotal.io Ruilong Huo: rhuo@pivotal.ioMeetup @Shanghai, 2016/04/28

Agenda

• RecSysReview

• RecommendationwithHAWQ

• Demo

• Q&A

2

Agenda

• RecSysReview

3

What is RecSys

• Recommendation

• Morethansearchengine

• exploretheworldwithoutquery

• Personalization

• discoveritemsbasedonpreference

• Self-evolving

Industry Trends

System Design

• Content-basedrecommendation

• featuresfromitemitself

• Collaborativefiltering

• user-usersimilarity&item-itemsimilarity

• Model-basedrecommendation

• factormodel

System Design

• Content-basedrecommendation

• featuresfromitemitself

• Collaborativefiltering

• user-usersimilarity&item-itemsimilarity

• Model-basedrecommendation

• factormodel

Agenda

• RecSysReview

• RecommendationwithHAWQ

8

System Overview

Import Training Data Matrix Factorization

Compute Item-Item

SimilaritiesMake Recommendation

Mod

el P

ipel

ine

Preparation Map original interactive data into user-item-rating tuples. Import into HAWQ Database.

Latent Factor Model Use lmf_igd_run function in MADLib to train user factors and item factors.

Similarity Apply CF strategy to compute similarity.

Recommendation Three ways to return(see later slides).

Preparation

• Userdata

• Itemdata

• Interactivedata

• Userdata

• Itemdata

• Interactivedata• {(user,item):preferenceweight}

Preparation

Latent Factor Model

• Matrixfactorization• modeleachuser-itemasavectoroffactors

Latent Factor Model

Matrix Factorization in MADlib

SELECTmadlib.lmf_igd_run(output,—resulttabletostoremodelinput,—inputtablerow,—userattributenamecol,—itemattributenamevalue,—preferenceweightrow_dim,—thenumberofuserscol_dim,—thenumberofitemsmax_rank,—thenumberoffactorsstepsize,—learningrateiterations,—trainingroundstolerance);

Similarity

• WehaveuserfactorsUanditemfactorsV

• Applycollaborativefilteringstrategy

• computeitem-itemsimilarities

• lesscomputation,largescale

• gooddiversityvsoriginalCF

Similarity

CREATE TABLE item_sim AS ( SELECT ifacA.iid, ifacB.iid, madlib.cosine_similarity(ifacA.rating, ifacB.rating) as sim FROM ifactor AS ifacA, ifactor AS ifacB WHERE ifacA.iid < ifacB.iid sim DESC LIMIT K );

Recommendation

• Useitemsimilaritydirectly

radioFMbooklike-like

Recommendation

• Useuserfactoranditemfactordirectly

• dotproduct

Recommendation

• Useitemsimilaritytopredictrating

• weightedsum

20

FeedStream

Playlist

MovieGuess

Agenda

• RecSysReview

• RecommendationwithHAWQ

• Demo

21

Try it Out

• https://github.com/xunzhang/xz_talk/tree/master/develop_recsys_with_hawq

Why Choose HAWQ to Build Machine Learning Application

• Simple&Efficient

• highlevelabstraction,notransfercost

• focusmoreonapplication’sownlogic

• dataflowthinking

• Highscalability&performance

• distributedmachinelearningwithMADlib

• parallelqueryenginewithHAWQ

23

Summary

• BriefintroofRecSys

• AhybridmodelpipelinebasedonHAWQ

24

Agenda

• RecSysReview

• RecommendationwithHAWQ

• Demo

• Q&A

25

Thanks Questions

Backup Slide

SolutionofCFinlargescale:AB_similarity.

Storagedisasterofdotproductbasedonfactormodel:BuildBalltreeindex.

Recommended