Recommendation Techn

  • Published on
    15-Jan-2015

  • View
    427

  • Download
    4

Embed Size (px)

DESCRIPTION

 

Transcript

  • 1. 2014 MapR Technolo 2g0ie1s4 MapR Technologies 1

2. 2014 MapR Technologies 2Who I amTed Dunning, Chief Applications Architect, MapR TechnologiesEmail tdunning@mapr.com tdunning@apache.orgTwitter @Ted_DunningApache Mahout https://mahout.apache.org/Twitter @ApacheMahout22 June 2014 Big Data Everywhere Conference #DataIsrael 3. Recommendation: Widely Used Machine LearningExample: Open source Apache Mahout used in production 2014 MapR Technologies 3http://www.wired.com/wiredenterprise/2012/12/mahout/ 4. 2014 MapR Technologies 4Recommendations Data: interactions between people taking action (users) and items Data used to train recommendation model Goal is to suggest additional interactions Example applications: movie, music or map-based restaurant choices;suggesting sale items for e-stores or via cash-register receipts 5. Google maps: restaurant recommendations 2014 MapR Technologies 5 6. 2014 MapR Technologies 6Google maps: tech recommendations 7. 2014 MapR Technologies 7Tutorial Part 1:How recommendation works,or I want a pony 8. First question:Are you using the right data? 2014 MapR Technologies 9 9. 2014 MapR Technologies 10RecommendationBehavior of a crowdhelps us understandwhat individuals will do 10. 2014 MapR Technologies 11RecommendationsAlice got an apple andAlice a puppy 11. 2014 MapR Technologies 12RecommendationsAlice got an apple andAlice a puppyCharles Charles got a bicycle 12. 2014 MapR Technologies 13RecommendationsAlice got an apple andAlice a puppyBob Bob got an appleCharles Charles got a bicycle 13. 2014 MapR Technologies 14RecommendationsAlice got an apple andAlice a puppyBob What else would Bob like?Charles Charles got a bicycle 14. 2014 MapR Technologies 15RecommendationsAlice got an apple andAlice a puppyBob A puppy!Charles Charles got a bicycle 15. 2014 MapR Technologies 16You get the idea of howrecommenders can work 16. By the way, like me, Bob alsowants a pony 2014 MapR Technologies 17 17. 2014 MapR Technologies 18Recommendations?AliceBobAmeliaCharlesWhat if everybody gets apony?What else would you recommend fornew user Amelia? 18. 2014 MapR Technologies 19Recommendations?AliceBobAmeliaCharlesIf everybody gets a pony, itsnot a very good indicator ofwhat to else predict... 19. 2014 MapR Technologies 20Problems with Raw Co-occurrence Very popular items co-occur with everything or why its not veryhelpful to know that everybody wants a pony Examples: Welcome document; Elevator music Very widespread occurrence is not interesting to generate indicatorsfor recommendation Unless you want to offer an item that is constantly desired, such asrazor blades (or ponies) What we want is anomalous co-occurrence This is the source of interesting indicators of preference on which tobase recommendation 20. Overview: Get Useful Indicators from Behaviors 2014 MapR Technologies 211. Use log files to build history matrix of users x items Remember: this history of interactions will be sparse compared to allpotential combinations2. Transform to a co-occurrence matrix of items x items3. Look for useful indicators by identifying anomalous co-occurrences tomake an indicator matrix Log Likelihood Ratio (LLR) can be helpful to judge which co-occurrencescan with confidence be used as indicators of preference ItemSimilarityJob in Apache Mahout uses LLR 21. Apache Mahout: Overview Open source Apache project http://mahout.apache.org/ Mahout version is 0.9 released Feb 2014; inc Scala 2014 MapR Technologies 22 Summary 0.9 blog at http://bit.ly/1rirUUL Library of scalable algorithms for machine learning Some run on Apache Hadoop distributions; others do not require Hadoop Some can be run at small scale Some are run in parallel; others are sequential Includes the following main areas: Clustering & related techniques Classification Recommendation Mahout Math Library 22. 2014 MapR Technologies 24Log FilesAliceCharlesCharlesAliceAliceBobBob 23. 2014 MapR Technologies 25Log Filesu1u2u2u1u1u3u3t1t4t3t2t3t3t1 24. 2014 MapR Technologies 26History Matrix: Users x ItemsAliceBobCharles 25. 2014 MapR Technologies 27Co-Occurrence Matrix: Items x Items1 2 0111 11020 0How do you tell which co-occurrencesare useful? 26. 2014 MapR Technologies 28Co-Occurrence Matrix: Items x Items1 2 0111 11020 0Use LLR test to turn co-occurrenceintoindicators 27. 2014 MapR Technologies 29Co-occurrence Binary Matrix1not1not 1 28. Which one is the anomalous co-occurrence?A not AB 1 0not B 0 2 2014 MapR Technologies 30A not AB 13 1000not B 1000 100,000A not AB 1 0not B 0 10,000A not AB 10 0not B 0 100,000 29. Which one is the anomalous co-occurrence?A not A0.90 1.95B 1 0not B 0 2 2014 MapR Technologies 31A not AB 13 1000not B 1000 100,000A not A4.52 14.3B 1 0not B 0 10,000A not AB 10 0not B 0 100,000 30. 2014 MapR Technologies 32Co-Occurrence Matrix: Items x Items1 2 0111 11020 0Recap:Use LLR test to turn co-occurrenceinto indicators 31. Indicator Matrix: Anomalous Co-OccurrenceResult:Each marked row showsthe indicators of what torecommend 2014 MapR Technologies 33 32. Indicator Matrix: Anomalous Co-Occurrence 2014 MapR Technologies 34Why not pony + other item? 33. 2014 MapR Technologies 36How will you deliverrecommendations to users? 34. 2014 MapR Technologies 37Seeking SimplicityInnovation:Exploit search technology to deployyour recommendation system 35. 2014 MapR Technologies 38But first, a look at howsearch works 36. Apache Solr/Apache Lucene Apache Solr/Lucene is an open-source powerful searchengine used for flexible, heavily indexed queries including datasuch as Full text, geographical data, statistically weighted data 2014 MapR Technologies 39 Lucene Provides core retrieval Is a low-level library Solr Is a web-based wrapper around Lucene Is easy to integrate because you talk to it via a web-interface URL http://host machine:8888 37. 2014 MapR Technologies 40LucidWorks Enterprise platform and collection of applications based onApache Solr/Lucene Wrapper around Solr A free version ships with MapR LucidWorks leaves Solr exposed but makes Solradministration much easier, which in turn makes it easier touse Lucene URL http://host machine:8989 38. 2014 MapR Technologies 41LucidWorksSolrQuery ResponseIndexLuceneQuery ResponseData SourceRelationship of Solr /Lucene/LucidWorks 39. 2014 MapR Technologies 42Other Options: Elastic Search Apache Lucene library at the heart of several approaches It can also be used on its own Elastic Search is a different web interface for Lucene Real time search and analytics Open source (not Apache) Big advantage is less accumulated crufthttp://www.elasticsearch.com/ 40. 2014 MapR Technologies 44What is a Document? Data is stored in collections made up of documents Documents contain fields that can be Indexed makes field searchable; dont have to index all Stored If you want Solr to return content, have Solr store content Not all data of interest must be stored: can access via stored URL (good for verylarge data set) Multi-value Body field can contain more than one type of data Facetted A way to refine a search or use for statistics Example: data for country: Could return facetted as 37 from US, 23 UK, 7 Japan 41. 2014 MapR Technologies 45Fields and How to Set Them Up Lucene is mostly used for text Text has to be tokenized Also supports other types Long, string, keywords, comma separated Fields properties such as stored, indexed, faceted can also bedefined Defaults arent usually so great 42. 2014 MapR Technologies 46Example of Facetted SearchThe indexed fields Area and Gender have beenfacetted to provide counts for the results 43. Field-specific Searches Using Lucene Syntax Documents often have title, author, keywords and body text General syntax is 2014 MapR Technologies 48field:(term1 term2) Alternativesfield:(term1 OR term2) field:(term1 term2 ~ 5) field:(term1 AND term2) Default field, default interpretations work very well for text 44. 2014 MapR Technologies 49Send Data to Solr (not LucidWorks) LucidWorks has lots of spiders that can use file extension ormime type to trigger certain file parsers Works best with web-ish sources and mime types Also includes MapR and MapR high volume indexers More common at modest volumes or for updates to use JSONformat{"id":"book_314", "title":{"set":"The Call of the Wild"}} Use REST interface to send update filescurl http://localhost:8983/solr/update -H 'Content-type:application/json' --data-binary @file.json 45. Back to recommendation:How do you abuse search to makerecommendation easy? 2014 MapR Technologies 51 46. Collection of Documents: Insert Meta-Data 2014 MapR Technologies 52SearchTechnologyItemmeta-dataIngest easily via NFSDocument forpuppy id: t4title: puppydesc: The sweetest little puppyever.keywords: puppy, dog, pet 47. From Indicator Matrix to New Indicator Field 2014 MapR Technologies 53id: t4title: puppydesc: The sweetest little puppyever.keywords: puppy, dog, petindicators: (t1)Solr documentfor puppyNote: data for the indicator field is added directly to meta-data for a document in ApacheSolr or Elastic Search index. You dont need to create a separate index for the indicators. 48. Lets look at a real example:we built a music recommender 2014 MapR Technologies 54 49. 2014 MapR Technologies 56 50. User activity: Listens to classic jazz hit Take the A Train 2014 MapR Technologies 57 51. System delivers recommendations based on activity 2014 MapR Technologies 58 52. 2014 MapR Technologies 59Lets look inside the musicrecommender 53. Music Meta Data for Search Document Collections 2014 MapR Technologies 60 MusicBrainz data Data includes Artist ID, MusicBrainz ID, Name, Group/Person, From (geolocations) and Gender as seen in this sample 54. Sample User Behavior Histories: Music Log Files 2014 MapR Technologies 6213 START 10113 218265428123 BEACON 10113 218265428124 START 10113 7960061193502834 BEACON 10113 7960061193502844 BEACON 10113 7960061193502854 BEACON 10113 7960061193502864 BEACON 10113 7960061193502874 BEACON 10113 7960061193502884 BEACON 10113 7960061193502894 BEACON 10113 79600611935028104 BEACON 10113 79600611935028109 FINISH10113 79600611935028111 START 10113 58999912011972121 BEACON 10113 58999912011972TimeEvent typeUser IDArtist IDTrack ID 55. 2014 MapR Technologies 63Sample Music Log FilesArtist ID for jazzmusician Duke EllingtonWhat has user 119done here in thehighlighted lines? 56. 2014 MapR Technologies 64Internals of a Recommendation Engine 57. 2014 MapR Technologies 65Internals of a Recommendation Engine 58. Lucene Documents for Music RecommendationNotice that data from indicator matrix of trained Mahout recommendermodel has been added to indicator field in documents of the artistscollection 2014 MapR Technologies 66id 1710mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499name Chuck Berryarea United Statesgender Maleindicator_artists 386685,875994,637954,3418,1344,789739,1460, id 541902mbid 983d4f8f-473e-4091-8394-415c105c4656name Charlie Winstonarea United Kingdomgender Noneindicator_artists 997727,815,830794,59588,900,2591,1344,696268, 59. 2014 MapR Technologies 68Offline AnalysisAnalysis UsingMahoutItemMeta-DataUsers HistorySearchTechnologyLog Files Indicators 60. Real-time recommendations using MapR data platform 2014 MapR Technologies 69Log FilesNew User Historyvia NFS PythonMahoutAnalysisSearchTechnologyItemMeta-DataIngest easily via NFSMapR ClusterUse Pythondirectly via NFSPigWebRecommendations Tier 61. 2014 MapR Technologies 70A Quick Simplification Users who do h Also doAhAT Ah ( )(ATA)hUser-centric recommendationsItem-centric recommendations 62. 2014 MapR Technologies 71Architectural AdvantageAT (Ah)AT( A)hUser-centric recommendationsItem-centric recommendations 63. 2014 MapR Technologies 72Architectural AdvantageAT Ah ( ) User-centric recommendationsWith the first design, you have to do the real-time computation first (inparenthesis). No way to pre-compute. Less efficient, less fast.(ATA)h Item-centric recommendationsWith the second design, you can pre-compute offline (overnight)things that change slowly. Only the smaller computation for new uservector (h) is done in real-time, so response is very fast. 64. 2014 MapR Technologies 74Tutorial Part 2:How to make recommendation better 65. Going Further: Multi-Modal Recommendation 2014 MapR Technologies 75 66. Going Further: Multi-Modal Recommendation 2014 MapR Technologies 76 67. 2014 MapR Technologies 77For example Users enter queries (A) (actor = user, item=query) Users view videos (B) (actor = user, item=video) ATA gives query recommendation did you mean to ask for BTB gives video recommendation you might like these videos 68. 2014 MapR Technologies 78The punch-line BTA recommends videos in response to a query (isnt that a search engine?) (not quite, it doesnt look at content or meta-data) 69. 2014 MapR Technologies 79Real-life example Query: Paco de Lucia Conventional meta-data search results: hombres de paco times 400 not much else Recommendation based search: Flamenco guitar and dancers Spanish and classical guitar Van Halen doing a classical/flamenco riff 70. 2014 MapR Technologies 80Real-life example 71. 2014 MapR Technologies 81Hypothetical Example Want a navigational ontology? Just put labels on a web page with traffic This gives A = users x label clicks Remember viewing history This gives B = users x items Cross recommend BA = label to item mapping After several users click, results are whatever users think theyshould be 72. 2014 MapR Technologies 82Nice. But we cando better? 73. 2014 MapR Technologies 84Symmetry Gives Cross Recommentations(ATA)hBT( A)hConventional recommendations withoff-line learningCross recommendations 74. 2014 MapR Technologies 85thingsA users 75. 2014 MapR Technologies 86A1 A2usersthingtype 1thingtype 2 76. 2014 MapR Technologies 87A1 A2TA1 A2=TA1TA2A1 A2=TA1 A1A1TA2AT2A1 AT2A2r1r2=TA1 A1A1TA2AT2A1 AT2A2h1h2TA1 A1r1 = A1TA2h1h2 77. 2014 MapR Technologies 88Bonus Round:When worse is better 78. 2014 MapR Technologies 89The Real Issues After First Production Exploration Diversity Speed Not the last fraction of a percent 79. 2014 MapR Technologies 90Result Dithering Dithering is used to re-order recommendation results Re-ordering is done randomly Dithering is guaranteed to make off-line performance worse Dithering also has a near perfect record of making actual performancemuch better 80. 2014 MapR Technologies 91Result Dithering Dithering is used to re-order recommendation results Re-ordering is done randomly Dithering is guaranteed to make off-line performance worse Dithering also has a near perfect record of making actual performancemuch betterMade more difference than any other change 81. 2014 MapR Technologies 92Why Dithering WorksReal-timerecommenderOvernighttrainingLog Files 82. 2014 MapR Technologies 93Why Use Dithering? 83. 2014 MapR Technologies 94Simple...