Searching for The Matrix in haystack (with Elasticsearch)

Embed Size (px)

Citation preview

  1. 1. Searching for The Matrix in haystack(with Elasticsearch) Synopsi.TV case study Tom Sirn @junckritter Pyvo/Rubyslava November 2012
  2. 2. The Environment Recommendation service for movies, TV shows People mark titles they watched(check-in), ratethem Get recommendations Make Watch Later or other-purpose lists Search (to check-in, add to list, share, etc.)
  3. 3. The Problem Input box for search on top of web page Many movies, TV shows in database Lot of them have similar titles, use similarwords Some are more probable to be searched for Few input information 3, 4 letters Autocomplete, not only exact match
  4. 4. The Red Pill
  5. 5. The Blue Pill
  6. 6. The Tool Elasticsearch designed for searching indocuments Based on Lucene de facto standard Young yet feature-rich Quick development (despite 1 core developer) Business company recently founded 10M funding in A-round
  7. 7. The (Wannabe) Solution Differentiate titles Have cover, plot, cast, directors Year Popularity (whatever it means) Prefer ones with more data, more popular
  8. 8. The Text First Attempt Text Query (now Match Query) phrase_prefix type all words in input withmatching of prefixes (m, ma, mat, ), sameorder of words operator and not_analyzed name field (not broke down towords)
  9. 9. The Text First Attempt slop parameter - allows change of order, skipwords matrix revolutions revolutions matrixmatrix first revolutions
  10. 10. The Sorting First Attempt Default scoring considers only occurence text indocuments We also want other properties of document tocount Custom Score Query Define script for scoringscript: _score * doc[rating].value
  11. 11. The Rating Allows to prefer more popular titles External top lists, links, etc. Internal usage data from system Problem for newly added titles lack of data ofboth types
  12. 12. The Tuning of Rating Get rid off external data Only score completeness of each document Release year script: 3 * log(_score) + 1 * log(doc["year"].date.year 1880) +0.75 * log(doc["watched_count"].value +1)
  13. 13. The Tuning of QueryName field analyzed, edgeNGram filterindex:analysis: filter:my_ngram:type: edgeNGrammin_gram : 1max_gram : 11side : front analyzer:my_analyzer:type: customtokenizer: standardfilter: [lowercase, asciifolding, my_ngram]
  14. 14. The AKAs Also know as names of title in differentcountries Lot of additional data, sometimes only noise original is still most important
  15. 15. The AKAs Array of AKAs problems with scoring of shortnames Nested AKA documents - query does not returnnested document which matched AKA document is child of title have owninformation (original, country, slug) Top Children Query which AKA matched Another query with Ids Filter get titles
  16. 16. The Sorting Second Attempt Custom Filter Score Query apply set of filters,each filter boosts documents which pass itscondition boost parameter of filter differentiateimportance of that filter score_mode sum, product of boost values
  17. 17. The Sorting Used Score Filters Release date (in case of TV show last episode)in last 6 months Release date in next 3 months original AKA Have all important categories filled Not Short genre Not TV movie
  18. 18. The Sorting Short Input Special case 1 3 letters Very rare to exact match Should work after typing of first letter Only titles from this year 3 letters also titles in near future and previousyear
  19. 19. The Year in Input Matrix 1999 Matrix Reloaded (2003) Matrix 2000- released to 2000 Matrix 2000+ released since 2000
  20. 20. One More Thing Advanced Search Titles have also data about their usage Watched by Friends FilterShows titles with IDs of your friends in properfield (TermsFilter([IDS])) Not Watched filterShow titles in which is your ID absent(NotFilter(TermFilter(ID)) combination titles to watch to catch up withfriends
  21. 21. The EndThanksTom Sirn@junckritter