1. Searching for The Matrix in haystack(with Elasticsearch)
Synopsi.TV case study Tom Sirn @junckritter Pyvo/Rubyslava November
2012
2. The Environment Recommendation service for movies, TV shows
People mark titles they watched(check-in), ratethem Get
recommendations Make Watch Later or other-purpose lists Search (to
check-in, add to list, share, etc.)
3. The Problem Input box for search on top of web page Many
movies, TV shows in database Lot of them have similar titles, use
similarwords Some are more probable to be searched for Few input
information 3, 4 letters Autocomplete, not only exact match
4. The Red Pill
5. The Blue Pill
6. The Tool Elasticsearch designed for searching indocuments
Based on Lucene de facto standard Young yet feature-rich Quick
development (despite 1 core developer) Business company recently
founded 10M funding in A-round
7. The (Wannabe) Solution Differentiate titles Have cover,
plot, cast, directors Year Popularity (whatever it means) Prefer
ones with more data, more popular
8. The Text First Attempt Text Query (now Match Query)
phrase_prefix type all words in input withmatching of prefixes (m,
ma, mat, ), sameorder of words operator and not_analyzed name field
(not broke down towords)
9. The Text First Attempt slop parameter - allows change of
order, skipwords matrix revolutions revolutions matrixmatrix first
revolutions
10. The Sorting First Attempt Default scoring considers only
occurence text indocuments We also want other properties of
document tocount Custom Score Query Define script for
scoringscript: _score * doc[rating].value
11. The Rating Allows to prefer more popular titles External
top lists, links, etc. Internal usage data from system Problem for
newly added titles lack of data ofboth types
12. The Tuning of Rating Get rid off external data Only score
completeness of each document Release year script: 3 * log(_score)
+ 1 * log(doc["year"].date.year 1880) +0.75 *
log(doc["watched_count"].value +1)
13. The Tuning of QueryName field analyzed, edgeNGram
filterindex:analysis: filter:my_ngram:type: edgeNGrammin_gram :
1max_gram : 11side : front analyzer:my_analyzer:type:
customtokenizer: standardfilter: [lowercase, asciifolding,
my_ngram]
14. The AKAs Also know as names of title in differentcountries
Lot of additional data, sometimes only noise original is still most
important
15. The AKAs Array of AKAs problems with scoring of shortnames
Nested AKA documents - query does not returnnested document which
matched AKA document is child of title have owninformation
(original, country, slug) Top Children Query which AKA matched
Another query with Ids Filter get titles
16. The Sorting Second Attempt Custom Filter Score Query apply
set of filters,each filter boosts documents which pass itscondition
boost parameter of filter differentiateimportance of that filter
score_mode sum, product of boost values
17. The Sorting Used Score Filters Release date (in case of TV
show last episode)in last 6 months Release date in next 3 months
original AKA Have all important categories filled Not Short genre
Not TV movie
18. The Sorting Short Input Special case 1 3 letters Very rare
to exact match Should work after typing of first letter Only titles
from this year 3 letters also titles in near future and
previousyear
19. The Year in Input Matrix 1999 Matrix Reloaded (2003) Matrix
2000- released to 2000 Matrix 2000+ released since 2000
20. One More Thing Advanced Search Titles have also data about
their usage Watched by Friends FilterShows titles with IDs of your
friends in properfield (TermsFilter([IDS])) Not Watched filterShow
titles in which is your ID absent(NotFilter(TermFilter(ID))
combination titles to watch to catch up withfriends