Upload
nguyenque
View
234
Download
2
Embed Size (px)
Citation preview
Apache: Big Data 2015
Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search
Toan Vinh Luu, PhD Senior Search Engineer local.ch AG
Apache: Big Data 2015
In this talk
• Requirements of an autosuggestion feature
• Autosuggestion architecture
• Evaluation
Apache: Big Data 2015
local.ch • Local search engine in Switzerland (web, mobile)
• Each month: – > 4 millions unique users – > 8 millions queries on mobile (iOS, android,…)
• Users search for: – Services (e.g “restaurant zurich”) – Resident information (e.g “toan luu”) – Phone number (e.g. 079574xxyy) – Addresses, weather, – ...
Apache: Big Data 2015
Why autosuggestion is important?
User taps on the phone 8 times instead of 34 times to get to the result list when searching for “Electric installation Wallisellen”
Apache: Big Data 2015
What should we suggest to user?
Apache: Big Data 2015
Popular data suggestion
Apache: Big Data 2015
Popular queries suggestion
>2000 queries/month for “cablecom” which have only 1 entry
“mc donalds” has less entries than “muller” but is queried >10x
Apache: Big Data 2015
Query history suggestion • 9% mobile queries are
historical queries.
• 38% users search by a query in the past
Apache: Big Data 2015
Spellchecker suggestion >700’000 mistakes per month on mobile (9%)
Apache: Big Data 2015
Detail entry suggestion
Apache: Big Data 2015
Special information suggestion
Apache: Big Data 2015
Autosuggestion Architecture
Aut
osug
gest
AP
I/S
earc
h A
PI SuggestData
component
Query history component
Popular query component
Spellchecker component
Index
Index
Index
Index
Index
Query log
Popular query processor
Local.ch Database
Apache: Big Data 2015
Data suggestion • Pre-generating suggested queries from the data • Entry:
– Name: Subito – Category: Restaurant – Street: Konradstrasse – Zipcode: 8005 – City: Zürich
Possible suggested queries: • Restaurant • Subito • Restaurant Zürich • Restaurant Subito • Restaurant Subito Zürich • Konradstrasse, 8005 Zürich • Zürich
Apache: Big Data 2015
Compute data popularity • Use faceting to get suggested queries sorted by frequency
• This approach guarantees near real-time suggestion
• Suggested queries are copied to 2 fields: – Search field used for matching, apply analyzers, tokenizer… – Facet field used for displaying and for computing frequency
• Example: – q=restaurant zu* => suggest “Restaurant Zürich” – q=zurich restau* => suggest “Restaurant Zürich”
Apache: Big Data 2015
Improvement • Faceting is expensive for short prefix match queries
⇒ Store suggested results in a Cache for all queries with 1, 2 characters
• Filter duplicated suggestion – “Restaurant Subito” and “Restaurant Subito Zürich” is
1 entity if they have same frequency => keep only 1 suggestion
• Store location, language with suggested queries to filter out irrelevant suggestion to user.
Apache: Big Data 2015
How do we process “popular queries” • Popular is just not high frequency!
• Depend on user’s language – 4 languages are used in Switzerland. Fail if we suggest “bäckerei” for a French
speaking user
• Depend on location – Fail if we suggest a hospital in Zurich for an user in Geneva
• Misspell – Fail if we suggest “zürich” and “züruch”
• Number of unique users – Fail if we suggest “toan” just because I searched my name thousands of times
• Blacklist – Fail if we suggest “f**k”, “pe**is”
Apache: Big Data 2015
Popular query processor • Preprocessing query log:
– Text normalization, stopword, blacklist, keep only queries return results…
• A query log item in elasticsearch index { "q": "restaurant", "language": "de", "lon": 8.50646, "lat": 47.4192, "datetime": "2014-06-02 11:10:07”, "user": “eeaad0c09abc41676c1c99530693”
}
Apache: Big Data 2015
Find candidate popular queries for each language
{ "query" : { "query_string" : { "query" : "language:" + language } }, "facets" : { "q" : { "terms" : { "field" : "q.untouched", "size" : TOP_POPULAR } } } }
Apache: Big Data 2015
Find number of unique users given a query
{ "query" : { "query_string" : { "query" : "q.untouched:" + query } },
"aggs": { "num_users": { "cardinality": { "field": "user" } } } }
Apache: Big Data 2015
Bounding box to limit popular queries given location
0
50
100
150
200
250
300
5.9
5
6.0
5
6.1
5
6.2
5
6.3
5
6.4
5
6.5
5
6.6
5
6.7
5
6.8
5
6.9
5
7.0
5
7.1
5
7.2
5
7.3
5
7.4
5
7.5
5
7.6
5
7.7
5
7.8
5
7.9
5
8.0
5
8.1
5
8.2
5
8.3
5
8.4
5
8.5
5
8.6
5
8.7
5
8.8
5
8.9
5
9.0
5
9.1
5
9.2
5
9.3
5
9.4
5
9.5
5
9.6
5
9.7
5
9.8
5
9.9
5
10.0
5
10.1
5
10.2
5
10.3
5
10.4
5
90% Popular query: Chuv (Centre Hospitalier Universitaire Vaudois)
Apache: Big Data 2015
45.81 45.88 45.95 46.02 46.09 46.16 46.23 46.3
46.37 46.44 46.51 46.58 46.65 46.72 46.79 46.86 46.93
47 47.07 47.14 47.21 47.28 47.35 47.42 47.49 47.56 47.63 47.7
47.77
5.9
5
6.0
4
6.1
3
6.2
2
6.3
1
6.4
6.4
9
6.5
8
6.6
7
6.7
6
6.8
5
6.9
4
7.0
3
7.1
2
7.2
1
7.3
7.3
9
7.4
8
7.5
7
7.6
6
7.7
5
7.8
4
7.9
3
8.0
2
8.1
1
8.2
8.2
9
8.3
8
8.4
7
8.5
6
8.6
5
8.7
4
8.8
3
8.9
2
9.0
1
9.1
9.1
9
9.2
8
9.3
7
9.4
6
9.5
5
9.6
4
9.7
3
9.8
2
9.9
1
10
10.0
9
10.1
8
10.2
7
10.3
6
10.4
5
Histogram of query “chuv” based on freq, longitude and latitude
Apache: Big Data 2015
46.5243,6.6397
46.52,6.63
46.53,6.64
Apache: Big Data 2015
Percentiles aggregation to find min, max value of querying location
"query" : { "match" : {"q" : {"query" :”chuv”}}
}, "aggs" : { "lat_outlier" : { "percentiles" : { "field" : "lat", "percents" : [5, 95] } }, "lon_outlier" : { "percentiles" : { "field" : "lon", "percents" : [5, 95] } } }
Apache: Big Data 2015
Popular query stored in Solr index
{ "q": "chuv", "lang": ["de”,"fr”, "en”], "users": 7435, "min_lat": 46.2245, "max_lon": 7.3332, "max_lat": 46.9909, "min_lon": 6.29637, "freq": 9524
}
Apache: Big Data 2015
Solr request to suggest popular query
q:ch* lang:en users: [100 TO *] min_lat:[* TO " + user_lat + "] min_lon:[* TO " + user_lon + "] max_lat:[" + user_lat + " TO *] max_lon:[" + user_lon + " TO *] & sort=freq desc
Apache: Big Data 2015
Evaluation
• Several metrics are used to evaluate autosuggestion feature – Number of typed characters to get to result list
• Average length of input: 10.0 chars • Average length of suggestion: 15.4 chars
– Number of clicks on suggested items – Average rank of clicked item
Apache: Big Data 2015
Number of clicks on suggested items since query history release
Release date
Apache: Big Data 2015
0
0.5
1
1.5
2
2.5
Average rank of clicked item
Release query history suggestion
Apache: Big Data 2015
Conclusion
• Requirement of an autosuggestion feature: – reduces number of user’s interactions with your
application to get search result.
• We can combine 2 search frameworks to bring better search experience to user: – Solr is efficient for querying, faceting and caching – Elasticsearch is efficient for big data aggregation
and query log storing
Apache: Big Data 2015
Contact information
• Search team at local.ch – [email protected] – [email protected] – [email protected]