38
Dailymotion Elasticsearch June 10, 2014 Meetup Elasticsearch France #7 Cédric Hourcade Core developer at Dailymotion twitter.com/hced

Elasticsearch at Dailymotion

Embed Size (px)

DESCRIPTION

A quick overview of Elasticsearch usage at Dailymotion for video search Talk given at Elasticsearch Meetup France #7 June 10, 2014 http://www.meetup.com/elasticsearchfr/events/171946592/

Citation preview

Page 1: Elasticsearch at Dailymotion

DailymotionElasticsearch

June 10, 2014Meetup Elasticsearch

France #7Cédric HourcadeCore developer at Dailymotiontwitter.com/hced

Page 2: Elasticsearch at Dailymotion

> video search @ Dailymotion

1 > cluster overview and indexation

2 > query and score

3 > sharding

4 > benchmarks, tuning

5 > questions?

Page 3: Elasticsearch at Dailymotion

elasticsearch clustercluster and indexation

Page 4: Elasticsearch at Dailymotion

search clustercluster and indexation

> 10 nodes for video search

_ one main video index

_ 5 shards with 1 replica

_ nodes : 32 cores, 48 gb RAM, 15k disks

> 600 to 1000 search requests per second

> end-to-end response time < 40 ms

Page 5: Elasticsearch at Dailymotion

search clustercluster and indexation

elasticsearch cluster

10 nodes

mysql

farm

indexconstantly

Page 6: Elasticsearch at Dailymotion

search clustercluster and indexation

elasticsearch cluster

10 nodes

mysql

farm

elasticsearch indexer

2 nodesindexconstantly

filter data for search

update if hash changed

hash data

Page 7: Elasticsearch at Dailymotion

searchquery and score

Page 8: Elasticsearch at Dailymotion

"query" : { "function_score": { "query": { "custom_common": { … } }, "script_score": { "script": "custom_scorer", "lang": "native" }, "scoring_score": "multiply" } }

searchquery and score

> custom query

x

> custom scorer

=

> score

Page 9: Elasticsearch at Dailymotion

searchquery and score

Scorer> custom scorer

_ only slightly alter the query score

_ take into account: recency, popularity, etc.

> boosted filters and scripts when testing

> native java for performance

Page 10: Elasticsearch at Dailymotion

searchquery and score

Query> we need to keep control of the query base score

> problem is our text content is thin _ short title, a few tags _ a more or less relevant description

> bare bones TF-IDF may not be suitable _ TF not that relevant to us

Page 11: Elasticsearch at Dailymotion

searchquery and score

> BM25: reduce importance of

document length

> why common terms query

_ increase performance

_ ignore popular terms when searching

_ but still use them for scoring

_ like a real time specialized stop words list

similarity: my_bm25: type: BM25 b: 0.001

Page 12: Elasticsearch at Dailymotion

> ignore inexistent terms in query

> boost repeated terms (TF) only if repeated in query

a doc titled “A A A game” has a better score than “A game” only when explicitly searching for “A A A”

> boost term by position in query and documents

searchquery and score

brown fox zerzer brown fox zerzer

the quick brown fox jumps.

^1.1 ^1.07 ^1.05 ^1.03 ^1.02

Page 13: Elasticsearch at Dailymotion

> keep both stemmed and original terms

> score them with dis_max (tie_breaker = 0)

> disable coord factor for consistent scoring

I like dogsi like dogs dog token

1 2 3 position

searchquery and score

"field": "dogs" "dis_max": { "tie_breaker" : 0, "queries" : [ { "term": { "field": "dogs" } }, { "term": { "field": "dog" } } ...

Page 14: Elasticsearch at Dailymotion

shardingwhat suits us

Page 15: Elasticsearch at Dailymotion

shardingwhat suits us

> less shards make a query slower

> but not 16 times slower (112 ms vs 12 ms)

index / 16 shards index / 1 shard

1 ms request handling 1 ms request handling

10 ms

shard 0 (9ms)

shard 1(10ms)

shard 2(6ms)

shard 3(9ms)

110 ms shard 0

shard 4(10ms)

shard 5(10ms)

shard 6(9ms)

shard 7(10ms)

shard 8(10ms)

shard 9(8ms)

shard 10(5ms)

shard 11(10ms)

shard 12(7ms)

shard 13(7ms)

shard 14(10ms)

shard 15(10ms)

1 ms return result 1 ms return result

12 ms 112 ms

Page 16: Elasticsearch at Dailymotion

shardingwhat suits us

> takes more resources

> everything runs at 100 % for each query

> less requests per second for the same hardware

9 ms + 10 ms + 6 ms+ (…)

=140 ms

shard 0 (9ms)

shard 1(10ms)

shard 2(6ms)

shard 3(9ms)

110 ms shard 0

shard 4(10ms)

shard 5(10ms)

shard 6(9ms)

shard 7(10ms)

shard 8(10ms)

shard 9(8ms)

shard 10(5ms)

shard 11(10ms)

shard 12(7ms)

shard 13(7ms)

shard 14(10ms)

shard 15(10ms)

140 ms spent by the shards 110 ms spent

Page 17: Elasticsearch at Dailymotion

shardingwhat suits us

Before> we used to have 40 shards on 18 nodes

_ ~2 millions docs per shard

_ 3 gb by shards

_ ~ 120 gb total index size

> cluster was very loaded

_ every single query was hitting all the nodes

_ response times could have been better

Page 18: Elasticsearch at Dailymotion

shardingwhat suits us

After> we now have 5 shards on 10 nodes

> cluster run smoother, less load

_ only 5 nodes involved per query

_ it handles many times more requests

Page 19: Elasticsearch at Dailymotion

shardingwhat suits us

less data! _ ~10 millions docs per shard _ 4 gb by shards _ ~ 25 gb total index size

> only data we need right now _ { "_source" : false }

_ round numbers and dates

_ { "precision_step" : 2147483647 }

> less updates, faster indexation, rebalance, merges...

Page 20: Elasticsearch at Dailymotion

shardingwhat suits us

drawbacks

> queries taken individually are slower…

> but only marginally slower _ eg: 7 ms instead of 5 ms

> but some slower queries became more noticeable

Page 21: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

Page 22: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

load test

_ benchmark with Tsung _ dedicated test cluster

_ run real queries, lots of them

_ aim for our expected load

_ monitor everything

_ reshard, change schema

_ set masters, data-only nodes...

repeat

Page 23: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

use warmers

> warm segments after each merge

_ prevent slow first queries

> set it up to build cache for the filters we use

> zero reasons for not using them

Page 24: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

"constant_score": {

"filter": {

"term": {

"visible": "yes"

}

}

}

"constant_score": { "filter": { "bool": { "_cache": true, "must": [ { "term": { "visible": "yes" } }, { "range": { "age": { "from": 18, "to": 30 } } } ] ...

Page 25: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

query testing

> to test a particular query raw performance

_ one index, one shard

_ millions of simple documents

_ merged in one segment

_ with some deletes

Page 26: Elasticsearch at Dailymotion

?how do we test

benchmarks, tuning

{ "query": { "filtered": { "query": { "match": { "title": "some very popular terms" } }, "filter": { "term": { "user": "cedric" } } } }}

Page 27: Elasticsearch at Dailymotion

!how do we test

benchmarks, tuning

{ "query": { "filtered": { "strategy": "leap_frog", "query": { "match": { "title": "some very popular terms" } }, "filter": { "term": { "user": "cedric" } } } }}

Page 28: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

> we also use Elasticsearch to just filter and sort

> these queries match millions of documents

_ they are slow

_ even when terms are cached

_ iterating, scoring and sorting is tedious

Page 29: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

query

"sort": { "created": "desc" },"query": { "bool": { "must": [ { "term": { "public": true } } ] }}

Page 30: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

query result

"sort": { "created": "desc" },"query": { "bool": { "must": [ { "term": { "public": true } } ] }}

"took": 695"hits": { "total": 79582599}

Page 31: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

> we know our data

> we can help our query

Page 32: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

query

"sort": { "created": "desc" },"query": { "bool": { "must": [ { "term": { "public": true } }, { "range": { "created": { "from": "2014-06-03" } } } ]

...

> with a range filter

on the sorted field

Page 33: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

query result

"sort": { "created": "desc" },"query": { "bool": { "must": [ { "term": { "public": true } }, { "range": { "created": { "from": "2014-06-03" } } } ]

...

"took": 15"hits": { "total": 92312}

// Same top docs// returned

Page 34: Elasticsearch at Dailymotion

how do we testbenchmarks, tuning

> what if there are not enough hits?

_ re-run the query without the filter

> we use a custom query to do just that!

_ breaks once it matches enough hits

_ runs at segment level

_ no round-trips

Page 35: Elasticsearch at Dailymotion

"sort": { "created": "desc" },"query": { "break_once": {

"minimum_hits": 100, "query": { { "term": { "public": true } },

"filters": [

{ "range": { "created": { "from": "2014-06-03" } } }, { "range": { "created": { "from": "2014-05-03” } } }, { "range": { "created": { "from": "2014-01-03" } } }

]}

}

how do we testbenchmarks, tuning

stop once there are

enough hits

Page 36: Elasticsearch at Dailymotion
Page 37: Elasticsearch at Dailymotion

thank you

> index only what you need now

> shard today, reshard tomorrow

> benchmark to find what suits you best

> test and optimize your queries

Page 38: Elasticsearch at Dailymotion

thank you

> questions?