www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
3
A typical “event-centric” deployment
Time-based event indexesEvent stream
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
4
Problem: some aggregations are expensive
We need to join all event-level data together at query-time.
?Using web server log data, answer the question:
"how long on average do customers spend on my site?"
!
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
5
How to cripple elasticsearch with a bucket explosion:
1. Ask a question about values that needs to be derived from multiple
documents (e.g. deriving a web session’s duration)
2. Make the joining key a high cardinality field e.g. something like “IP address”
3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
7
Solution: an “entity-centric” model
Usual stream of eventsTime-based event indexes
Entity-based summary indexes
Periodic extracts sorted by entity ID and time
www.elastic.co8
• WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?”
• Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?”
Entity-centric queries
www.elastic.co9
• Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?”
• Cars • “Which cars drove long distances after failing a road worthiness test?”
Entity-centric queries
www.elastic.co11
• Analyses website traffic for retailers and manufacturers in the automotive industry
• Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations
• Faced scaling issues producing some results from raw events
Use case: GFORCES
www.elastic.co12
• Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms• Simplified queries for common entry pages, common exit pages etc
Results of moving to entity-centric indexing
www.elastic.co13
Amazon marketplace reviews - building profiles for reviewers
Worked example
Play along! Code + data here: bit.ly/entcent
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
14
An “entity-centric” model
AmazonReviews (an event-centric index)
reviews.csv loadEvents.sh
Review event fields• rating • seller • reviewer • date
AmazonReviewers (an entity-centric index)
buildEntities.sh
• Drops and creates reviewers index. • Uses Python client to query and scroll list of
reviews sorted by reviewerId and time • Python pushes _update requests to ~400k
“Reviewer” documents each containing bundles of their recent reviews using bulk indexing API
• Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour
Reviewer entity fields• positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
15
Anatomy of an entity indexing groovy script
Initialize if new document
Loop to consolidate latest events
Re-‐run risk profile logic
Load stored state
Store the script in ES_HOME/config/scripts/foo.groovy
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
16
Insight: which sellers have a lot of fanboys?
Seller #187 has more than his fair share of “fanboy” reviewers …
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
17
Drilling down into seller #187’s fanboys
Suspiciously synchronised behaviour
www.elastic.co19
• In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport)
• It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre)
• Taxis and other forms of public transport have to be tested more frequently - every 6 months.
• All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations.
Example background
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
20
Example background
MOTs
mots.csv loadMOTs.sh
Cars
buildEntities.sh
• Drops and creates mots index.
• Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/
• Drops and creates cars index. • Registers CarProfileUpdater.groovy as a
stored script • Uses Python client to query and scroll list of
mot test results sorted by vehicle ID and time
• Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API
• Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg
MOT event fields• result (pass/fail) • vehicle ID • Make + model +
age • mileage • test date • test location
Car entity fields• Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer
readings
www.elastic.co21
Car attributes derived from 3 test result documents
Data fusion logic
1
2
3
Test date
Mile-‐o-‐m
eter re
ading
daysForFix
badReading?
milesDrivenAfterFailure
mile-o-meterRewind
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
22
Insight: who is driving failed vehicles?
Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
23
Insight: Taxis keep on trucking after failures..
www.elastic.co25
• A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their
movie ratings
Movielens data
Example background
* http://files.grouplens.org/datasets/movielens/ml-‐10m-‐README.html
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited
26
“Uncommonly common”user behaviours
www.elastic.co28
• Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external
technologies
Entity centric indexing: Advantages
www.elastic.co29
• Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog
• Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes
• Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates
• Parallelise the pull of event information
Entity centric indexing: tips
www.elastic.co30
• Incremental entity updates can be achieved by querying all events since the
timestamp of the last run • Data integrity - implement policies for:
• handling any failures in performing entity updates • retiring old entities (use of TTL?)
Entity centric indexing