Upload
yalisassoon
View
1.646
Download
2
Tags:
Embed Size (px)
Citation preview
Understanding event dataData Insights, Cambridge April 2015
In the beginning…
3 years ago, we were pretty frustrated…
Product / platform development decisions
But…
• Hard to identify patterns in user behavior
• Hard to identify good / bad engagements
• We were using tools (GA / Adobe) to answer questions that the tools were not designed to support
Cloud services + open source big data technology -> we can collect and warehouse the event-level data in a linearly scalable way…
…and perform any analysis we want on it
Snowplow v1 was born…
• Every event on your website, represented as a line of data in your own data warehouse (on EMR)
• Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based data processing framework
Lots of flexibility to perform more involved / advanced web analytics
Is web analytics data a subset of a broader category of digital event data?
• Stream of events describe “what has happened” over time
• High volume of data: one line of data per event, potentially 1000s of events per second
• In all cases, having the complete, event-level dataset available for analysis provides the possibility of using the data to answer a very broad set of questions
Could we extend the pipeline we built for web data to encompass digital event data more generally?
Yes! Just build more trackers…
…BUT – what about the structure of the data? Doesn’t this vary by event type? And aren’t we now looking at many more event types?
Web events
All events
• Page view • Order • Add to basket• Page activity
• Game saved • Machine broke• Car started
• Spellcheck run • Screenshot taken• Fridge empty
• App crashed • Disk full• SMS sent
• Screen viewed • Tweet drafted• Player died
• Taxi arrived • Phonecall ended• Cluster started
• Till opened • Product returned
There are two historic approaches to dealing with the explosion of possible event types
Web analytics vendors Mobile and app analytics vendors
Custom Variables Schema-less JSONs
Custom variables are very restrictive
1. Take a standard web event, like a page view:
2. and add custom variables until it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived+ +
Schema-less JSONs are better, but they have a different set of problems
Issues with the event name:• Separate from the event properties• Not versioned• Not unique – HBO video played
versus Brightcove video played
Lots of unanswered questions about the properties:• Is length required, and is it always a
number?• Is id required, and is it always a string?• What other optional properties are
allowed for a video play?
Other issues:• What if the developer
accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields
• Why does the analyst need to keep an implicit schema in their head to analyze video played events?
Our approach: schema our JSONs
When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event
Additional optional field we might not know about otherwise
No other fields allowed
Yes length should always be a number
But we need to let our event definitions evolve, so let’s add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION
• Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc
• Try to stick to backwards-compatible ADDITION upgrades as much as possible
We make the event JSONs self-describing, with a schema header and data body
The schema field determines where the schema can be found in
Iglu, our schema repositorySchemas are namespaced…
For each event being processed, we can ‘look up’ the schema from the repo and use it to drive event validation and loading of event data into structured data stores
Schema repo{}
Being able to load data into multiple different stores is very valuable
• Different data stores support different types of analyses• SQL DBS -> pivoting / OLAP
• Elasticsearch -> search, simple OLAP
• Graph databases -> pathing analysis
• Many of these data stores (all except elasticsearch) are ‘structured’• Having the data pass through the pipeline in a schemaed format means we do not
need to manually structure it ourselves, which is expensive and error-prone
We are working on a second, real-time version of the Snowplow data pipeline
Batch:
Real-time:
Requests logged to S3
Requests logged to Amazon Kinesis and Kafka
Event data processed in Scalding / EMR and SQL
Event data processing using Kinesis Client Library / Samza
Kinesis and Kafka enable us to publish and consume events from a distributed stream in a real-time, robust and scalable way
With the real-time pipeline, event data can be fed into data-driven applications alongside the datawarehouse
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
Data warehouse
This is exciting for data scientists
• Build predictive models based on the complete history of events in the datawarehouse e.g. forecast revenue for acquired users over lifetime…
• Put those models live on the same source of truth, as the data comes in in real-time
• Use this approach for all types of applications: fraud detection, real-time personalization, product recommendation
Data warehouse
RT data-driven application
However, this only makes figuring out how to model and describe events more important
• Lots of applications (not just offline reporting / data warehouse) fed off the event stream
• All those applications are decoupled: downstream applications have no control over the structure of data inputted generated upstream of them
• So the better able we are to specify (and constrain) the data structures upstream, the easier it’ll be to write downstream applications to consume the data
Can we specify a standard framework / structure for events? What about a semantic model?
We can extend our self-describing JSON model to encapsulate this semantic model…
This is something we need to put more thought / research into
The next few months are pretty exciting
• Building out the real-time pipeline
• Encouraging an ecosystem of developers / partners to build apps to run on the real-time stream
• Develop the semantic model / event grammar
Any questions?