Understanding event data

Understanding event dataData Insights, Cambridge April 2015

In the beginning…

3 years ago, we were pretty frustrated…

Product / platform development decisions

But…

• Hard to identify patterns in user behavior

• Hard to identify good / bad engagements

• We were using tools (GA / Adobe) to answer questions that the tools were not designed to support

Cloud services + open source big data technology -> we can collect and warehouse the event-level data in a linearly scalable way…

…and perform any analysis we want on it

Snowplow v1 was born…

• Every event on your website, represented as a line of data in your own data warehouse (on EMR)

• Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based data processing framework

Lots of flexibility to perform more involved / advanced web analytics

Is web analytics data a subset of a broader category of digital event data?

• Stream of events describe “what has happened” over time

• High volume of data: one line of data per event, potentially 1000s of events per second

• In all cases, having the complete, event-level dataset available for analysis provides the possibility of using the data to answer a very broad set of questions

Could we extend the pipeline we built for web data to encompass digital event data more generally?

Yes! Just build more trackers…

…BUT – what about the structure of the data? Doesn’t this vary by event type? And aren’t we now looking at many more event types?

Web events

All events

• Page view • Order • Add to basket• Page activity

• Game saved • Machine broke• Car started

• Spellcheck run • Screenshot taken• Fridge empty

• App crashed • Disk full• SMS sent

• Screen viewed • Tweet drafted• Player died

• Taxi arrived • Phonecall ended• Cluster started

• Till opened • Product returned

There are two historic approaches to dealing with the explosion of possible event types

Web analytics vendors Mobile and app analytics vendors

Custom Variables Schema-less JSONs

Custom variables are very restrictive

1. Take a standard web event, like a page view:

2. and add custom variables until it becomes something totally different:

= a “taxi arrived” event, kind of!

Page View

Page View vehicle=taxi23 status=arrived+ +

Schema-less JSONs are better, but they have a different set of problems

Issues with the event name:• Separate from the event properties• Not versioned• Not unique – HBO video played

versus Brightcove video played

Lots of unanswered questions about the properties:• Is length required, and is it always a

number?• Is id required, and is it always a string?• What other optional properties are

allowed for a video play?

Other issues:• What if the developer

accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields

• Why does the analyst need to keep an implicit schema in their head to analyze video played events?

Our approach: schema our JSONs

When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event

Additional optional field we might not know about otherwise

No other fields allowed

Yes length should always be a number

But we need to let our event definitions evolve, so let’s add versioning – we’re calling this SchemaVer

MODEL-REVISION-ADDITION

• Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc

• Try to stick to backwards-compatible ADDITION upgrades as much as possible

We make the event JSONs self-describing, with a schema header and data body

The schema field determines where the schema can be found in

Iglu, our schema repositorySchemas are namespaced…

For each event being processed, we can ‘look up’ the schema from the repo and use it to drive event validation and loading of event data into structured data stores

Schema repo{}

Being able to load data into multiple different stores is very valuable

• Different data stores support different types of analyses• SQL DBS -> pivoting / OLAP

• Elasticsearch -> search, simple OLAP

• Graph databases -> pathing analysis

• Many of these data stores (all except elasticsearch) are ‘structured’• Having the data pass through the pipeline in a schemaed format means we do not

need to manually structure it ourselves, which is expensive and error-prone

We are working on a second, real-time version of the Snowplow data pipeline

Batch:

Real-time:

Requests logged to S3

Requests logged to Amazon Kinesis and Kafka

Event data processed in Scalding / EMR and SQL

Event data processing using Kinesis Client Library / Samza

Kinesis and Kafka enable us to publish and consume events from a distributed stream in a real-time, robust and scalable way

With the real-time pipeline, event data can be fed into data-driven applications alongside the datawarehouse

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

Data warehouse

This is exciting for data scientists

• Build predictive models based on the complete history of events in the datawarehouse e.g. forecast revenue for acquired users over lifetime…

• Put those models live on the same source of truth, as the data comes in in real-time

• Use this approach for all types of applications: fraud detection, real-time personalization, product recommendation

Data warehouse

RT data-driven application

However, this only makes figuring out how to model and describe events more important

• Lots of applications (not just offline reporting / data warehouse) fed off the event stream

• All those applications are decoupled: downstream applications have no control over the structure of data inputted generated upstream of them

• So the better able we are to specify (and constrain) the data structures upstream, the easier it’ll be to write downstream applications to consume the data

Can we specify a standard framework / structure for events? What about a semantic model?

We can extend our self-describing JSON model to encapsulate this semantic model…

This is something we need to put more thought / research into

The next few months are pretty exciting

• Building out the real-time pipeline

• Encouraging an ecosystem of developers / partners to build apps to run on the real-time stream

• Develop the semantic model / event grammar

Any questions?

Business

Understanding event data