20
Snowplow: evolve your analytics stack with your business Snowplow Meetup Tel Aviv July 2016

Snowplow the evolving data pipeline

Embed Size (px)

Citation preview

Page 1: Snowplow   the evolving data pipeline

Snowplow: evolve your analytics stack with

your business

Snowplow Meetup Tel Aviv July 2016

Page 2: Snowplow   the evolving data pipeline

Hello! I’m Yali

• Co-founder at Snowplow: open source event data pipeline

• Analytics Lead. Focus on business analytics

Page 3: Snowplow   the evolving data pipeline

I work with our clients so they get more out of their data

• Marketing / customer analytics: how do we engage users better?

• Product analytics: how do we improve our user-facing products?

• Content / merchandise analytics:

• How to we write/produce/buy better content?

• How do we optimise the use of our existing content?

Page 4: Snowplow   the evolving data pipeline

Self-describing data Event data modeling+

Event data pipeline that evolves with your business

Page 5: Snowplow   the evolving data pipeline

Self-describing dataOverview

Page 6: Snowplow   the evolving data pipeline

Event data varies widely by company

Page 7: Snowplow   the evolving data pipeline

As a Snowplow user, you can define your own events and

entities

Events

Entities (contexts)

• Build castle • Form alliance• Declare war

• Player• Game• Level• Currency

• View product• Buy product• Deliver product

• Product• Customer• Basket• Delivery van

Page 8: Snowplow   the evolving data pipeline

You then define a schema for each event

and entity{"$schema":

"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",

"description": "Schema for a fighter context","self": {

"vendor": "com.ufc","name": "fighter_context","format": "jsonschema","version": "1-0-1"

},

"type": "object","properties": {

"FirstName": {"type": "string"

},"LastName": {

"type": "string"},"Nickname": {

"type": "string"},"FacebookProfile": {

"type": "string"},"TwitterName": {

"type": "string"},"GooglePlusProfile": {

"type": "string"},

"HeightFormat": {"type": "string"

},"HeightCm": {

"type": ["integer", "null"]},"Weight": {

"type": ["integer", "null"]},"WeightKg": {

"type": ["integer", "null"]},"Record": {

"type": "string","pattern": "^[0-9]+-[0-9]+-[0-

9]+$"},"Striking": {

"type": ["number", "null"],"maxdecimal": 15

},"Takedowns": {

"type": ["number", "null"],"maxdecimal": 15

},"Submissions": {

"type": ["number", "null"],"maxdecimal": 15

},"LastFightUrl": {

"type": "string"},

"LastFightEventText": {"type": "string"

},"NextFightUrl": {

"type": "string"},"NextFightEventText": {

"type": "string"},"LastFightDate": {

"type": "string","format": "timestamp"

}},"additionalProperties": false

} Upload the schema to

Iglu

Page 9: Snowplow   the evolving data pipeline

Then send data into Snowplow as self-describing JSONs

{ “schema”: “iglu:com.israel365/temperature_measure/jsonschema/1-0-0”, “data”: { “timestamp”: “2016-07-11 17:53:21”, “location”: “Tel-Aviv”, “temperature”: 32 “units”: “Centigrade” }}

{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",

"description": "Schema for an ad impression event",

"self": {"vendor": “com.israel365","name": “temperature_measure","format": "jsonschema","version": "1-0-0"

},"type": "object",

"properties": { "timestamp": { "type": "string" }, "location": { "type": "string" }, … },…}

Event

Schema reference

Schema

Page 10: Snowplow   the evolving data pipeline

The schemas can then be used in a number of

ways• Validate the data (important for data quality)

• Load the data into tidy tables in your data warehouse

• Make it easy / safe to write downstream data processing application (for real-time users)

Page 11: Snowplow   the evolving data pipeline

Event data modeling

Overview

Page 12: Snowplow   the evolving data pipeline

What is event data modeling?

Event data modeling is the process of using business logic to aggregate over event-level data to produce 'modeled' data that is simpler for

querying.

Page 13: Snowplow   the evolving data pipeline

Immutable. Unopiniated. Hard to consume. Not

contentious

Mutable and opinionated. Easy to

consume. May be contentious

Unmodeled data Modeled data

Page 14: Snowplow   the evolving data pipeline

In general, event data modeling is performed on the complete event

stream

• Late arriving events can change the way you understand earlier arriving events

• If we change our data models: this gives us the flexibility to recompute historical data based on the new model

Page 15: Snowplow   the evolving data pipeline

The evolving event data pipeline

Page 16: Snowplow   the evolving data pipeline

How do we handle pipeline evolution?

PUSH FACTORS:

What is being

tracked will change over

time

PULL FACTORS:

What questions are being

asked of the data will

change over time

Businesses are not static, so event pipelines should not be either

Page 17: Snowplow   the evolving data pipeline

Push example:new source of event data

• If data is self-describing it is easy to add an additional sources

• Self-describing data is good for managing bad data and pipeline evolution

I’m an email send event and I have information

about the recipient (email address, customer ID) and

the email (id, tags, variation)

Page 18: Snowplow   the evolving data pipeline

Pull example: new business question

Page 19: Snowplow   the evolving data pipeline

Answering the question: 3 possibilities

1. Existing data model supports answer

2. Need to update data model

3. Need to update data model and data

collection

• Possible to answer question with existing modeled data

• Data collected already supports answer

• Additional computation required in data modeling step (additional logic)

• Need to extend event tracking

• Need to update data models to incorporate additional data (and potentially additional logic)

Page 20: Snowplow   the evolving data pipeline

Self-describing data and the ability to recompute data models are essential to enable

pipeline evolutionSelf-describing data Recompute data models on entire data set

• Updating existing events and entities in a backward compatible way e.g. add optional new fields

• Update existing events and entities in a backwards incompatible way e.g. change field types, remove fields, add compulsory fields

• Add new event and entity types

• Add new columns to existing derived tables e.g. add new audience segmentation

• Change the way existing derived tables are generated e.g. change sessionization logic

• Create new derived tables