24
Why your company needs a Unified Log Span Conference, London, 28 th October 2014

Span Conference: Why your company needs a unified log

Embed Size (px)

Citation preview

   Why  your  company  needs  a  Unified  Log  

Span  Conference,  London,  28th  October  2014  

Introducing  myself  

•  Alex  Dean  

•  Co-­‐founder  and  technical  lead  at  Snowplow,  the  open-­‐source  event  analyBcs  plaCorm  based  here  in  London  [1]  

• Weekend  writer  of  Unified  Log  Processing,  available  on  the  Manning  Early  Access  Program  [2]  

[1]  hNps://github.com/snowplow/snowplow    

[2]  hNp://manning.com/dean  

So  what’s  a  Unified  Log?  

A  quick  history  lesson:  the  three  eras  of  business  data  processing  [1]  

1.  The  classic  era,  1996+  

2.  The  hybrid  era,  2005+  

3.  The  unified  era,  2013+  

[1]  hNp://snowplowanalyBcs.com/blog/                  2014/01/20/the-­‐three-­‐eras-­‐of-­‐business-­‐data-­‐processing/  

The  classic  era  of  business  data  processing,  1996+  

OWN  DATA  CENTER  

Data  warehouse  

HIGH  LATENCY  

Point-­‐to-­‐point  connec+ons  

WIDE  DATA  COVERAGE  

CMS  

Silo  

CRM  

Local  loop   Local  loop  

NARROW  DATA  SILOES     LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  Local  loop  

Management  reporBng  

ERP  

Silo  Local  loop  

Silo  

Nightly  batch  ETL  process  

FULL  DATA  HISTORY  

The  hybrid  era,  2005+  

CLOUD  VENDOR  /  OWN  DATA  CENTER  

Search  

Silo  Local  loop  

LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  Local  loop  

CRM  

Local  loop  

SAAS  VENDOR  #2  

Email  markeBng  

Local  loop  

ERP  

Silo  Local  loop  

CMS  

Silo  Local  loop  

SAAS  VENDOR  #1  

NARROW  DATA  SILOES    

Stream  processing  

Product  rec’s  

Micro-­‐batch  processing  

Systems  monitoring  

Batch  processing  

Data  warehouse  

Management  reporBng  

Batch  processing  

Ad  hoc  analyBcs  

Hadoop  

SAAS  VENDOR  #3  

Web  analyBcs  

Local  loop  

Local  loop   Local  loop  

LOW  LATENCY   LOW  LATENCY  

HIGH  LATENCY   HIGH  LATENCY  

APIs  

Bulk  exports  

The  hybrid  era:  a  surfeit  of  soNware  vendors  

CLOUD  VENDOR  /  OWN  DATA  CENTER  

Search  

Silo  Local  loop  

LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  Local  loop  

CRM  

Local  loop  

SAAS  VENDOR  #2  

Email  markeBng  

Local  loop  

ERP  

Silo  Local  loop  

CMS  

Silo  Local  loop  

SAAS  VENDOR  #1  

NARROW  DATA  SILOES    

Stream  processing  

Product  rec’s  

Micro-­‐batch  processing  

Systems  monitoring  

Batch  processing  

Data  warehouse  

Management  reporBng  

Batch  processing  

Ad  hoc  analyBcs  

Hadoop  

SAAS  VENDOR  #3  

Web  analyBcs  

Local  loop  

Local  loop   Local  loop  

LOW  LATENCY   LOW  LATENCY  

HIGH  LATENCY   HIGH  LATENCY  

APIs  

Bulk  exports  

The  hybrid  era:  company-­‐wide  reporQng  and  analyQcs  ends  up  like  Rashomon  

The  bandit’s  story  

vs.  

The  wife’s  story  

vs.  

The  samurai’s  story  

vs.  

The  woodcuNer’s  story  

The  hybrid  era:  the  number  of  data  integraQons  is  unsustainable  

So  how  do  we  unravel  the  hairball?  

The  unified  era,  2013+  CLOUD  VENDOR  /  OWN  DATA  CENTER  

Search  

Silo  

SOME  LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  

CRM  

SAAS  VENDOR  #2  

Email  markeBng  

ERP  

Silo  

CMS  

Silo  

SAAS  VENDOR  #1  

NARROW  DATA  SILOES    

Streaming  APIs  /  web  hooks    

Unified  log  

LOW  LATENCY   WIDE  DATA  COVERAGE  

Archiving  

Hadoop  

<  WIDE  DATA  COVERAGE  >  <  FULL  DATA  HISTORY  >  

FEW  DAYS’  DATA  HISTORY  

Systems  monitoring  

Eventstream  

HIGH  LATENCY   LOW  LATENCY  

Product  rec’s  Ad  hoc  analyBcs  

Management  reporBng  

Fraud  detecBon  

Churn  prevenBon  

APIs  

CLOUD  VENDOR  /  OWN  DATA  CENTER  

Search  

Silo  

SOME  LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  

CRM  

SAAS  VENDOR  #2  

Email  markeBng  

ERP  

Silo  

CMS  

Silo  

SAAS  VENDOR  #1  

NARROW  DATA  SILOES    

Streaming  APIs  /  web  hooks    

Unified  log  

Archiving  

Hadoop  

<  WIDE  DATA  COVERAGE  >  <  FULL  DATA  HISTORY  >  

Systems  monitoring  

Eventstream  

HIGH  LATENCY   LOW  LATENCY  

Product  rec’s  Ad  hoc  analyBcs  

Management  reporBng  

Fraud  detecBon  

Churn  prevenBon  

APIs  

The  unified  log  is  Amazon  Kinesis,  or  Apache  KaVa  

•  Amazon  Kinesis,  a  hosted  AWS  service  

•  Extremely  similar  semanBcs  to  Kaba  

•  Apache  Kaba,  an  append-­‐only,  distributed,  ordered  commit  log  

•  Developed  at  LinkedIn  to  serve  as  their  organizaBon’s  unified  log  

“Kaba  is  designed  to  allow  a  single  cluster  to  serve  as  the  central  data  backbone  for  a  

large  organizaBon”  [1]  

[1]  hNp://kaba.apache.org/    

So  what  does  a  unified  log  give  us?  

A  single  version  of  the  truth      Our  truth  is  now  upstream  from  the  data  warehouse      The  hairball  of  point-­‐to-­‐point  connecQons  has  been  unravelled      Local  loops  have  been  unbundled  

1

2

3

4

What  does  a  unified  log  let  us  do  that  we  couldn’t  do  before?  

PopulaQng  a  unified  log  with  your  company’s  event  streams  

Real-­‐Bme  management  reporBng  

To  enable…  

HolisBc  systems  

monitoring  

Re-­‐running  models  from  

Day  0  

A/B  tesBng  end-­‐to-­‐end  pipelines  

Shipping  offline  

models  to  RT  

…  anything  requiring  low  latency  response  /  holis+c  view  of  our  company’s  data!  

But  garbage  in,  garbage  out:  it’s  crucial  to  properly  model  the  event  streams  feeding  into  the  unified  log  

Subject   Direct  Object  

Indirect  Object  Verb  

Event  Context  

Prep.  Object  ~  

• We  are  working  on  a  semanBc  model  for  events  –  an  “event  grammar”  at  Snowplow  [1]  

•  The  event  grammar  borrows  concepts  from  human  language:  

•  A  semanBc  model  prevents  business  and  technology  assumpBons  leaking  in  to  the  event  stream  –  making  it  less  briNle  over  Bme  

[1]  hNp://snowplowanalyBcs.com/blog/2013/08/12/                  towards-­‐universal-­‐event-­‐analyBcs-­‐building-­‐an-­‐event-­‐grammar/  

We  also  need  to  store  and  version  the  schemas  used  to  describe  our  events,  as  these  will  change  over  Qme  

Unified  log  

How  are  we  embracing  the  unified  log  at  Snowplow?  

Some  background:  early  on,  we  decided  that  Snowplow  should  be  composed  of  a  set  of  loosely  coupled  subsystems  

1.  Trackers   2.  Collectors   3.  Enrich   4.  Storage   5.  AnalyBcs   A   B   C   D  

D  =  Standardised  data  protocols  

Generate  event  data  from  any  environment  

Log  raw  events  from  trackers  

Validate  and  enrich  raw  events  

Store  enriched  events  ready  for  analysis  

Analyze  enriched  events  

These  turned  out  to  be  criBcal  to  allowing  us  to  evolve  the  above  stack  

Today  almost  all  users/customers  are  running  a  batch-­‐based  Snowplow  configuraQon  

Hadoop-­‐based  

enrichment  

Snowplow  event  

tracking  SDK  Amazon  Redshik  

Amazon  S3  

HTTP-­‐based  event  

collector  

•  Batch-­‐based  •  Normally  run  overnight;  

someBmes  every  4-­‐6  hours  The  Snowplow  batch-­‐based  flow  uses  Amazon  S3  as  a  “poor  man’s”  unified  log  

CLOUD  VENDOR  /  OWN  DATA  CENTER  

Search  

Silo  

SOME  LOW  LATENCY  LOCAL  LOOPS  

E-­‐comm  

Silo  

CRM  

SAAS  VENDOR  #2  

Email  markeBng  

ERP  

Silo  

CMS  

Silo  

SAAS  VENDOR  #1  

NARROW  DATA  SILOES    

Streaming  APIs  /  web  hooks    

Unified  log  

Archiving  

Hadoop  

<  WIDE  DATA  COVERAGE  >  <  FULL  DATA  HISTORY  >  

Systems  monitoring  

Eventstream  

HIGH  LATENCY   LOW  LATENCY  

Product  rec’s  Ad  hoc  analyBcs  

Management  reporBng  

Fraud  detecBon  

Churn  prevenBon  

APIs  

Can  we  implement  Snowplow  on  top  of  Kinesis/KaVa?  

We  are  working  on  Amazon  Kinesis  support  first;  Apache  KaVa  will  come  later  (using  Apache  Samza  for  stream  processing)  

Scala  Stream  Collector  

Raw  event  stream  

Enrich  Kinesis  app  

Bad  raw  events  stream  

Enriched  event  stream  

S3  Redshik  

S3  sink  Kinesis  app  

Redshik  sink  Kinesis  

app  

Snowplow  Trackers  

=  not  yet  released  

ElasBc-­‐Search  sink  Kinesis  app  

DynamoDB  ElasBc-­‐Search  

Event  aggregator  Kinesis  app  

AnalyQcs  on  Read  (for  agile  exploraBon  of  event  stream,  ML,  audiBng,  

applying  alternate  models,  

reprocessing  etc)  

AnalyQcs  on  Write  (for  dashboarding,  audience  segmentaBon,  RTB,  etc)  

Live  demo!  

QuesQons?  

 

hNp://snowplowanalyBcs.com  hNps://github.com/snowplow/snowplow  

@snowplowdata  To  meet  up  or  chat,  @alexcrdean  on  TwiNer  or  

[email protected]  

Discount  code:  spancNw  (43%  off  all  Manning  eBooks  for  Span  J)