45
Analytics Infrastructure @ Viki Grokking Engineering Dec 2014

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Embed Size (px)

Citation preview

Page 1: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Analytics Infrastructure @ Viki

Grokking Engineering

Dec 2014

Page 2: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Talk Outline

• Introduction

• Background + Problems

• Data Architecture

– Data Collection & Storage

– Data Processing & Aggregation

– Data Presentation & Vizualization

– Real-time dashboard and alerts

• Other Comments

Page 4: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Youtube - A Typical Web Application

• Daily/weekly registered users by different platforms, countries?

• How many video uploads do we have everyday?

Page 5: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Youtube - A Typical Web Application

• Daily/weekly registered users by different platforms, countries?

• How many video uploads do we have everyday?

Page 6: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Page 7: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Behavioral Data? (vs Transactional Data)

• Transactional Data

Mission-critical data (e.g user accounts, bookings, payments)

Often fixed schema

Lower volume

Transaction control

• Behavioral Data

Logging data (e.g. page view, video start, ad impression)

Often semi-structure (JSON)

Huge volume

No transaction control

Page 8: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Data Infrastructure

Page 9: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Data Infrastructure

1.Collect and Store Data

2.Centralize and Process Data

3.Present and Vizualize Data

Page 10: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

1. Collect & Store Data

{

"origin":"tv_show_show", "app_ver":"2.9.3.151”,

"uuid":"80833c5a760597bf1c8339819636df04”,

"user_id":"5298933u”,

"vs_id":"1008912v-1380452660-7920”,

"app_id":"100004a”, "event":”video_play",

"timed_comment":"off”, "stream_quality":"variable”,

"bottom_subtitle":"en", "device_size":"tablet”,

"feature":"auto_play", "video_id":"1008912v”,

”subtitle_completion_percent":"100”,

"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”,

"ip":"99.232.169.246”, "country":"ca”,

"city_name":"Toronto”, "region_name":"ON”

}

• Samples: page view, video start,

ad impression, etc.

• Behavioural Data

Semi-structured (JSON)

Massive Volume (100M+/day)

Does not fit traditional RDBMS

databases

Page 11: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

• fluentd

Scalable

Extensible

Forward data to Hadoop, MongoDB, PostgreSQL etc.

1. Collect & Store Data

Page 12: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Hydration System

• Inject time-sensitive information into events

Page 13: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Hydration System

Page 14: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

2. Centralizing & Processing Data

Page 15: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

2. Centralizing & Processing Data

Page 16: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

2. Centralizing & Processing Data

Page 17: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Getting All Data To 1 Place

thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1

thor db:cp --source A --destination B –t reporting.video_plays --increment

Page 18: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1008912v ca 2

2013-09-29 android viki video_play 1008912v us 18

b) Click-stream Data (Hadoop) Analytics DB:

Hadoop

PostgreSQL

Aggregation (Hive)

Export Output / Sqoop

Page 19: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

SELECT

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,

v['source'],

v['partner'],

v['event'],

v['video_id'],

v['country'],

COUNT(1) as cnt

FROM events

WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')

AND v['event'] = 'video_play'

GROUP BY

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),

v['source'],

v['partner'],

v['event'],

v['video_id'],

v['country'];

Simple Aggregation SQL

Page 20: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

The Data Is Not Clean!

Event properties and names change as we

develop:

But…

{"user_id": "152u”, "country": "sg" }

{"user_id": "152", "country_code":"sg" }Old Version:

New Version:

Page 21: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,

v['app_id'] AS `app_id`,

CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'

WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'

WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'

WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'

WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'

WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'

WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'

ELSE LOWER( v['partner'] )

END AS `partner`,

CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'

WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'

WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'

WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'

WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'

ELSE TRIM( v['source'] )

END AS `source` ,

LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2

THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )

ELSE NULL END ) AS `country` ,

COALESCE ( v['device_size'] ,v['device'] ) AS `device`,

COUNT( 1 ) AS `cnt`

FROM events

WHERE time >= 1380326400 AND time <= 1380412799

AND v['event'] = 'video_play'

GROUP BY

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],

CASE WHEN v['app_ver'] LIKE '%_ax'

THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'

THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'

THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'

THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'

THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'

THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'

THEN 'samsung_viki_premiere'

ELSE LOWER( v['partner'] )

END ,

CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'

WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'

WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )

THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'

WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'

ELSE TRIM( v['source'] ) END,

LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2

THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )

ELSE NULL END ),

COALESCE ( v['device_size'] ,v['device'] );

(Not so) simple Aggregation SQL

Hadoop

Page 22: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

UPDATE "reporting"."cl_main_2013_09"

SET source = 'embed', partner = ’partner1'

WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')

UPDATE "reporting"."cl_main_2013_09"

SET app_id = '100105a'

WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')

UPDATE reporting.cl_main_2013_09

SET user_id = user_id || 'u’

WHERE RIGHT(user_id, 1) ~ '[0-9]’

UPDATE "reporting"."cl_main_2013_09"

SET app_id = '100106a'

WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')

UPDATE reporting.cl_main_2013_09

SET source = 'raynor', partner = 'viki', app_id = '100000a’

WHERE event = 'pv’

AND source IS NULL

AND partner IS NULL

AND app_id IS NULL

…post-import cleanup

PostgreSQL

Cleaning Up Data Takes Lots of Time

Page 23: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Transforming Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 24: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Transforming Data

Table A

Table B

Analytics DB (PostgreSQL)

Page 25: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

date source partner event country cnt

2013-09-29 ios viki video_play ca 20

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1v ca 2

2013-09-29 ios viki video_play 2v ca 18

PostgreSQL

20M records

4M records

a) Reducing Table Size By Dropping Dimension (Aggregation)

video_plays_with_video_id

video_plays

Page 26: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

id title

1c Game of Thrones

2c How I Met Your Mother

PostgreSQL

b) Injecting Extra Fields For Analysis

id title num_videos

1c Game of Thrones

30

2c How I Met Your Mother

16

shows videos

shows shows

1 n

Page 27: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

id title

1c Game of Thrones

2c My Girlfriend Is A Gumiho

PostgreSQL

Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30

2c My Girlfriend Is A Gumiho

16

containers videos

containers containers

1 n

Page 28: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Chunk Tables By Month

video_plays_2013_06

video_plays_2013_07

video_plays_2013_08

video_plays_2013_09

ALTER TABLE video_plays_2013_09 INHERIT

video_plays;

ALTER TABLE video_plays_2013_09

ADD CONSTRAINT CHECK

date >= '2013-09-01'

AND date < '2013-10-01';

video_plays (parent table)

Page 29: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Managing Job Dependency

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 30: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Managing Job Dependency

Job A

Job B

Analytics DB (PostgreSQL)

Page 31: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Managing Job Dependency

tableA

tableB

Analytics DB (PostgreSQL)

Page 32: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Azkaban

Cron dependency

management

(Viki Cron Dependency Graph)

Page 33: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

3. Data Presentation and Visualization

Page 34: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Query Reports

Page 35: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Summary report

• Higher level view of metrics

• See changes over time

• (screen shot)

Page 36: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Data Explorer“The world is your oyster”

Page 37: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

4. Real Time Infrastructure

Page 38: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Real Time Infrastructure (Apache

Storm)

Page 39: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Real Time Dashboard

Page 40: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Alerts

Know when the house is burning down!

Page 41: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Then Global Content Source and

Consumption

Page 42: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Our Technology Stack

• Languages/Frameworks

– Ruby, Rails, Python, Go, JavaScript, NodeJS

– Fluentd (Log collector)

– Java, Apache Storm, Kestrel

• Databases

– PostgreSQL, MongoDB, Redis

– Hadoop/Hive, Amazon Redshift

– Amazon Elastic MapReduce

Page 43: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Hadoop vs. Amazon Redshift

• Hadoop is a big-data storage and processing engine

platform

– HDFS: data-storage layer

– YARN: resource management

– MapReduce/Pig/Hive/Spark: processing layer

• Amazon Redshift (MPP, massively parallel processing)

– Columnar-storage database. Meant for analytics purpose.

– OLAP – Online Analytics Processing

– Examples: Vertica, Amazon Redshift, Parracel

Page 44: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Recap

Page 45: Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Thank You!

[email protected]

http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/

http://bit.ly/viki-datawarehouse

engineering.viki.com