34
Data Platform and Services Vipul Sharma and Eyal Reuveni

Testtting

Embed Size (px)

Citation preview

Page 1: Testtting

Data Platform and Services

Vipul Sharma and Eyal Reuveni

Page 2: Testtting

Agenda

EventbriteData ProductsData Platform

RecommendationsQuestions

Page 3: Testtting

• A social event ticketing and discovery platform• 50th Million Ticket Sold• Revenue doubled YOY• 180 Employees in SOMA SF• Solving significant engineering problems

• Data• Data, Infrastructure, Mobile, Web, Scale, Ops, QA

• Firing all cylinders and hiring blazing fastwww.eventbrite.com/jobs

Page 4: Testtting

Data Products

Page 5: Testtting
Page 6: Testtting
Page 7: Testtting

Analytics

• Add–Hoc queries by Analysts

Page 8: Testtting

Fraud and Spam

Page 9: Testtting

Data Platform

Page 10: Testtting
Page 11: Testtting

Hadoop Cluster

• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3

formatted• CDH3 • Fair Scheduler• HBase

Page 12: Testtting

Infrastructure

• Search• Solr• Incremental updates towards event driven

• Recommendation/Graph• Hadoop• Native Java MapReduce• Bash for workflow

• Persistence• MySql• HDFS• HBase• MongoDB (Investigating Cassandra and Riak)

Page 13: Testtting

Infrastructure

• Stream• RabbitMQ• Internal Fire hose (Investigating Kafka)

• Offline• MapRedude• Streaming• Hive• Hue

Page 14: Testtting

Infrastructure - Sqoozie

• Workflow for mysql imports to HDFS• Generate Sqoop commands• Run these imports in parallel

• Transparent to schema changes• Include or exclude on column, data types, table

level• Data Type Casting tinyint(1) Integer• Distributed Table Imports

Page 15: Testtting

Infrastructure - Blammo

• Raw logs are imported to HDFS via flume• Almost real-time – 5 min latency• Logs are key-value pairs in JSON• Each log producer publishes schema in yaml• Hive schema and schema yaml in sync using

thrift• Control exclusion and inclusion

Page 16: Testtting

Recommendations

Page 17: Testtting

You will like to attend this event

Page 18: Testtting

Item Hierarchy (You bought camera so you need batteries - Amazon)

Collaborative Filtering – User-User Similarity (People who bought camera also bought batteries - Amazon)

Collaborative Filtering – Item-Item similarity(You like Godfather so you will like Scarface - Netflix)

Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)

Interest Graph Based (Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)

Recommendation Engines

Page 19: Testtting

Why Interest?

Events are Social Events are Interest

Dense Graph is IrrelevantInterest are Changing

Page 20: Testtting

How do we know your Interest?

• We ask you• Based on your activity

• Events Attended• Events Browsed

• Facebook Interests• User Interest has to match Event category• Static

• Machine Learning• Logistic Regression using MLE• Sparse Matrix is generated using MapReduce• A model for each interest

Page 21: Testtting

Model Based vs Clustering

Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem

Item-Item vs User-User

Page 22: Testtting

Implicit Social Graph

U1

U2 U3

U4 U5

E1

E2 E3

E4

Page 23: Testtting

Mixed Social Graph

U1

U2 U3

U4 U5

E1

E2 E3FB

LI

Page 24: Testtting

15M * 260 * 260 = 1.14 Trillion Edges

4Billion edges ranked

Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship

Page 25: Testtting

Feature Generation

• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event Attendees

• MAP: eid: uid• REDUCE: eid:[uid]

• Attendees Social Graph• Input: eid:[uid]• MAP: uidi:[uid]

• REDUCE: uid:[neighbors]

• Interest based features, user specific, graph mining etc• Upload feature values to HBase

Page 26: Testtting

U1

U2 U3

Page 27: Testtting

HBase

Page 28: Testtting

HBase

• Collect data from multiple Map Reduce jobs• Stores entire social graph• Over one million writes per second

Page 29: Testtting

HBase

rowid neighbors events featureX

2718282 101 3 0.3678795

Page 30: Testtting

HBase

rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx

2718282 31 1 0.3183 83 2 0.618

Page 31: Testtting

Tips & Tricks

• Distributed cache database• Sped up some Map Reduce jobs by hours• Be sure to use counters!

Page 32: Testtting

Tips & Tricks

• Hive (ab)uses• Almost as many hive jobs as custom ones• “flip join”• Statistical functions using hive• UDF

Page 33: Testtting

Tips & Tricks

• Memory Memory Memory• LZO, WAL• Combiners are great until• Shuffle and Sorting stage• Hadoop ecosystem is still new

Page 34: Testtting

Questions?