COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

Preview:

DESCRIPTION

Presentation from Vipul Sharma, Eventbrite #dataconf More at http://event.gigaom.com/structuredata/

Citation preview

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS

SPEAKER: Vipul SharmaDirector of Data EngineeringEventbrite

Monday, April 1, 13

Real Time Data Processing at ScaleVipul Sharma – Director of Data Engineering

Monday, April 1, 13

Eventbrite by the Numbers

Monday, April 1, 13

1.5 million events80 million tickets sold

$1 billion in gross ticket salesEvents in 179 countries

Eventbrite by the Numbers

Monday, April 1, 13

Who am I?

Director of Data Engineering at EventbriteInfrastructure, Data Science, Analytics, Spam and Fraud

linkedin.com/in/vipulsharma3@vipulsharmavipul@eventbrite.com

Monday, April 1, 13

Real Time

• Definition of real time varies with use case• Real time at scale is a challenge• Active learning requires real time data processing• Spam/Fraud• Discovery • Search

• Analytics• Real time analytics

• Data Changes• Changes in inventory, user settings etc

Monday, April 1, 13

Scaling for Growth

• Decouple Services• Decouple services based on CAP, Size and Growth• NoSQL attractive for out of the box sharding, replication and multi data

center support along with high write speeds• Multiple data stores pose a challenges of data flow between services in real

time• Batch Processing• Batch processing for big data e.g. data science, analytics etc• MapReduce is not built for real time• Data locality requires data to be stored on HDFS• Data Sync to Hadoop in real time is a challenge

Monday, April 1, 13

Monday, April 1, 13

Challenges with Real Time• Data Flow• How to transfer data captured in logs to services in real

time• How to transfer data captured in database to services in

real time• Data Processing• How to process significant data in real time• Distributed data processing for real time

Monday, April 1, 13

Data Flow

• Database polling• Rather than each application polling build a single polling service• Downstream applications polls from this service• Built for consistency and read scalability• Example: Event Cache• Excited about Linkedin’s Databus - http://data.linkedin.com/projects/

databus• Persisted Queues• Transfer logs via a distributed persisted message queue• Downstream applications subscribe to these queues getting a stream of

data• Example: Firehose• Excited about Linkedin’s Kafka - http://kafka.apache.org/index.html

Monday, April 1, 13

Data Processing

• Denormalization• Write data ready to serve• NoSQL built for Denormalization• Example: See who’s visiting

• Distributed Data Processing• Complex business logic needs more than de-normalization• Example: API stats using Storm• http://storm-project.net/

Monday, April 1, 13

Questions?

See it in action. Download our app:

eventbrite.com/eventbriteapp

Monday, April 1, 13

Thank You!@vipulsharma/ vipul@eventbrite.com

Monday, April 1, 13

Monday, April 1, 13

Recommended