36
Big Data at uberVU Mihnea Giurgea Lead Developer @ uberVU

Big data @ uber vu (1)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Big data @ uber vu (1)

Big Data at uberVU

Mihnea GiurgeaLead Developer @ uberVU

Page 2: Big data @ uber vu (1)

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Page 3: Big data @ uber vu (1)

● Gather mentions that refer to a specific subject from the web:○ a brand (Coca-Cola)○ an event (Comic-con), etc.

● Search, aggregate and analyze data to provide statistics and insights to clients

● Everything is within the context of a "stream"

What do we do?

Page 4: Big data @ uber vu (1)

What do we do?

● social media monitoring, reporting and engagement

● reactive market lead us to actionable insights

● Signals - collection of intelligent algorithms○ some machine learning○ mostly just statistics

Page 5: Big data @ uber vu (1)

Signals

Page 6: Big data @ uber vu (1)

Signals

● Twitter top influencers○ reach out to promote your brand

● spikes & bursts○ be aware of global events for specific annotations

● "asking for help" mentions○ generate leads

● trending stories○ promote and raise engagement

Page 7: Big data @ uber vu (1)

BigData?

What is "big data"?

Wikipedia says: "Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications"

Page 8: Big data @ uber vu (1)

The LHC is one of the biggest data sources:● produces 15 PB per year (~41 TB per day)

BigData?

Page 9: Big data @ uber vu (1)

What's our data?

● ~70M mentions per day○ tweets○ Facebook (public) posts○ Google+ posts, etc.

● 100s of Amazon instances

● we record ~3TB per month

Page 10: Big data @ uber vu (1)

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Page 11: Big data @ uber vu (1)

Technologies

● Amazon Web Services

● MongoDB - lots of use-cases

● Kestrel - fast & low administration

● Redis - fast, in-memory dataset

● DynamoDB - fast, easy to scale, but $$$

Page 12: Big data @ uber vu (1)

Data acquisition

● collect data from multiple platforms (20+)○ Twitter, Facebook, Google+, blogs, boards, etc.

● specialized workers per each platformsuch that adding a new platform is easy

● in-house refreshing system○ periodically poll each feed ○ adjust refresh rate according to activity

Page 13: Big data @ uber vu (1)

Data processing

Each mention needs to be processed by multiple modules:● language detection● sentiment detection● location detection● persistence (database storage)● ...

(preferably in real-time)

Page 14: Big data @ uber vu (1)

Data processing

Page 15: Big data @ uber vu (1)

Workers

● each processing step is done by a specialized worker○ input: a (single) tweet○ output: a (processed) tweet○ output is sent to next worker or written to database

● a worker is just a Python process○ each worker runs in multiple instances○ across multiple machines○ capable of processing multiple tweets in parallel

Page 16: Big data @ uber vu (1)

Queues

● workers communicate using a system of distributed queues

● each worker has its own input queue

● queues need to be persistent

● multiple independent Kestrel servers

Page 17: Big data @ uber vu (1)

Kestrel sharding

● multiple independent servers (Round-robin)○ start with a random server○ switch to next server every 100 operations○ or when server is down

● dequeue○ switch to next server when queue is empty

● sharding causes loose ordering :(

Page 18: Big data @ uber vu (1)

Kestrel sharding

Page 19: Big data @ uber vu (1)

Fault-tolerance

Guaranteed by queueing system

● Kestrel servers are durable○ persisted to EBS○ resistant to instance failure

● ack messages using Kestrel's primitives○ /open to read ○ /close to confirm processed message

Page 20: Big data @ uber vu (1)

Failure scenarios

● worker failure?○ just a small decrease in processing speed

● Kestrel server failure?○ using multiple servers (Round-robin)○ workload will be handled by remaining servers○ possible performance impact○ some messages will be temporarily unavailable

Page 21: Big data @ uber vu (1)

Scalability

● easily scalable by adding more nodes○ both Kestrel servers & workers

● worker requirements○ stateless○ fail-fast○ boot-fast

● small granularity○ allows you to grow infrastructure costs steadily

Page 22: Big data @ uber vu (1)

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Page 23: Big data @ uber vu (1)

Signals

● machine learning turned out to be very slow○ language detection○ sentiment detection○ mention clustering

● most ML boils down to matrix multiplication

● detecting sentiment for one tweet at a time is very inefficient

Page 24: Big data @ uber vu (1)

Signals

Solution?● batch processing

○ e.g.: run algorithm for 100 tweets at a time

● within the same pipeline

● batch modules "wait" to gather more data

Page 25: Big data @ uber vu (1)

Signals

Still real-time?● yes, via max_wait

○ don't wait more than 30 seconds

● overall performance increased○ despite the artificial "wait"○ then we found a lot of other use-cases for this

Page 26: Big data @ uber vu (1)

Twitter Influencers

● find the Twitter users that are most influential

● in a given context (topic)○ e.g.: "big data OR #bigdata location:Romania"

● right now○ top influencers change quickly, every few days

Page 27: Big data @ uber vu (1)

Twitter Influencers

Build a Twitter graph using:

● each user is a node○ Weight(node) = f(# tweets of user)○ only measure activity in context of some given topic

● each retweet is a directed edge○ Weight(edge) = g(# retweets)

Page 28: Big data @ uber vu (1)

Influencers Graph

Page 29: Big data @ uber vu (1)

Twitter Influencers

● we're only interested in recent data○ e.g.: last few days

● because we want to determine current influencers○ not all-time influencers

● How - use Redis to create a "rolling graph"○ Redis > memcache for eviction strategies

Page 30: Big data @ uber vu (1)

Twitter Influencers

● each tweet updates a node

● each retweet updates an edge

● use Redis to expire old data

● => only recent data will be stored

Page 31: Big data @ uber vu (1)

Twitter Influencers

Algorithm:● similar to Google's PageRank

● influencers are computed almost in real-time○ batch processing (hence "almost" real-time)○ continuous computation○ each subgraph is updated every ~30 minutes

Page 32: Big data @ uber vu (1)

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Page 33: Big data @ uber vu (1)

Lessons learned

Monitoring is vital, monitor everything!

● we use Graphite for everything○ number of messages○ average processing speed○ usage reports (histograms)○ etc.

Page 34: Big data @ uber vu (1)

Lessons learned

● Assume nothing○ Eventually, everything that can fail, will!

● Plan for everything○ "Failing to plan is planning to fail"

● Failures are usually correlated○ Expect multiple components to fail at the same time.

Page 35: Big data @ uber vu (1)

Thank you!

Mihnea Giurgea @ uberVU

Credits to the following colleagues:● Andrei Vasilescu

● Sonia Stan

● Bogdan Sandulescu

Page 36: Big data @ uber vu (1)

Questions?