Big data @ uber vu (1)

Big Data at uberVU

Mihnea GiurgeaLead Developer @ uberVU

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

● Gather mentions that refer to a specific subject from the web:○ a brand (Coca-Cola)○ an event (Comic-con), etc.

● Search, aggregate and analyze data to provide statistics and insights to clients

● Everything is within the context of a "stream"

What do we do?

What do we do?

● social media monitoring, reporting and engagement

● reactive market lead us to actionable insights

● Signals - collection of intelligent algorithms○ some machine learning○ mostly just statistics

Signals

Signals

● Twitter top influencers○ reach out to promote your brand

● spikes & bursts○ be aware of global events for specific annotations

● "asking for help" mentions○ generate leads

● trending stories○ promote and raise engagement

BigData?

What is "big data"?

Wikipedia says: "Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications"

The LHC is one of the biggest data sources:● produces 15 PB per year (~41 TB per day)

BigData?

What's our data?

● ~70M mentions per day○ tweets○ Facebook (public) posts○ Google+ posts, etc.

● 100s of Amazon instances

● we record ~3TB per month

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Technologies

● Amazon Web Services

● MongoDB - lots of use-cases

● Kestrel - fast & low administration

● Redis - fast, in-memory dataset

● DynamoDB - fast, easy to scale, but $$$

Data acquisition

● collect data from multiple platforms (20+)○ Twitter, Facebook, Google+, blogs, boards, etc.

● specialized workers per each platformsuch that adding a new platform is easy

● in-house refreshing system○ periodically poll each feed ○ adjust refresh rate according to activity

Data processing

Each mention needs to be processed by multiple modules:● language detection● sentiment detection● location detection● persistence (database storage)● ...

(preferably in real-time)

Data processing

Workers

● each processing step is done by a specialized worker○ input: a (single) tweet○ output: a (processed) tweet○ output is sent to next worker or written to database

● a worker is just a Python process○ each worker runs in multiple instances○ across multiple machines○ capable of processing multiple tweets in parallel

Queues

● workers communicate using a system of distributed queues

● each worker has its own input queue

● queues need to be persistent

● multiple independent Kestrel servers

Kestrel sharding

● multiple independent servers (Round-robin)○ start with a random server○ switch to next server every 100 operations○ or when server is down

● dequeue○ switch to next server when queue is empty

● sharding causes loose ordering :(

Kestrel sharding

Fault-tolerance

Guaranteed by queueing system

● Kestrel servers are durable○ persisted to EBS○ resistant to instance failure

● ack messages using Kestrel's primitives○ /open to read ○ /close to confirm processed message

Failure scenarios

● worker failure?○ just a small decrease in processing speed

● Kestrel server failure?○ using multiple servers (Round-robin)○ workload will be handled by remaining servers○ possible performance impact○ some messages will be temporarily unavailable

Scalability

● easily scalable by adding more nodes○ both Kestrel servers & workers

● worker requirements○ stateless○ fail-fast○ boot-fast

● small granularity○ allows you to grow infrastructure costs steadily

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Signals

● machine learning turned out to be very slow○ language detection○ sentiment detection○ mention clustering

● most ML boils down to matrix multiplication

● detecting sentiment for one tweet at a time is very inefficient

Signals

Solution?● batch processing

○ e.g.: run algorithm for 100 tweets at a time

● within the same pipeline

● batch modules "wait" to gather more data

Signals

Still real-time?● yes, via max_wait

○ don't wait more than 30 seconds

● overall performance increased○ despite the artificial "wait"○ then we found a lot of other use-cases for this

Twitter Influencers

● find the Twitter users that are most influential

● in a given context (topic)○ e.g.: "big data OR #bigdata location:Romania"

● right now○ top influencers change quickly, every few days

Twitter Influencers

Build a Twitter graph using:

● each user is a node○ Weight(node) = f(# tweets of user)○ only measure activity in context of some given topic

● each retweet is a directed edge○ Weight(edge) = g(# retweets)

Influencers Graph

Twitter Influencers

● we're only interested in recent data○ e.g.: last few days

● because we want to determine current influencers○ not all-time influencers

● How - use Redis to create a "rolling graph"○ Redis > memcache for eviction strategies

Twitter Influencers

● each tweet updates a node

● each retweet updates an edge

● use Redis to expire old data

● => only recent data will be stored

Twitter Influencers

Algorithm:● similar to Google's PageRank

● influencers are computed almost in real-time○ batch processing (hence "almost" real-time)○ continuous computation○ each subgraph is updated every ~30 minutes

Contents

Introduction

Infrastructure

Signals

Lessons learned

Questions?

Lessons learned

Monitoring is vital, monitor everything!

● we use Graphite for everything○ number of messages○ average processing speed○ usage reports (histograms)○ etc.

Lessons learned

● Assume nothing○ Eventually, everything that can fail, will!

● Plan for everything○ "Failing to plan is planning to fail"

● Failures are usually correlated○ Expect multiple components to fail at the same time.

Thank you!

Mihnea Giurgea @ uberVU

Credits to the following colleagues:● Andrei Vasilescu

● Sonia Stan

● Bogdan Sandulescu

Questions?

Documents

Big data @ uber vu (1)