23
Scalable real-time processing techniques How to almost count Lars Albertsson, Schibsted

Scalable real-time processing techniques

Embed Size (px)

DESCRIPTION

A glance at a few scalable stream processing techniques.

Citation preview

Page 1: Scalable real-time processing techniques

Scalable real-time processing techniques

How to almost countLars Albertsson, Schibsted

Page 2: Scalable real-time processing techniques

“We promised to count live...

...but since you can’t do that, we used historical numbers and this cool math to extrapolate.”

?!?

Page 3: Scalable real-time processing techniques

Stream counting is simple

You already have the building blocks

Yet many wait for batch execution

Or go through estimation hoops

Page 4: Scalable real-time processing techniques

BucketiserBucketiser

Accurate counting

● Straightforward, with some plumbing.● Heavier than you need.

BusServer

Bucketiser

AggregatorServer

ServerServer

Page 5: Scalable real-time processing techniques

Now or later? Exact or rough?

Approximation now >> accurate later

Page 6: Scalable real-time processing techniques

Basic scenarios

● How many distinct items in last x minutes?● What are the top k items in last x minutes?● How many Ys in last x minutes?

These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.

Page 7: Scalable real-time processing techniques

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.

Page 8: Scalable real-time processing techniques

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter

+ counter.

Page 9: Scalable real-time processing techniques

Counting in context

● Look backward, different time windows, compare.

● Count for a small time quantum, keep history.

● Aggregate old windows.● Monoid representations are desirable.

Page 10: Scalable real-time processing techniques

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter

+ counter.● Naive 3: Hash to bitmap. Count bits.

Page 11: Scalable real-time processing techniques

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter

+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision

compensation. Linear Probabilistic Counter.

Page 12: Scalable real-time processing techniques

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter

+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision

compensation. Linear Probabilistic Counter.● Read papers… -> HyperLogLog counter

Page 13: Scalable real-time processing techniques

Cardinality - distinct stream count

Source: Shakespeare, highscalability.com

Page 14: Scalable real-time processing techniques

Top K counting

U2 65

Gaga 46

Avicii 23

Eminem 21

Dolly 18

U2 65

Gaga 46

Avicii 23

Eminem 21

Peps 19

U2 65

Gaga 46

Avicii 23

Eminem 21

Dolly 20

● Keep k items, assume absentees have lowest value

● Accurate at top, overcounting in bottom

Page 15: Scalable real-time processing techniques

Approx counting - Count-Min Sketch

● Compute n hashes for key.● Increment once on each row, col by mod

(hash)● Retrieve by min() over rows

3 7 20 3 11 6 3+1 4 1 1

3 8 6 2+1 17 13 1 0 4 5

12 7 6 14 2 0 2 3 6+1 7

3 2 12 8+1 10 2 7 2 11 2

Page 16: Scalable real-time processing techniques

Top K with Count-Min Sketch

U2 65

Gaga 46

Avicii 23

Eminem 21

Dolly 18

U2 65

Gaga 46

Avicii 23

Eminem 21

Peps 2

U2 65

Gaga 46

Avicii 23

Eminem 21

Dolly 19

● Keep Heavy Hitters list.● Lookup absentees in CMS.● Risk of overcount is smaller and spread out.

Page 17: Scalable real-time processing techniques

Cubic CMS

● Decorate song with geo, age, etc. Pour into CMS.

● Keep heavy hitters per geo, age group.

*:*:<U2>

SE:*:<U2>

*:31-40:<U2>

SE:31-40:<U2>

+1

+1

+1

+1

Page 18: Scalable real-time processing techniques

Machinery

O(104) messages / s per machine.

You probably only need one. If not, use Storm.

Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.

Page 19: Scalable real-time processing techniques

Brute force alternative

Dump every single message into ElasticSearch.

Suitable for high dimensionality cubes.

Page 20: Scalable real-time processing techniques

Recommendations, you said?

● Collaborative filtering - similarity matrix

2 4 1 1 5 2

0 1 7 1 0 6

5 2 9 0 3 0

3 8 0 6 0 7

Users

Item

s

Page 21: Scalable real-time processing techniques

Shave the matrixUsers

Item

s

0,0 3

0,1 5

0,2 0

0,3 2

1,0 8

... ...

2,1 9

1,0 8

2,2 7

5,0 7

5,2 6

... ...

Flip Sort

2,1 9

1,0 8

2,2 7

5,0 7

5,2 6

Cut

0 0 0 0 0 0

0 0 7 0 0 6

0 0 9 0 0 0

0 8 0 0 0 7

Noise removed - fine for recommendations

2 4 1 1 5 2

0 1 7 1 0 6

5 2 9 0 3 0

3 8 0 6 0 7

Page 23: Scalable real-time processing techniques

Want to work in this [email protected]