Transcript
Page 1: OWF14 - Big Data Track : Abstract Algebra for Analytics

Abstract Algebra for Analytics

Sam BESSALAH

@samklr

Page 2: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 3: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 4: OWF14 - Big Data Track : Abstract Algebra for Analytics

What do we want?

• We want to build scalable systems.

• Preferably by leveraging distributed computing

• A lot of analytics amount to counting or adding in some sort of way.

Page 5: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

11, 12, 0,3,56,48 K=3 56,48,12

Page 6: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

Page 7: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

Page 8: OWF14 - Big Data Track : Abstract Algebra for Analytics

In Scalding

Page 9: OWF14 - Big Data Track : Abstract Algebra for Analytics

In Scalding

Page 10: OWF14 - Big Data Track : Abstract Algebra for Analytics

Problems

• Curse of the last reducer

• Network Chatter, hinder on performance

• Inefficient Order for map and reduce steps

• Multiple jobs, with a sync barrier at the reducer

Page 11: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Page 12: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

Page 13: OWF14 - Big Data Track : Abstract Algebra for Analytics

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

In a single Pass

Page 14: OWF14 - Big Data Track : Abstract Algebra for Analytics

Why is it better and faster?

Page 15: OWF14 - Big Data Track : Abstract Algebra for Analytics

Associativity allows parallelism

Page 16: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 17: OWF14 - Big Data Track : Abstract Algebra for Analytics

Do we have data structures that are intrinsically parallelizable?

Page 18: OWF14 - Big Data Track : Abstract Algebra for Analytics

Abstract Algebra Redux

• Semi Group

Associative Set (Grouping doesn’t matter)

• Monoid

Semi Group with a zero (Zeros get ignored)

• Group

Monoid with inverse

• Abelian Group

Commutative Set (ordering doesn’t matter)

Page 19: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 20: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 21: OWF14 - Big Data Track : Abstract Algebra for Analytics

Stream mining challenges

• Update predictions after every observation

• Single pass : can’t read old data or replay the stream

• Limited time for computation per observation

• O(n) memory size

Page 22: OWF14 - Big Data Track : Abstract Algebra for Analytics

Existing solutions

• Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.

• Stream subsampling

• Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees

• Use time series analysis methods …

• Etc

Page 23: OWF14 - Big Data Track : Abstract Algebra for Analytics

Approximate algorithms for stream analytics

Page 24: OWF14 - Big Data Track : Abstract Algebra for Analytics

Idea : Hash, don’t Sample

Page 25: OWF14 - Big Data Track : Abstract Algebra for Analytics

Bloom filters

• Approximate data structure for set membership

• Like an approximate set

BloomFilter.contains(x) => Maybe | NO

P(False Positive) > 0

P(False Negative) = 0

Page 26: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Bit Array of fixed size

add(x) : for all element i, b[h(x,i)]=1

contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

Page 27: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 28: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 29: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Bloom Filters

Adding an element uses a boolean OR

Querying uses a boolean AND

Both are Monoids

Page 30: OWF14 - Big Data Track : Abstract Algebra for Analytics

HyperLogLogard

Page 31: OWF14 - Big Data Track : Abstract Algebra for Analytics

Intuition

• Long runs of trailings 0 in a random bits chain are rare

• But the more bit chains you look at, the more likely you are to find a long one

• The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.

Page 32: OWF14 - Big Data Track : Abstract Algebra for Analytics

HyperLogLog

• Popular sketch for cardinality estimation

HLL.size = Approx[Number]

We know the distribution on the error.

Page 33: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 34: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 35: OWF14 - Big Data Track : Abstract Algebra for Analytics

http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Page 36: OWF14 - Big Data Track : Abstract Algebra for Analytics

• HyperLogLog

Adding an element uses MAX, which is a

monoid (Ordered Semi Group really ...)

Querying use an harmonic sum : Monoid.

Page 37: OWF14 - Big Data Track : Abstract Algebra for Analytics

Min Hash

• Gives the probability of two sets being similar.

• Essentially amounts to

P(A ∩ B) / P(A U B)

• Jaccard Similarity

Page 38: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 39: OWF14 - Big Data Track : Abstract Algebra for Analytics

Count min Sketch

Gives an approximation of the number of occurrences of an element in a set.

Page 40: OWF14 - Big Data Track : Abstract Algebra for Analytics

• Count min sketch

Adding an element is a numerical addition

Querying uses a MIN function.

Both are associative.

Page 41: OWF14 - Big Data Track : Abstract Algebra for Analytics

Anomaly Detection

Page 42: OWF14 - Big Data Track : Abstract Algebra for Analytics

- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.

- Many exist : Q-Tree, Q-Digest, T-Digest

- All of those are associative.

- Another neat thing : types your data uniformaly.

Page 43: OWF14 - Big Data Track : Abstract Algebra for Analytics

Many more sketches and tricks

• FM Counters, KMV

• Histograms

• Ball Sketches : streaming k-means, clustering

• SGD : fit online machine learning algorithms

Page 44: OWF14 - Big Data Track : Abstract Algebra for Analytics
Page 45: OWF14 - Big Data Track : Abstract Algebra for Analytics

Algebird

Page 46: OWF14 - Big Data Track : Abstract Algebra for Analytics

Conclusion

• Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers

• As data size grows, sampling becomes painful, hashing provide better cost effective solution

• Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.

http://speakerdeck.com/samklr

Page 47: OWF14 - Big Data Track : Abstract Algebra for Analytics

DON’T BE SCARED ANYMORE.

Page 48: OWF14 - Big Data Track : Abstract Algebra for Analytics

Bibliography

• Great intro into Algebird

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

• Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

• Probabilistic data structures for web analytics.

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Algebird : github.com/twitter/algebird

Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-analytics

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf


Recommended