OWF14 - Big Data Track : Abstract Algebra for Analytics

Preview:

DESCRIPTION

Sam BESSALAH Algebird is an abstract algebra library for Scala developed at Twitter and released under the ASL 2.0 license. It has support for algebraic structures such as semigroups, monoids, groups, rings and fields as well as the standard functional things like monads. More interestingly though are the probabilistic data structures and the accompanying monoids that come out of the box. I'll talk a bit about Algebird in general and how it eases building large scale analytics systems with Map Reduce systems or in a stream processing context.

Citation preview

Abstract Algebra for Analytics

Sam BESSALAH

@samklr

What do we want?

• We want to build scalable systems.

• Preferably by leveraging distributed computing

• A lot of analytics amount to counting or adding in some sort of way.

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

11, 12, 0,3,56,48 K=3 56,48,12

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

• Example : Finding TopK Elements

Read Input

Sort, Filter and take top K records

Write Output

Hadoop Map-Reduce

In Scalding

In Scalding

Problems

• Curse of the last reducer

• Network Chatter, hinder on performance

• Inefficient Order for map and reduce steps

• Multiple jobs, with a sync barrier at the reducer

But in Scalding, « sortWithTake » uses :

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

But in Scalding, « sortWithTake » uses :

Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative

PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45

In a single Pass

Why is it better and faster?

Associativity allows parallelism

Do we have data structures that are intrinsically parallelizable?

Abstract Algebra Redux

• Semi Group

Associative Set (Grouping doesn’t matter)

• Monoid

Semi Group with a zero (Zeros get ignored)

• Group

Monoid with inverse

• Abelian Group

Commutative Set (ordering doesn’t matter)

Stream mining challenges

• Update predictions after every observation

• Single pass : can’t read old data or replay the stream

• Limited time for computation per observation

• O(n) memory size

Existing solutions

• Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.

• Stream subsampling

• Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees

• Use time series analysis methods …

• Etc

Approximate algorithms for stream analytics

Idea : Hash, don’t Sample

Bloom filters

• Approximate data structure for set membership

• Like an approximate set

BloomFilter.contains(x) => Maybe | NO

P(False Positive) > 0

P(False Negative) = 0

• Bit Array of fixed size

add(x) : for all element i, b[h(x,i)]=1

contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

• Bloom Filters

Adding an element uses a boolean OR

Querying uses a boolean AND

Both are Monoids

HyperLogLogard

Intuition

• Long runs of trailings 0 in a random bits chain are rare

• But the more bit chains you look at, the more likely you are to find a long one

• The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.

HyperLogLog

• Popular sketch for cardinality estimation

HLL.size = Approx[Number]

We know the distribution on the error.

http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

• HyperLogLog

Adding an element uses MAX, which is a

monoid (Ordered Semi Group really ...)

Querying use an harmonic sum : Monoid.

Min Hash

• Gives the probability of two sets being similar.

• Essentially amounts to

P(A ∩ B) / P(A U B)

• Jaccard Similarity

Count min Sketch

Gives an approximation of the number of occurrences of an element in a set.

• Count min sketch

Adding an element is a numerical addition

Querying uses a MIN function.

Both are associative.

Anomaly Detection

- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.

- Many exist : Q-Tree, Q-Digest, T-Digest

- All of those are associative.

- Another neat thing : types your data uniformaly.

Many more sketches and tricks

• FM Counters, KMV

• Histograms

• Ball Sketches : streaming k-means, clustering

• SGD : fit online machine learning algorithms

Algebird

Conclusion

• Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers

• As data size grows, sampling becomes painful, hashing provide better cost effective solution

• Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.

http://speakerdeck.com/samklr

DON’T BE SCARED ANYMORE.

Bibliography

• Great intro into Algebird

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

• Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

• Probabilistic data structures for web analytics.

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Algebird : github.com/twitter/algebird

Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-analytics

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Recommended