Upload
guillaume-pitel
View
500
Download
0
Embed Size (px)
Citation preview
PAGE 1
www.exensa.com
www.exensa.com
PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLPCount-Min Tree SketchGuillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict between counters 4 and 7
PAGE 2
www.exensa.com
A bit of contextWhy do we need to count ?
Data analysis platform : eXenGine. Processes different kind of data (mostly text).
We need to create relevant cross-features : to do that we need to count occurrences of all possible cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.
There are many different measures to decide if a n-gram is interesting. All require to count the occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in bigrams)
Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the whole data structure containing the counts in memory is impossible, so one has to resort to using huge map/reduce with joins to do the job.
PAGE 3
www.exensa.com
A bit of contextWhat kind of data are we talking about ?
Google N-grams
tokens 1024 Billions
sentences 95 Billions
1-grams (count > 200) 14 Millions
2-grams (count > 40) 314 Millions
3-grams 977 Millions
4-grams 1.3 Billion
5-grams 1.2 Billion
PAGE 4
www.exensa.com
A bit of contextWhat kind of data are we talking about ?
Zipfian distribution
[Le Quan & al. 2003]
PAGE 5
www.exensa.com
A bit of contextWhat kind of measures are we talking about ?
PMI, TF-IDF, LLR
PAGE 6
www.exensa.com
A bit of contextSummary / Goals
Many counts
Logarithms in measures
We need to store a large amount of counts
We care about the order of magnitude
Fast and memory controlled
We don’t want a distributed memory for the counts
Zipfian counts
Many very small counts that will be filtered out later
PAGE 7
www.exensa.com
A bit of contextSummary / Goals
Many counts
Logarithms in measures
We need to store a large amount of counts
We care about the order of magnitude
Fast and memory controlled
We don’t want a distributed memory for the counts
Zipfian counts
Many very small counts that will be filtered out later
We can use probabilistic structures
PAGE 8
www.exensa.com
Count-Min SketchA probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
PAGE 9
www.exensa.com
Count-Min SketchA probabilistic data structure to store counts
Conservative Update : improve CMS by updating
only min values
PAGE 10
www.exensa.com
Count-Min Log SketchA probabilistic data structure to store logarithmic counts
[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch
Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting logarithmically.
Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can use more counters
However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some point, error stops improving with space (there is an inherent residual error)
PAGE 11
www.exensa.com
Count-Min Tree SketchA count min sketch with shared counters
Idea : use a hierarchal storage where most significant bits are shared between counters.
Somehow similar to TOMB counters [Van Durme, 2009], except that overflow is managed very differently.
PAGE 12
www.exensa.com
Tree Shared Counters
Sharing most significant bits
8 counters structure
oA tree is made of three kinds of storage:o Counting bitso Barrier bitso Spire (not required except for
performance)oSeveral layers alternating counting
and barrier bits.oHere we have a <[(8,8),(4,4),(2,2),
(1,1)],4> counter
Or : how can we store counts with an average approaching 4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE 13
www.exensa.com
Tree Shared Counters
Sharing most significant bits
8 counters structure
o8 counters in 30 bits + spireoWithout a spire, n bits can count up
to oMany small shared counters with spires
are more efficient than a large shared counter
Or : how can we store counts with an average approaching 4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE 14
www.exensa.com
Tree Shared Counters
Reading values
oA counter stops at the first ZERO barrieroWhen two barrier paths meet, there is a
conflictoBarrier length (b) is evaluated in unaryoCounter bits (c) are evaluated in a more
classical way
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict between counters 4 and 7
PAGE 15
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0 1 2
PAGE 16
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
3 4 5
PAGE 17
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 1 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
61
A bit at that level is worth …
224
48
PAGE 18
www.exensa.com
Count-Min Tree Sketches
Experiments
Results !
• 140M tokens from English Wikipedia* • 14.7M words (unigrams + bigrams)• Reference counts stored in UnorderedMap 815MiB
Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits counters. For 14.7M words, it amounts to 59MiB.
Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native UnorderedMap performance.
We use 3-layers sketches (good performance/precision tradeoff)
* We preferred to test our counters with a large number of parameters rather than with a large corpus, so we limit to 5% of Wikipedia.
PAGE 19
www.exensa.com
Count-Min Tree Sketches
Average Relative Error
Results !
PAGE 20
www.exensa.com
Count-Min Tree Sketches
RMSE
Results !
PAGE 21
www.exensa.com
Count-Min Tree Sketches
RMSE on PMI
Results !
PAGE 22
www.exensa.com
Count-Min Tree SketchQuestion : are CMTS really useful in real-life ?
1 – CMTS are better on the whole vocabulary, but what happens if we skip the least frequent words / bigrams ?2 – CMTS are better on average, but what happens quantile by quantile ?
PAGE 23
www.exensa.com
Count-Min Tree Sketches
PMI Error per quantile (sketches at 50% perfect size, limit eval to f > 10-7 )
Results !
PAGE 24
www.exensa.com
Count-Min Tree Sketches
Relative Error per log2-quantile (sketches at 50% perfect size, limit eval to f > 10-7 )
Results !
PAGE 25
www.exensa.com
ConclusionWhere are we ?
CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient way.
Because most of the time in sketch accesses is due to memory access, its performance is on-par with other methods
• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage size), the error skyrockets
• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to increment the counters.
Merging (and thus distributing) is easy once you can read and set a counter.
PAGE 26
www.exensa.com
ConclusionWhere are we going ?
Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)
Pressure control : when we detect that pressure becomes too high, we can divide and subsample to stop the collisions to cascade
Open Source python package on its way