Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
St reaming Da ta M in ing
PRESENTED BY Edo Liberty⎪ April 11, 2014
Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.
Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4 Yahoo Confidential & Proprietary
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Result
The World
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Distributed model (map/reduce, message passing, …)
5 Yahoo Confidential & Proprietary
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Result
The World
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Query
Distributed model (indexes, tables, databases, …)
207 big-data infographics (meta infographic)
6 Yahoo Confidential & Proprietary
7 Yahoo Confidential & Proprietary
8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm Result Query
Result
Computation
The streaming model
9 Yahoo Confidential & Proprietary
Aggregate+ Sketch
The World
Query Algorithm Result Query
Result
Compute + Sketch
Compute + Sketch
Compute + Sketch
Compute + Sketch
The parallel streaming model
10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accurately)
O(n) Items
O(polylog(n)) Space
O(polylog(n)) Computation per item
11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
Frequent i tems
Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
13 Yahoo Confidential & Proprietary
d
n
f( ) = 5
14 Yahoo Confidential & Proprietary
f( ) = 5
d
15 Yahoo Confidential & Proprietary
`
16 Yahoo Confidential & Proprietary
`
17 Yahoo Confidential & Proprietary
`
18 Yahoo Confidential & Proprietary
`
19 Yahoo Confidential & Proprietary
`
20 Yahoo Confidential & Proprietary
`
21 Yahoo Confidential & Proprietary
`
22 Yahoo Confidential & Proprietary
f 0( ) = 0
`
f 0( ) = 2
23 Yahoo Confidential & Proprietary
Assume we do this times t
Second fact: f 0(x) � f(x)� t
f
0(x) f(x) First fact:
The proof (very short)
24 Yahoo Confidential & Proprietary
Third (not so obvious) fact: Which gives . In words: We can only delete items times!
t n/`
0 �P
f
0(x) =P
f(x)� t · ` = n� t · `
⌅
The proof (very short)
` n/`
|f 0(x)� f(x)| n/`
Useful form…
25 Yahoo Confidential & Proprietary
Define And We get that This is very useful for keeping approx’ distributions!
p(x) = f(x)/np
0(x) = f
0(x)/n
|p0(x)� p(x)| 1/`
Threading Machine Generated Emai l
27 Yahoo Confidential & Proprietary
Email threads
A simple email thread (that’s not very hard to do…)
Threading Machine Generated Email
28 Yahoo Confidential & Proprietary
Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
29 Yahoo Confidential & Proprietary
Threading Machine Generated Email
30 Yahoo Confidential & Proprietary
Threading Machine Generated Email
What else can we do in the streaming model…
31 Yahoo Confidential & Proprietary
Items (words, IP-adresses, events, clicks,...): § Item frequencies § Counting distinct elements § Moment and entropy estimation § Approximate set operations
Vectors (text documents, images, example features,...) § Dimensionality reduction § Clustering (k-means, k-median,…) § Linear Regression § Machine learning (some of it at least)
Matrices (text corpora, user preferences, graphs...) § Covariance estimation matrix § Low rank approximation § Sparsification
Thanks!
32 Yahoo Confidential & Proprietary
Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA
Seth Tropper [email protected]
Doug DeSimone [email protected]
Keith Daniels [email protected]
Yahoo is an equal opportunity employer.