Upload
corinium-global
View
178
Download
0
Embed Size (px)
Citation preview
Deep Learning and Topological Data Analysis for
machine intelligence and predictive analytics
Unlabeled Data
The vast majority of data is unlabeled
Medical and DNA profiling
Images
Text Stock market transactions
Customer Activities
Sensor signals System Logs Sound
Unlabeled Data
• How many categories in my dataset? • Which categories are the best for the business? • Why some objects are not like the others? • How I can contextualize new objects? • Is there a simpler way to describe my data?
Business questions to unlabeled data: Unlabeled Data
A topological invariant is a map f that assigns the same object to homeomorphic spaces, that is:
Homology: is a machine that converts local data about a space into global algebraic structure
Topological invariants
Reference: Wikipedia, 2010.
The Čech Complex
Combinatorial representations
a b
a. Compute a combinatorial model approximating the structure of the underlying space
b. Then compute topological invariants of this structure c. Represent these topological invariants in 2d space
Topology Data Analysis Pipeline
c
Theorem: Supposeh:Xg is a discrete Morse function. Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p.
Reference:TengMa;ZhuangzhiWu;PeiLuo;LuFeng.Reebgraphcomputa1onthroughspectralclustering,2011.
Morse Theory and Reeb Graph
Deep Generative Nets + TDA
1. Learning of deep generative model 2. Fine-tuning using topological loss
Case study: Netflix competition A dataset from Netflix open competition best collaborative filtering algorithm to predict user ratings for films:
• 100,480,507 ratings • 480,189 users • 17,770 movies • 2.1 GB of CSV file
Case study: Netflix competition
PCA
Standard Approach to cluster analysis
Case study: Netflix competition
PCA
Hessian LLE
Isomap
Locally-Linear Embedding (LLE)
Local Tangent Space Alignment (LTSA)
Standard Approach to cluster analysis
Case study: Netflix competition Topological Result
Case study: Netflix competition Topological Result with Labels
Case study: Netflix competition Horror Movies
Case study: Netflix competition Science Fiction / Fantasy Series
Case study: 20 Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
• 18,820 documents • From 6 to 5000 words each • 20 newsgroups (classes)
20Newsgroupsacademicdataset
(semi-supervised)
Case study: 20 Newsgroups alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey
sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
Detailed topology (user group overlay) Case study: 20 Newsgroups
Baseball cluster Case study: 20 Newsgroups
“pitch” > 1.2 This must be baseball speed game margin realist chip ucdavi edu gari
built villanova huckabai basebal game and shade hour that damn long don plai hour game watch game for that long butt fall asleep and watch
channel surf pitch catch color
Motorcycles cluster Case study: 20 Newsgroups
“bike” > 1.114 This must be motorcycles
ride sixteen dai had put test drive honda final saturdai rain fact clear warm and sunni and wind di week ago long cool ride hawk cycl for test ride
had sold and deliv demo fifteen hour arriv and demo vfr bike lock showroom surround bike and
not like move todai even bike us dirt bike us street bike car and big tent full outlandishli fat tour bike trailer squeez park lot sort fat bike convent shelli and dave run msf each time classroom and back lot usual free cookout
distribut severli affect will bike perform such load cling back rest secur shift increas chanc surf
collect wisdom request can afford leather pant boot and jean can make you knee protector
rollerblad us bean and sell
Result of learning first two groups Case study: 20 Newsgroups
Labeled baseball!Unlabeled baseball!
Labeled Motorcycles!
Unlabeled Motorcycles!
Autos Pc.hardware
Mac.hardware
Result of learning five groups Case study: 20 Newsgroups
Mac.hardware
Baseball!
Pc.hardware
Autos
Motorcycles!
Scy.med!
Politics.misc!
Politics.!mideast!
Hockey!
Final result for 2nd layer Case study: 20 Newsgroups
Motorcycles
Christian Atheism
Religion.misc
Politics.guns
Politics.misc
Politics.mideast
Scy.crypt
Scy.med Hockey
Baseball
Autos
Forsale Mac.hardware
Electronics
Scy.space Comp.graphics
Windows.x
Ms-windows.misc
Pc.hardware
Case study: Badoo A subset of user activity in the United States. Aggregated activity metrics over two weeks in August 2014.
• 88,567 users • 867 metrics
Case study: Badoo Data Transformation
Used aggregated representations of user activities per day: • Number of likes • Number of dislikes • Number of matches • Profiles visited • Photos uploaded • Number of messages sent (no content analysed) • Number of message replies • Interactions with different app features
Case study: Badoo
Case study: Badoo Messages sent / received
Case study: Badoo Users with high retention
Case study: Badoo Users grouped in retention clusters by using deep generative nets
Case study: Badoo Users grouped in retention clusters by using deep generative nets
Case study: Badoo Users grouped in retention clusters by using deep generative nets
“Pretty boys”: users with high score, received a lot of likes and messages in
first 3 days
“Dedicated”: users, invested much time in profiles, were active of site and received several
messages in first three days
“Curious”: invested less time in profiles, send lots of messages, sometimes being blocked by other users
Case study: Badoo On-line learning and prediction of user clusters
1. Configure integration 2. Perform segmentation
3. System performs classification 4. Report classification results
• CSV API • JSON API • Database connector
Case study: Financial Articles Understand main topics from news and scientific articles on economics topic
• 17,020 documents
Case study: Financial Articles
Demo
Case study: Financial Articles
Case study: Financial Articles
Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf Discrete Morse Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Extracting and Composing Robust Features with Denoising Autoencoders (Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol) http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
[email protected] www.datarefiner.com