Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
June 2010 Indo-Spain ICT Workshop 1
Data Mining for Information Retrieval, Business and Scientific Applications
Jayant HaritsaDatabase Systems Lab, SERCIndian Institute of Science
DM Research Map in India
June 2010 Indo-Spain ICT Workshop 2
IIIT BangaloreSrinath Srinivasa [Graph,Doc]
IISc BangaloreCSA, EE, SERC
M Narasimha Murthy [ML,ARM]C Bhattacharyya [ML,Bio,Hand]
Shirish Shevade [SVM]P S Sastry [Temporal, DecTrees]
Jayant Haritsa [ARM, Privacy]
IIT BombayCSE, Chemical
Sunita Sarawagi [IR]Soumen Chakrabarti [Web]
Saketa Nath [ML]Pramod Wangikar [Bio]
IIT MadrasCSE
P Sreenivasa Kumar [Text, XML]D Janaki Ram [S/W Engg]
B Ravindran [Text, Tutoring, Motion]
IIIT HyderabadKamal Karlapalem [Ecom, Cluster]Vikram Pudi [ ARM, Multimedia]
P Krishna Reddy [ARM]
IIT DelhiCSE
S K Gupta [ARM,Security]
IIT KanpurCSE
Arnab Bhattacharya[Spatial, Temporal, Bio]
IIT KharagpurCSE
Pabitra Mitra [ML]
Temporal Data Mining[P S Sastry]
June 2010 Indo-Spain ICT Workshop 3
Temporal Data Mining:Analyzing Symbolic Time Series Data
A temporal data mining framework Data is a sequence of events. Each event has a type and a time of occurrence‘Patterns’ in the formalism are episodes –partially ordered sets of event types. Episode occurrence: events in the data that conform to the partial order
June 2010 4Indo-Spain ICT Workshop
TDM (contd)
Algorithms are developed to discover frequent episodes of different structures. Can handle additional temporal constraints.Statistical theory developed to assess significance under various null hypothesis models. Unified view of counting-based CS and model-based Stats approaches [equivalence between frequent episodes and HMMs]
June 2010 5Indo-Spain ICT Workshop
Example Data Sequences
Fault report logs in manufacturing plants:root cause diagnostics Multi-Neuronal Spike train data:discovering microcircuits or groups of neurons with ‘strong’ interactionsWeb navigation data:Prediction of user behaviour
June 2010 6Indo-Spain ICT Workshop
An Example Application
Data: Sequence of fault codes in an assembly plantEvent types: fault codes, date and timeEpisodes: Causative (?) ChainsRules derived from episodes provide help in root cause diagnosisCurrently in use in some engine assembly plants of General Motors
June 2010 7Indo-Spain ICT Workshop
Fault Correlations
Root cause diagnostics
Application: Status logs from Assembly Plants
Fault Logs
June 2010 8Indo-Spain ICT Workshop
Associated PublicationsS Laxman, PS Sastry & KP Unnikrishnan, Discovering frequent episodes and learning HMMs: A formal connection, IEEE Trans Knowledge Data Engg, Vol 17 pp 1505-1517, November 2005S Laxman, PS Sastry & KP Unnikrishnan, Learning frequent generalized episodes when events persist for different durations, IEEE Trans Knowledge Data Engg, Vol 19, pp 1188-1201, September 2007D. Patnaik, PS Sastry & KP Unnikrishnan, Inferring Neuronal network connectivity from spike data: A temporal data mining approach, Scientific Programming (special issue on biological data mining), Vol16, pp 49-77, 2008
June 2010 9Indo-Spain ICT Workshop
Relevant Publications (contd)
C Diekman, PS Sastry & KP Unnikrishnan,Statistical Significance of sequential firing patterns in multi-neuronal spike trains, J. Neuroscience Methods, Vol 182, pp 279-284, 2009.PS Sastry & KP Unnikrishnan,Conditional probability based significance test for sequential patterns in multi-neuronal spike trains,Neural Computation, Vol 22, pp1025-1059, April 2010.KP Unnikrishnan, BQ Shadid, PS Sastry & S Laxman, Root cause diagnostics using temporal data mining, US Patent No 7509234, issued on 24 March 2009.
June 2010 10Indo-Spain ICT Workshop
June 2010 Indo-Spain ICT Workshop 11
Online Generation of Web Tables[Prof. Sunita Sarawagi ]
The semi-structured webTable
List
Regular page
Formatted list
Queries in WWTQuery by example
Query by description
Alan Turing Turing MachineE. F. Codd Relational Databases
Inventor Computer science concept
Answer: Table with ranked rows
Inventor Computer Science Concept
Alan Turing Turing Machine
Seymour Cray Supercomputer
E. F. Codd Relational Databases
Tim Berners-Lee WWW
Charles Babbage Babbage Engine
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Key
wor
d Q
uery
Sou
rce
L1 ,…
,Lk
Type Inference
ResolverResolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated TableCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated tableUser
Annotate
Ontology
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries− TV Series: Character name, Actor name, Season − Oil spills: Tanker, Region, Time− Golden Globe Awards: Actor, Movie, Year− Dadasaheb Phalke Awards: Person, Year− Parrots: common name, scientific name, family
Accuracy of joint labelingDataset− Manually labeled
450 tables spanning general Web and Wikipedia− Automatically labeled
650 tables from Wikipedia where cells have entity links
0
0,2
0,4
0,6
0,8
1
Entity Types relations Entity
Manual Automatic
F1 A
ccur
acy
LCAMajorityOurs
SummaryAmazing amount of quality information on the semi-structured web.WWT− Online: structure interpretation at query time− Domain-independent methods for extraction and
coreference resolution− Relies heavily on unsupervised statistical learning
Joint training for exploiting overlap during extractionGraphical model for table annotationCollective column labeling for descriptive queriesBayesian network for consolidationPage rank + confidence from a probabilistic extractor for ranking
Further DetailsGupta and Sarawagi, VLDB 2009
June 2010 Indo-Spain ICT Workshop 23
Association Rule Mining[Jayant Haritsa]
June 2010 Slide 24Indo-Spain ICT Workshop
• Table Organization– Horizontal (transactions / rows) – Vertical (items / columns)
• Data Representation– Value List (only presence)– Bit Vector (presence and absence)
– Horizontal (transactions / rows)
Data Layouts for AR Mining
1 2 5
1 1 0 10
June 2010 Slide 25Indo-Spain ICT Workshop
Data Layout Combinations
OurApproach
Apriori
June 2010 Slide 26Indo-Spain ICT Workshop
Why Vertical Organization?
• “Natural” for association rule mining’s goal of discovering correlated items (columns)
• Support counting simple (set intersections)• No excess baggage from disk (automatic and
immediate reduction of database after each scan)• Ideal for parallel implementations (asynchronous,
not level-wise, computation)– counting of AB can start before item C has been fully
counted
June 2010 Slide 27Indo-Spain ICT Workshop
Why Bit-Vector Representation?
• Tremendous scope for compression, especially since databases are typically sparse
• Vertical orientation offers better compression than horizontal since column lengths are proportional to size of database whereas row lengths are proportional to size of schema
• In fact, compressed VTV occupies much less space than HIL
June 2010 Slide 28Indo-Spain ICT Workshop
Contributions
• Compressed VTV data layout — “Snake”• VIPER snake mining algorithm
– (Vertical Itemset Partitioning for Efficient Rule-extraction)– several optimizations for snake generation, intersection,
counting and storage– “general-purpose’’ (prior vertical algorithms have
restrictions on DB size, shape, contents, mining process)– substantial performance improvement
(response time, disk space, disk traffic)– in some cases, beats “optimal” horizontal !
• Presented in ACM SIGMOD 2000
June 2010 Indo-Spain ICT Workshop 29
Privacy Preserving Mining
JUNE 2008 Slide 30S N Bose Centre
A Typical Web-Service Form(e.g. Amazon.com)
JUNE 2008 Slide 31S N Bose Centre
State-of-the-Art
• To cater to such privacy concerns, users are forced to resort to falsifying their data– (E.g.: There appear to be numerous grandmothers from
Bangalore downloading rock music, because theactual clients ― young male IISc students ― have falsified their age and gender)
• But then, the models become meaningless …– Garbage in, Garbage out
JUNE 2008 Slide 32S N Bose Centre
Design Wish-list
• High Privacy– User-visible (i.e. should be done at user site)
• Highly accurate models– Association Rules correctly identified
• Efficiency– Data collection / Mining-process
Privacy
Accuracy
Efficiency
User → Service → MiningProvider
Company
JUNE 2008 Slide 33S N Bose Centre
The Digital Divide
Data Privacy Accurate ModelsVersus
Desire “Globally accurate, locally private”, but
JUNE 2008 Slide 34S N Bose Centre
Bridging the Divide
User Data
Mining Algorithm
Accurate Models
Data DistortionProcedure
Distribution ReconstructionProcedure
JUNE 2008 Slide 35S N Bose Centre
Optimal Distortion Matrix
• Symmetric positive-definite Toeplitz matrix, with Gamma diagonal
• Results in matrix with lowest “condition number” (ratio of maximum and minimum eigen-values)– makes reconstruction least sensitive to the variance in the
distribution of the distorted database– order-of-magnitude accuracy improvements
• Dependent column perturbation
γ determines privacy level
June 2010 Slide 36Indo-Spain ICT Workshop
Summary
• By careful mathematical design, it is possible to simultaneously achieve
user data privacy,accurate statistical models, and good runtime efficiency.
June 2010 Slide 37Indo-Spain ICT Workshop
Further Details
• “Maintaining Data Privacy in Association RuleMining” [VLDB 2002]
• “On Addressing Efficiency Concerns inPrivacy-Preserving Mining” [DASFAA 2004]
• “A Framework for High-Accuracy Privacy-Preserving Mining” [ICDE 2005]
• All available at http://dsl.serc.iisc.ernet.in
June 2010 Indo-Spain ICT Workshop 38
Questions?
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries– TV Series: Character name, Actor name,
Season – Oil spills: Tanker, Region, Time– Golden Globe Awards: Actor, Movie, Year– Dadasaheb Phalke Awards: Person, Year– Parrots: common name, scientific name, family