Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database

June 2010 Indo-Spain ICT Workshop 1

Data Mining for Information Retrieval, Business and Scientific Applications

Jayant HaritsaDatabase Systems Lab, SERCIndian Institute of Science

DM Research Map in India


IIIT BangaloreSrinath Srinivasa [Graph,Doc]

IISc BangaloreCSA, EE, SERC

M Narasimha Murthy [ML,ARM]C Bhattacharyya [ML,Bio,Hand]

Shirish Shevade [SVM]P S Sastry [Temporal, DecTrees]

Jayant Haritsa [ARM, Privacy]

IIT BombayCSE, Chemical

Sunita Sarawagi [IR]Soumen Chakrabarti [Web]

Saketa Nath [ML]Pramod Wangikar [Bio]

IIT MadrasCSE

P Sreenivasa Kumar [Text, XML]D Janaki Ram [S/W Engg]

B Ravindran [Text, Tutoring, Motion]

IIIT HyderabadKamal Karlapalem [Ecom, Cluster]Vikram Pudi [ ARM, Multimedia]

P Krishna Reddy [ARM]

IIT DelhiCSE

S K Gupta [ARM,Security]

IIT KanpurCSE

Arnab Bhattacharya[Spatial, Temporal, Bio]

IIT KharagpurCSE

Pabitra Mitra [ML]

Temporal Data Mining[P S Sastry]


http://www.google.co.in/imgres?imgurl=http://www.iisc.ernet.in/centenary-conf/sastry.jpg&imgrefurl=http://www.iisc.ernet.in/centenary-conf/sastry.html&usg=__1nbJgKMCLmqBqwUoN6Y8aCpMSYA=&h=159&w=150&sz=3&hl=en&start=1&sig2=Y5LTgzxaZkaj06oJPNaseQ&um=1&itbs=1&tbnid=Ktt9cCj-HKpi_M:&tbnh=97&tbnw=92&prev=/images?q=P+S+Sastry+IISc&um=1&hl=en&sa=N&rls=com.microsoft:en-us&rlz=1I7RNTN_en&tbs=isch:1&ei=CAEFTMLID4q7rAe0gum5Dg

Temporal Data Mining:Analyzing Symbolic Time Series Data

A temporal data mining framework Data is a sequence of events. Each event has a type and a time of occurrence‘Patterns’ in the formalism are episodes –partially ordered sets of event types. Episode occurrence: events in the data that conform to the partial order

June 2010 4Indo-Spain ICT Workshop

TDM (contd)

Algorithms are developed to discover frequent episodes of different structures. Can handle additional temporal constraints.Statistical theory developed to assess significance under various null hypothesis models. Unified view of counting-based CS and model-based Stats approaches [equivalence between frequent episodes and HMMs]


Example Data Sequences

Fault report logs in manufacturing plants:root cause diagnostics Multi-Neuronal Spike train data:discovering microcircuits or groups of neurons with ‘strong’ interactionsWeb navigation data:Prediction of user behaviour


An Example Application

Data: Sequence of fault codes in an assembly plantEvent types: fault codes, date and timeEpisodes: Causative (?) ChainsRules derived from episodes provide help in root cause diagnosisCurrently in use in some engine assembly plants of General Motors


Fault Correlations

Root cause diagnostics

Application: Status logs from Assembly Plants

Fault Logs


Associated PublicationsS Laxman, PS Sastry & KP Unnikrishnan, Discovering frequent episodes and learning HMMs: A formal connection, IEEE Trans Knowledge Data Engg, Vol 17 pp 1505-1517, November 2005S Laxman, PS Sastry & KP Unnikrishnan, Learning frequent generalized episodes when events persist for different durations, IEEE Trans Knowledge Data Engg, Vol 19, pp 1188-1201, September 2007D. Patnaik, PS Sastry & KP Unnikrishnan, Inferring Neuronal network connectivity from spike data: A temporal data mining approach, Scientific Programming (special issue on biological data mining), Vol16, pp 49-77, 2008


Relevant Publications (contd)

C Diekman, PS Sastry & KP Unnikrishnan,Statistical Significance of sequential firing patterns in multi-neuronal spike trains, J. Neuroscience Methods, Vol 182, pp 279-284, 2009.PS Sastry & KP Unnikrishnan,Conditional probability based significance test for sequential patterns in multi-neuronal spike trains,Neural Computation, Vol 22, pp1025-1059, April 2010.KP Unnikrishnan, BQ Shadid, PS Sastry & S Laxman, Root cause diagnostics using temporal data mining, US Patent No 7509234, issued on 24 March 2009.



Online Generation of Web Tables[Prof. Sunita Sarawagi ]

http://www.google.co.in/imgres?imgurl=http://www.it.iitb.ac.in/~sunita/ss2006.jpg&imgrefurl=http://www.cse.iitb.ac.in/~sunita/&usg=__DEewmpgf0PNHqxQPZ0BrXxd0u9M=&h=141&w=150&sz=5&hl=en&start=1&sig2=b0ZUAuyK8Vvcsd1QlGMGnA&um=1&itbs=1&tbnid=ai6HxipFocdnaM:&tbnh=90&tbnw=96&prev=/images?q=Sunita+Sarawagi&um=1&hl=en&sa=N&rls=com.microsoft:en-us&rlz=1I7RNTN_en&tbs=isch:1&ei=OQEFTLrkGs-2rAewx_CsDg

The semi-structured webTable

List

Regular page

Formatted list

Queries in WWTQuery by example

Query by description

Alan Turing Turing MachineE. F. Codd Relational Databases

Inventor Computer science concept

Answer: Table with ranked rows

Inventor Computer Science Concept

Alan Turing Turing Machine

Seymour Cray Supercomputer

E. F. Codd Relational Databases

Tim Berners-Lee WWW

Charles Babbage Babbage Engine

WWT Architecture

Index Query Builder

Web

Extract record sources

Query Table

Content+context index

Offline

Store

Key

wor

d Q

uery

Sou

rce

L1 ,…

,Lk

Type Inference

ResolverResolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated TableCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated tableUser

Annotate

Ontology

Experiments

Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries− TV Series: Character name, Actor name, Season − Oil spills: Tanker, Region, Time− Golden Globe Awards: Actor, Movie, Year− Dadasaheb Phalke Awards: Person, Year− Parrots: common name, scientific name, family

Accuracy of joint labelingDataset− Manually labeled

450 tables spanning general Web and Wikipedia− Automatically labeled

650 tables from Wikipedia where cells have entity links

0

0,2

0,4

0,6

0,8

1

Entity Types relations Entity

Manual Automatic

F1 A

ccur

acy

LCAMajorityOurs

SummaryAmazing amount of quality information on the semi-structured web.WWT− Online: structure interpretation at query time− Domain-independent methods for extraction and

coreference resolution− Relies heavily on unsupervised statistical learning

Joint training for exploiting overlap during extractionGraphical model for table annotationCollective column labeling for descriptive queriesBayesian network for consolidationPage rank + confidence from a probabilistic extractor for ranking

Further DetailsGupta and Sarawagi, VLDB 2009


Association Rule Mining[Jayant Haritsa]

June 2010 Slide 24Indo-Spain ICT Workshop

• Table Organization– Horizontal (transactions / rows) – Vertical (items / columns)

• Data Representation– Value List (only presence)– Bit Vector (presence and absence)

– Horizontal (transactions / rows)

Data Layouts for AR Mining

1 2 5

1 1 0 10


Data Layout Combinations

OurApproach

Apriori


Why Vertical Organization?

• “Natural” for association rule mining’s goal of discovering correlated items (columns)

• Support counting simple (set intersections)• No excess baggage from disk (automatic and

immediate reduction of database after each scan)• Ideal for parallel implementations (asynchronous,

not level-wise, computation)– counting of AB can start before item C has been fully

counted


Why Bit-Vector Representation?

• Tremendous scope for compression, especially since databases are typically sparse

• Vertical orientation offers better compression than horizontal since column lengths are proportional to size of database whereas row lengths are proportional to size of schema

• In fact, compressed VTV occupies much less space than HIL


Contributions

• Compressed VTV data layout — “Snake”• VIPER snake mining algorithm

– (Vertical Itemset Partitioning for Efficient Rule-extraction)– several optimizations for snake generation, intersection,

counting and storage– “general-purpose’’ (prior vertical algorithms have

restrictions on DB size, shape, contents, mining process)– substantial performance improvement

(response time, disk space, disk traffic)– in some cases, beats “optimal” horizontal !

• Presented in ACM SIGMOD 2000


Privacy Preserving Mining

JUNE 2008 Slide 30S N Bose Centre

A Typical Web-Service Form(e.g. Amazon.com)


State-of-the-Art

• To cater to such privacy concerns, users are forced to resort to falsifying their data– (E.g.: There appear to be numerous grandmothers from

Bangalore downloading rock music, because theactual clients ― young male IISc students ― have falsified their age and gender)

• But then, the models become meaningless …– Garbage in, Garbage out


Design Wish-list

• High Privacy– User-visible (i.e. should be done at user site)

• Highly accurate models– Association Rules correctly identified

• Efficiency– Data collection / Mining-process

Privacy

Accuracy

Efficiency

User → Service → MiningProvider

Company


The Digital Divide

Data Privacy Accurate ModelsVersus

Desire “Globally accurate, locally private”, but


Bridging the Divide

User Data

Mining Algorithm

Accurate Models

Data DistortionProcedure

Distribution ReconstructionProcedure


Optimal Distortion Matrix

• Symmetric positive-definite Toeplitz matrix, with Gamma diagonal

• Results in matrix with lowest “condition number” (ratio of maximum and minimum eigen-values)– makes reconstruction least sensitive to the variance in the

distribution of the distorted database– order-of-magnitude accuracy improvements

• Dependent column perturbation

γ determines privacy level


Summary

• By careful mathematical design, it is possible to simultaneously achieve

user data privacy,accurate statistical models, and good runtime efficiency.


Further Details

• “Maintaining Data Privacy in Association RuleMining” [VLDB 2002]

• “On Addressing Efficiency Concerns inPrivacy-Preserving Mining” [DASFAA 2004]

• “A Framework for High-Accuracy Privacy-Preserving Mining” [ICDE 2005]

• All available at http://dsl.serc.iisc.ernet.in


Questions?

Experiments

Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries– TV Series: Character name, Actor name,

Season – Oil spills: Tanker, Region, Time– Golden Globe Awards: Actor, Movie, Year– Dadasaheb Phalke Awards: Person, Year– Parrots: common name, scientific name, family

Documents

Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database