Upload
zulema
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Database Research: Data Mining & Other Areas. Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008. Agenda. Database Systems Introduction to Databases and Research Areas Data Mining - PowerPoint PPT Presentation
Citation preview
Database Research: Database Research: Data Mining & Other Areas Data Mining & Other Areas
Dr. Aparna VardeDr. Aparna VardePh.D., Computer Science, WPI, MAPh.D., Computer Science, WPI, MA
Assistant Professor, Computer Science, VSU, VAAssistant Professor, Computer Science, VSU, VA
Presentation at Montclair State University, NJ Presentation at Montclair State University, NJ May 2, 2008May 2, 2008
AgendaAgenda
Database SystemsDatabase Systems– Introduction to Databases and Research AreasIntroduction to Databases and Research Areas
Data MiningData Mining– Research Problem in Graphical Data MiningResearch Problem in Graphical Data Mining
Other AreasOther Areas– Data Warehousing Data Warehousing – Web DatabasesWeb Databases
Data in Various FormsData in Various Forms
Human Mind(Too much data)
Documents (Processed)
Raw Data(Handwritten)
Flat Files(Unprocessed)
Images (Complex)
Simple Tables (Organized)
Need for DatabasesNeed for Databases
Integration of dataIntegration of data
Efficient storageEfficient storage
Fast retrievalFast retrieval
Ease of modificationEase of modification
Security of informationSecurity of information
Recovery from failuresRecovery from failures
Database System EnvironmentDatabase System Environment
Database
DBMS (Database Management System)
Application Programs/Queries
Users
Database System
Roles in the Database World Roles in the Database World
Database Administrator Database Application Programmer
Database User Database Researcher
Examples of Database Research AreasExamples of Database Research Areas
Query Processing and OptimizationQuery Processing and Optimization
Privacy and SecurityPrivacy and Security
Storage and IndexingStorage and Indexing
Data MiningData Mining
Data WarehousingData Warehousing
Web DatabasesWeb Databases
Data MiningData Mining
Discovering knowledge from data Discovering knowledge from data – Non-trivial process of finding novel and Non-trivial process of finding novel and
interesting patterns in large datasets to guide interesting patterns in large datasets to guide future decisionsfuture decisions
Types of DataTypes of Data– NumbersNumbers– GraphsGraphs– ImagesImages– TextText
Data Mining TechniquesData Mining Techniques
Association Rule MiningAssociation Rule Mining– Discovering relationships of the type A => BDiscovering relationships of the type A => B
Clustering Clustering – Grouping objects based on similarityGrouping objects based on similarity
ClassificationClassification– Predicting the class of a target Predicting the class of a target
Graphical Data Mining ProblemGraphical Data Mining Problem
Experimental results in scientific domains plotted as graphs Experimental results in scientific domains plotted as graphs
Users pose queries for predictive analysis:Users pose queries for predictive analysis:– Given input conditions, predict most likely graphGiven input conditions, predict most likely graph– Given desired graph, predict most likely conditions Given desired graph, predict most likely conditions
Need for mining graphical data to discover knowledge Need for mining graphical data to discover knowledge
Proposed Approach: AutoDomainMineProposed Approach: AutoDomainMine
AutoDomainMine: Prediction of GraphAutoDomainMine: Prediction of Graph
AutoDomainMine: Prediction of ConditionsAutoDomainMine: Prediction of Conditions
Main TasksMain Tasks
Task 1AutoDomainMine Learning Strategy
of Integrating Clustering and Classification
[AAAI-06 Poster, ACM SIGART’s ICICIS-05]
Task 2Learning Domain-Specific
Distance Metrics for Graphs
[ACM KDD’s MDM-05, MTAP-06 Journal]
Task 3Designing Semantics-Preserving
Representatives for Clusters
[ACM SIGMOD’S IQIS-06,ACM CIKM-06]
Learning Distance Metrics for Graphs
Various distance metrics Various distance metrics • Absolute position of pointsAbsolute position of points• Statistical observationsStatistical observations• Critical features Critical features
IssuesIssues• Not known what metrics apply Not known what metrics apply • Multiple metrics may be Multiple metrics may be
relevantrelevant
Need for distance metric Need for distance metric learning in graphslearning in graphs
Example of domain-specific problem
Proposed Distance Metric Learning Approach: LearnMet
GivenGiven• Training set with Training set with
actual clusters of actual clusters of graphsgraphs
Additional InputAdditional Input• Components: Components:
distance metrics distance metrics applicable to applicable to graphsgraphs
LearnMet Metric • D = ∑wiDi
Evaluate Accuracy
Use pairs of graphsUse pairs of graphs
A pair (gA pair (gaa,g,gbb) is) is TP - same predicted, TP - same predicted,
same actual cluster: same actual cluster: (g(g11, g, g22))
TN - different TN - different predicted, different predicted, different actual clusters: (gactual clusters: (g22,g,g33))
FP -FP - same predicted same predicted cluster, different actual cluster, different actual clusters: (gclusters: (g33,g,g44))
FN - different FN - different predicted, same actual predicted, same actual clusters: (gclusters: (g44,g,g55))
Evaluate Accuracy (Contd.)
How do we compute error for whole set of graphs?How do we compute error for whole set of graphs?• For all pairsFor all pairs
Error MeasureError Measure• Failure Rate FR Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN)FR = (FP+FN) / (TP+TN+FP+FN)
Error Threshold (t)Error Threshold (t)• Extent of FR allowed Extent of FR allowed • If (FR < t) then clustering is accurate If (FR < t) then clustering is accurate
Adjust the Metric
Weight Adjustment Heuristic: for each DWeight Adjustment Heuristic: for each Dii
• New wNew wii = w = wi i – sf– sfi i (DFN(DFNii/DFN + DFP/DFN + DFPii/DFP) [KDD’s MDM-05]/DFP) [KDD’s MDM-05]
Testing of LearnMetDetails: MTAP-06 Details: MTAP-06
Effect of pairs per epoch Effect of pairs per epoch (ppe)(ppe)• G = number of graphs, G = number of graphs,
e.g., = 25e.g., = 25
• GGCC2 2 = total number of = total number of
pairs, e.g., = 300pairs, e.g., = 300
• Select subset of Select subset of GGCC22 pairs pairs
per epochper epoch
ObservationsObservations• Highest accuracy with Highest accuracy with
middle range of ppemiddle range of ppe• Learning efficiency best Learning efficiency best
with low ppewith low ppe
Accuracy of Learned Metrics over Test Set
Learning Efficiency over Training Set
User Surveys of the AutoDomainMine System
Formal user surveys in Formal user surveys in different applicationsdifferent applications
Evaluation ProcessEvaluation Process• Compare estimation with Compare estimation with
real data in test setreal data in test set• If they match estimation If they match estimation
is accurateis accurate
ObservationsObservations• Estimation Accuracy Estimation Accuracy
around 90 to 95 %around 90 to 95 %Accuracy: Estimating Graphs
Accuracy: Estimating Conditions
Related WorkRelated WorkSimilarity Search [HK-01, WF-00]Similarity Search [HK-01, WF-00]• Non-matching conditions could be significant Non-matching conditions could be significant
Mathematical Modeling [M-95, S-60]Mathematical Modeling [M-95, S-60]• Existing models not applicable under certain situationsExisting models not applicable under certain situations
Case-based Reasoning [K-93, AP-03]Case-based Reasoning [K-93, AP-03]• Adaptation of cases not feasible with graphsAdaptation of cases not feasible with graphs
Learning nearest neighbor in high-dimensional spaces: [HAK-00]Learning nearest neighbor in high-dimensional spaces: [HAK-00]• Focus is dimensionality reduction, do not deal with graphsFocus is dimensionality reduction, do not deal with graphs
Distance metric learning given basic formula: [XNJR-03]Distance metric learning given basic formula: [XNJR-03]• Deal with position-based distances for points, no graphs involvedDeal with position-based distances for points, no graphs involved
Similarity search in multimedia databases [KB-04] Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metricUse various metrics in different applications, do not learn a single metric
Image Rating: [HH-01]Image Rating: [HH-01]• User intervention involved in manual ratingUser intervention involved in manual rating
Semantic Fish Eye Views: [JP-04] Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representativesDisplay multiple objects in small space, no representatives
PDA Displays in Levels of Detail: [BGMP-01]PDA Displays in Levels of Detail: [BGMP-01]• Do not evaluate different types of representativesDo not evaluate different types of representatives
Data WarehousingData Warehousing
Data WarehouseData Warehouse– Subject-oriented, integrated repository of relevant Subject-oriented, integrated repository of relevant
data from various information sourcesdata from various information sources
DW
R11 R12
Mediator
View
IS1 IS2 IS3 R31R21 R22 R23
Research Problem in Data Research Problem in Data WarehousingWarehousing
View Maintenance (VM)View Maintenance (VM)– Keeping warehouse view consistent with respect to Keeping warehouse view consistent with respect to
change in sourceschange in sources
Incremental VMIncremental VM– Update warehouse as the source data changesUpdate warehouse as the source data changes– Propagate only the updates, not all dataPropagate only the updates, not all data
Concurrency ConflictsConcurrency Conflicts– Two or more sources / relations try to send updates at Two or more sources / relations try to send updates at
the same timethe same time
ProblemProblem– Solve concurrency conflicts in view maintenance in multi-Solve concurrency conflicts in view maintenance in multi-
source multi-relation environmentssource multi-relation environments
Wrapper (Single-Source
VM Algorithm)
Wrapper (Single-Source VM Algorithm)
Wrapper (Single-Source VM Algorithm)
V
IS2IS1 IS3
Mediator (Multi-Source VM Algorithm)
R11 R21 R22 R23 R31 R32
IS1 IS3IS2
Data Warehouse
R11 R21 R22 R31
Proposed Solution: MEDWRAP (MEDiator Proposed Solution: MEDWRAP (MEDiator WRAPper compensation)WRAPper compensation)
Generic for any compensation based algorithmsGeneric for any compensation based algorithms
Allows sources to be semi-autonomousAllows sources to be semi-autonomous– Sources do not participate in maintenance beyond Sources do not participate in maintenance beyond
processing queries and reporting updatesprocessing queries and reporting updates– No locking neededNo locking needed
Low Storage CostLow Storage Cost– Additional views not stored at wrappersAdditional views not stored at wrappers– Copies of source relations not stored at warehouseCopies of source relations not stored at warehouse
Efficient Processing TimeEfficient Processing Time– No need to re-compute whole viewNo need to re-compute whole view
Details in DEXA-2002 paperDetails in DEXA-2002 paper
Advantages of MEDWRAPAdvantages of MEDWRAP
RV: Re-computation of View (Traditional)RV: Re-computation of View (Traditional)– Rewrite all tuples, not only affected onesRewrite all tuples, not only affected ones– Highly inefficient if done for every updateHighly inefficient if done for every update
SM: Self Maintenance [Q-96, G-96]SM: Self Maintenance [Q-96, G-96]– DW stores copies of source relations for maintenanceDW stores copies of source relations for maintenance– Huge storage at warehouse Huge storage at warehouse
Version Control: [K-99, C-00]Version Control: [K-99, C-00]– Versions of transactions / tuples stored at wrappersVersions of transactions / tuples stored at wrappers– Latest version used to answer queriesLatest version used to answer queries– Huge storage at source wrappersHuge storage at source wrappers
Related WorkRelated Work
Web DatabasesWeb Databases
Management of Data on the WebManagement of Data on the Web
XML, the eXtensible Markup LanguageXML, the eXtensible Markup Language– Widespread standard in storing and publishing dataWidespread standard in storing and publishing data
Domain-specific markup languages designed Domain-specific markup languages designed with XML tag setswith XML tag sets
Standardization bodies extend these to include Standardization bodies extend these to include additional semanticsadditional semantics
Aspects such domain knowledge, XML Aspects such domain knowledge, XML constraints are importantconstraints are important
Domain-specific Markup LanguageDomain-specific Markup Language
Medium of communication for Medium of communication for potential users of the domainpotential users of the domainFollows XML syntaxFollows XML syntaxEncompasses the semantics Encompasses the semantics of the domainof the domainExamples Examples
MML: Medical Markup MML: Medical Markup Language Language ChemML: Chemical Markup ChemML: Chemical Markup Language Language
Markup Language
Industries
Consumers
Universities Research Organizations
Publishers
Markup Language Development StepsMarkup Language Development Steps1. Acquisition of Domain Knowledge1. Acquisition of Domain Knowledge
- - Familiarity with related markupsFamiliarity with related markups
2. Data Modeling 2. Data Modeling - - E.g.,E.g., Entity Relationship modelsEntity Relationship models
3. Requirements Specification3. Requirements Specification- - E.g.,E.g., Interviews with Domain ExpertsInterviews with Domain Experts
4. Ontology Creation4. Ontology Creation- - Analogous to pilot version of softwareAnalogous to pilot version of software
5. Revision of Ontology5. Revision of Ontology- - Alpha versionAlpha version
6. Schema Definition6. Schema Definition- - Beta versionBeta version
7. Reiteration of Schema until 7. Reiteration of Schema until StandardizationStandardization- - Release VersionRelease Version
Snapshot of Final Schemawith data storage
Desired Features of Markup LanguagesDesired Features of Markup Languages
Avoidance of RedundancyAvoidance of Redundancy– No duplicate informationNo duplicate information
Non-Ambiguous Presentation of DataNon-Ambiguous Presentation of Data– Issues such as synonymy & polysemyIssues such as synonymy & polysemy
Easy Interpretability of DataEasy Interpretability of Data– E.g. in scientific domains, store experimental input E.g. in scientific domains, store experimental input
conditions before resultsconditions before results
Incorporation of Domain-Specific RequirementsIncorporation of Domain-Specific Requirements– E.g. conflicts such as: in financial domains, a person E.g. conflicts such as: in financial domains, a person
can be either insolvent or asset-holder but not bothcan be either insolvent or asset-holder but not both
Extensibility of the MarkupExtensibility of the Markup– Users should be able to capture additional semanticsUsers should be able to capture additional semantics
Application of XML ConstraintsApplication of XML Constraints
Sequence ConstraintSequence Constraint– To control the order of tagsTo control the order of tags
Choice ConstraintChoice Constraint– To use either one tag or the otherTo use either one tag or the other
Key ConstraintKey Constraint– To identify an attribute as a unique primary keyTo identify an attribute as a unique primary key
Occurrence ConstraintOccurrence Constraint– To declare minimum and maximum occurrences To declare minimum and maximum occurrences
Convenient Access to InformationConvenient Access to Information
Data stored using XML based markup Data stored using XML based markup languages can be easily accessed using languages can be easily accessed using languages such aslanguages such as– XQuery: XML Query LanguageXQuery: XML Query Language– XSLT: XML Stylesheet Language TransformationsXSLT: XML Stylesheet Language Transformations– XPath: XML Path LanguageXPath: XML Path Language
Details on markup language development Details on markup language development – Chapter on “XML Based Markup Languages for Chapter on “XML Based Markup Languages for
Specific Domains” by Varde et al. in book “XML Specific Domains” by Varde et al. in book “XML Based Support Systems”, Springer 2008Based Support Systems”, Springer 2008
Related WorkRelated Work
Semantic Extensions of XML for Advanced Semantic Extensions of XML for Advanced Applications [YKB-2001]Applications [YKB-2001]
Versions and Standards of HTML [B-95]Versions and Standards of HTML [B-95]
The Latest MML (Medical Markup Language) The Latest MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Version 2.3 - XML based Standard for Medical Data Exchange/ Storage [GATSSTSNY-2003]Exchange/ Storage [GATSSTSNY-2003]
XQuery 1.0: An XML Query Language [BFFRS-2003]XQuery 1.0: An XML Query Language [BFFRS-2003]
Handbook of Modern Finance [SL-2004]Handbook of Modern Finance [SL-2004]
Propagating XML Constraints to Relations [DFHQ-Propagating XML Constraints to Relations [DFHQ-2003]2003]
Conclusions and Ongoing WorkConclusions and Ongoing WorkData MiningData Mining– Graphical Data Mining Area, AutoDomainMine approachGraphical Data Mining Area, AutoDomainMine approach– Ongoing WorkOngoing Work
• Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Grants involved)Grants involved)
• Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology)Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology)
Data WarehousingData Warehousing– View Maintenance Area, MEDWRAP approachView Maintenance Area, MEDWRAP approach– Ongoing WorkOngoing Work
• Data Warehouse Maintenance in real time environments (with researchers at Data Warehouse Maintenance in real time environments (with researchers at Microsoft Search Labs)Microsoft Search Labs)
Web DatabasesWeb Databases– Book Chapter on XML Based Markup Languages for Specific DomainsBook Chapter on XML Based Markup Languages for Specific Domains– Ongoing WorkOngoing Work
• Development of Domain-specific markups (with NIST: National Institute of Development of Domain-specific markups (with NIST: National Institute of Standards and Technology)Standards and Technology)