Upload
lynette-palmer
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
1
Dejing DouComputer and Information Science
University of Oregon, Eugene, Oregon
September, 2010@ Kent State University
OutlineIntroduction
Ontology and the Semantic Web Biomedical Ontology DevelopmentChallenges for Data-driven Approaches
The NEMO ProjectMining ERP Ontologies (KDD’07)Modeling NEMO Ontology Databases
(SSDBM’08, JIIS’10)Mapping ERP Metrics (PAKDD’10)Ongoing Work
3
4
What is Ontology?
Formal specification of a vocabulary of domain concepts and relationships relating them .
5
A Genealogy Ontology Individual
FamilyEvent
Male
Female
MarriageEvent
DivorceEvent
DeathEvent
BirthEventhusband
childIn
wife
marriage
divorce
birth
Gendersex
Classes: Individual, Male, Female, Family, MarriageEvent…
Properties: sex, husband, wife, birth……
Axioms: If there is a MarriageEvent, there will be a Family related to the husband and wife properties.
Ontology languages: OWL, KIF, OBO …
6
Current WWW The majority of data resources in WWW are in human
readable format only (e.g. HTML).
human
WWW
7
The Semantic Web One major goal of the Semantic Web is that web-
based agents can process and “understand” data[Berners-Lee et al 2001].
Ontologies formally describe the semantics of data and web-based agents can take web documents (e.g. in RDF, OWL) as a set of assertions and draw inferences from them.
human
SW
Web-based agents
Biomedical OntologiesThe Gene Ontology (GO): to standardize the formal representation of gene and gene product attributes across all species and gene databases (e.g., zebrafish, mouse, fruit fly) Classes: cellular component, molecular function, biological
process, … Properties: is_a, part_of
The Unified Medical Language System (UMLS): a comprehensive thesaurus and ontology of biomedical concepts.
The National Center of Biomedical Ontology (NCBO) at Stanford University>200 ontologies (hundreds to thousands concepts each one)
4 millions of mappings.
8
Biomedical Ontology DevelopmentTypically Knowledge Driven: top down process
Some basic steps and principles:Discussions among domain experts and ontology engineersSelect basic (root) classes and properties (i.e., terms)Go to deeper depth for sub-concepts and relationships.
Modularization may be considered if the ontology is expected to be large.
Add constraints (axioms)Add unique IDs (e.g., URLs) and textual definitions for termsConsistency checkingUpdating and Evolution (e.g., GO is updated every 15 minutes)
9
Challenges: Knowledge Sharing does not help Data Sharing
AutomaticallyAnnotation (like tags) helps Search in text (e.g., papers), but not
good for experimental data (e.g., numerical values)
Three main challenges for knowledge/data sharing:Heterogeneity: different labs use different analysis
methods, spreadsheet attributes , DB schemas. Reusability: knowledge mined from different
experimental data may not be consistent and sharable Scalability: the size of experimental data grow much
larger than the size of ontologies. Ontology-based reasoning (e.g., ABox) for large size data is a headache.
10
Case Study: EEG dataElectroencephalogram (EEG) data
Observing Brain Functions through EEG
11
•Brain activity occurs in cortex and cortex activity generates scalp EEG
•EEG data (dense-array, 256 channels) has high temporal (1msec) / poor spatial
resolution (2D), MR imaging (fMRI, PET) has good spatial (3D) / poor
temporal resolution (~1.0 sec)
ERP data and Pattern AnalysisEvent-related potentials (ERP) are created by averaging across segments
of EEG data in different trials and time-locking (e.g., every 2 seconds) to stimulus events or response.
Some existing tools (e.g., Net Station, EEGLAB, APECS, the Dien PCA Toolbox) can process ERP data and do pattern analysis.
h
12
(A) 128-channel ERPs to visual word and nonword stimuli. (B) Time course for P100 pattern by PCA. (C) Scalp topography (spatial distribution) of P100 pattern.
NEMO: NeuroElectroMagnetic Ontologies
Some challenges in ERP studyPatterns can be difficult to identify and definitions
vary across research labs. Methods for ERP analysis differ across research sites.
It is hard to compare and share the results across experiments and across labs.
The NEMO (NeuroElectroMagnetic Ontologies)
project is to address those challenges by developing ontologies to support ERP data and pattern representation, sharing and meta-analysis. It has been funded by the NIH as an R01 project since 2009.
13
Architecture
14
Progress in Data Driven ApproachesMining ERP Ontologies (KDD’07) --
Reusability
Modeling NEMO Ontology Databases (SSDBM’08, JIIS’10) -- Scalability
Mapping ERP Metrics (PAKDD’10) --
Heterogeneity
15
Ontology Mining Ontology mining is a process for learning an ontology,
including classes, class taxonomy, properties and axioms, from data.
Existing ontology mining approaches focus on text mining or web mining (web content, usage, structure, user profiles). Clustering and association rule mining have been used for classes
and properties. [Li&Zhong @ TKDE 18(4), Maedche&Staab @ EKAW’00, Reinberger et al @ ODBASE’03].
NetAffix Gene ontology mining tool is applied to microarray data [Cheng et al @ Bioinformatics 20 (9)]
Our approach includes hierarchical clustering and
classification for mining class taxonomy, properties and axioms of the first-generation of ERP data-specific ontology from spreadsheets, which is novel.
16
17
Knowledge Reuse in KDD
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation?Lack of formal
Semantics
Our Framework (KDD’07)
18 A semi-automatic framework for mining ontologies
Four General Procedures
Classes <= Clustering-based Classification
Class Taxonomy <= Hierarchical Clustering
Properties <= Classification
Axioms <= Association Rule Mining and Classification
19
Experiments on ERP DataPreprocessing Data with Temporal PCA
Mining ERP Classes with Clustering-based Classification
Mining ERP Class Taxonomy with Hierarchical Clustering
Mining Properties and Axioms (Rules) with Classification
Discovering Axioms among Properties with Association Rules Mining
20
Input Raw ERP data
21
Subject Condition Channel# Time1(µv) Time2(µv) Time3(µv) Time4(µv) Time5(µv) Time6(µv)
S01 A 1 0.077 0.136 0.075 0.095 0.188 0.097
S01 A 2 0.891 1.780 0.895 0.805 1.612 0.813
S01 A 3 0.014 0.018 0.013 0.040 0.066 0.035
S01 A 4 0.657 1.309 0.657 0.789 1.571 0.785
S01 A 5 0.437 0.864 0.432 1.007 2.002 1.003
S01 B 1 0.303 0.603 0.303 0.128 0.250 0.123
S01 B 2 0.477 0.951 0.483 0.418 0.841 0.418
S01 B 3 0.538 0.073 0.038 0.029 0.043 0.022
S01 B 4 0.509 1.061 0.533 0.628 1.254 0.626
S01 B 5 1.497 1.024 0.510 0.218 0.434 0.219
S02 A 1 1.275 2.987 1.500 0.382 0.769 0.386
S02 A 2 0.666 2.555 1.281 0.326 0.648 0.329
S02 A 3 0.673 1.321 0.666 1.026 2.051 1.029
S02 A 4 0.284 1.341 0.678 1.966 3.914 1.966
S02 A 5 0.980 0.564 0.292 0.511 1.012 0.507
S02 B 1 0.367 1.960 0.978 1.741 3.486 1.739
S02 B 2 0.864 0.721 0.365 1.470 2.934 1.472
S02 B 3 0.568 1.729 0.866 1.342 2.680 1.337
S02 B 4 0.149 1.134 0.575 0.210 0.423 0.215
S02 B 5 0.042 0.287 0.151 0.433 0.860 0.433
Sampling rate: 250Hz for 1500ms (375 samples)Experiment 1-2: 89 subjects and 6 experiment conditionsExperiment 3: 36 subjects and 4 experiment conditions
Data Preprocessing (1)
Temporal PCA Decomposition
22
component 1 + component 2 = complex waveform
+ =
PCA
PCA extracts as many factors (components) as there are variables (i.e., number of samples). We retain the first 15 PCA factors, accounting for most of variances (> 75%). The remaining factors are assumed to contain “noise”.
Data Preprocessing (2) Intensity, spatial, temporal and functional
metrics (attributes) for each factor
23
ERP Factors after PCA Decomposition
24
TI-max
(µs)
IN-mean (ROI) (µv)
IN-mean (ROCC) (µv)
... SP-min
(channel#)
128 4.2823 4.7245 … 24
96 1.2223 1.3955 … 62
164 -6.6589 -4.7608 … 59
220 -3.635 -2.0782 … 58
244 -0.81322 0.29263 … 65
For Experiment 1 data, number of Factors = (474) (594) For Experiment 2 data, number of Factors = (588) (598)
For Experiment 3 data, number of Factors = 708
Mining ERP Classes with Clustering (1)
We use EM (Expectation-Maximization) clusteringE.g. for Experiment 1 group 2 data
25
Cluster/
Pattern
0 1 2 3
P100 0 76 0 2
N100 117 1 0 54
lateN1/N2 13 14 0 104
P300 0 61 110 42
Mining ERP Classes with Clustering (2)
We use OWL to represent ERP Classes
26
Mining ERP Class Taxonomy with Hierarchical ClusteringWe use EM clustering in both divisive and
agglomerative ways. E.g. for Experiment 3 data
27
Mining ERP Class Taxonomy with Hierarchical ClusteringWe use OWL to represent class taxonomy
28
Mining Properties and Axioms with Clustering-based Classification (1)
We use decision tree learning (C4.5) to do classification with the training data labeled by clustering results.
29
Mining Properties and Axioms with Clustering-based Classification (2)
We use OWL to represent datatype properties which are based on those attributes with high information gain (e.g., top 6).
30
Mining Properties and Axioms with Clustering-based Classification (3)
We use SWRL to represent axioms. In FOL:
31
Discovering Axioms among Properties with Association Rule Mining
We use Apriori algorithm to find association rules among properties. The split points are determined by classification rules. In FOL, they looks like:
32
Rule Optimization
33
Idea: (A → B) (A B → C) => (A → C)
And
A Partial View of the Mined ERP Data Ontology
34
• Our first-generation ERP ontology consists of 16 classes, 57 properties and 23 axioms.
Ontology-based Data Modeling (SSDBM’08, JIIS’10)
In general, ontologies can be treated as one kind of conceptual model. Considering the size of data (e.g., PCA factors) can be large, instead of building a knowledge base to store those data, we propose to use relational databases.
We designed database schemas based on our ERP ontologies which include temporal, spatial and functional concepts.
35
Ontology Databases
Axioms
ClassClass
Datatype
Datatype
Objects
Facts
RelationRelation
Datatype
Datatype
keys
constraints
triggers
tuples
Now we have bridged these.
Ontology Databases
Axioms
ClassClass
Datatype
Datatype
Objects
Facts
RelationRelation
Datatype
Datatype
keys
constraints
views
triggers
tuples
Loading time in Lehigh University Benchmark
Load Time (1.5 million facts)
(10 Universities, 20 Departments)
Query timeQuery Performance (logarithmic time)
Ontology-based Data Modeling
For example, especially for the important subsumption axioms (e.g., subclassof ) of the current ERP ontologies, we use SQL Triggers and Foreign-Keys to represent them.
40
Ontology-based Data Modeling
41
The ER Diagram for the ERP ontology database shows tables (boxes) and foreign key constraints (arrows). The concepts pattern, factor, and channel are most densely connected (toward the right-side of the image) as expected.
42
NEMO Data Mapping (PAKDD’10)
Motivation Lack of meta-analysis across experiment
because different labs may use different metrics
Goal of the studyMapping alternative sets of ERP spatial
and temporal metrics
Problem definition
Alternative sets of ERP metrics
ChallengesSemi-structured
dataUninformative
column headers (string similarity matching does not work)
Numerical values
Grouping and reordering
Grouping and reordering
Sequence post-processing
Cross-spatial Join
Process all point-sequence curves
Calculate Euclidean distance between sequences in the Cartesian product set (Cross-spatial join)
● ● ●
Metric Set1Metric Set1Metric Set2Metric Set2
Cross-spatial Join
Assumptions and HeuristicsThe two datasets contain the same or similar
ERP patterns if they are from the same paradigms (e.g., oddball in visual/audio - watching or listening uncommon or fake words among common words)
Wrong Mappings. Precision = 9/13
Gold standard mapping falls along the diagonal cells
ExperimentDesign of experiment data
2 simulated “subject groups” (samples) SG1 = sample 1 SG2 = sample 2
2 data decompositions tPCA = temporal PCA decomposition sICA = spatial ICA (Independent Component Analysis)
decomposition
2 sets of alternative metrics m1 = metric set 1 m2 = metric set 2
Experiment Result
Overall Precision: 84.6%
NEMO Related Ongoing Work
Application of our framework to other domainmicroRNA, medical informatics, gene databases,
Mapping discovery and integration across ontologies related to different modalities (e.g., EEG vs. fMRI).
55
56
Joint EEG-fMRI Data Mapping
Joint work with:
Gwen Frishkoff, Jiawei Rong, Robert Frank, Paea LePendu, Haishan Liu, Allen Malony, and Don Tucker 3,4
57
Thanks for your attention !
Any Question?
58