44
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources Jie Bao, Doina Caragea, Jyotishman Pathak, Adrian Silvescu, Carson Andorf, Changhui Yan, Drena Dobbs and Vasant Honavar June 28, 2005

INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

  • Upload
    jie-bao

  • View
    1.121

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically

Heterogeneous Data Sources

Jie Bao, Doina Caragea, Jyotishman Pathak, Adrian Silvescu, Carson Andorf, Changhui Yan, Drena Dobbs and Vasant Honavar

June 28, 2005

Page 2: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Background and Motivation

Transformation of biology from a data poor science into a data rich science

Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone)

Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources

InterProMIPS

Swissprot

Page 3: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress

Page 4: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Ontology-based information integration in INDUS

Page 5: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Semantically Heterogeneous Data Sources

Protein ID Protein Name Protein Sequence Prosite Motifs EC Number

P35626Beta-adrenergic

receptor kinase 2

MADLEAVLAD

VSYLMAMEKS

RGS

PROT_KIN_DOM

PH_DOMAIN

2.7.1.126

Beta-adrenergic

receptor kinase

Q12797Aspartyl/asparaginyl

beta-hydroxylase

MAQRKNAKSS

GNSSSSGSGS

TPR

TPR_REGION

TPR

1.14.11.16

Peptide-aspartate

beta-dioxygenase

Accession Number AN

Gene AA Sequence LengthPfam

DomainsMIPS Funcat

P32589 SSE1

STPFGLDLGN

NNSVLAVARN

692 HSP70 16.01 protein binding

P07278 BCY1

VSSLPKESQA

ELQLFQNEIN

415 RIIa

16.19.01 cyclic nucleotide

binding (cAMP, cGMP, etc.)

D1

D2

Page 6: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Capabilities of INDUS

INDUS provides support for:• Specification and update of schemas and ontologies• Specification of mappings between ontologies• Registration of new data sources • Specification of user views • Specification and execution of queries across distributed,

semantically heterogeneous data sources

Page 7: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Tools

Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source

schemas Mapping Editor for specifying mappings between

ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying

results

Page 8: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Ontologists

A domain ontologist can: Specify or update ontologies Specify or update schemas Specify or update mappings between ontologies Specify or update mappings between schemas

Page 9: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Data Providers

A data provider can: Associate a predefined schema and ontology with a data

source Specify data source location, type and access procedures Register a data source Act as a domain ontologist

Page 10: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Experts

A domain expert can specify an application view, i.e., Select data sources of interest in an application domain Select an application specific schema Select an application specific ontology Select relevant mappings

A domain expert can serve as Domain ontologist Data provider

Page 11: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Scientists

A domain scientist can Select an application view Formulate and execute queries

A domain scientist can act as Domain ontologist Data provider Domain expert

Page 12: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress

Page 13: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Semantically Heterogeneous Data

Protein ID Protein Name Protein Sequence Prosite Motifs EC Number

P35626Beta-adrenergic

receptor kinase 2

MADLEAVLAD

VSYLMAMEKS

RGS

PROT_KIN_DOM

PH_DOMAIN

2.7.1.126

Beta-adrenergic

receptor kinase

Q12797Aspartyl/asparaginyl

beta-hydroxylase

MAQRKNAKSS

GNSSSSGSGS

TPR

TPR_REGION

TPR

1.14.11.16

Peptide-aspartate

beta-dioxygenase

Data sources need to be made self-describing by specifying the relevant meta data

Accession Number AN

Gene AA Sequence LengthPfam

DomainsMIPS Funcat

P32589 SSE1

STPFGLDLGN

NNSVLAVARN

692 HSP70 16.01 protein binding

P07278 BCY1

VSSLPKESQA

ELQLFQNEIN

415 RIIa

16.19.01 cyclic nucleotide

binding (cAMP, cGMP, etc.)

D1

D2

Page 14: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Meta Data

Schema – structure of data Specification of the attributes of the data and their types

Ontology – conceptualization of semantics of data Domains of attributes and relationships between values

Protein ID : Swissprot ID

Protein Name: String

Protein Sequence: AA String

Prosite Motifs: Motifs

EC Number: EC Hierarchy

Schema for protein data in D1

Page 15: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Attribute value hierarchy

An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data

Example: MIPS Funcat Hierarchy

Page 16: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Making data sources self-describing- Ontology-extended data source

Accession Number: MIPS ID

Gene:

Gene ID

Length:

Positive Integer

Prosite Motifs: Motifs

MIPS Funcat:

MIPS Hierarchy

Data

Schema

Ontology

+

+

P32589 SSE1STPFGLDLGNNNSVLAVARN 692 HSP70 16.01 protein binding

P07278 BCY1VSSLPKESQAELQLFQNEIN 415 RIIa

16.19.01 cyclic nucleotidebinding (cAMP, cGMP.)

Page 17: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: Ontology Editor

Page 18: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: Schema Editor

Page 19: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: Data Editor

Page 20: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

User view

PID: Swissprot ID

Source:

Species String

Protein:

AA String

Structural Class: SCOP

GO Function:

GO Hierarchy

MIPS Swissprot

User Schema

Data Sources of Interest

User ViewUser OntologyA user view is given by:

a set of ontology-extended data sources that are of interest to the user

a user schema and ontology (defining a virtual data source)

a set of mappings from data source schemas and ontologies to the user schema and ontology

Page 21: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings

The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at: Schema Level: between attributes in different schemas Ontology Level: between values of the attributes described in

different ontologies

The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner

Page 22: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene ID

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: Swissprot ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 23: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene Set

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: Swissprot ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 24: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein Sequence : D1≡ AA Composition : DU

AA Sequence : D2 ≡ AA Composition : DU

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA StringProsite Motifs: AA String

EC Number: EC Hierarchy

Accession No AN: MIPS ID

Gene: Gene ID

AA Sequence:

AA StringLength: Pos Integer

MIPS Funcat: MIPS Hierarchy

Pfam Motifs: Motifs

D1

D2

PID: Swissprot ID

Protein:

AA StringGO Function:GO Hierarchy

DUSource: Species String

Page 25: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein Sequence : D1≡ AA Composition : DU

AA Sequence : D2 ≡ AA Composition : DU

EC Number : D1 ≡ GO Function : DU’

MIPS Funcat : D2 ≡ GO Function : DU

Protein ID: SwissProt ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene ID

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: SwissProt ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 26: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

DUDUD1

Page 27: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

EC 2.7.1.126 : D1 ≡ GO 0047696 : DU

DUD1

Page 28: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

DU

EC 2.7.1 : D1 GO 00047696 : DU

D1

Page 29: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

D1

EC 2.7.1.126: D1 GO 0004672 : DU

DU

Page 30: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: View Editor

Page 31: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: Mapping Editor

Page 32: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Sample Query

Return ALL proteins whose GO function isa nucleotide binding

Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism

Page 33: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Query processing in Indus

QL

SV,OV

QLSQL

SV

Q1

S1,OV

Qn

Sn,OV

Qr1

S1,O1

Qrn

S1,On

Qr1SQL

S1

QrnSQL

Sn

D1

Dn

r1

rn

In remote ontology

In local ontology In local schema

In remote schema

r1L

rnL

RL

QueryFormation

LocalRewriting

Query Decomposition

Query Translation

Remote Rewriting

QueryExecution

InverseTranslation

ResultComposition

M1

Mn

M1

Mn

Query Formulation

Page 34: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS: Query Editor

Page 35: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS

Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable

ontologies and mappings (no single global ontology) Data integration on the fly Semantic integrity of queries ensured by means of

semantics preserving mappings

Page 36: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Related work

Information integration: [Sheth and Larson, 1990; Davidson et al., 2001; Eckman, 2003; Levy, 1998]

Biological data integration: SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc.

Ontology and mappings editors: Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc.

Ontology-extended relational algebra: [Bonatti et al., 2003]

Page 37: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress

Page 38: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Summary

Page 39: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Ontologies and mappings Support for more expressive ontologies (beyond hierarchies)

[Bao et al., 2005] Support for interactive specification of mappings between

ontologies, including automated generation of candidate mappings

Support for modular ontologies and mappings [Bao and Honavar, 2004]

Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings

More powerful reasoning to ensure the semantic integrity of mappings

Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL)

Page 40: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Query Processing Query optimization under access, bandwidth and

computational constraints• Implementation of data retrieval procedures (iterators) for

widely used bioinformatics data sources• Support for data caching and data sharing

Page 41: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Knowledge Acquisition Support for learning classifiers and other predictive models

from semantically heterogeneous data [Caragea et al., 2005]

Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ]

Support for annotating and sharing results of knowledge acquisition

Page 42: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships Prediction of protein function [Andorf et al., 2004] Prediction of protein-protein, protein-DNA and protein-RNA

interfaces [Yan et al., 2004] Analysis, visualization, and interpretation of gene expression data

[Caragea et al., 2005] Modeling and discovery of gene regulatory networks

Usability studies Design of better user interfaces Performance evaluation

Page 43: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Relevant Publications

Caragea, D., Pathak, J., Bao, J., Silvescu, A., Andorf., C., Dobbs, D. and Honavar, V. (2005). Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. In: Proceedings of the 2nd International Workshop on Data Integration in Life Sciences (DILS'05), San Diego, CA.

Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the Third International Conference on Ontologies, DataBases and Applications of Semantics for Large Scale Information Systems (ODBASE’04), October 25-29, 2004, Agia Napa, Cyprus.

Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1, No. 2. Invited Paper.

Page 44: INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

http://www.cs.iastate.edu/~dcaragea/indus.html