1
www.ischool.drexel.edu iBioSearch: The Integrated Biological Database Search Ritu Khare and Yuan An METHODOLOGY 1. Web Interface (Wis) Collection: Collect WIs to biological databases. 2. Information Extraction: For each WI, extract attributes corresponding to the WI metamodel. Broadly, a WI can be represented as a collection of search entities and their respective labels (search criteria). 3. Mapping WI- metamodel: Map each WI to the WI metamodel to generate the instances of the metamodel. Then, we have a list of search entities and their respective criteria (labels). For a given search entity Si , there will be label set (li1, li2, li3,…, lim). 4. Clustering: Find non-overlapping classes of search entities representing synonyms, and for each class, find a list of non-redundant labels. 5. Generation of GBWS: Eventually, we generate another conceptual model that we call as a “Global Biological WI Schema“ (GBWS). It would represent all possible input WIs in a non-redundant manner, and capture matchings between individual instances of the WI metamodel. CURRENT AND PREDICTED RESULTS The GBWS or ontology could be represented as a meta-search interface for biologists wherein they can search for most of the biological entities on several search criteria available on different databases. Eventually, we aim to find the answers to other research questions such as: 1. Differences between commercial and biological databases. 2. Automatic identification of biological search interfaces. 3. Reverse Engineering of a WI into an ER diagram. 4. Integration of multiple ER diagrams 5. Extracting relationships between biological search entities. FUTURE WORK In future, we intend to dynamically update biological databases repository, maintain semantic mappings when base databases evolve, translate user queries, and consolidate, reconcile, and rank the query results using data cleansing and relevance computing algorithms. In addition to this, our plan includes performing usability testing of iBioSearch system with the help of biologists. REFERENCES 1. Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 337-348. 2. Barbosa, L., Tandon, S., & Freire, J. (2007). Automatically constructing a directory of molecular biology databases. Proceedings of the International Workshop on Data Integration in the Life Sciences 2007 (DILS), Philadelphia, PA. 3. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, Californi. 217-228. 4. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419. Fig. 1: Problem - biologist searching for an entity Fig. 3: Methodology Fig.2: WI Metamodel PROBLEM Presence, of a very large number of biological Web databases and their interfaces, makes it difficult for biologists to search for any biological entity (See Fig. 1). Currently, the only option biologists have is to search each of these numerous interfaces individually. OUR SOLUTION We aim to provide a unified search interface with capability of searching multiple (1000+) biological databases. This interface would be a representation of the biological search interface ontology. For finding the global search ontology, we take a novel approach of reverse engineering individual search interface into a conceptual model, and then finding an integrated model that would be consistent with all the interfaces up to a level of significance. HYPOTHESIS & ASSUMPTIONS WI Metamodel: We observe that all input Web Interfaces (WIs) have an underlying global model. We created this global model manually and termed it as the "WI Metamodel". See Fig. 2. WI: Every Web Interface (WI) can be represented as an instance of the metamodel. Which interface to search? Which database to access? What all search criteria do I have? How many sources to consider? OLDB OLDB OLDB OLDB OLDB INFORMATION EXTRACTION MAPPING WI WITH METAMODEL REVERSE ENGINEERING INFORMATION RETRIEVAL META-SEARCH INTERFACE WI MetaModel CLUSTERING SEARCH ENTITIES AND LABELS GENERATION OF GLOBAL BIOLOGICAL WI SCHEMA

iBioSearch: The Integrated Biological Database Search

Embed Size (px)

Citation preview

Page 1: iBioSearch: The Integrated Biological Database Search

www.ischool.drexel.edu

iBioSearch: The Integrated Biological Database Search

Ritu Khare and Yuan An

METHODOLOGY

1. Web Interface (Wis) Collection: Collect WIs to biological databases.

2. Information Extraction: For each WI, extract attributes corresponding to

the WI metamodel. Broadly, a WI can be represented as a collection of

search entities and their respective labels (search criteria).

3. Mapping WI- metamodel: Map each WI to the WI metamodel to generate

the instances of the metamodel. Then, we have a list of search entities and

their respective criteria (labels). For a given search entity Si , there will be

label set (li1, li2, li3,…, lim).

4. Clustering: Find non-overlapping classes of search entities representing

synonyms, and for each class, find a list of non-redundant labels.

5. Generation of GBWS: Eventually, we generate another conceptual model

that we call as a “Global Biological WI Schema“ (GBWS). It would represent

all possible input WIs in a non-redundant manner, and capture matchings

between individual instances of the WI metamodel.

CURRENT AND PREDICTED RESULTS The GBWS or ontology could be represented as a meta-search

interface for biologists wherein they can search for most of the

biological entities on several search criteria available on

different databases.

Eventually, we aim to find the answers to other research

questions such as:

1. Differences between commercial and biological databases.

2. Automatic identification of biological search interfaces.

3. Reverse Engineering of a WI into an ER diagram.

4. Integration of multiple ER diagrams

5. Extracting relationships between biological search entities.

FUTURE WORK In future, we intend to dynamically update biological databases

repository, maintain semantic mappings when base

databases evolve, translate user queries, and consolidate,

reconcile, and rank the query results using data cleansing and

relevance computing algorithms. In addition to this, our plan

includes performing usability testing of iBioSearch system with

the help of biologists.

REFERENCES 1. Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from

web pages. Proceedings of the 2003 ACM SIGMOD International

Conference on Management of Data , San Diego, California. 337-348.

2. Barbosa, L., Tandon, S., & Freire, J. (2007). Automatically constructing

a directory of molecular biology databases. Proceedings of the

International Workshop on Data Integration in the Life Sciences 2007

(DILS), Philadelphia, PA.

3. He, B., & Chang, K. C. (2003). Statistical schema matching across web

query interfaces. 2003 ACM SIGMOD International Conference on

Management of Data , San Diego, Californi. 217-228.

4. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based

schema matching for web databases by domain-specific query probing.

Thirtieth International Conference on very Large Data Bases, 30, 408 -

419.

Fig. 1: Problem - biologist

searching for an entity

Fig. 3: Methodology

Fig.2: WI Metamodel

PROBLEM Presence, of a very large number of biological Web databases and

their interfaces, makes it difficult for biologists to search for any

biological entity (See Fig. 1). Currently, the only option biologists

have is to search each of these numerous interfaces individually.

OUR SOLUTION We aim to provide a unified search interface with capability of

searching multiple (1000+) biological databases. This interface

would be a representation of the biological search interface

ontology. For finding the global search ontology, we take a novel

approach of reverse engineering individual search interface into a

conceptual model, and then finding an integrated model that would

be consistent with all the interfaces up to a level of significance.

HYPOTHESIS & ASSUMPTIONS

WI Metamodel: We observe that all input Web Interfaces (WIs) have an

underlying global model. We created this global model manually and termed

it as the "WI Metamodel". See Fig. 2.

WI: Every Web Interface (WI) can be represented as an instance of the

metamodel.

Which interface to search?

Which database to access?

What all search criteria do I have?

How many sources to consider?

OLDB OLDBOLDBOLDB OLDB

INFORMATION

EXTRACTION

MAPPING WI

WITH

METAMODEL

RE

VE

RS

E E

NG

INE

ER

ING

INF

OR

MA

TIO

N

RE

TR

IEV

AL

META-SEARCH

INTERFACE

WI MetaModel

CLUSTERING

SEARCH ENTITIES

AND LABELS

GENERATION OF

GLOBAL

BIOLOGICAL WI

SCHEMA