An XML Database for Gene Expression · Lalit Kumar . ii Acknowledgements I wish to express my sincere gratitude towards my supervisors Dr. Albert Burger and Dr. Yiya Yang. Thanks

An XML Database for Gene Expression

MSc Dissertation Thesis

Lalit Kumar Registration No.: 061056187

[email protected] / [email protected]

M.Sc. Bioinformatics September 2007

Heriot-Watt University

Edinburgh, United Kingdom http://www.hw.ac.uk

Supervisors:

Dr. Albert Burger [email protected]

School of Mathematical and Computer

Sciences, Heriot-Watt University Edinburgh, United Kingdom

http://www.macs.hw.ac.uk

Dr. Yiya Yang

[email protected]

Human Genetics Unit Medical Research Council

Edinburgh, United Kingdom http://www.hgu.mrc.ac.uk

i

Declaration

I, Lalit Kumar, confirm that this work submitted for assessment is my own and is

expressed in my own words. Any uses made within it of the words of other authors in

any form e.g., ideas, equations, figures, text, tables, programs etc are properly

acknowledged. A list of references employed is included.

………………………………

Lalit Kumar

ii

Acknowledgements I wish to express my sincere gratitude towards my supervisors Dr. Albert Burger and

Dr. Yiya Yang. Thanks to Dr. Burger for reading my thesis drafts, listening to my

problems and for supporting me every step of the way. Thanks to Dr. Yang for her

understanding and putting up with all my learning-curve troubles. Their guidance and

generous time throughout this project helped me in timely completion of the project

and also in appreciating the potential of XML technologies.

I can not ever thank enough my family for their love and support which enabled me

to pursue this MSc. While Mummy, Papa, Uncle and Auntie’s blessings gave me

encouragement; Kishore, Sudha, Pooja, Naveen, Kapil and Manish’s love remained a

source of meaning of life. The arrival of a new member in family, Dhairya, Kishore

and Sudha’s son, provided immense joy.

Thanks to the Scottish Executive that awarded me the Scottish International

Scholarship for the perusal of this MSc. The scholarship was managed by the British

Council. Thanks to Alison Kanby, Regional Services Officer, British Council,

Scotland who did an excellent work in managing my scholarship.

Internet and mobile phones have made world a much smaller place. My old friends

remained in touch with me when I moved to UK for studies. I thank Daniel Reinharz

for his words of wisdom and for his friendship that I treasure. Supatra Kundu,

Sangeeta Yaduvanshi and Reena Torawat have been great friends through all ups and

downs. No words would suffice for Shubhda.

Scotland has given me a number of new friends. I am thankful to my friends and

fellow SISP scholars Aishwarya, Anurag, Naren, Ruchi and Sujai for their cheerful

company on so many occasions. Chris, Daylan, Kanchan, Basil, Rizwan and Atiq also

have been pleasant folks to be with.

Edinburgh is an amazing city.

iii

Abstract Extensible Markup Language (XML) is fast becoming a standard method of

information exchange among computing devices. The flexible nature of XML has

made it possible to develop subject specific languages out of it (for example MathML

is XML used for describing mathematical notations). People with varying computer

skills can manage XML easily because of its ability to keep the data in human readable

format. Due to these reasons, among others, XML is being widely used in

Bioinformatics applications, as well.

In Bioinformatics, gene expression databases have become greatly useful tools. Gene

expression data is fairly complex and is gathered using various experimentation

techniques in laboratories. The transmission of this data from laboratories to gene

expression databases is now being standardized. The efforts are on to develop

standard formats of storing this data. MISFISHIE is one of such standards.

The EMAP project of the Medical Research Council’s Human Genetics Unit (MRC

HGU) has developed a gene expression database called EMAGE, which is an object

database. Now the MRC HGU wants to investigate the possibility of converting

EMAGE into an XML based and MISFISHIE compliant database. The objective of this

dissertation project is to develop this XML database.

iv

Table of Content

Declaration................................................................................................................i

Acknowledgements.................................................................................................. ii

Abstract .................................................................................................................. iii

Table of Content...................................................................................................... iv

List of Figures.........................................................................................................vi

1. Introduction .........................................................................................................1

1.1 Dissertation Thesis Outline.................................................................................. 3

2. Background..........................................................................................................5

2.1 Bioinformatics....................................................................................................... 5 2.1.1 Gene Expression Data .................................................................................................. 5 2.1.2 Gene Expression Databases.......................................................................................... 8

2.2 Introduction to XML............................................................................................ 9

2.3 XML in Bioinformatics Applications................................................................ 11

2.4 The EMAP Project ............................................................................................. 15

2.5 EMAGE: A Gene Expression Database ........................................................... 16

3. Project Description ............................................................................................19

3.1 Problem Situation............................................................................................... 19

3.2 Project Aims and Objectives ............................................................................. 20

3.3 Overview of Solution Development................................................................... 20

4. Development of XML Schema...........................................................................22

4.1 Existing Data Structure ..................................................................................... 22

4.2 XML Schema for Original Dataset ................................................................... 25

4.3 MISFISHIE Standard........................................................................................ 27

4.4 Rationale for a New XML Schema ................................................................... 28

4.5 Validation of New XML Schema....................................................................... 29

5. Transformation of Original Dataset .................................................................32

5.1 Investigated Approaches.................................................................................... 32 5.1.1 JDOM based Application ........................................................................................... 32 5.1.2 XML Shredding using DB2 9 Database ..................................................................... 33 5.1.3 XSL Transformation................................................................................................... 34

5.2 XSL Transformation.......................................................................................... 34 5.2.1 Rationale of Using XSL Transformation.................................................................... 34

v

5.2.2 Mapping Scheme........................................................................................................ 35 5.2.3 Mapping Tools ........................................................................................................... 37 5.2.4 Automated Code Generation ...................................................................................... 39 5.2.5 Transformation Process .............................................................................................. 40

6. Database Preparation and Querying ................................................................41

6.1 Why IBM DB2 9 ................................................................................................. 41 6.1.1 pureXMLTM Technology ............................................................................................ 44

6.2 Insertion of XML into DB2 Database............................................................... 44

6.3 Querying of XML Data...................................................................................... 47 6.3.1 XQuery Language ...................................................................................................... 47 6.3.2 SQL/XML .................................................................................................................. 49

6.4 User Interface Development .............................................................................. 50 6.4.1 Java Server Pages (JSP).............................................................................................. 50 6.4.2 Hypertext Preprocessor (PHP).................................................................................... 52

6.5 User Interface for a Few Queries ...................................................................... 53

7. Conclusion .........................................................................................................61

7.1 Summary of the Work Done.............................................................................. 61

7.2 Summary of Evaluation ..................................................................................... 61

7.3 Accomplishments................................................................................................ 62

7.4 Limitations Encountered ................................................................................... 63

7.5 Skills Acquired.................................................................................................... 63

7.6 Future Work ....................................................................................................... 64 7.6.1 Comprehensive Query Interface Development........................................................... 64 7.6.2 Interface for Inserting New Data ................................................................................ 64 7.6.3 Performance Evaluation ............................................................................................. 65 7.6.4 Query Optimization .................................................................................................... 65

7.7 Final Thoughts.................................................................................................... 65

References ..............................................................................................................67

Appendix A.............................................................................................................69

Other Appendices (on CD-ROM)..........................................................................92

vi

List of Figures Figure 1: Protein synthesis process [18]........................................................................................... 2

Figure 2: Spatial queries formulated using EMAGE Java interface [4] ......................................... 18

Figure 3: Screenshot of the EditiX interface.................................................................................... 26

Figure 4: Screenshot of DTD/Schema menu of EditiX..................................................................... 26

Figure 5: Screen shot of Oxygen XML Editor.................................................................................. 29

Figure 6: Screenshot of a portion of Mapping generated using MapForce .................................... 39

Figure 7: Structure of the relational table created to hold the XML documents ............................. 45

Figure 8: Input screen for query about gene expression detection.................................................. 54

Figure 9: Output of the gene expression detection query ................................................................ 54

Figure 10: Input screen for query that counts the fully or partially sequenced assays ................... 57

Figure 11: Output of the query that counts the fully or partially sequenced assays........................ 57

Figure 12: Input screen for the query that finds components where a gene is expressed................ 58

Figure 13: Output of the query that finds components where a gene is expressed .......................... 59

1

1

1. Introduction Recent years have seen rapid development in the natural science disciplines like

biotechnology, cell biology, molecular biology, genetics and bioinformatics. As a

consequence, increasingly vast amount of related data is being produced from the

practices in these fields. A matter of concern has been the management of this data

and making it readily available as and when required. Traditionally, the data produced

by the experiments in these fields used to be distributed by the means of journals and

other types of print publications. However, a few problems have been associated with

this traditional approach:

• Lack of standardized formats for data publication made it difficult to compile

data from different sources

• Availability of print publications is not same everywhere in the world

• Data compilations produced in print media are difficult to search through

• Analysis of printed data and generating information out of it is difficult

The problem of absence of standard formats has, lately, been mitigated to

considerable extent with the advent and steady development of such formats.

Information Technology (IT) tools have been very useful in the solution of the other

problems listed above. A variety of databases containing bioinformatics data have

become available in past couple of decades. Online availability of most of these

databases ensures the global reach and easy access to the data. With the growing

importance and availability of gene expression data; many gene expression databases

have also come into existence (See section 2.1.2).

2

Gene expression refers to the presence (or absence) of the effect of a particular gene.

Genome1 contains genes which determine the amino acid2 sequences of the resultant

proteins. In addition, genome also contains a comprehensive mechanism for

controlling the synthesis of functional proteins from genes.

Figure 1: Protein synthesis process [18]

Often the amino acid sequence produced from the genetic blueprint undergoes

extensive modifications before it becomes a functional protein. Moreover, the

functionality of the matured proteins itself is regulated by a number of other factors

which can suppress or enhance the functionality. Therefore, merely knowing the gene

and protein sequences is not enough. It is very important to “functionalise” the

genome by finding out the structure of genes and the regulatory mechanisms which

give rise to the functional proteins. In short, it is important to know how, when, why

and where a gene expresses itself. The systematic collection of gene expression data

1 Genome refers to all the genetic (hereditary) material in an organism. Mostly, it is composed of DNA and sometimes RNA. 2 Amino acids are building blocks of proteins. A protein is composed of chain(s) of amino acid molecules.

3

helps in deducing the eventual effects or functions of the gene in the body of the

organism. [15]

Gene expression databases not only store the information about genes and their

expression sites in organism’s body –but also these databases contain other relevant

information, for example the details of experiments that were conducted to find out

about gene expression. Often, these databases are put online and access is given

through a web interface. Using this interface the users can query the available data.

The Medical Research Council’s Human Genetics Unit (MRC HGU) has developed a

mouse gene expression database called EMAGE. It has been developed as a part of a

larger project called the Edinburgh Mouse Atlas Project (EMAP). The underlying

technology of the present EMAGE database is that of an object database. The format

of data stored in this database does not comply with the relevant standards like

MISFISHIE. Nowadays, XML that is fast becoming a standard for data exchange

among computing devices, therefore, MRC HGU intends to investigate the feasibility

of organizing the present data in compliance with the MISFISHIE standard and

migrating the EMAGE object database to an XML based database.

This dissertation project takes the first step towards the intended migration. While

developing the XML version of the EMAGE database, the project aims to research

and document the nitty-gritty of the process. The performance evaluation of the

resultant XML database is not a part of this project and would be carried out

separately by MRC HGU.

1.1 Dissertation Thesis Outline

This thesis document describes author’s work on the abovementioned migration.

Chapter 2 introduces the relevant background information. It explains about what

gene expression is and how the data about gene expression is collected. Further it

introduces a few well established gene expression databases before moving on to a

brief introduction to the concept of XML. Nowadays, XML is being used in a

4

number of bioinformatics applications. The next section in chapter 2 talks about

advantages and disadvantages of the use of XML in bioinformatics applications. Then

the EMAP project and the EMAGE database are described in the subsequent

sections.

Chapter 3 outlines the objectives that the work involved in this dissertation was

supposed to fulfill. It also presents an overview of the solution strategy that was

adopted for this purpose.

The subsequent chapters describe the process that was followed to develop the XML

database. Chapter 4 explains the development and validation of the XML schema

which would act as a template for the XML documents to be stored in the database.

Chapter 5 explains the process of transformation of existing EMAGE data into XML

documents prepared as per the new XML schema. Chapter 6 details the development

of a relational database which would hold the XML documents. It also talks about the

retrieval of the desired data from the database by the means of XQuery language.

This chapter also details the development of user interface through which the users

will be able to query the new XML database.

Chapter 7 presents the summary of and conclusions drawn from the project. This

project lays the foundation of the overall migration process. There are several other

tasks which should to be done but are beyond the scope of this project. Some of

these tasks are discussed under the Future Work section of this chapter.

5

2

2. Background

2.1 Bioinformatics

“Bioinformatics is an interdisciplinary research area which uses computers for storage, retrieval,

manipulation and distribution of information related to biological macromolecules such as DNA,

RNA and proteins.” [11]

Bioinformatics is used to perform the functions like analysis of the biological

sequence information, recovery of evolutionary patterns, prediction of gene function

and biological data mining using computer applications. Bioinformatics deals with

and generate enormous amount of complex data. This gives rise to the need of

development of more efficient and sophisticated computer tools to manage and

analyze this data.

2.1.1 Gene Expression Data

Just like computers work according to the software instructions –the development

and functioning of the living organisms are controlled by the instructions encoded in

their genetic material (which is mostly DNA and sometimes RNA). Genes are the

portions of the genetic material that contain the encoded instructions. These

instructions initiate and control the process of proteins formation in the cells of

organism. The proteins then carry out various functions in the cells. The process of

instructions in a gene getting translated into functional protein(s) is called gene

expression.

6

A paper by D’haeseleer, Liang and Somogyi (1999) titled “Gene Expression Data

Analysis and Modelling” [3] provides elementary information about what gene

expression is and how it is measured. The paper is written in the form of a tutorial.

Although the focus of the paper is on data analysis and modelling, which is out of the

scope of this dissertation project, the general information about gene expression was

found to be useful for understanding of gene expression data.

The paper begins by emphasizing that in this “Age of Genomics” we are not dealing

with the data related to the isolated genes and their products (i.e. proteins). The

advancements in the field of genomics have given rise to enormous amount of gene

expression data. This dataset not only includes the data about isolated gene

expressions and proteins but also contain data about complex interactions among

genes and proteins. Analysis of this data is of high importance because the more

information we have on gene expressions, the more we would be able to understand

about how the organisms function.

In its section 2, the paper introduces the methods that are used nowadays for

obtaining the gene expression data. In most organisms, genes transcribe messenger

RNA (mRNA) and then mRNA gets translated into proteins (See Figure 1).

Therefore, by measuring the concentration of a particular type of mRNA in the cell, it

is possible to assess the expression level of the gene which transcribed that mRNA.

Higher concentration of mRNA implies the higher level of expression of the

associated gene. The higher concentration of mRNA, however, does not always

confirm the higher level of gene expression because expression of some genes is

known to be regulated after transcription [10]. In this approach of measuring gene

expression, the DNA microarrays are often used for gene expression profiling. DNA

microarrays are slides made of glass onto which cDNA3 is deposited by high-speed

robotic printing.

The D’haeseleer, Liang and Somogyi (1999) paper is not particularly written for

providing the general description of gene expression and related measurement

3 cDNA or complimentary DNA is made by the process of reverse transcription of mRNA.

7

techniques. However, the basic information that it provides about gene expression is

good and was found to be useful.

Christiansen et al (2006) in their paper “EMAGE: a spatial database of gene

expression patterns during mouse embryo development” [4] say that the major

challenge in modern biology is to functionalize genomes that have been sequenced,

and to understand the interactions among genes and their products. It has become of

very high importance to know the sites in the body where genes express and in situ4

techniques are available that are used for determining the sites of gene expressions.

These techniques include immunohistochemistry and in situ hybridization. Using

these techniques the concentration of the gene products in the cell can be visualized.

�� Immunohistochemistry and in situ hybridization

Immunohistochemistry is a method of localizing proteins in the cells. The

antibodies in an organism bind themselves to specific antigens. Antigens are

antibody generating molecules and are part of the immune system. Antigens

are usually proteins or polysaccharides. If an antibody is available that works in

situ, then it is possible to know the distribution of the associated antigen

protein throughout the body of the organism. In immunohistochemistry, the

antibody with a known target protein antigen is used to find the presence,

distribution or absence of the protein in tissues. Once applied, the antibody

will bind itself with the target protein if it is present in the tissue cells. These

antibodies are used in combination with the coloring or fluorescent agents so

that the location and concentration of the antibody could be known. [16]

In situ hybridization is a technique of using a labeled cDNA or RNA sequence

to find the location of DNA or RNA sequence in tissues. The labeled

sequence binds itself with the complementary naturally present DNA/RNA

sequence, thus revealing its presence and location.

4 in situ, in context of biology, indicates observation of a biological activity in the place where it naturally happens. For example, observation of cell activities in a living organism would be in situ observation.

8

The Christiansen et al (2006) paper states that traditionally the gene expression data

has been archived by publishing in journals. But this approach does not allow easy

access to the published data. The lack of availability in electronic format, lack of

proper citations and lack of standardization of the writing format are the main

reasons which hamper the fast retrieval and distribution of the gene expression data.

To overcome these problems, nowadays gene expression databases are being

developed to make use of the benefits provided by the Information Technology.

2.1.2 Gene Expression Databases

Collecting and managing gene expression data has become an important task in

bioinformatics. To cater to the need of management of large amounts of data from

different types of expression assays, sevaral gene expression databases have been

developed. Some of these are mentioned below.

GXD is the “Gene Expression Database” developed and managed by Jackson

Laboratories, United States. This database contains the gene expression data of the

mouse development. It gathers data from published literature and researchers also

submit data directly to the database via electronic submissions. A number of web

forms are provided for querying the data available in this database. The records in the

database have links to other data resources which makes it easier to find relevant

information about the data in GXD. Home page of GXD is available at:

http://www.informatics.jax.org/mgihome/GXD/aboutGXD.shtml

GENSET stands for “Gene Expression Nervous System Atlas” and it contains the

gene expression data of the mouse central nervous system. This atlas is managed by

the National Center for Biotechnology Information (NCBI). The aim of GENSAT is

to create a mapping of expression of all genes that express in mouse brain at various

stages of its development cycle. The data is freely available to anyone through the

homepage of GENSAT, which is available at:

http://www.ncbi.nlm.nih.gov/projects/gensat

9

“Gene Expression Atlas” is managed by Genomic Institute of the Novartis Research

Foundation. It contains gene expression information related to mouse and human

being. This atlas can be accessed at:

http://expression.gnf.org/cgi-bin/index.cgi

EMAGE is the gene expression database developed and managed by the Human

Genetics Unit of MRC. This dissertation project aims to develop an XML based

version of EMAGE database. This database will be discussed in more details in

Section 2.5. The home page of EMAGE database is available at:

http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html

2.2 Introduction to XML

The development of World Wide Web (WWW) gained an immense momentum when

Hypertext Markup Language (HTML) became de facto language for the web

development. HTML provides a set of markups (or tags) which are used for making

page layouts and content formatting on web pages. For example, when we want to

write text in bold typeface on a web page, we write as below:

<b>Cardiovascular</b>

Here, the <b> and </b> tags informs the web browser to display the contained text

in bold typeface. This way, though, HTML can be used to control the display of data;

it does not provide any information as to what the data is about. In the above

example, HTML does not tell anything what the word “Cardiovascular” means. This

is where XML comes into the picture.

To get the basic information about XML, a book titled “XML in a Nutshell, 3rd

Edition” by Harold and Means was consulted. [6] The book introduces XML as a

general purpose language to mark up the data in a document with simple and human-

readable tags. XML does not have a finite set of tags and rather allows the developers

to define and use their own tags (which makes XML an extensible and customizable

10

language). A developer can form the relevant XML tags to give meaning to the word

“Cardiovascular”. For example:

<organsystem>Cardiovascular</organsystem>

The XML tags <organsystem> and </organsystem> tell that the word

“Cardiovascular” indicates an organ system. By adding more XML tags, the meaning

of the text can be made even clearer. For example,

<mouse>

<organsystem>Cardiovascular</organsystem>

</mouse>

Now the information is more focussed as it is related with the cardiovascular organ

system of mouse. It is important to note that, unlike in HTML, the tags <mouse>

and <organsystem> are custom made and do not have any effect on the

appearance of the data. In XML, the same information can be provided different

meaning by changing the tags around it. For example:

<venom>

<targetorgans>Cardiovascular</targetorgans>

</venom>

Here “Cardiovascular” is no longer interpreted in context of the mouse organ system.

The change in tags has given a new meaning to the information. Now it talks about

the target organs of a venom.

In addition to this basic understanding of what exactly XML is, the book by Harold

and Means was also used for learning about features and syntax of XML (XML

syntax is a set of rules for writing XML in correct form). The book presents the XML

concepts and fundamentals in a clear, concise and easy to understand way. A large

number of examples are given and that makes the understanding of the XML easy

even for those who have no previous knowledge about this language.

11

2.3 XML in Bioinformatics Applications

In recent years, XML has been increasingly used for managing the biology data. The

data generated in the discipline of biology is flexible in nature and this is one of the

reasons that XML becomes a good candidate for the bioinformatics applications like

gene expression databases.

Achard, Vaysseix and Barillot (2000) produced a paper “XML, Bioinformatics and

Data Integration” [5] in which the authors discuss the suitability of XML for the

bioinformatics applications. In the beginning, the authors briefly introduce the XML

concepts and then set out to outline some of the areas where XML is being used. The

paper states that a number of commercial and academic actors are now using XML

for managing their data and “within a few years (XML) will be as widespread as HTML is

today”. XML is also used as a framework for developing specialized languages to be

used in various different fields. For instance, Wireless Markup Language (WML) is

based on XML and is used in the wireless application development. In the field of

biology, one of the uses of XML is to annotate the gene or protein sequence data.

Two of the examples of specialized languages developed using XML for biology field

are listed in the paper as:

1. Bioinformatics Sequence Markup Language (BSML): It is an extensible

language specification for bioinformatics data like DNA, RNA, protein

sequences and their graphical properties

2. BIOpolymer Markup Language (BioML): The paper quotes the developer of

the BioML as saying “BioML’s goal is to allow the expression of complex annotation

for protein and nucleotide sequence information. BioML was designed to mimic the

hierarchical structure of a living organism.”

The paper continues by discussing the data management in Bioinformatics. The

authors note that bioinformatics deals with very large quantities of data and managing

this data is one of the key concerns which have started to become bottleneck in the

development of the discipline. The amount of data in bioinformatics, however, is not

12

the real problem because disciplines like particle physics generate even more data

than bioinformatics and are still able to manage it efficiently. The main issues in

managing bioinformatics data arise from certain characteristics of this data. For

instance:

1. It is complex to model bioinformatics data. There are numerous data types

with complex relationships among them

2. New data types keep emerging regularly and proper modelling and integration

of these new data types often requires changes in the whole semantics of the

system. This happens because bioinformatics is under rapid development and

the new data often redefines the previously known concepts

3. Analysis of known data generates even more data and this new data has to be

integrated back into the original data

4. Experimental raw data needs to be archived because the researchers and

scientists often want to consult it in order to confirm the results given by the

computer analysis

5. The granularity of bioinformatics data is finer than in many other fields.

Objects and entities are smaller in size and therefore a unit amount of data

often contains a larger number of objects

6. Data is accessed, queried, exchanged and updated frequently

7. Data is used by a variety of users (with varying computer skills) which include

biologists, programmers, database administrators and data analysts etc.

Based on these observations, the paper suggests that the data management

technology in bioinformatics must be scalable, flexible and expressive. It also points

out some technical issues related with bioinformatics data management. These issues

include the sustained and rapid growth in the amount of data, data is stored in a

number of different types of databases, data is redundant and data is often stored in

different flat-file formats which make indexing difficult.

13

The paper then identifies some pros and cons of XML in context of bioinformatics.

Some of the strengths of XML are listed below:

1. XML is very flexible. It is human readable and therefore can be easily edited

by people with little computer skills.

2. It is capable of linking data and is Internet oriented. This capability enables

XML to provide cross references among various data sources on the Internet.

3. XML allows defining the customized specifications. The ever changing

bioinformatics data lacks standardization and therefore XML can be used to

construct a customized specification for the data.

Alongside these advantages, XML has its own share of weaknesses as well. Some of

these weaknesses are listed below:

1. XML is a text based format and has overhead of data parsing5.

2. XML itself does not provide facilities like indexing and clustering of data to

improve the performance.

3. The expressiveness of XML is not sufficient for molecular biology. Unlike

object-oriented technologies, XML does not provide mechanism for

inheritance and does not have concept of relationships among data as such.

There are no elaborate data types and constraints available.

This paper provided a good understanding of challenges in managing bioinformatics

data and role of XML in this context. The paper, however, was written in the year

2000. Now it is 2007 and seven years are like an age both in computer science and

bioinformatics. Though the challenges posed by bioinformatics in the area of data

management still exist, the XML technologies have become much more powerful

than these were in the year 2000. As the authors of the paper also hoped, XML

schemas have been able to solve many of the weaknesses listed above. In addition,

5 In computer science, data parsing means the analysis of syntax of input against a pre-defined set of rules.

14

the commercial databases management systems are now providing built-in XML

capabilities and as a result the performance of data stored in XML format has

improved.

The paper continues further by comparing some approaches that are used for

bioinformatics data management. These approaches include flat-files, Abstract Syntax

Notation One (ASN.1), Common Object Request Broker Architecture (CORBA),

Java Remote Method Invocation (RMI) and Object Databases Management System

(ODBMS). Out of these, the ODBMS was of particular interest because the current

EMAGE gene expression database is based on this approach. The paper stated that

ODBMS provides a rich data model which fits well to the requirements of

bioinformatics field. In addition, ODBMS:

• Provide indexing, object clustering and query optimization

• Have a standardized data definition and query language

• Guarantee security, concurrent access, integrity, consistency and reliability

• Provide Application Programming Interfaces (APIs)

While stating that XML does not provide many of these ODBMS features, the paper

suggest that the use of a combination of ODBMS and XML technologies might be

best for the bioinformatics needs. In this combination, the ODBMS is used for data

management and XML is used for display and exchange. The data is queried from the

ODBMS and ODBMS returns the results in XML format to enable users to exchange

the results easily.

It is said in conclusion of the paper that XML was not completely mature at the time

when the paper was written. The paper concludes that because of its simplicity,

flexibility and interconnection capabilities XML is a very promising candidate for a

standard language of data exchange.

By studying this paper a fairly deep insight was gained into the challenges present in

the bioinformatics data management. It explains, in detail, these challenges and also

15

the promise of XML to overcome these challenges. The paper has also been useful

because it compares ODBMS and XML technologies and points out the strengths

and weaknesses of both of these.

2.4 The EMAP Project

As this dissertation is related with the Edinburgh Mouse Atlas Project (EMAP) of the

MRC HGU, the author looked for published literature about EMAP project that

could help in gaining the basic understanding of the overall project. The EMAP

project has set up a website (http://genex.hgu.mrc.ac.uk) to distribute the information

about its activities. In addition to the general information and publications, the

website provides the outcomes of the EMAP project that have been obtained so far.

Detailed information was found about the EMAP project from a paper titled “EMAP

and EMAGE: A Framework for Understanding Spatially Organized Data” produced

by Baldock et al (2003) [2]. The relevant parts of this paper are being briefly reviewed

below.

Baldock et al (2003), in their paper, state that “the EMAP project has implemented a spatio-

temporal framework for capturing spatially organized and mapped data”. This framework

consists of three main components:

1. A 3D grey-level (voxel) models of the mouse embryos

2. An anatomical ontology

3. A mapping between the spatial context of the digital model and the textual

context of the anatomy

The paper continues by outlining the details of these components. It describes

various tools (like Atlas browsing tools) and resources that have been developed

under the EMAP project. The information given in this paper about the Edinburgh

Mouse Atlas Gene Expression (EMAGE) database was of particular interest because

the primary aim of this dissertation is to convert the current EMAGE object database

16

into an XML based database. The EMAGE part of this paper would be reviewed in

the next section.

The Baldock et al paper is very useful in knowing the context and background of this

dissertation project. It gives details of the aims and objectives of the EMAP project

and also describes the progress that had been made by the time the papers was

written.

2.5 EMAGE: A Gene Expression Database

The EMAGE database is a major component of the EMAP project. It holds the gene

expression data and is to be converted into an XML database under this dissertation

project. Current version of EMAGE database is available at:

http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html

Christiansen et al (2006) describe the EMAGE database in their paper [4]. The paper

states that this database contains the data related with the gene expression in mouse

embryo which is gathered using in situ techniques. This data is described by a

combination of associated text and space. The text description presents a list of

anatomical parts where a particular gene expresses itself. The space description is

used for showing the sites of expression in the images and 3D models of the mouse

embryo.

The paper explains the structure of EMAGE database and its content. The EMAGE

database is an integral part of the EMAP framework of Digital Atlas of Mouse

Development. The framework contains the 3D models of the post-implantation

Theiler Stages6 of mouse embryo. The domains (areas or spaces) within these 3D

models are mapped directly with anatomy ontology. As a result, corresponding

anatomical information can be fetched from the database upon selection of a domain

in a 3D model.

6 Theiler Stages describe and encompass gestation period of 18 days in mouse embryo. In total, there are 26 Theiler Stages.

17

The EMAGE database operates in a client-server environment. The server software

has been developed in C++ while the client software has been developed in Java. The

server software accesses the database and the client software communicates with the

server using Common Object Request Broker Architecture (CORBA).

The authors of the paper inform that in July 2005, there were 1905 records in the

EMAGE database. These records pertain to 704 genes and 22 Theiler Stages. Out of

these records, 10% were directly submitted by the individual laboratories; 46% were

submitted by screening consortia and 44% are those that have been previously

published in journals etc. The incorporation of previously published data is done in

cooperation with the GXD7, a gene expression database developed by the Jackson

Laboratory, USA. For this purpose, both these databases (EMAGE and GXD) have a

joint global copyright agreement with the Company of Biologists Ltd and Elsevier

B.V.

The paper also details the methods of interaction with the EMAGE database. The

EMAGE interface has been developed in Java which enables it to run on any

platform (like Windows, MacOSX, Solaris, UNIX and Linux) which has Java 1.4.2 or

higher installed on it. The EMAGE Java interface can be downloaded from the

EMAGE home page http://genex.hgu.mrc.ac.uk/Emage/database. The interface is

downloaded to user’s computer through JavaWebStart which ensures that on each

start the user gets the latest version of the interface software. This interface can be

used for querying and browsing the EMAGE database. It can also create local

database on researchers’ local computers where they can store the partial data while

completing the dataset. Once the dataset is complete, the researchers can submit this

data to the central EMAGE database. The individual laboratories and screening

consortia submit data directly to EMAGE. There is a dedicated Editorial Staff for

EMAGE database which curates the data that comes from various sources.

7 http://www.informatics.jax.org/mgihome/GXD/aboutGXD.shtml

18

Figure 2: Spatial queries formulated using EMAGE Java interface [4]

Christiansen et al (2006) paper is a particularly useful resource of information that

was found on the subject. It describes the complete setup of the EMAGE database in

a concise form. The paper provides a good understanding of how the EMAGE

database setup was working when the paper was written. This helped in

brainstorming about the solution for the conversion of EMAGE database into a

XML based database.

Before continuing, a quick review of a part of Baldock et all (2003) paper would be

appropriate in this context. The paper discusses the EMAGE database and provides

the same information as the Christiansen et al (2006) paper but in more details. A few

aspects of the overall working of the EMAGE system (e.g. how exactly the textual

and spatial data is mapped with the 3D models) become clearer by studying the

Baldock et al (2003) paper. This paper has provided an insight into the finer details of

the EMAGE functioning. Although these details are not directly related with this

dissertation project but it still helped in visualizing the present system as a whole.

19

3

3. Project Description

3.1 Problem Situation

The MRC HGU is intended to investigate the feasibility and approaches of

converting the existing EMAGE object database into an XML based database. The

need of this changeover is to address some of the concerns and problems which are

being encountered in the present setup of the database and the software client. Some

of these problems are listed below:

• The researchers who submit the data directly to EMAGE have to install the

software client.

• Though the client is based on Java technology and, therefore, is platform

independent (as long as appropriate version of Java is installed), yet the

researchers sometime face problems due to lack of required version of JVM

or Java capable browser.

• Often the results of the experiments come slowly and the researchers want to

store the partial results in their local computers as the results keep trickling in.

They send the data to the central EMAGE database when the dataset is

complete. With the present setup, it is difficult for the researchers to

conveniently keep the partial records of the data on their local machines.

By developing an XML database for gene expressions, complete with a web based

interface, the MRC HGU is aiming to solve of the above problems. In addition, XML

20

is rapidly becoming a global standard for structured data exchange among the

computing devices. Therefore, it is only rational to keep the data in the latest standard

format. This would make it easier to exchange the data with other organizations,

databases and other computer systems.

3.2 Project Aims and Objectives

To give structure to the project and to ensure that the project is completed within the

given timeframe, the scope of the project was clearly defined. Following aims and

objectives were determined for this dissertation project:

• To analyse the database structure of the existing EMAGE object database

• To develop an equivalent XML schema for the current EMAGE database

structure

• To design a new XML schema for the proposed XML database. This schema

needs to be MISFISHIE standard compliant and will be developed with the

help of MRC HGU staff

• To transform the existing EMAGE data into new XML documents as per the

new XML schema

• To develop an XML based database which would hold the XML documents

• To develop a web based user interface for querying the data present in the

XML database

3.3 Overview of Solution Development

After initial research and brainstorming, it was found that the following solution

could lead to the development of an XML based implementation of EMAGE, as

required:

21

1. Prepare a typical dataset of about 3000 XML documents drawn from the

existing data in EMAGE. This dataset would be used for the purpose of this

project

2. Develop an equivalent valid XML schema for the existing dataset. This

schema would be used for transformation purpose

3. Develop a MISFISHIE compliant valid XML schema

4. Develop a method to transform the original dataset as per the new

MISFISHIE compliant XML schema

5. Use IBM DB2 to create a database and put the transformed XML documents

into it

6. Use XQuery and SQL/XML to retrieve the required data from the DB2

database

7. Use a server-side scripting language (like JSP, PHP, ASP etc.) to present the

retrieved data to the user

In subsequent chapters, the details of the how these steps were implemented are

given.

22

4

4. Development of XML Schema

4.1 Existing Data Structure

The existing EMAGE database is an object oriented database which uses the

ObjectStore® Object Database Management System (ODBMS). The ODBMS

databases are designed to efficiently handle the data from applications which are

created using object-oriented programming languages (e.g. C++ and Java). The data

created by such applications is represented in the form of objects and these objects

can directly be stored in the ODBMS.

It was required to know the structure of the current EMAGE object database and to

have the existing data in order to proceed further in the project. For this purpose, the

structure of the current EMAGE database was exported in the form of CORBA

Interface Definition Language (IDL) by the MRC HGU staff. Also, approximately

3600 records were exported from the EMAGE database in the form of XML

documents. This dataset of about 3600 XML documents would be referred to as the

original dataset.

The IDL representation of a SpecimenDetails object is given below (See the

attached CD-ROM for complete EMAGE data structure in IDL form).

struct SpecimenDetails

{

string species;

23

string strain;

string sex;

boolean wildType;

MutantDetailsSeq mutants;

string stageFormat;

string stage;

string assayType;

string fixationMethod;

string embedding;

string clearingMethod;

string notes;

};

The elements corresponding to variables in an IDL struct were to be created in the

new MISFISHIE compliant XML schema. For example, the species string variable

became the element commonName under the parent element organismType in the

new XML schema.

<xsd:complexType name="organismType">

<xsd:sequence>

<xsd:choice>

<xsd:element name="commonName" type="nonEmptyToken"/>

<xsd:element name="taxon" type="taxonType"/>

</xsd:choice>

<xsd:element name="stage" type="stageSystemType"

minOccurs="0" maxOccurs="unbounded"/>

<xsd:element name="tissue" type="xsd:token" minOccu rs="0"/>

<xsd:element name="strain" type="nonEmptyToken"

minOccurs="0"/>

</xsd:sequence>

</xsd:complexType>

24

This correspondence between the variables of IDL and the elements of the new

schema was established through a mapping scheme. (See section 5.2.2)

While creating the new XML schema, conventions given below were followed so as

to make the schema easily understandable by present and future developers:

• All the element names in the schema were written using camel case8 with first

letter in small case (often refer to as lowerCamelCase)

• All the element type names ended in word “Type”. For example,

sequenceType

• nonEmptyToken and nonEmptyString types were used to indicate that

the element must have a value

• Extended and restricted simple types were kept anonymous unless these were

the utility types

• Complex types were always named and never left anonymous

• All element names were spelled in singular form

• minOccurs and maxOccurs were used to show the cardinality (the number

of occurrences that an element should have)

• Instead of ID the name accession was used for the unique identification

fields. Accession is used in most of the bioinformatics applications to give

data a unique identity.

• Union of the elements was used to indicate the preferred or common values

The resultant new XML schema was a fairly complex one and consisted of more than

850 lines of code.

8 In camel case, the words are written joined together with each word’s first letter as capital. E.g., AntibodyAssayType

25

4.2 XML Schema for Original Dataset

The partial existing dataset of EMAGE records in the form of XML document was

to be used for the purpose of this project. The dataset consisted of approximately

3600 XML documents. But these documents were not created on the basis of an

XML schema. These were generated by automatically exporting the EMAGE data in

the form of XML documents. This conversion was done according to CORBA

Interface Definition Language, which defined the structure of the EMAGE object

database.

However, it was necessary for the project to have an XML schema for the original

dataset. Without a schema, the dataset could not be validated and mapped with the

new XML schema. To get around this problem, the author initially began to manually

write the schema for the dataset. This schema was to be a simple schema because it

was being written for the data that already existed. No rules and constraints were

needed to be enforced in this schema. Simultaneously, the author looked for a tool

which could create a schema for the XML documents of the original dataset.

It was found that EditiX 5.1 is capable of generating a W3C9 standard compliant

XML schema from an existing XML document. After this tool was found, the

schema for the original dataset was generated using this tool. A sample document

from the dataset was provided to this tool as input and it generated a basic schema

for the document.

9 W3C stands for the World Wide Web Consortium. This consortium defines the standards for the World Wide We. For more information, see http://www.w3.org

26

Figure 3: Screenshot of the EditiX interface

Figure 4: Screenshot of DTD/Schema menu of EditiX

The automatically generated schema did not define any constraints for the data and it

was not necessary either. However, thus generated schema was not all correct. A few

modifications were done by the author and also by Dr. Yiya Yang of MRC HGU in

order to make the schema capture the biological information in a more correct way.

27

4.3 MISFISHIE Standard

The XML schema developed at MRC for the purpose of this project is MISFISHIE

compliant. MISFISHIE stands for “Minimum Information Standard For In Situ

Hybridization and Immunohistochemistry Experiments”. A Minimum Information

Standard (MIS) is an information reporting guideline that specifies the minimum

information required to achieve a particular aim. In case of MISFISHIE, this goal is

to enable the reproduction of the results of experiments related with the in situ

hybridization and immunohistochemistry. The MISFISHIE Standard Working Group

defines the MISFISHIE standard as below:

“MISFISHIE specification details the minimum information that should be provided when

publishing, making public, or exchanging results from visual interpretation-based tissue gene

expression localization experiments such as in situ hybridization, immunohistochemistry, and

reporter construct genetic experiments (GFP/green fluorescent protein, β-galactosidase), etc.” [12]

The structure of information in the current EMAGE database has not been built

according to an international standard. This makes it difficult to exchange

information among various organizations due to lack of compatibility in the

information structure. Therefore, it was decided to base the new XML schema on the

MISFISHIE standard.

The MISFISHIE specification, in its current form, describes the following aspects of

the in situ hybridization and immunohistochemistry experiments:

• Experimental Design

• Specimens

• Probe or Antibody Information

• Staining Protocols and Parameters

• Imaging Data and Parameters; and

• Image Characterization

28

4.4 Rationale for a New XML Schema

Creation of an XML schema is one of the very first steps in the process of designing

an XML database. The schema defines the structure of the data that is going to reside

in the XML documents. The XML documents should conform to the underlying

XML schema to enable the database and any other application using it to function

smoothly.

The constraints in the existing EMAGE object database were not strictly defined.

While the development of the equivalent XML database was to begin from the

scratch; it was prudent to invest some time for coming up with a good functional

schema. The schema designing staff at MRC HGU developed the schema in

consultation with the EMAGE editors and other biologists –so that a tightly defined

schema could be developed. Such a schema would enable the database to capture all

the biological information in a meaningful and consistent fashion.

XML schema acts like a template which the associated documents have to comply

with. It is the successor of Document Type Definition (DTD) and is more elaborate

and powerful than DTD. The XML schema allows defining the rules and constrains

which the data must follow. It ensures the consistency in the data. Following are

some of the constraint types that can be applied using the XML Schema Definition

(XSD): [9]

• XSD can define which elements are allowed to appear in an XML document

• The correctness of the data in the XML document can be validated

• Restrictions on data values can be defined. For example, it can be defined that

an element must either have “DNA” or “RNA” as its data. No third value

would be allowed

• Patterns of data can be defined. For example, data in an element must start

with character “t” and end with character “e”

29

• Data can be converted between data types

The XML schema is extensible; which means a schema can be used in another schema.

Extensibility also means that the developers can define their own user defined data

types using the standard data types available as part of the XML Schema Definition.

Another good feature of XML schema is that it is written in XML itself. XML

schema supports the namespaces and a variety of standard data types.

4.5 Validation of New XML Schema

The XML schema which was provided to the author by MRC HGU was not a valid

schema and contained syntactical and other errors. To begin with the project, having

a valid new XML schema was mandatory because this schema was the foundation of

the whole project.

There are a number of tools available to work with XML schema. These tools help

user to create and validate the schemas. The author downloaded evaluation versions

of various such tools and tested them for their features and user-friendliness. For

validation of the XML schema, Oxygen XML Editor 8.2 was used. It’s a very user-

friendly tool but takes relatively much more memory than some other tools.

Figure 5: Screen shot of Oxygen XML Editor

30

Use of Oxygen XML Editor made the task of validation easier as it provides features

like several visual aids and automatic completion of end tags. However, still the

process of correcting schema took quite some time. It was found, in the end, that

most of the errors reported by the editor were related with the incorrect definition

and use of namespace in the schema. It is important to correctly define namespaces

while developing the schema. Also, from author’s experience, it is recommended to

check the namespace definition first while debugging. The wrong definition or usage

of namespace gives rise to a large number of errors in the schema and schema editing

tools are unable to identify the wrong definition of namespace as the cause of these

errors.

�� Concept of Namespace in XML

Namespace in XML is a mechanism of avoiding conflicts among the names of

elements. For instance, consider the following two XML documents:

<table>

<tr><td>5 prime</td></tr>

</table>

and

<table>

<helix>3 prime</helix>

</table>

If these two documents are merged together, there would be a conflict

between the two <table> elements. Such conflicts can be avoided by using

namespace as below:

<htm:table xmlns=”http://www.w3.org/namespace1”>

<htm:tr><htm:td>5 prime</htm:td></htm:tr>

</htm:table>

and

31

<x:table xmlns=”http://www.w3.org/namespace2”>

<x:helix>3 prime</x:helix>

</x:table>

Now, both the <table> elements are identified by two different namespace

prefixes “htm” and “x” and there would not be any conflict if two documents

are merged together. [6] [9]

32

5

5. Transformation of Original Dataset After successfully validating the new XML schema, the transformation of the original

dataset according to the new XML schema was the next step. The inputs to this step

were the XML documents in the dataset and the validated new XML schema. As an

output, the XML documents were to be transformed into new structure documents.

The data in these documents would remain same but the names and positions of the

elements would change as per the new schema.

5.1 Investigated Approaches

5.1.1 JDOM based Application

JDOM is a Document Object Model (DOM) designed for the Java platform. It is

freely available in the form of a set of API which could be used to develop

applications. The API can be downloaded from http://www.jdom.org website.

JDOM is Java-centric, Java-optimized and combines DOM with Simple API for

XML (SAX). For parsing of XML documents, JDOM uses external parsers and it is

possible to specify a particular parser to be used for the purpose. The XPath and

XSLT support is integrated in JDOM. [13]

This approach was investigated by start building a Java application which would use

JDOM. The fundamental logic behind the application was as below:

33

• Read the source XML document element by element

• Find a new name for the element from an XML document containing

mapping information

• Write the element into a new XML document using data from the source

element and element name from the mapping information.

A basic application was developed which worked as per this logic. However, the

XML documents in the original dataset and the expected transformation was far

more complex than could be handled by this basic application. It was also noted that

building an application which could handle the transformation task at hand would

take much more time than was available. As a result, this approach was found to be

unsuitable for this project.

5.1.2 XML Shredding using DB2 9 Database

IBM DB2 9 database management system10 allows already existing XML documents

to be “shredded” into the columns of a relational table. It was thought that this

facility could be used for carrying out transformation. The logic behind this approach

was as below:

• Shred the XML documents into a relational database table. One XML

document would fill one row (record) in the table

• Develop a Java application which would connect to the relational database

table

• Read a record from the relational table and write the values in the record in a

new XML document according to the new element names and positions

This approach was also found unsuitable because it needed the XML schema for the

original dataset to be annotated. The shredding functions of DB2 database shred an

10 See section 6.1 for more details

34

XML document as per the annotations given in the document. It was estimated that

the time required for the correct annotation of the schema might exceed the time at

hand. Therefore, this approach was also found to be unsuitable for this project.

5.1.3 XSL Transformation

XSL is a language that is used for formatting the XML data in HTML web page. XSL

Transformation (XSLT) is a mechanism of transforming the XML document into

other formats like HTML, XHTML or XML. This approach was found to be suitable

for the project because it takes least amount of time among other approaches that

were considered.

5.2 XSL Transformation

XSL stands for Extensible Stylesheet Language. XSL Transformation (XSLT) is a

language that is used for transforming the XML data into various other formats

including XML. The XSLT itself is XML based. During transformation the source

XML document remains unchanged and the target document is created using only the

content of the source document.

5.2.1 Rationale of Using XSL Transformation

After investigating different approaches described in the previous section, it was

found that using XSL transformation was the most suitable method to perform the

task at hand. The decision of selecting XSL transformation was based on many

advantages that this approach provides. These advantages are explained below.

The foremost reason of using XSL transformation was that it was the quickest way to

carry out transformation of the XML documents in the original dataset. The

dissertation project was to be completed in a limited amount of time and meeting

deadline was of utmost importance. Therefore, an approach that would take least

time in performing the desired work would be the most suitable one. There are tools

35

available which can help user in creating XSLT code based on the mapping between

two XML schemas. This code is then used for transformation of the original XML

source documents. The two XML schemas that were used in this project were

significantly large and complex and there existed a large number of mappings

between them. Consequently, the resultant XSLT code was to be quite large too.

Therefore, it would have been time consuming task to manually produce the correct

XSLT code. But availability of mapping tools saved the efforts of developing code

that was repetitive in nature and did not require complex logic application.

XSL transformation is a standard method of transforming XML documents. This is

the reason why tools have been built in order to make the XSL transformations

easier. There is a well established W3C standard for XSLT which has provisions like

defining variables, constructs like if and when and loop constructs like for-each. All these

facilities make XSLT a full-fledged programming language for writing transformation

code.

Another advantage of using XSLT is that it is possible to modify the XSLT code to

generate different kinds of target formats from the source XML. This dissertation

project required producing XML-to-XML transformation but, if needed, the same

XSLT code could be reused to write code for transforming the original XML data

into other formats like HTML, XHTML and PDF etc.

5.2.2 Mapping Scheme

In order to produce XSLT code, it was required to know which elements/attributes

of the schema for the original source documents would map to which

element/attributes of the new XML schema. For this purpose a mapping scheme was

provided by the staff at the MRC HGU. The scheme provided numbers to the IDL

variables in the IDL structure of the existing EMAGE database. Then the

corresponding elements in the new XML schema were marked with these numbers.

This mapping scheme was provided to the author for the purpose of carrying out the

transformation work.

36

A small portion of this mapping scheme is given below to demonstrate how source

schema was mapped with the target schema.

// details for any publication

struct PublicationDetails

{

string authors; ............. 56

string journal; ............. 57

string title; ............. 58

string volume; ............. 59

string issue; ............. 60

unsigned short year; ............. 61

string pages; ............. 62

string accessionNo; ............. 63

};

Snippet: A portion of the IDL structure showing mapping numbers in bold

<xsd:complexType name="publicationType">

<xsd:sequence>

<xsd:element name="author" type="nonEmptyToken"/>

... 56

<xsd:element name="journal" type="xsd:token"

minOccurs="0"/>... 57

<xsd:element name="title" type="nonEmptyToken"/>

... 58

<xsd:element name="volume" type="xsd:token" minOccu rs="0"/>

... 59

<xsd:element name="issue" type="xsd:token" minOccur s="0"/>

... 60

<xsd:element name="year" type="xsd:token"/>

... 61

<xsd:element name="page" type="xsd:token" minOccurs ="0"/>

... 62

37

<xsd:group ref="accession" minOccurs="0"/>

... 63

</xsd:sequence>

</xsd:complexType>

Snippet: A portion of the new XML schema showing mapping numbers in bold

As this portion of the mapping scheme shows, the element number 56 of the IDL

structure would map to the author element under the publicationType element

of the new XML schema.

5.2.3 Mapping Tools

Two tools were tested that are available for performing mapping between XML

schemas. These were Altova® MapForce 2007 Enterprise Ed. and Stylus Studio® 2007

XML Enterprise Suite. Both provide the same functionality and both provide

graphical interfaces for mapping. Stylus Studio is an integrated tool for working with

different XML technologies while MapForce is designed specifically for mapping

purpose. The author downloaded the evaluation versions of these two tools from the

websites of their respective makers and began to explore the facilities provided by

these tools. In the end, the author chose to use Altova® MapForce for the

development of the XSLT code. The reasons behind this decision were:

• Stylus Studio consumed a lot of computer resources and took much time in

starting on author’s machine.

• The author was using evaluation versions of both the software under

academic licenses. The makers of the Stylus Studio did not extend the

evaluation period after 30 days and the software could not be used further.

The makers of MapForce, however, extended the academic license.

• In general, in terms of author’s experience, MapForce was found to be more

user-friendly.

38

MapForce software accepts a source and a target schema from the user. Then it

presents the elements and attributes of both the schemas on the screen. User can drag

an element/attribute of the source schema and drop it on corresponding

element/attribute of the target schema to make a connection between them. The

connection indicates the mapping between the two elements/attributes.

This tool also provides graphical drag-and-drop interface for introducing the:

• String Functions including concatenate string, find sub-string, find string

length etc.

• Functions for testing if a node exists or does not exist

• Math functions like add, divide, multiply, subtract, ceiling, floor etc.

• Logical functions including equal-to, less-than, greater-than etc.

• Conversion functions for converting data to string, number and boolean

values

• XSLT specific functions like current, document, element-available, generate-

id etc.

• Constructs like if, when and filter nodes

39

Figure 6: Screenshot of a portion of Mapping generated using MapForce

Mapping tools generate XSLT code in the background as user proceeds with

mapping. MapForce can generate code in many languages like XSLT 1.0, XSLT 2.0,

XQuery, Java, C# and C++.

Another very useful feature of MapForce is that it can accept a sample XML

document based on the source schema. Afterwards, as the user proceeds with

mapping between source and target schema, the software can instantly show how the

sample XML document will get transformed on the basis of the defined mapping.

5.2.4 Automated Code Generation

Although the software like MapForce are very helpful in generating the XSLT code

on the basis of mapping, it was noted that the software could not always be used for

generating the code as desired. In such cases, after finishing the mapping work,

manual editing needs to be done in the automatically generated code to make it fit for

the purpose. Some of the important information was being missed by the

40

automatically generated XSLT code while it transformed the original dataset. To

resolve this some changes in the code were done manually. Later, some of the

manipulations were also done by the MRC staff in consultation with biologists in

order to capture the required biological meaningfulness in the XML data.

Also the code generated by MapFroce was unnecessarily complicated. The XSLT

code was created all in one template. That was one of the reasons that the code

became complicated. The author believes that it would be a good idea to develop a

mapping software that could XSLT generate code on the basis of template approach.

The software should allow the user to create templates and then it should allow

mapping on the basis of these templates instead of doing mapping on an element-to-

element basis. Although element-to-element mapping is easier to do and seems

intuitive; but the resultant code of such mapping is far more complicated then it

should be. Towards the end of the dissertation, the MRC HGU staff simplified the

XSLT code by breaking it into templates. This would make the code easier to handle

by MRC HGU staff for further developments after this dissertation project is over.

5.2.5 Transformation Process

Once the required XSLT code was ready and validated, the next step was to use this

code and carry out transformation of approximately 3600 XML documents in the

original dataset. It was not a good idea to do transformation of one document at a

time as it would have taken too much time. Therefore, the author looked for a tool

that could carry out batch transformation.

Altova® XMLSpy 2007 Professional Ed. is a tool that can do the work of carrying out

batch XSL transformation. In order to do this, the user needs to create a new project

in XMLSpy and add the location of the directory of the source XML documents in

the “XML Files” folder of the project. In the properties of the project, user can

specify the XSL file which would be used to do XSL transformation. In addition,

location of an XML schema can also be specified which would be used to validate the

transformed XML documents.

41

6

6. Database Preparation and Querying Having transformed the original XML documents derived from the EMAGE object

database, it was time to prepare a new database which could be used to store the

transformed documents. IBM DB2 9 was chosen as the platform for the database

creation. It was not a difficult choice to make for the reason that MRC HGU had

already acquired and setup the DB2 9 Server in their premises. Therefore, it was good

to use the same platform for this dissertation so that the results of this project could

easily be migrated to the MRC HGU Server. However, this was not the only reason

why DB2 9 was a good choice. This latest version of DB2 database management

system is a very good tool for the XML data management.

6.1 Why IBM DB2 9

The DB2 9 database management system from IBM comes with the pureXMLTM

technology which has revolutionized the storage and management of the XML data.

pureXMLTM technology is designed to overcome various XML storage and retrieval

problems which have been there since long.

Traditionally, XML data management has had involved one or more of the following

approaches: [14]

• Store XML documents in the file system

42

• Stuff the XML data into relational databases using large objects (LOBs)

• Shred the XML data among different columns in relational tables

These are the obvious approaches but they often fail to perform well. The file system

approach is easy but, if the number of documents grows larger, file system is not

scalable like a database. Searching through large number of documents in the file

system proves to be a very slow process. Also features like concurrency, security,

recoverability and usability are also not available to the data stored in the file system.

By stuffing XML data into VARCHAR data type or large objects (like CLOB or

BLOB) in relational databases –some of the concerns related with storage in the file

system could be overcome. But the issue of low performance still remains because

LOBs are good only if the whole XML document is to be retrieved from the

database. However, searching for the portions of XML (like elements, attributes or

sub-trees) would still be a tedious task as all the documents needs to be scanned at

run-time in order to perform search. [14]

Shredding is the process of decomposition of the XML documents. The decomposed

portions of an XML document are stored in the columns of the relational tables. To

achieve this, the XML schema is annotated with the mapping information which is

then used by the shredding facility to store the portions of XML into the appropriate

columns of the relational tables. The normalization rules of the relational design and

the complexity of the XML document may cause the document to span over a large

number of columns. Also to retrieve data or to reconstruct the XML document

would require writing of complex queries. Sometimes reconstruction may even

become impossible. [14]

In order to know about the capabilities and features of DB2, a paper titled

“DB2/XML: Designing for Evolution” produced by Beyer, Özcan, Saiprasad and

Linden (2005) was studied [1]. The paper states that “DB2 provides native XML storage,

indexing, navigation and query processing through both SQL/XML and XQuery”. DB2 has

capability of storing XML data in the relational tables. This way, the XML type data is

physically stored in DB2 preserves all the information in the XQuery data model,

43

which means DB2 supports the XML fidelity. It can also shred the XML data into

relational form, thus supporting the relation fidelity. DB2 supports textual fidelity as well

by allowing to store XML data into CLOB columns. Also, an XML column in a DB2

table does not require being associated with an XML schema for the purpose of

validation. Validation can be done during insertion of data or at the time of query.

The paper notes that the retrieval of data from large XML documents is slower

because XML itself does not provide any indexing capability. In DBMS products

supporting XML, it is possible to create indexes on the entire XML documents but

the speed of data retrieval still remains quite slow. DB2 overcomes this major

problem by indexing on the XPath expressions instead of indexing the whole

document. This makes the queries execute significantly faster. The paper claims that

the XML support in DB2 has been designed while keeping evolution in mind. As a

result, the design decisions were taken to facilitate the enhancements of the XML

schemas as and when required. The paper then goes on explaining these features of

DB2 by the means of a case study. This paper provided the author with a good

understanding of DB2 features with respect to XML. Although the paper is not

written for the beginners in the field of DB2/XML it still proves useful in gaining

insight about the issues related with XML data management and how DB2 claims to

resolve these issues. Also DB2 9 with pureXMLTM technology has been released after

the publication of this paper. Some of the concerns raised in this paper regarding

XML data management have been addressed to in a more efficient manner by the

pureXMLTM technology.

To acquire more practical understanding of DB2; a courseware from IBM

Corporation [7] on DB2 9 was studied. This courseware is detailed and suitable for

those who are new to DB2 or XML databases. Not only it provides information on

DB2 functioning but also it provides good content on topics like XML concepts,

XPath and XQuery etc. The pureXMLTM technology of DB2 9 comes as a new

approach of storing XML data natively in databases. This technology would be

discussed in the next section.

44

6.1.1 pureXMLTM Technology

The pureXMLTM technology in IBM DB2 9 provides XML data type to natively store

XML data. It also provides efficient data management techniques to store the

hierarchical structures. Hierarchical structures quite commonly present in the XML

data. The pureXMLTM technology used in DB2 9: [7]

• Provides seamless consolidation of diverse data sources

• Provides XQuery/SQL interface, which enables faster and easier

development than the previously available methods

• Eliminates need for proprietary software to shred the XML data, which

means XML searches have become faster

• Provides flexible XML schema, which makes changes in the schema much

quicker

• Assists in conversion of data available as .DOC, .XLS and .PPT formats into

XML format

• Has XML support in all the APIs

6.2 Insertion of XML into DB2 Database

In order to query the XML documents through a web interface, it was important to

put all these documents in a relational database. This way, the web application would

be able to connect to the database server via the APIs like JDBC/ODBC and access

the XML data stored in the database.

For this purpose, a DB2 9 database was created and in the database a simple table

was created with the following structure:

45

Figure 7: Structure of the relational table created to hold the XML documents

The structure of the database table is simple because all the data actually resides in the

XML documents. The table is used merely to put the XML documents in a relational

database. Therefore, one column, XMLDOC, was created to hold the XML

documents and another one, RECID, to hold the unique identification of an XML

document. Once the database was ready, a simple Java application was written which

could be used to insert records in the relational table.

At this point of time, a problem occurred which took significant time to be solved.

When Java application attempted to connect to the DB2 9 database, following error

was shown at the runtime:

java.lang.ClassNotFoundException: COM.ibm.db2.jdbc. app.

DB2DriverClassNotFoundException:

COM.ibm.db2.jdbc.app.DB2Driver

The ClassNotFoundException is thrown by Java Runtime Environment when it is

not able to find the definition of a class which is being called by the code. The DB2 9

database server comes with the driver which could be used to connect to a database

via JDBC. The driver class is found in a package called

COM.ibm.db2.jdbc.app.DB2Driver . This package is stored in a zip file called

db2java.zip and could be found in the installation directory of DB2 9 server. On

the Windows XP machine used by the author, the location of this file was:

C:\Program Files\IBM\SQLLIB\java\db2java.zip

If this error occurs, in terms of author’s experience, the first thing that should be

checked is whether the location of db2java.zip file is included in the

CLASSPATH environment variable. If it is not included then the CLASSPATH

variable should be edited to include the location of the zip file. The Java Runtime

46

uses CLASSPATH variable in order to find the classes that are being used in a Java

program.

The problem, however, could not be solved even after setting the CLASSPATH

variable right. The author was using Eclipse platform for Java development. It was

found that for Eclipse, it is a must to add any referenced files in the reference library

of the Eclipse project. To add db2java.zip file to the reference library, the author

followed these steps:

Go to Project Properties of the project > Select Java Build Path > Add External Jar Files.

Zip files can also be added using the “Add external Jar files” option. Once the

db2java.zip file was added to the reference library the Java Runtime was able to

find the driver required to connect to the DB2 9 database. Apparently, when the Java

programs are run using the console provided along with the Eclipse platform, the

referred files have to be added to the reference library. The CLASSPATH variable

must be set right if the application is run from outside the Eclipse environment.

It would useful to mention here that the XML documents were inserted in the XML

column of the table as a binary stream. The Java program used the FileInputStream

method to open the XML file as a stream and then this input stream was passed to

the XML column. Following snippet shows the code.

try{

PreparedStatement ps=null;

String strSQL="INSERT INTO LK.EMAGEXML Values(?,?)" ;

ps=con.prepareStatement(strSQL);

ps.setString(1, fileNameForID);

File xmlFile=new File(fileLoc);

ps.setBinaryStream(2, new

FileInputStream(xmlFile),(int)xmlFile.length());

ps.execute();

47

}catch(Exception e){e.printStackTrace();}

6.3 Querying of XML Data

6.3.1 XQuery Language

XQuery is the language that is used for querying the XML data. What SQL does for

Relational Database Management System (RDBMS), XQuery does the same for XML

data. XQuery uses XPath and FLWOR expression to locate the data in an XML

document.

XPath is the syntax for navigation in an XML document. It uses the path expressions,

similar to the directory structure in a computer file system, to select one or more

nodes in the document. For example, in the following XML snippet:

<specimen>

<organism>

<commonName>mouse</commonName>

<stage>

<name>dpc</name>

<value>9.5</value>

</stage>

<strain>-</strain>

</organism>

<type>whole mount</type>

<sex>unknown</sex>

<genotype>

<wildType>true</wildType>

</genotype>

</specimen>

48

To select all the <wildType> nodes, the XPath would be:

/specimen/genotype/wildType

The initial front slash (/) represents the root element in the XML document. The

XPath syntax has more than a hundred built-in functions which makes the node

selection and other tasks easier. For example, to get the text present within the

<value> node, the text() function could be used:

/specimen/organism/stage/value/text()

Similarly, to select the last <specimen> element within the root element, the

XPath would be:

/specimen[last()]

The XPath syntax comes very handy while working with XQuery. XQuery uses

XPath expressions to select the required nodes in the XML document. The FLWOR

(For, Let, Where, Order by and Return) expressions of XQuery makes it easy to

manipulate the nodes and data selected by the XPath expressions. A simple example

is given below:

for $x in doc("EMAGE_100.xml")/specimen

let $cName:=$x/organism/commonName

where $x/organism/stage/value/text()>=”9”

return $cName

This XQuery expression will first open the “EMAGE_100.xml” document and then

it would select all the <specimen> elements within the document. Then XQuery

will extract the values of commonName element in a variable $cName. After that it

would return only those variable $cName in whose case the dpc stage value would be

higher than 9. Same process will be repeated for all the <specimen> elements found

in the XML file.

49

6.3.2 SQL/XML

SQL/XML is an extension of the SQL standard. It provides several functions which

could be used to construct the XML data in SQL queries. With introduction of XML

data type in several commercial databases, the need of retrieving and manipulating the

XML data became a major concern [17]. The SQL/XML extension answers this

concern. The extension provides functions to be used in combination with the SQL

queries as well as XQueries. For instance:

SELECT RECID,

XMLQUERY('$t/hguMrcSubmission/inSituAssay/entityBei ngDetecte

d/symbol/text()' PASSING XMLDOC AS "t") AS xmltxt F ROM

LK.EMAGEXML

The above SQL query retrieves data from an XML column just like column of any

other data type. While RECID is a VARCHAR type of column, XMLDOC column’s

data type is XML. While SQL SELECT statement can retrieve data from RECID, the

data from XMLDOC column is retrieved using XMLQUERY function of

SQL/XML extension. Following query shows how SQL SELECT statement and

XQuery FLWOR expression could be used together:

SELECT XMLSERIALIZE(XMLQUERY('for \$doc in

\$t/hguMrcSubmission let

\$status:=\$doc/annotation/expressionByOntology/exp ression/s

trength/text()

where \$status=\"detected\"

return \$doc/inSituAssay/entityBeingDetected/symbol /text()'

PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K))

FROM LK.EMAGEXML

50

6.4 User Interface Development

Although IBM DB2 9 is a good candidate for development of an XML database, in

terms of author’s experience, working with DB2 has been a challenging task. There

were many issues related with DB2 which made the progress of the project slow. It

took significant time to resolve these issues and sometimes the issue could not be

resolved and an alternative approach had to be opted in order to successfully finish

the project with in the available time.

6.4.1 Java Server Pages (JSP)

For the development of the user interface, initially, a web application based on Java

Server Pages (JSP) was tried out. The setup involved Tomcat web server, JDBC and

DB2. Tomcat (http://tomcat.apache.org) is a freely available application server which

implements the JSP and Servlet specifications. Tomcat is an oft-used web server for

JSP/Servlet applications. After installing Tomcat, necessary configuration was done.

All the instructions to configure Tomcat come along with the installation package.

The web server was successfully installed and it was serving the simple JSP pages

without any problem. However, when the code to connect to the DB2 database was

included in a JSP page, the web server could not serve the page. The author searched

through the Internet to find ways to solve this DB2/Tomcat connection problem –

but the problem could not be solved. In terms of author’s experience, connecting

with database management systems like SQL Server or Oracle through Tomcat is not

a problem. It’s a rather straightforward procedure. But with DB2 it was not as

straightforward.

The first problem which appeared while connecting with DB2 database through

Tomcat was in the form of following exception:

java.lang.ClassNotFoundException:

com.ibm.db2.jdbc.app.DB2Driver

It was learnt that the DB2 driver comes in following two versions:

51

Application driver: com.ibm.db2.jdbc.app.DB2Driver

and Network driver: com.ibm.db2.jdbc.net.DB2Driver

The network driver is used for making connection with DB2 through Java applets.

The application driver is used for the local applications. As, in case of this

dissertation, the JSP application needed to locally connect with DB2, therefore the

application driver was used. The above mentioned ClassNotFoundException

was resolved by including the db2java.zip file in the Eclipse environment (See section

6.2 for more details). This, however, did not help much because after this the

following exception appeared:

SQLException: [IBM][JDBC Driver] CLI0627E The resul t set is

not scrollable

This exception arises when the JSP code tries to navigate through the resultset11

obtained from DB2 database. Navigation through the resultset is very important and

the intended web application could not be developed without this facility. The only

possible solution for this error that could be found in the IBM DB2 manual and on

Internet was that the resultset should be SCROLL SENSITIVE. This could be done

as below:

Statemnt stmtObj =

con.createStatement(ResultSet.TYPE_SCROLL_INSENSITI VE,

ResultSet.CONCUR_UPDATABLE);

The problem of resultset scrolling, however, was not solved even after using the

above solution. Further research on how to solve this problem was not yielding any

success and, as a result, the JSP implementation of the user interface was taking too

much time. At this point of time, in order to complete the project within time, the

author decided to investigate an alternative approach to develop the user interface by

using PHP instead of JSP.

11 A resultset is an object which contains the records matching with the criteria specified in the query.

52

6.4.2 Hypertext Preprocessor (PHP)

The decision to use PHP, after much time consuming research to solve the

Tomcat/JDBC/DB2 related problems, was made because PHP is one of the most

widely used server-side scripting languages. It was hoped that PHP would have

readily available solution for DB2 connectivity.

PHP is a language used for developing dynamic web pages. To make use of it, a PHP

enabled web server is required to be installed. The author made use of the Apache

web server which is freely available from Apache Software Foundation’s website

(http://httpd.apache.org/download.cgi). In addition, PHP needs to be downloaded from

http://www.php.net and installed. In order to connect to the DB2 database, a PECL

extension also needs to be installed on the computer where PHP has been installed.

PECL stands for PHP Extension Community Library. It is a mechanism of

distributing PHP extensions. The PECL required for working with DB2 is ibm_db2

and it could be downloaded from http://pecl.php.net/package/ibm_db2. The

configuration information of PHP installation on a computer resides in the php.ini

file. The PECL should be installed in the directory where PHP has been installed (in

case of PHP 5 installed on a Windows machine, the ibm_db2 PECL consist of one

DLL file which needs to be copied to the PHP directory). Following changes needs

to be done in the php.ini configuration file for making PHP work with DB2:

• Add the ibm_db2 extension so that it is loaded whenever PHP is loaded in

the web server. (Syntax is extension=php_ibm_db2.dll )

• include_path variable should be correctly set to the locations where

required include files are located. The location of ibm_db2 PECL must be in

this path

Besides this, it is also recommended to set the following variables as shown. This

helps in tracking errors in the PHP pages.

• display_errors = On

53

• error_reporting = E_ALL

• log_errors = On

When PHP gets installed, it automatically changes the Apache web server

configuration file so that whenever Apache starts, it loads PHP module as well. But if

there is any problem, it is recommended to check if Apache is loading PHP module

or not. This could be ascertained by checking if the following line is present in the

Apache configuration file (httpd.ini)

LoadModule php5_module "C:\\Program

Files\\PHP\\php5apache2_2.dll"

This line shows that PHP5 module located in the given location should be loaded

when Apache starts.

6.5 User Interface for a Few Queries

The development of full-fledged query interface was not part of this dissertation.

Therefore, a smaller set of queries was suggested by MRC HGU which could be

implemented and a user interface could be developed for demonstration purpose.

These queries were provided in the form of plain English questions. Appropriate

XQueries were to be prepared which could be executed to get the answer data from

the XML database. The user interface for inputting the query criteria and for

outputting the returned data was also prepared using PHP. Details of these queries

and user interface are given below.

Query 1: What genes are detected (or not detected) in an anatomical structure

between a specified range of Theiler Stages?

This query has three inputs: status of detection of gene expression, anatomical

structure and range of Theiler Stages. Anatomical structure was to be entered as a

reference. The EMAGE database stores anatomical reference to EMAP Mouse Atlas.

The user enters reference and the name of the anatomical structure, images and other

54

details are fetched from the Mouse Atlas and shown in the result. However, this

connectivity between EMAGE XML database and EMAP Mouse Atlas was not to be

developed as part of this dissertation (this would have required using web services).

Consequently, the query will take reference EMAP ID as input from the user and will

return a record if it has a matching EMAP ID associated with it. To receive the inputs

from the user, the following interface was developed:

Figure 8: Input screen for query about gene expression detection

When user enters the query input through this interface and clicks “Search” button;

the result of the query is displayed as below:

Figure 9: Output of the gene expression detection query

The XQuery code which brings the results for this query from the XML database is

given below:

55

SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$status:=\$doc/annotation/textAnnotation/expressio nByOntology/expression/strength/text() let \$anatomyref:=\$doc/annotation/textAnnotation/expre ssionByOntology/accession/text() let \$TStage:=\$doc/annotation/referenceStage/value /text() let \$id:=\$doc/@accession let \$DPCStage:=\$doc/specimen/organism/stage/value /text() let \$inSituAssayPresence:=if(fn:exists(\$doc/inSituAss ay/firstLabel/text())) then(\"ISH\") else(\"\") let \$antibodyAssayPresence:=if(fn:exists(\$doc/antibod yAssay/firstLabel/text())) then(\"IHC\") else(\"\") let \$reporterAssayPresence:=if(fn:exists(\$doc/reporte rAssay/firstLabel/text())) then(\"ISR\") else(\"\") let \$inSituSymbol:=if(fn:exists(\$doc/inSituAssay/enti tyBeingDetected/symbol/text())) then(\$doc/inSituAssay/entityBeingDetected/symbol/t ext()) else(\"\") let \$antibodySymbol:=if(fn:exists(\$doc/antibodyAssay/ entityBeingDetected/symbol/text())) then(\$doc/antibodyAssay/entityBeingDetected/symbol /text()) else(\"\") let \$reporterSymbol:=if(fn:exists(\$doc/reporterAssay/ entityBeingDetected/symbol/text())) then(\$doc/reporterAssay/entityBeingDetected/symbol /text()) else(\"\") let \$specimenType:=\$doc/specimen/type/text() let \$inSituProbeID:=if(fn:exists(\$doc/inSituAssay/det ectionReagent/name/text())) then(\$doc/inSituAssay/detectionReagent/name/text() ) else() let \$antibodyProbeID:=if(fn:exists(\$doc/antibodyAssay /detectionReagent/name/text())) then(\$doc/antibodyAssay/detectionReagent/name/text ()) else() let \$genotype:=if(\$doc/specimen/genotype/wildType /text()) then(\"wild-type\") else(\"mutant\") where \$status=\"$detection\" and"; if($anatomaicalname!="")

56

{$query=$query."\$anatomyref=\"$anatomaicalname\" a nd";} $query=$query."\$TStage>=\"$TSfrom\" and \$TStage<=\"$TSto\" return <tr> <td class=\"tab_item\">{\$inSituSymbol}{\$antibodySymbo l}{\$reporterSymbol}</td> <td class=\"tab_item\">{data(\$id)}</td> <td class=\"tab_item\">{\$inSituProbeID}{\$antibodyProb eID}</td> <td class=\"tab_item\">TS{\$TStage}</td> <td class=\"tab_item\">{\$DPCStage}dpc</td> <td class=\"tab_item\">{\$inSituAssayPresence}{\$antibo dyAssayPresence}{\$reporterAssayPresence}</td> <td class=\"tab_item\">{\$specimenType}</td> <td class=\"tab_item\">{\$genotype}</td> </tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML";

Query 2: How many fully (or partially) sequenced assay records are available

for different types of assays?

This query has two inputs. One is the status of sequence (fully or partially sequenced)

and the type of assay (ISH, IHC or ISR)12. To receive these inputs from the user,

following interface was developed:

12 ISH is used for in situ assay; IHC for antibody assay and ISR for reporter assay

57

Figure 10: Input screen for query that counts the fully or partially sequenced assays

After selecting the sequence status and assay type, when user clicks the “Search”

button, the XQuery informs the count of the matching records:

Figure 11: Output of the query that counts the fully or partially sequenced assays

XQuery code that generates the above output is given below:

SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$assayType:=\$doc/experiment/assayType/text() let \$id:=\$doc/@accession let \$inSituSeqSts:=\$doc/inSituAssay/entityBeingDetect ed/sequence/sequenceField/@sequenceStatusType let \$antibodySeqSts:=\$doc/antibodyAssay/entityBeingDe tected/se

58

quence/sequenceField/@sequenceStatusType let \$reporterSeqSts:=\$doc/reporterAssay/entityBeingDe tected/sequence/sequenceField/@sequenceStatusType where \$assayType=\"$assayType\" and (if (\$assayType eq \"ish\") then \$inSituSeqSts eq \"$seqStatus\" else if(\$assayType eq \"ihc\") then \$antibodySeqSts eq \"$seqStatus\" else false()) return <tr><td>{data(\$id)}</td></tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML

Query 3: Which components express a particular gene?

A gene symbol is the only input for this query. Following interface accepts this input:

Figure 12: Input screen for the query that finds components where a gene is expressed

And the following interface displays the results:

59

Figure 13: Output of the query that finds components where a gene is expressed

The XQuery code for this query is given below:

SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$status:=\$doc/annotation/textAnnotation/expressio nByOntology/expression/strength/text() let \$anatomyref:=\$doc/annotation/textAnnotation/expre ssionByOntology/accession let \$inSituSymbol:=if(fn:exists(\$doc/inSituAssay/enti tyBeingDetected/symbol/text())) then(\$doc/inSituAssay/entityBeingDetected/symbol/t ext()) else(\"\") let \$antibodySymbol:=if(fn:exists(\$doc/antibodyAssay/ entityBeingDetected/symbol/text())) then(\$doc/antibodyAssay/entityBeingDetected/symbol /text()) else(\"\") let \$reporterSymbol:=if(fn:exists(\$doc/reporterAssay/ entityBeingDetected/symbol/text())) then(\$doc/reporterAssay/entityBeingDetected/symbol /text()) else(\"\")

60

where \$status=\"detected\" and (\$inSituSymbol=\"$geneSymbol\" or \$antibodySymbol=\"$geneSymbol\" or \$reporterSymbol=\"$geneSymbol\") return <tr><td>{data(\$anatomyref)}</td></tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML

61

7

7. Conclusion

7.1 Summary of the Work Done

The project developed an XML version of the EMAGE gene expression database.

Presently EMAGE is an object based database and is part of the larger EMAP project

of the Human Genetics Unit of MRC. In the development process, the project

investigated various approaches, technologies and tools that could be used to perform

the similar tasks. The data present in the current EMAGE database was transformed

into the new XML format. An XML schema was developed in order to validate the

correctness of the newly created XML documents. As part of the project, web based

query interface was also developed that could be used by the potential users of the

XML database to retrieve the desired data.

7.2 Summary of Evaluation

The purpose of this project was to convert the existing EMAGE into an XML

database and to explore and evaluate the tools, technologies and approaches involved

in the process. This evaluation has been discussed at relevant places in this thesis

document. Most of the work involved in this project would not produce a visible

outcome. Therefore, a user evaluation was neither possible nor was it required. The

query interface is the only visible outcome. However, the interface was supposed to

be developed more for the demonstration purpose. It is not the full-fledged interface.

62

The user interface development and its evaluation have not been part of this project

and have, therefore, been listed in the Future Work section. The performance

evaluation of the developed XML database is also a future activity.

7.3 Accomplishments

Following contributions have resulted from this project:

• The transformation code for the existing EMAGE data into XML documents

and an XML schema for validation of the transformed documents are in

place. This was accomplished while working in cooperation with the MRC

staff, particularly, Dr. Yiya Yang.

• Various tools that can help working with different XML technologies have

been explored. The advantages and disadvantages of these tools have been

noted in this document. It would help MRC select the appropriate tools for

further development of the XML database.

• An XML database was setup using IBM DB2 9. This database holds the XML

documents.

• Retrieval of desired data from XML database using XQuery has been

demonstrated.

• A web based query interface for the potential users has been developed. Users

can interact with the XML database through this interface and get answers to

their queries.

• Difficulties encountered and solutions used throughout this project have been

documented in this dissertation thesis. This should be of help to the

developers of the project in future.

63

7.4 Limitations Encountered

The main limitation encountered in the project was that a lot of time had to be spent

in doing research about the “side things”. Though equally important but these things

do not directly contribute to the measurable output of the project. These side things

included, for example, setting up the web server and DB2 connection, and validating

the XML schema. This left lesser time for doing tangible work. Another considerable

limitation was that the author was not involved in the XML schema design from the

beginning. As a result, sometimes it was difficult to understand the meaning, logic or

need of a particular element in the schema.

7.5 Skills Acquired

From this project, the author acquired considerable skills in a number of

technologies. These include development of XML schema, doing XSL

transformations, Using XPath and XQuery for XML data retrieval, working with IBM

DB2 9, Eclipse Java platform, JSP and PHP. In addition, the author became a good

user of various XML development tools, like:

• Oxygen XML Editor 8.2 [ SyncRO Soft Ltd, www.oxygenxml.com ]

• EditiX 5.2.2 [ JAPIsoft, www.editix.com ]

• XMLWriter 2.7 [ Wattle Software, www.xmlwriter.net ]

• Altova® XMLSpy 2007 Professional Ed. [ Altova®, www.Altova.com/xmlspy ]

• Altova® MapForce 2007 Enterprise Ed. [ Altova®, www.Altova.com/mapforce ]

• Stylus Studio® 2007 XML Enterprise Suite [ DataDirect Technologies,

www.stylusstudio.com ]

64

Above all, the author gained a good understanding (or the Weltanschung13) of how

XML databases could be used to store the data which has traditionally been stored in

relational form and how this XML data could be used to build websites. And,

naturally, the project also gave insight into the content and working of the gene

expression databases.

7.6 Future Work

The XML database and a user interface catering to a few queries have been

successfully developed as part of this project. The dissertation has been successful in

achieving its aim of researching about various approaches that could be helpful in

developing an XML database. Beyond this project, however, more work needs to be

done before the XML database could replace the object based EMAGE databse.

Future work that could be carried out to build on the work done in this dissertation is

given below.

7.6.1 Comprehensive Query Interface Development

The XML database contains gene expression data which needs to thoroughly analyze

by the users in order to draw conclusions. A comprehensive and well designed query

interface would be very useful in this regard. The potential users of the database are

mostly researchers who may have varying computer skills; therefore the query

interface should be easy to use. Application of website usability principles while

developing the interface is highly recommended.

7.6.2 Interface for Inserting New Data

This dissertation project only needed to deal with the existing data in the EMAGE

database. A future requirement would be an interface through which researchers from

around the world could add more data to the XML version of EMAGE. The

13 Weltanschung is a German term used in Philosophy. It means the “mental construct” or the “world view”

65

interface should allow the users to save the partial data either on the EMAGE server

or in the local machine of the researcher. This is important because data contained in

one EMAGE record often comes slowly as the related laboratory experiments

progress. An XML enabled web interface would be a good choice in this case. The

user will save the partial data through a web form and then the web application could

save the data in an XML file on server and keeps the data in file format until user or

database administrator submits it to the EMAGE database.

7.6.3 Performance Evaluation

This would be a very interesting thing to do. The XML version of EMAGE database

has been created but it should be investigated whether it can match or exceed the

performance of the present object based database. Even though EMAGE does not

contain a high number of records at present, the equivalent XML database might take

more time than the object database in retrieving the search results. In case of XML

database, the query engine has to go through all the XML documents available in the

database. This may, or may not, affect the performance significantly. A carefully

carried out performance evaluation could reveal the answer.

7.6.4 Query Optimization

On the basis of the performance evaluation results, it might be found useful to

optimize the queries written in the database application to handle the questions from

the user. The objective of this exercise should be to minimize the time of getting

answer after submitting the query to the database.

7.7 Final Thoughts

XML has turned out to be a very important concept. With continuous development

of the related technologies, standards and tools, the scope of XML applications is

increasing by the day. The initiative taken by various relational database venders to

provide native XML support through their products is surely going to make the XML

66

a “first class citizen” among other data types in the databases. This will help XML in

realizing its full potential as it could be used more easily in relational database

applications as well. Bioinformatics, like almost all other fields in need of data

management, has a number of applications where XML can make the things easier

and better. Gene expression databases are among these applications.

67

References

[ 1 ] Beyer, K., Özcan, F., Saiprasad, S., Linden, B., (2005). DB2/XML: Designing for Evolution, Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data SIGMOD '05, ACM Press New York, USA

[ 2 ] Baldock, R.., A., et al, (2003). EMAP and EMAGE: A Framework for Understanding Spatially Organized Data, Neuroinformatics, Vol. 1, 2003, pp 309-325

[ 3 ] D’haeseleer, R.., Liang, S., Somogyi, R.., (1999). Gene Expression Data Analysis and Modeling, Pacific Symposium on Biocomputing

[ 4 ] Christiansen, J., et al, (2006). EMAGE: a spatial database of gene expression patterns during mouse embryo development, Nucleic Acids Research, Database issue, Vol. 34, 2006, pp D637–D641

[ 5 ] Achard, F., Vaysseix, G., Barillot, E., (2001). "XML, Bioinformatics and Data Integration", Bioinformatics Review, Oxford University Press, Vol. 17, No. 2, 2001, pp 115-125

[ 6 ] Harold, E.R.., Means, W.S., (2004). "XML in a Nutshell", 3rd Ed., O'Reilly Media

[ 7 ] DB2 9 Bootcamp, Student Notes, (2006). IBM Corporation

[ 8 ] Edinburgh Mouse Atlas Project, < http://genex.hgu.mrc.ac.uk > [Accessed on: 27th May 2007]

[ 9 ] W3 Schools, < www.w3schools.com > [Accessed on: 25th May 2007]

[ 10 ] Wikipedia, < http://en.wikipedia.org/wiki/Gene_expression > [Accessed on: 25th May 2007]

[ 11 ] Xiong, J., (2006). Essential Bioinformatics. Cambridge University Press.

[ 12 ] MISFISHIE Standard Working Group. <http://mged.sourceforge.net/misfishie>. [Accessed on: 09th August 2007]

[ 13 ] JDOM.org. < http://www.jdom.org>. [Accessed on: 02nd July 2007]

[ 14 ] DB2 9 pureXML Guide, Redbooks, IBM Corp. 1st Ed., January 2007

68

[ 15 ] Hishiki, T., Kawamoto S., Morishita S., Okubo K., (2000). BodyMap: a human and mouse gene expression database, Nucleic Acids Res., Vol. 28, No. 2, January 2000, pp 136–138

[ 16 ] Duerr, J., Immunohistochemistry, WormBooks.org, pp 1-6

[ 17 ] Funderburk, J. E., Malaika, S., Reinwald, B., (2002). XML programming with SQL/XML and XQuery, IBM Systems Journal, Vol. 41, No. 4, 2002, pp 642-665

[ 18 ] The National Institute of Health. Stem Cell Information. <http://stemcells.nih.gov/info/scireport/appendixA.asp>. [Accessed on: 02nd September 2007]

69

Appendix A Given below is the complete new MISFISHIE compliant XML schema

<?xml version="1.0" encoding="UTF-8"?>  <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSc hema"> <xsd:annotation> <xsd:documentation>MISFISHIE compliant schema for ISH/IHC data transfer</xsd:documentation> </xsd:annotation>   <xsd:element name="hguMrcSubmission" type="hguMrcSubmissionType"/> <xsd:complexType name="hguMrcSubmissionType"> <xsd:sequence> <xsd:element name="administration" type="administrationType"/> <xsd:element name="specimen" type="specimenType"/> <xsd:element name="experiment" type="experimentType" minOccurs="0"/> <xsd:element ref="assayRef"/> <xsd:element name="result" type="resultType"/> <xsd:element name="annotation" type="annotationType" minOccurs="0" maxOccurs="unbo unded"/> <xsd:element name="contributor" type="contributorType"/> <xsd:element name="reference" type="referenceType" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="accession" type="xsd:token" use="required"/> <xsd:attribute name="status" type="xsd:token" use="optional"/> </xsd:complexType>  <xsd:complexType name="specimenType"> <xsd:sequence> <xsd:element name="organism" type="organismType"/> <xsd:element name="type"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="wholemount"/> <xsd:enumeration value="section"/> <xsd:enumeration value="section from wholemount"/> <xsd:enumeration value="whole cells"/> <xsd:enumeration value="sections of cells"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="sex" type="xsd:token" minOccurs= "0"/> <xsd:element name="genotype" type="genotypeType"/>

71

<xsd:element name="phenotype" type="phenotypeType" minOccurs="0"/> <xsd:element name="physiologicalState" type="xsd:token" minOccurs="0"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="note" type="xsd:string" minOccurs="0"/> <xsd:element name="tissueExamined" type="tissueType" minOccurs="0" maxOccurs="unbounde d"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="organismType"> <xsd:sequence> <xsd:choice> <xsd:element name="commonName" type="nonEmptyToken"/> <xsd:element name="taxon" type="taxonType"/> </xsd:choice> <xsd:element name="stage" type="stageSystemType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="tissue" type="xsd:token" minOccurs="0"/> <xsd:element name="strain" type="nonEmptyToken" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="taxonType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="genotypeType"> <xsd:choice> <xsd:element name="wildType"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="T"/> <xsd:enumeration value="t"/> <xsd:enumeration value="Y"/> <xsd:enumeration value="y"/> <xsd:enumeration value="TRUE"/> <xsd:enumeration value="true"/> <xsd:enumeration value="True"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="mutantAllele" type="mutantAlleleType" maxOccurs="unbounded"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="mutantAlleleType">

72

<xsd:choice> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="alleleOnFirstChromatid" type="nonEmptyToken"/> <xsd:choice minOccurs="0"> <xsd:element name="alleleOnSecondChromatid" type="nonEmptyToken" /> <xsd:element name="nonPairedOrMissingChromosome" type="nonEmptyT oken"/> </xsd:choice> </xsd:sequence> <xsd:element name="localAllele" type="localAlleleType"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="localAlleleType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="description" type="nonEmptyToken"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="phenotypeType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="tissueType"> <xsd:choice> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession"/> </xsd:choice> </xsd:complexType>  <xsd:complexType name="experimentType"> <xsd:sequence> <xsd:element name="description" type="nonEmptyString"/> <xsd:element name="design" type="xsd:token" minOccurs="0"/> <xsd:element name="experimentalFactor" type="xsd:token" minOccurs="0"/> <xsd:element name="assayType" type="xsd:token" minOccurs="0"/> <xsd:element name="numberOfAssaysPerformed" type="xsd:integer"/>

73

<xsd:element name="controlData" type="xsd:token" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType>  <xsd:element name="assayRef" type="assayType"/> <xsd:element name="inSituAssay" type="inSituAssayT ype" substitutionGroup="assayRef"/> <xsd:element name="antibodyAssay" type="antibodyAs sayType" substitutionGroup="assayRef"/> <xsd:element name="reporterAssay" type="reporterAs sayType" substitutionGroup="assayRef"/> <xsd:complexType name="assayType"> <xsd:sequence> <xsd:element name="firstLabel" type="nonEmptyToken"/> <xsd:element name="lastLabel" type="xsd:token" minOccurs="0"/> <xsd:element name="exogenousFactor" type="xsd:token" minOccurs="0"/> <xsd:element name="fixationReagent" type="xsd:token" minOccurs="0"/> <xsd:element name="embeddingReagent" type="xsd:token" minOccurs="0"/> <xsd:element name="clearingMethod" type="xsd:token" minOccurs="0"/> <xsd:element name="detectionProcedure" type="detectionProcedureType"/> <xsd:element name="protocol" type="protocolType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="protocolType"> <xsd:sequence> <xsd:choice> <xsd:element name="value" type="nonEmptyString"/> <xsd:element name="linkedProtocol" type="nonEmptyString"/> </xsd:choice> <xsd:element name="type" minOccurs="0"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="general"/>

74

<xsd:enumeration value="specimen pre treatment"/> <xsd:enumeration value="reagent production"/> <xsd:enumeration value="detection reagent binding"/> <xsd:enumeration value="staining"/> <xsd:enumeration value="embedding"/> <xsd:enumeration value="imaging"/> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="reagentTypeType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="value" type="xsd:token"/> <xsd:element name="order"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="secondary"/> <xsd:enumeration value="tertiary"/> <xsd:enumeration value="quaternaery"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="detectionProcedureType"> <xsd:sequence> <xsd:element name="signalDetectionMethod" type="nonEmptyToken"/> <xsd:element name="type" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="direct"/> <xsd:enumeration value="indirect"/> </xsd:restriction> </xsd:simpleType>

75

</xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="detectionReagentType"> <xsd:sequence> <xsd:choice> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="accession" type="nonEmptyToken"/> </xsd:choice> <xsd:element name="concentration" type="xsd:token" minOccurs="0"/> <xsd:element name="reagentType" type="reagentTypeType" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:element name="supplier" type="supplierType"/> <xsd:element name="localGenerated" type="xsd:token"/> </xsd:choice> <xsd:element name="permanentLabel" type="nonEmptyToken" minOccurs="0" maxOccurs="unbou nded"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType>  <xsd:complexType name="reporterAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence> <xsd:element name="detectionReagent" type="detectionReagentType" minOccurs="0" maxOccurs ="unbounded"/> <xsd:element name="entityBeingDetected" type="entityBeingDetectedByReporterType" minOccurs= "0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByReport erType"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="note" type="nonEmptyString"/> </xsd:sequence> </xsd:complexType>

76

 <xsd:complexType name="sequenceType"> <xsd:choice> <xsd:element name="sequenceField" type="sequenceFieldType"/> <xsd:element name="description" type="nonEmptyString"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="sequenceFieldType"> <xsd:sequence maxOccurs="unbounded"> <xsd:choice minOccurs="0"> <xsd:element name="sequenceInFile" type="fileType"/> <xsd:element name="sequenceDirect" type="xsd:token"/> </xsd:choice> <xsd:group ref="accession"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:sequence> <xsd:element name="startLocation" type="xsd:integer"/> <xsd:element name="endLocation" type="xsd:integer"/> </xsd:sequence> <xsd:sequence> <xsd:element name="startLocationOfFragment" type="xsd:integer"/> <xsd:element name="fragmentSize" type="xsd:integer"/> </xsd:sequence> <xsd:sequence> <xsd:element name="endLocation" type="xsd:integer"/> <xsd:element name="fragmentSize" type="xsd:integer"/> </xsd:sequence> <xsd:element name="fivePrimePrimer" type="xsd:token"/> <xsd:element name="threePrimePrimer" type="xsd:token"/> </xsd:choice> </xsd:sequence> <xsd:attribute name="sequenceStatusType"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType>

77

<xsd:restriction base="xsd:token"> <xsd:enumeration value="fully-sequenced"/> <xsd:enumeration value="partially-sequenced"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:attribute> </xsd:complexType> <xsd:complexType name="variantType"> <xsd:sequence> <xsd:element name="name" type="xsd:token"/> <xsd:element name="activity" type="xsd:token"/> <xsd:element name="activityType" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="inactive form"/> <xsd:enumeration value="activated form"/> <xsd:enumeration value="both"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="originType"> <xsd:choice> <xsd:element name="cellLine" type="cellLineType"/> <xsd:element name="organism" type="organismType"/> </xsd:choice> </xsd:complexType>  <xsd:complexType name="antibodyAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence>

78

<xsd:element name="detectionReagent" type="detectionReagentType" minOccurs="0" maxOccurs ="unbounded"/> <xsd:element name="primaryReagentStorage" type="xsd:token" minOc curs="0"/> <xsd:element name="entityBeingDetected" type="entityBeingDetectedByAntibodyType" minOccurs= "0" maxOccurs="unbounded"/> <xsd:element name="thingToGenerateDetectionReagent" type="thingToGenerateAntibodyType" minOccurs="0"/> <xsd:element name="type" type="antibodyType" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByAntibo dyType"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="proteinVariant" type="proteinVariantType" minOccurs="0"/> <xsd:element name="anatomicalStructure" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="speciesSpecificity" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="molecularGroup" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="cdMarker" type="xsd:token" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="proteinVariantType"> <xsd:complexContent> <xsd:extension base="variantType"> <xsd:sequence> <xsd:element name="proteinIsoform" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="thingToGenerateAntibodyType "> <xsd:sequence> <xsd:element name="antigen" type="nonEmptyToken"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="originOfAntigen" type="originType" minOccurs="0"/> <xsd:element name="proteinDomainCovered" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/>

79

<xsd:element name="postTranslationalModification" type="xsd:toke n" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="carrierOrFusion" type="xsd:token" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="antibodyType"> <xsd:sequence> <xsd:choice> <xsd:element name="monoclonal" type="monoclonalType"/> <xsd:element name="polyclonal" type="polyclonalType"/> </xsd:choice> <xsd:element name="chainSubType" type="xsd:token" minOccurs="0"/> <xsd:element name="productionMethod" minOccurs="0"/> <xsd:element name="purificationMethod" type="xsd:token" minOccurs="0"/> <xsd:element name="immunoGlobulinIsoType" minOccurs="0"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="I"/> <xsd:enumeration value="i"/> <xsd:enumeration value="G"/> <xsd:enumeration value="gm"/> <xsd:enumeration value="GM"/> <xsd:enumeration value="Gm"/> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="monoclonalType"> <xsd:sequence> <xsd:element name="hybridoma" type="xsd:token" minOccurs="0"/> <xsd:element name="phageDisplay" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="polyclonalType"> <xsd:sequence> <xsd:element name="speciesImmunized" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="cellLineType"> <xsd:sequence> <xsd:element name="name" type="xsd:token"/> <xsd:group ref="accession"/> </xsd:sequence>

80

</xsd:complexType>  <xsd:complexType name="inSituAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence> <xsd:element name="detectionReagent" type="probeDetectionReagentType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="primaryReagentStorage" type="xsd:token" minOc curs="0"/> <xsd:element name="entityBeingDetected" type="entityBeingDetecte dByProbeType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="thingToGenerateDetectionReagent" type="thingToGenerateProbeType" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="probeDetectionReagentType"> <xsd:complexContent> <xsd:extension base="detectionReagentType"> <xsd:sequence> <xsd:element name="chemistry"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="RNA"/> <xsd:enumeration value="DNA"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="direction"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token">

81

<xsd:enumeration value="sense"/> <xsd:enumeration value="antisense"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByProbeT ype"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="spliceVariant" type="spliceVariantType" minOccurs="0"/> <xsd:element name="anatomicalStructure" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="spliceVariantType"> <xsd:complexContent> <xsd:extension base="variantType"> <xsd:sequence> <xsd:element name="transcriptSplice" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="thingToGenerateProbeType"> <xsd:sequence> <xsd:element name="cloneName" type="nonEmptyToken"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="templateDNAType" type="templateDNATypeType" minOccurs="0"/> <xsd:element name="originOfTemplate" type="originType" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="templateDNATypeType"> <xsd:sequence> <xsd:choice> <xsd:element name="genomic" type="xsd:token" maxOccurs="3"/>

82

<xsd:element name="cdna" type="xsd:token" maxOccurs="3"/> </xsd:choice> </xsd:sequence> </xsd:complexType>  <xsd:complexType name="resultType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="element" type="supplementaryFileType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="supplementaryFileType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="file" type="fileType"/> <xsd:element name="resolution" type="xsd:token" minOccurs="0"/> <xsd:element name="mode" type="xsd:token" minOccurs="0"/> <xsd:element name="magnification" type="xsd:token" minOccurs="0"/> <xsd:element name="photographicPlatform" type="xsd:token" minOccurs="0"/> <xsd:element name="description"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="default image for single section"/> <xsd:enumeration value="default image for default section in multi-sections"/> <xsd:enumeration value="any other image of default section in multi-sections"/> <xsd:enumeration value="any other assay image"/> <xsd:enumeration value="multi-section montage image"/> <xsd:enumeration value="movie of 3D voxel"/>

83

<xsd:enumeration value="best frame of the movie of 3D voxel"/> <xsd:enumeration value="OPT default image"/> <xsd:enumeration value="OPT wlz"/> <xsd:enumeration value="OPT movie"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="height" type="xsd:integer" minOccurs="0"/> <xsd:element name="width" type="xsd:integer" minOccurs="0"/> <xsd:element name="position" type="sectionType" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:element name="nonOverlayChannel" type="channelType"/> <xsd:element name="multipleOverlayChannel" type="channelType" minOccurs="2" maxOccurs="unbound ed"/> </xsd:choice> <xsd:element name="note" type="xsd:string" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="channelType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="falseColour" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType>  <xsd:complexType name="annotationType"> <xsd:sequence> <xsd:element name="referenceStage" type="stageSystemType" minOccurs="0"/> <xsd:element name="annotator"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token">

84

<xsd:enumeration value="EMAGE editor"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="confidenceOfAnnotator" type="confidenceType" minOccurs="0" maxOccurs="unbo unded"/> <xsd:element name="textAnnotation" type="textAnnotationType" minOccurs="0"/> <xsd:element name="imageAnnotation" type="imageAnnotationType" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="confidenceType"> <xsd:sequence> <xsd:element name="aspect"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="morphological match to model"/> <xsd:enumeration value="pattern clarity and extraction"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="level"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="low"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="high"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="textAnnotationType"> <xsd:sequence>

85

<xsd:element ref="textAnnotationRef" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="imageAnnotationType"> <xsd:sequence> <xsd:element ref="imageAnnotationRef" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:element name="textAnnotationRef" type="expressionAnnotationType"/> <xsd:element name="expressionByOntology" type="expressionByOntologyType" substitutionGroup="textAnnotationRef"/> <xsd:element name="imageAnnotationRef" type="expressionAnnotationType"/> <xsd:element name="expressionByWholemount" type="expressionByImageType" substitutionGroup="imageAnnotationRef"/> <xsd:element name="expressionByVoxel" type="expressionByImageType" substitutionGroup="imageAnnotationRef"/> <xsd:complexType name="expressionAnnotationType"> <xsd:sequence> <xsd:element name="expression" type="expressionType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="expressionByOntologyType"> <xsd:complexContent> <xsd:extension base="expressionAnnotationType"> <xsd:sequence> <xsd:group ref="accession"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="expressionByImageType"> <xsd:complexContent> <xsd:extension base="expressionAnnotationType"> <xsd:sequence> <xsd:element name="correspondingResultName" type="xsd:token" min Occurs="0"/> <xsd:element name="file" type="fileType"/> <xsd:element name="section" type="sectionType" minOccurs="0" maxOccurs="unbound ed"/> <xsd:element name="referenceModel" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="sectionType"> <xsd:sequence>

86

<xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="x" type="xsd:decimal"/> <xsd:element name="y" type="xsd:decimal"/> <xsd:element name="z" type="xsd:decimal"/> <xsd:element name="theta" type="xsd:decimal"/> <xsd:element name="phi" type="xsd:decimal"/> <xsd:element name="distance" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="expressionType"> <xsd:sequence> <xsd:element name="strength"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="not detected"/> <xsd:enumeration value="detected"/> <xsd:enumeration value="present"/> <xsd:enumeration value="not examined"/> <xsd:enumeration value="uncertain"/> <xsd:enumeration value="possible"/> <xsd:enumeration value="strong"/> <xsd:enumeration value="moderate"/> <xsd:enumeration value="weak"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="pattern" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="graded"/> <xsd:enumeration value="homogenous"/> <xsd:enumeration value="single cell"/>

87

<xsd:enumeration value="spotted"/> <xsd:enumeration value="regional"/> <xsd:enumeration value="n/a"/> <xsd:enumeration value="not applicable"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="location" minOccurs="0" maxOccurs="unbounded"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="dorsal"/> <xsd:enumeration value="ventral"/> <xsd:enumeration value="anterior"/> <xsd:enumeration value="posterior"/> <xsd:enumeration value="caudal"/> <xsd:enumeration value="deep"/> <xsd:enumeration value="lateral"/> <xsd:enumeration value="medial"/> <xsd:enumeration value="proximal"/> <xsd:enumeration value="radial"/> <xsd:enumeration value="surface"/> <xsd:enumeration value="n/a"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType>  <xsd:complexType name="administrationType"> <xsd:sequence> <xsd:element name="softwareType" type="xsd:token" minOccurs="0"/> <xsd:element name="softwareVersion" type="xsd:token" minOccurs="0"/> <xsd:element name="creationDate" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="createdBy" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="lastModificationDate" type="xsd:string" minOccurs="0"/> <xsd:element name="modifiedBy" type="xsd:token" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="schemaVersion" type="xsd:tok en" fixed="1.0"/> </xsd:complexType>  <xsd:complexType name="contributorType"> <xsd:sequence> <xsd:element name="author" type="xsd:string" minOccurs="0"/> <xsd:element name="contactPerson" type="personType"/> <xsd:sequence minOccurs="0" maxOccurs="unbounded"> <xsd:element ref="roleRef"/> </xsd:sequence> </xsd:sequence> </xsd:complexType> <xsd:group name="contact"> <xsd:sequence> <xsd:element name="tel" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="email" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="fax" type="xsd:token" minOccurs="0"/> <xsd:element name="url" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:element name="roleRef" type="personType"/>

89

<xsd:element name="submitter" type="personType" substitutionGroup="roleRef"/> <xsd:element name="principalInvestigator" type="pe rsonType" substitutionGroup="roleRef"/> <xsd:element name="acknowledgement" type="acknowledgementType" substitutionGroup="roleR ef"/> <xsd:complexType name="personType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="contact"/> <xsd:element name="organization" type="organizationType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="organizationType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:group ref="address"/> <xsd:group ref="contact"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="acknowledgementType"> <xsd:complexContent> <xsd:extension base="personType"> <xsd:sequence> <xsd:element name="description" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:group name="address"> <xsd:sequence> <xsd:element name="addressOne" type="xsd:string"/> <xsd:element name="addressTwo" type="xsd:string" minOccurs="0"/> <xsd:element name="addressThree" type="xsd:string" minOccurs="0"/> <xsd:element name="city" type="xsd:string" minOccurs="0"/> <xsd:element name="county" type="xsd:string" minOccurs="0"/> <xsd:element name="postcode" type="xsd:string" minOccurs="0"/> <xsd:element name="country" type="xsd:string" minOccurs="0"/> </xsd:sequence> </xsd:group>  <xsd:complexType name="referenceType"> <xsd:sequence> <xsd:element name="history" type="linkType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="relation" type="linkType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="publication" type="publicationType" minOccurs="0" maxOccurs="unb ounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="linkType"> <xsd:sequence> <xsd:element name="type" type="xsd:token" minOccurs="0" maxOccurs="unbounded"/> <xsd:group ref="accession"/> <xsd:element name="url" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="publicationType"> <xsd:sequence> <xsd:element name="author" type="nonEmptyToken"/> <xsd:element name="journal" type="xsd:token" minOccurs="0"/> <xsd:element name="title" type="nonEmptyToken"/> <xsd:element name="volume" type="xsd:token" minOccurs="0"/> <xsd:element name="issue" type="xsd:token" minOccurs="0"/> <xsd:element name="year" type="xsd:token"/> <xsd:element name="page" type="xsd:token" minOccurs="0"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType>   <xsd:simpleType name="nonEmptyString"> <xsd:restriction base="xsd:string">  </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="nonEmptyToken"> <xsd:restriction base="xsd:token">  </xsd:restriction> </xsd:simpleType>

91

 <xsd:simpleType name="anyMeaningfulToken"> <xsd:restriction base="xsd:token"/> </xsd:simpleType> <xsd:complexType name="stageSystemType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="value" type="nonEmptyToken"/> </xsd:sequence> </xsd:complexType> <xsd:group name="accession"> <xsd:sequence> <xsd:element name="accession" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="source" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:group name="triplet"> <xsd:sequence> <xsd:element name="symbol" type="nonEmptyToken"/> <xsd:element name="accession" type="xsd:token" minOccurs="0"/> <xsd:element name="name" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:complexType name="supplierType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="catalogueNumber" type="xsd:token" minOccurs="0"/> <xsd:element name="lotNumber" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="fileType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="type" type="xsd:token" minOccurs="0"/> <xsd:element name="zipFileName" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:element name="nonEmptyToken" type="xsd:token" />

</xsd:schema>

92

Other Appendices (on CD-ROM) Other appendices have been included on the CD-ROM attached

with this dissertation thesis. These appendices include the code

used/produced during this project. The CD-ROM contains an

index page listing all these appendices. This thesis document is also available

in electronic form on CD-ROM disk.

The CD will automatically run when inserted in the CD Drive. In case, it doesn’t run;

go to CD Drive and open index.htm file. Following page having listing of all the files

in CD will appear.

Documents

An XML Database for Gene Expression · Lalit Kumar . ii Acknowledgements I wish to express my sincere gratitude towards my supervisors Dr. Albert Burger and Dr. Yiya Yang. Thanks