GUS The G enomics U nified S chema A Platform for Genomics Databases

GUSThe Genomics Unified Schema A Platform for Genomics DatabasesV. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G.

Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert

Center for Bioinformatics, University of Pennsylvania

stevef,[email protected]

Overview

AbstractThe Genomics Unified Schema (GUS) is a strongly typed relational database

schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.

Goals of GUS Generic platform for model organism or disease specific

databases Freely available at www.gusdev.org and www.cbil.upenn.edu Integration of genome, transcript and protein data, including:

Sequence Function Expression Interaction Regulation Orthologs and paralogs

Support for: automated annotation and integration manual curation data mining/analysis and sophisticated queries web access

GUS Powers Multiple Genomics DBsAllGenesAllGenes PlasmoDBPlasmoDB

EPConDBEPConDB

CoreSRESTESSRADDoTS

Oracle RDBMS Object Layer for Data Loading

Java Servlets

Other sites,Other projectsOther sites,Other projects

Components of GUS

Relational database schema Lightweight object layer Application frameworks

Data access Pipeline/workflow Web (servlets)

Applications Annotator’s interface Parsers and exporters (using standards) Annotation and analysis programs

Schema browser Utilizes Oracle 9i

AutomatedAnalysis &Integration

WWW queries,

browsing, & download

Java Servlets &

Perl CGI

Mining

Applications

DoTS Oracle/SQL

GenomicSequence

microarray& SAGE

Experiments

MappingData

GenBank, InterPro,

GO, etc

GSSs &ESTs

Annotation QTL,POP,SNP, Clinical

RAD Core SRes

Object Layer

TESS

Annotator’s Interface

Architecture of GUS

Usage of GUS

Annotation Of genomes: gene models, sequence features Of genes: function, expression, regulation

Integration From sequence to expression Map identifiers to/from external databases

Data mining, creating curated datasets Algorithm-based: GO function prediction Genome-wide querying: find all pancreas-specific transcripts PANCchip: non-redundant genes expressed in pancreas found using

ESTs, microarrays and cDNA libraries

GUS Schema

Schema features

Extensive integrated genomics schema (300 tables) Divided into 5 distinct domains Highly normalized Strongly typed

Controlled vocabularies used extensively Avoid using name-value pairs

Subclassing Use views of superclass to define subclasses Useful for mapping into the object layer

Warehousing Include databases such as Genbank, GO terms, Prodom, CDD. Facilitates management of value-added annotation across updates

Cross references to external databases Tracking and versioning

Five domains

OntologiesShared ResourcesSRes

(Shared Resources)

EvidenceData ProvenanceCore

GrammarsGene regulationTESS

(Trans Elem Search Site)

MIAME/MAGEGene expressionRAD

(RNA Abundance DB)

Central dogmaSequence and

annotationDoTS

(DB of Transcribed Seqs)

HighlightsDomainNamespace

* Protein interaction domain underway

GUS is divided into 5 domains* (separate name spaces)

Data Provenance

Core

Ownership Protection Algorithms Versioning Workflows

Ontologies

SRes GO Species Anatomy/Tissue Developmental stage Disease state

GenomicSequence

Genes, gene models STSs, repeats, etc Cross-species analysis

TranscribedSequence

Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS assemblies

ProteinSequence

Domains Function Structure Cross-species analysis

DoTS

ArraysSAGEConditions

TranscriptExpression

RAD

Binding Sites Patterns Grammars

Gene Regulation

TESS

DoTS RAD SRes

Core TESS

"Transcription factors upregulated in acute myeloid leukemiawith sequence similarity to c-fos and common promoter motifs"

Querying across the domains

DoTS central dogma schema

GeneGene GeneInstance

GeneFeature

(isa NA Feature)

GenomicSequence

(isa NA Sequence)

RNARNA RNAInstance

RNAFeature

(isa NA Feature)

RNASequence

(isa NA Sequence)

ProteinProtein ProteinInstance

ProteinFeature

(isa NA Feature)

ProteinSequence

(isa AA Sequence)

ElementAnnotation

Analysis

AnalysisImplementationParam

AnalysisInput

AnalysisImplementation1

0..*1

0..*

1 0..*1 0..*

AnalysisInvocationParamAnalysisInvocation1

0..*1

0..*

1

0..*

1

0..*

1 0..*1 0..*

AnalysisOutput

1

0..*

1

0..*

CompositeElementAnnotation

ArrayAnnotation

CompositeElementImp

0..*0..1 0..*0..1

1

0..*

1

0..*

ElementResultImp CompositeElementResultImp

1

0..*

1

0..*

0..10..* 0..10..*

QuantificationParam

RelatedQuantification

Study

StudyDesignDescription

StudyAssay10..* 10..*

StudyDesignAssay

StudyFactorValueAssayLabeledExtract

BioMaterialImp1

0..*

1

0..*

LabelMethod

0..1

0..*

0..1

0..*

ProtocolParam

MAGEDocumentation

MAGE_ML

0..*

1

0..*

1

AcquisitionParam

Assay

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

Channel

1

0..*

1

0..*

0..*0..1

0..*0..1

Quantification1

0..*

1

0..*1

0..*

1

0..*

10..*

10..*

1 0..*1 0..*1 0..*1 0..*

Acquisition1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

RelatedAcquisition1 0..*1 0..*1 0..*1 0..*

ProcessImplementationParam

ProcessIO

ProcessInvocation

1

0..*

1

0..*

ProcessInvocationParam10..* 10..*

Array

1

0..*

1

0..*

10..*

10..* 1 0..*1 0..*

BioMaterialMeasurement1 0..*1 0..*

Protocol

1

0..*

1

0..*

1

0..*

1

0..*

0..1

0..*

0..1

0..*

0..1

0..*

0..1

0..*Treatment

1

0..*

1

0..*

1

0..*

1

0..*

0..1

0..*

0..1

0..*

StudyDesign

1

0..*

1

0..*10..* 10..*

1 0..*1 0..*

BioMaterialCharacteristic1

0..*1

0..*

ProcessImplementation10..* 10..*

1

0..*

1

0..*

ElementImp

0..10..* 0..10..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

Control

1

0..*

1

0..*

ProcessResult1 0..*1 0..*

StudyFactor

1

0..*

1

0..*

10..* 10..*

OntologyEntry10..* 10..*

0..*0..1

0..*0..1

1

0..*

1

0..*

RAD schema uses MAGE/MIAMEMAGE

ExperimentArray

BioMaterialBioAssay

BioAssayData Protocol, Descr.

HigherLevelAnalysis

MAGEExperiment

ArrayBioMaterial

BioAssayBioAssayData

Protocol, Descr.HigherLevelAnalysis

MIAMEExperimental Design

Array designSamples

Hybridization, MeasureNormalization

.

MIAMEExperimental Design

Array designSamples

Hybridization, MeasureNormalization

.

TESS schema

ModelString

ModelConsensusString

ModelPositionalWeightMatrix

ModelGrammar

TESS.Model

ActivityProteinDnaBinding

ActivityTissueSpecificity

TESS.Activity

Moiety

TESS.Moiety

MoietyMultimer

MoietyHeterodimer

MoietyComplex

TESS.FootprintInstance

DoTS.NaFeatureBindingSite

Promoter

. . .

DoTS.NaSequence

TESS.TrainingSet

TESS.ParameterGroup

TESS.Note

Ontologies and vocabularies Ontologies

Gene Ontology (GO) Sequence Ontology (SO) (sequence features) Phenotype and Trait Ontology (PATO) Taxon (NCBI) Anatomy (Penn) Disease (ICD9) Developmental stage (multiple sources)

And vocabularies External database names Genetic codes Review status

Evidence trail

Evidence and tracking Data tables have columns for user, date, project, algorithm invocation Tables dedicated to algorithm, algorithm version and parameters 176 algorithms, including public and in-house Tracks automated and manual annotation, similarity and integration

Versioning All updated or deleted rows are copied to version table

Sophisticated queries

Sample queries from three projects that utilize GUS’s data integration and analysis

www.allgenes.org “Is my cDNA similar to any mouse genes that are predicted to encode

transcription factors and have been localized to mouse chromosome 5?”

http://plasmodb.org “List all genes whose proteins are predicted to contain a signal peptide

and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage”

www.cbil.upenn.edu/EPConDB “Which genes on chromosome 2 are expressed in pancreas and are

involved in signal transduction based on GO function assignments.”

Application Frameworks

GUS Object layer

Lightweight Perl implementation Java on the way One object per table Parent/child relationships Cascading delete

Data input

The GusApplication program manages inserts and updates to GUS, handling tracking and versioning.

Specific tasks are implemented as plugins. Plugins use either GUS objects or SQL access. Low-level database access is provided by DBI classes.

RAD TESSDoTS

CoreSResDBIPlugin

ObjectObjectObjectObjectObject S

uperClasses

SQL

GusApplication

Pipeline

Perl API for defining annotation pipelines Supports sequential protocols Distributes compute intensive work to compute cluster Used for 90 stage pipeline to build DoTS transcript index

Web

Servlets and cgi based design (JSP on the way) Automatic generation of HTML FORMs

Automated input checking Integrated help features INPUT elements populated from the database

Query history facility Boolean queries (AND, OR, SUBTRACT) Declarative configuration file Base system is relatively independent of GUS

Provided Applications

Assign Gene Name/Symbol

Assign Gene Description

Assign Gene Synonym(s) Evidence

Annotator’s interface

Parsing & exporting Parsing

Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR Protein Motifs: CDD, Prodom, InterPro Expression: MAGE Ontologies: GO, SO, PATO Mapping data: RH maps Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder Similarity: BLAST, BLAT, Sim4 CAP4

Exporting FASTA MAGE Table dumps DoTS Assemblies

Analysis & annotation

GO functional assignment Expression analysis (PaGE) Anatomy classification Library distribution Genes from BLAT of DoTS against genome DoTS assembly and annotation

Refresh warehouse Cluster and assemble mRNAs/ESTs into putative transcripts Annotate transcripts through similarity, GO function and markers Integrate previously existing manual curation

Functional predictions

GenomicSequence

DoTS consensusSequences

mRNA/ESTSequence

Clustering andAssembly

PredictedGenes

GeneIndex

Merge Genes

Gene/RNA clusterassignment

SIM4 or BLAT

ProteinsRNAs

Gene predictionsGenScan/ HMMer, PHAT

GO Functions

ProteinMotifs

BLAST Similarities

PFAM, Smart, ProDomBLASTPBLASTX

Other computed annotation(EPCR,

AssemblyAnatomyPercent,Index Key Words,

SNP analysis)

Annotate DoTSManual Annotation

Tasks

translationframefinder

DoTS Pipeline

References & Acknowledgements References

Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes Research Diabetes 51: 1997-2004, 2002.

Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.

Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90

Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001.

Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.

Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.

Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757.

Acknowledgements NIH grant RO1-HG-01539-03 DOE grant DE-FG02-00ER62893 Burroughs Wellcome Fund NIDDK 56947 and 56954 with cosponsorship from the JDFI

Related posters

114A. Web-Based Biological Discovery using the GUS Integrated Database.

170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars

148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?

Documents

GUS The G enomics U nified S chema A Platform for Genomics Databases