Upload
elijah-donovan
View
45
Download
1
Tags:
Embed Size (px)
DESCRIPTION
GUS The G enomics U nified S chema A Platform for Genomics Databases. V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert - PowerPoint PPT Presentation
Citation preview
GUSThe Genomics Unified Schema A Platform for Genomics DatabasesV. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G.
Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert
Center for Bioinformatics, University of Pennsylvania
stevef,[email protected]
Overview
AbstractThe Genomics Unified Schema (GUS) is a strongly typed relational database
schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.
Goals of GUS Generic platform for model organism or disease specific
databases Freely available at www.gusdev.org and www.cbil.upenn.edu Integration of genome, transcript and protein data, including:
Sequence Function Expression Interaction Regulation Orthologs and paralogs
Support for: automated annotation and integration manual curation data mining/analysis and sophisticated queries web access
GUS Powers Multiple Genomics DBsAllGenesAllGenes PlasmoDBPlasmoDB
EPConDBEPConDB
CoreSRESTESSRADDoTS
Oracle RDBMS Object Layer for Data Loading
Java Servlets
Other sites,Other projectsOther sites,Other projects
Components of GUS
Relational database schema Lightweight object layer Application frameworks
Data access Pipeline/workflow Web (servlets)
Applications Annotator’s interface Parsers and exporters (using standards) Annotation and analysis programs
Schema browser Utilizes Oracle 9i
AutomatedAnalysis &Integration
WWW queries,
browsing, & download
Java Servlets &
Perl CGI
Mining
Applications
DoTS Oracle/SQL
GenomicSequence
microarray& SAGE
Experiments
MappingData
GenBank, InterPro,
GO, etc
GSSs &ESTs
Annotation QTL,POP,SNP, Clinical
RAD Core SRes
Object Layer
TESS
Annotator’s Interface
Architecture of GUS
Usage of GUS
Annotation Of genomes: gene models, sequence features Of genes: function, expression, regulation
Integration From sequence to expression Map identifiers to/from external databases
Data mining, creating curated datasets Algorithm-based: GO function prediction Genome-wide querying: find all pancreas-specific transcripts PANCchip: non-redundant genes expressed in pancreas found using
ESTs, microarrays and cDNA libraries
GUS Schema
Schema features
Extensive integrated genomics schema (300 tables) Divided into 5 distinct domains Highly normalized Strongly typed
Controlled vocabularies used extensively Avoid using name-value pairs
Subclassing Use views of superclass to define subclasses Useful for mapping into the object layer
Warehousing Include databases such as Genbank, GO terms, Prodom, CDD. Facilitates management of value-added annotation across updates
Cross references to external databases Tracking and versioning
Five domains
OntologiesShared ResourcesSRes
(Shared Resources)
EvidenceData ProvenanceCore
GrammarsGene regulationTESS
(Trans Elem Search Site)
MIAME/MAGEGene expressionRAD
(RNA Abundance DB)
Central dogmaSequence and
annotationDoTS
(DB of Transcribed Seqs)
HighlightsDomainNamespace
* Protein interaction domain underway
GUS is divided into 5 domains* (separate name spaces)
Data Provenance
Core
Ownership Protection Algorithms Versioning Workflows
Ontologies
SRes GO Species Anatomy/Tissue Developmental stage Disease state
GenomicSequence
Genes, gene models STSs, repeats, etc Cross-species analysis
TranscribedSequence
Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS assemblies
ProteinSequence
Domains Function Structure Cross-species analysis
DoTS
ArraysSAGEConditions
TranscriptExpression
RAD
Binding Sites Patterns Grammars
Gene Regulation
TESS
DoTS RAD SRes
Core TESS
"Transcription factors upregulated in acute myeloid leukemiawith sequence similarity to c-fos and common promoter motifs"
Querying across the domains
DoTS central dogma schema
GeneGene GeneInstance
GeneFeature
(isa NA Feature)
GenomicSequence
(isa NA Sequence)
RNARNA RNAInstance
RNAFeature
(isa NA Feature)
RNASequence
(isa NA Sequence)
ProteinProtein ProteinInstance
ProteinFeature
(isa NA Feature)
ProteinSequence
(isa AA Sequence)
ElementAnnotation
Analysis
AnalysisImplementationParam
AnalysisInput
AnalysisImplementation1
0..*1
0..*
1 0..*1 0..*
AnalysisInvocationParamAnalysisInvocation1
0..*1
0..*
1
0..*
1
0..*
1 0..*1 0..*
AnalysisOutput
1
0..*
1
0..*
CompositeElementAnnotation
ArrayAnnotation
CompositeElementImp
0..*0..1 0..*0..1
1
0..*
1
0..*
ElementResultImp CompositeElementResultImp
1
0..*
1
0..*
0..10..* 0..10..*
QuantificationParam
RelatedQuantification
Study
StudyDesignDescription
StudyAssay10..* 10..*
StudyDesignAssay
StudyFactorValueAssayLabeledExtract
BioMaterialImp1
0..*
1
0..*
LabelMethod
0..1
0..*
0..1
0..*
ProtocolParam
MAGEDocumentation
MAGE_ML
0..*
1
0..*
1
AcquisitionParam
Assay
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
Channel
1
0..*
1
0..*
0..*0..1
0..*0..1
Quantification1
0..*
1
0..*1
0..*
1
0..*
10..*
10..*
1 0..*1 0..*1 0..*1 0..*
Acquisition1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
RelatedAcquisition1 0..*1 0..*1 0..*1 0..*
ProcessImplementationParam
ProcessIO
ProcessInvocation
1
0..*
1
0..*
ProcessInvocationParam10..* 10..*
Array
1
0..*
1
0..*
10..*
10..* 1 0..*1 0..*
BioMaterialMeasurement1 0..*1 0..*
Protocol
1
0..*
1
0..*
1
0..*
1
0..*
0..1
0..*
0..1
0..*
0..1
0..*
0..1
0..*Treatment
1
0..*
1
0..*
1
0..*
1
0..*
0..1
0..*
0..1
0..*
StudyDesign
1
0..*
1
0..*10..* 10..*
1 0..*1 0..*
BioMaterialCharacteristic1
0..*1
0..*
ProcessImplementation10..* 10..*
1
0..*
1
0..*
ElementImp
0..10..* 0..10..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
Control
1
0..*
1
0..*
ProcessResult1 0..*1 0..*
StudyFactor
1
0..*
1
0..*
10..* 10..*
OntologyEntry10..* 10..*
0..*0..1
0..*0..1
1
0..*
1
0..*
RAD schema uses MAGE/MIAMEMAGE
ExperimentArray
BioMaterialBioAssay
BioAssayData Protocol, Descr.
HigherLevelAnalysis
MAGEExperiment
ArrayBioMaterial
BioAssayBioAssayData
Protocol, Descr.HigherLevelAnalysis
MIAMEExperimental Design
Array designSamples
Hybridization, MeasureNormalization
.
MIAMEExperimental Design
Array designSamples
Hybridization, MeasureNormalization
.
TESS schema
ModelString
ModelConsensusString
ModelPositionalWeightMatrix
ModelGrammar
TESS.Model
ActivityProteinDnaBinding
ActivityTissueSpecificity
TESS.Activity
Moiety
TESS.Moiety
MoietyMultimer
MoietyHeterodimer
MoietyComplex
TESS.FootprintInstance
DoTS.NaFeatureBindingSite
Promoter
. . .
DoTS.NaSequence
TESS.TrainingSet
TESS.ParameterGroup
TESS.Note
Ontologies and vocabularies Ontologies
Gene Ontology (GO) Sequence Ontology (SO) (sequence features) Phenotype and Trait Ontology (PATO) Taxon (NCBI) Anatomy (Penn) Disease (ICD9) Developmental stage (multiple sources)
And vocabularies External database names Genetic codes Review status
Evidence trail
Evidence and tracking Data tables have columns for user, date, project, algorithm invocation Tables dedicated to algorithm, algorithm version and parameters 176 algorithms, including public and in-house Tracks automated and manual annotation, similarity and integration
Versioning All updated or deleted rows are copied to version table
Sophisticated queries
Sample queries from three projects that utilize GUS’s data integration and analysis
www.allgenes.org “Is my cDNA similar to any mouse genes that are predicted to encode
transcription factors and have been localized to mouse chromosome 5?”
http://plasmodb.org “List all genes whose proteins are predicted to contain a signal peptide
and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage”
www.cbil.upenn.edu/EPConDB “Which genes on chromosome 2 are expressed in pancreas and are
involved in signal transduction based on GO function assignments.”
Application Frameworks
GUS Object layer
Lightweight Perl implementation Java on the way One object per table Parent/child relationships Cascading delete
Data input
The GusApplication program manages inserts and updates to GUS, handling tracking and versioning.
Specific tasks are implemented as plugins. Plugins use either GUS objects or SQL access. Low-level database access is provided by DBI classes.
RAD TESSDoTS
CoreSResDBIPlugin
ObjectObjectObjectObjectObject S
uperClasses
SQL
GusApplication
Pipeline
Perl API for defining annotation pipelines Supports sequential protocols Distributes compute intensive work to compute cluster Used for 90 stage pipeline to build DoTS transcript index
Web
Servlets and cgi based design (JSP on the way) Automatic generation of HTML FORMs
Automated input checking Integrated help features INPUT elements populated from the database
Query history facility Boolean queries (AND, OR, SUBTRACT) Declarative configuration file Base system is relatively independent of GUS
Provided Applications
Assign Gene Name/Symbol
Assign Gene Description
Assign Gene Synonym(s) Evidence
Annotator’s interface
Parsing & exporting Parsing
Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR Protein Motifs: CDD, Prodom, InterPro Expression: MAGE Ontologies: GO, SO, PATO Mapping data: RH maps Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder Similarity: BLAST, BLAT, Sim4 CAP4
Exporting FASTA MAGE Table dumps DoTS Assemblies
Analysis & annotation
GO functional assignment Expression analysis (PaGE) Anatomy classification Library distribution Genes from BLAT of DoTS against genome DoTS assembly and annotation
Refresh warehouse Cluster and assemble mRNAs/ESTs into putative transcripts Annotate transcripts through similarity, GO function and markers Integrate previously existing manual curation
Functional predictions
GenomicSequence
DoTS consensusSequences
mRNA/ESTSequence
Clustering andAssembly
PredictedGenes
GeneIndex
Merge Genes
Gene/RNA clusterassignment
SIM4 or BLAT
ProteinsRNAs
Gene predictionsGenScan/ HMMer, PHAT
GO Functions
ProteinMotifs
BLAST Similarities
PFAM, Smart, ProDomBLASTPBLASTX
Other computed annotation(EPCR,
AssemblyAnatomyPercent,Index Key Words,
SNP analysis)
Annotate DoTSManual Annotation
Tasks
translationframefinder
DoTS Pipeline
References & Acknowledgements References
Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes Research Diabetes 51: 1997-2004, 2002.
Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.
Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001.
Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.
Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.
Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757.
Acknowledgements NIH grant RO1-HG-01539-03 DOE grant DE-FG02-00ER62893 Burroughs Wellcome Fund NIDDK 56947 and 56954 with cosponsorship from the JDFI
Related posters
114A. Web-Based Biological Discovery using the GUS Integrated Database.
170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars
148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?