Upload
iliana
View
29
Download
6
Tags:
Embed Size (px)
DESCRIPTION
Managing Gene Annotation Information the search is over … one problem solved … another begins. observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group. Interdisciplinary Center for Biotechnology Research. - PowerPoint PPT Presentation
Citation preview
Managing GeneAnnotation Information
the search is over… one problem solved
… another beginsobservations from a foot soldier in the bio-information (r)evolution
Bill Farmerie -- ICBR Genomics Group
Interdisciplinary Center for Biotechnology Research Established at the University of Florida in 1987
by the Florida Legislature centralized organization of biomedical core facilities supporting biotechnology-based research
How did information management become my problem?
1998 GSAC Miami Beach
Why should I care about this problem? Because my paycheck depends on it. Avoid fatal failure in the funding loop.
PI has $ for large gene-
based project
Core Lab generates
data
Downstream data management &
analysis
PI writes papers,
gives talks
PI applies for new funding
Other PI’s think this
looks like a good idea
From Sequence to Function The genomic sequence identifies the 'parts'
the next trick is understanding gene function Post genomic era = functional genomics Critical concept: genes of similar sequence
may have similar functions Inferring function for a new gene begins with
searching for it’s nearest neighbor (or homolog) of known function
BLAST Most common starting point for gene identification Similarity search of sequence repository (GenBank) Output
Calculated scores (bit score and e-value) Text string (definition line), ID Reference Tag Sequence alignment
Advantages Fast algorithm, very good at finding close homologs
Disadvantages Not good at finding distant relatives
Cluster and Grid-enabled versions available
HMMER HMMER developed by Sean Eddy Uses Hidden Markov Models Searches unknown protein query sequence against a
database of protein family models Statistical models constructed from alignment of conserved
protein regions (Pfam) Advantages
Superior to BLAST for discovering more distant homology relations
Disadvantages More computationally intensive than BLAST
GRID enabled
OK! Great!
Sequencing done. Homology searches complete.
But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?
Search for summarizing information that restores sanityCTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA
GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG
AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT
CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG
BlastQuest
A small idea with a big mission
BlastQuest Requirements Accessible to research groups at remote locations Privacy constrained sharing of results among the scientists Selective browsing of BLAST homology search results Selective data filtering on statistical criteria
e-value or bit score Selective data grouping on criteria such as GI number, or a defined
number of top-scoring results Ad hoc search capability on user determined criteria:
text terms boolean logic
From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.
Overview of BlastQuest Architecture
MySQL DBMS
Web BrowserClient Side GUI Tier 3
Tier 2
Tier 1
BLAST XML documentAssembly ACE fileXML Loader ACE Loader
SQL Constructor
Client Interface ModuleWebServer
JDBC
Welcome to BlastQuest
Choose among client projects
Results Selection
Grouped Results
Ad Hoc Text Searching
Internal BLAST Searches
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
KEGG Classification Kyoto Encyclopedia of Genes and Genomes “Wiring diagrams of life” KEGG Protein Networks
Metabolic pathways Regulatory pathways Molecular complexes Network-network relations Network-environment relations
Unique to non-UnigeneCommon to both Unique to Unigene
Bacterial Genome Annotation Workbench
Another simple idea driven by necessity
Start
Project Summary
Contig Browser
Contig summary
Physical map linked to annotation
Simple problems.Simple solutions.Why are these simple ideas important?
Human Genome Project HGP drove innovation in biotechnology 2 major technological benefits
stimulated development of high throughput methods
reliance on computational tools for data mining and visualization of biological information
The HGP and the cost of DNA sequencing
“finished” quality DNA sequence a DNA base call is considered finished if the probability of base
call error is less than 1 in 10,000 also known as phred > 40
contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage
1985: $10 per finished base 2001: $1 per 10 finished bases
Genbank August 22, 2005
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Public Collections of DNA and RNA Sequence Reach 100 Gigabases
Trends in the cost efficiency of DNA sequencing§
§Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335
454 Life Sciences Corporation
The first commercial, massively parallel, DNA sequencing technology
454 Technology Cyclic-array sequencing on in vitro amplified DNA
molecules individual molecules must be amplified to give a
detectable sequencing signal Instead of biological cloning, we amplify individual
DNA fragments on solid state beads using PCR Instead of terminator-based sequencing,
pyrosequencing used to determine nucleotide order “sequencing by synthesis”
454 Process Overview
The bottom line … efficiency of DNA sequencing increased 100X cost per finished base declined 10- to 30-fold
… so what happens next? The “democratization” of large-scale genomic biology Many projects are now possible that were once fiscally
inviable We must deal with basic local data management and
information issues or lose this opportunity
If you thought bioinformatics was important before
By terminator-based sequencing we @ UF produce 60-70 Mbp per yearBy synthesis-based sequencing we produce 60-70 Mbp per day