Managing Gene Annotation Information the search is over … one problem solved … another begins

Managing GeneAnnotation Information

the search is over… one problem solved

… another beginsobservations from a foot soldier in the bio-information (r)evolution

Bill Farmerie -- ICBR Genomics Group

Interdisciplinary Center for Biotechnology Research Established at the University of Florida in 1987

by the Florida Legislature centralized organization of biomedical core facilities supporting biotechnology-based research

How did information management become my problem?

1998 GSAC Miami Beach

Why should I care about this problem? Because my paycheck depends on it. Avoid fatal failure in the funding loop.

PI has $ for large gene-

based project

Core Lab generates

data

Downstream data management &

analysis

PI writes papers,

gives talks

PI applies for new funding

Other PI’s think this

looks like a good idea

From Sequence to Function The genomic sequence identifies the 'parts'

the next trick is understanding gene function Post genomic era = functional genomics Critical concept: genes of similar sequence

may have similar functions Inferring function for a new gene begins with

searching for it’s nearest neighbor (or homolog) of known function

BLAST Most common starting point for gene identification Similarity search of sequence repository (GenBank) Output

Calculated scores (bit score and e-value) Text string (definition line), ID Reference Tag Sequence alignment

Advantages Fast algorithm, very good at finding close homologs

Disadvantages Not good at finding distant relatives

Cluster and Grid-enabled versions available

HMMER HMMER developed by Sean Eddy Uses Hidden Markov Models Searches unknown protein query sequence against a

database of protein family models Statistical models constructed from alignment of conserved

protein regions (Pfam) Advantages

Superior to BLAST for discovering more distant homology relations

Disadvantages More computationally intensive than BLAST

GRID enabled

OK! Great!

Sequencing done. Homology searches complete.

But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?

Search for summarizing information that restores sanityCTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA

GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG

AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT

CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG

BlastQuest

A small idea with a big mission

BlastQuest Requirements Accessible to research groups at remote locations Privacy constrained sharing of results among the scientists Selective browsing of BLAST homology search results Selective data filtering on statistical criteria

e-value or bit score Selective data grouping on criteria such as GI number, or a defined

number of top-scoring results Ad hoc search capability on user determined criteria:

text terms boolean logic

From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.

Overview of BlastQuest Architecture

MySQL DBMS

Web BrowserClient Side GUI Tier 3

Tier 2

Tier 1

BLAST XML documentAssembly ACE fileXML Loader ACE Loader

SQL Constructor

Client Interface ModuleWebServer

JDBC

Welcome to BlastQuest

Choose among client projects

Results Selection

Grouped Results

Ad Hoc Text Searching

Internal BLAST Searches

Viewing a Gene Ontology Tree



KEGG Classification Kyoto Encyclopedia of Genes and Genomes “Wiring diagrams of life” KEGG Protein Networks

Metabolic pathways Regulatory pathways Molecular complexes Network-network relations Network-environment relations

Unique to non-UnigeneCommon to both Unique to Unigene

Bacterial Genome Annotation Workbench

Another simple idea driven by necessity

Start

Project Summary

Contig Browser

Contig summary

Physical map linked to annotation

Simple problems.Simple solutions.Why are these simple ideas important?

Human Genome Project HGP drove innovation in biotechnology 2 major technological benefits

stimulated development of high throughput methods

reliance on computational tools for data mining and visualization of biological information

The HGP and the cost of DNA sequencing

“finished” quality DNA sequence a DNA base call is considered finished if the probability of base

call error is less than 1 in 10,000 also known as phred > 40

contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage

1985: $10 per finished base 2001: $1 per 10 finished bases

Genbank August 22, 2005

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Public Collections of DNA and RNA Sequence Reach 100 Gigabases

Trends in the cost efficiency of DNA sequencing§

§Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335

454 Life Sciences Corporation

The first commercial, massively parallel, DNA sequencing technology

454 Technology Cyclic-array sequencing on in vitro amplified DNA

molecules individual molecules must be amplified to give a

detectable sequencing signal Instead of biological cloning, we amplify individual

DNA fragments on solid state beads using PCR Instead of terminator-based sequencing,

pyrosequencing used to determine nucleotide order “sequencing by synthesis”

454 Process Overview

The bottom line … efficiency of DNA sequencing increased 100X cost per finished base declined 10- to 30-fold

… so what happens next? The “democratization” of large-scale genomic biology Many projects are now possible that were once fiscally

inviable We must deal with basic local data management and

information issues or lose this opportunity

If you thought bioinformatics was important before

By terminator-based sequencing we @ UF produce 60-70 Mbp per yearBy synthesis-based sequencing we produce 60-70 Mbp per day

Documents

Managing Gene Annotation Information the search is over … one problem solved … another begins