41
Managing Gene Annotation Information the search is over … one problem solved … another begins observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group

Managing Gene Annotation Information the search is over … one problem solved … another begins

  • Upload
    iliana

  • View
    29

  • Download
    6

Embed Size (px)

DESCRIPTION

Managing Gene Annotation Information the search is over … one problem solved … another begins. observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group. Interdisciplinary Center for Biotechnology Research. - PowerPoint PPT Presentation

Citation preview

Page 1: Managing Gene Annotation Information the search is over … one problem solved … another begins

Managing GeneAnnotation Information

the search is over… one problem solved

… another beginsobservations from a foot soldier in the bio-information (r)evolution

Bill Farmerie -- ICBR Genomics Group

Page 2: Managing Gene Annotation Information the search is over … one problem solved … another begins

Interdisciplinary Center for Biotechnology Research Established at the University of Florida in 1987

by the Florida Legislature centralized organization of biomedical core facilities supporting biotechnology-based research

How did information management become my problem?

Page 3: Managing Gene Annotation Information the search is over … one problem solved … another begins

1998 GSAC Miami Beach

Page 4: Managing Gene Annotation Information the search is over … one problem solved … another begins

Why should I care about this problem? Because my paycheck depends on it. Avoid fatal failure in the funding loop.

PI has $ for large gene-

based project

Core Lab generates

data

Downstream data management &

analysis

PI writes papers,

gives talks

PI applies for new funding

Other PI’s think this

looks like a good idea

Page 5: Managing Gene Annotation Information the search is over … one problem solved … another begins

From Sequence to Function The genomic sequence identifies the 'parts'

the next trick is understanding gene function Post genomic era = functional genomics Critical concept: genes of similar sequence

may have similar functions Inferring function for a new gene begins with

searching for it’s nearest neighbor (or homolog) of known function

Page 6: Managing Gene Annotation Information the search is over … one problem solved … another begins

BLAST Most common starting point for gene identification Similarity search of sequence repository (GenBank) Output

Calculated scores (bit score and e-value) Text string (definition line), ID Reference Tag Sequence alignment

Advantages Fast algorithm, very good at finding close homologs

Disadvantages Not good at finding distant relatives

Cluster and Grid-enabled versions available

Page 7: Managing Gene Annotation Information the search is over … one problem solved … another begins

HMMER HMMER developed by Sean Eddy Uses Hidden Markov Models Searches unknown protein query sequence against a

database of protein family models Statistical models constructed from alignment of conserved

protein regions (Pfam) Advantages

Superior to BLAST for discovering more distant homology relations

Disadvantages More computationally intensive than BLAST

GRID enabled

Page 8: Managing Gene Annotation Information the search is over … one problem solved … another begins

OK! Great!

Sequencing done. Homology searches complete.

But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?

Page 9: Managing Gene Annotation Information the search is over … one problem solved … another begins

Search for summarizing information that restores sanityCTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA

GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG

AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT

CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG

Page 10: Managing Gene Annotation Information the search is over … one problem solved … another begins

BlastQuest

A small idea with a big mission

Page 11: Managing Gene Annotation Information the search is over … one problem solved … another begins

BlastQuest Requirements Accessible to research groups at remote locations Privacy constrained sharing of results among the scientists Selective browsing of BLAST homology search results Selective data filtering on statistical criteria

e-value or bit score Selective data grouping on criteria such as GI number, or a defined

number of top-scoring results Ad hoc search capability on user determined criteria:

text terms boolean logic

From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.

Page 12: Managing Gene Annotation Information the search is over … one problem solved … another begins

Overview of BlastQuest Architecture

MySQL DBMS

Web BrowserClient Side GUI Tier 3

Tier 2

Tier 1

BLAST XML documentAssembly ACE fileXML Loader ACE Loader

SQL Constructor

Client Interface ModuleWebServer

JDBC

Page 13: Managing Gene Annotation Information the search is over … one problem solved … another begins

Welcome to BlastQuest

Page 14: Managing Gene Annotation Information the search is over … one problem solved … another begins

Choose among client projects

Page 15: Managing Gene Annotation Information the search is over … one problem solved … another begins

Results Selection

Page 16: Managing Gene Annotation Information the search is over … one problem solved … another begins

Grouped Results

Page 17: Managing Gene Annotation Information the search is over … one problem solved … another begins
Page 18: Managing Gene Annotation Information the search is over … one problem solved … another begins
Page 19: Managing Gene Annotation Information the search is over … one problem solved … another begins

Ad Hoc Text Searching

Page 20: Managing Gene Annotation Information the search is over … one problem solved … another begins

Internal BLAST Searches

Page 21: Managing Gene Annotation Information the search is over … one problem solved … another begins

Viewing a Gene Ontology Tree

Page 22: Managing Gene Annotation Information the search is over … one problem solved … another begins

Viewing a Gene Ontology Tree

Page 23: Managing Gene Annotation Information the search is over … one problem solved … another begins

Viewing a Gene Ontology Tree

Page 24: Managing Gene Annotation Information the search is over … one problem solved … another begins

KEGG Classification Kyoto Encyclopedia of Genes and Genomes “Wiring diagrams of life” KEGG Protein Networks

Metabolic pathways Regulatory pathways Molecular complexes Network-network relations Network-environment relations

Page 25: Managing Gene Annotation Information the search is over … one problem solved … another begins

Unique to non-UnigeneCommon to both Unique to Unigene

Page 26: Managing Gene Annotation Information the search is over … one problem solved … another begins

Bacterial Genome Annotation Workbench

Another simple idea driven by necessity

Page 27: Managing Gene Annotation Information the search is over … one problem solved … another begins

Start

Page 28: Managing Gene Annotation Information the search is over … one problem solved … another begins

Project Summary

Page 29: Managing Gene Annotation Information the search is over … one problem solved … another begins

Contig Browser

Page 30: Managing Gene Annotation Information the search is over … one problem solved … another begins

Contig summary

Page 31: Managing Gene Annotation Information the search is over … one problem solved … another begins

Physical map linked to annotation

Page 32: Managing Gene Annotation Information the search is over … one problem solved … another begins

Simple problems.Simple solutions.Why are these simple ideas important?

Page 33: Managing Gene Annotation Information the search is over … one problem solved … another begins

Human Genome Project HGP drove innovation in biotechnology 2 major technological benefits

stimulated development of high throughput methods

reliance on computational tools for data mining and visualization of biological information

Page 34: Managing Gene Annotation Information the search is over … one problem solved … another begins

The HGP and the cost of DNA sequencing

“finished” quality DNA sequence a DNA base call is considered finished if the probability of base

call error is less than 1 in 10,000 also known as phred > 40

contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage

1985: $10 per finished base 2001: $1 per 10 finished bases

Page 35: Managing Gene Annotation Information the search is over … one problem solved … another begins

Genbank August 22, 2005

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Public Collections of DNA and RNA Sequence Reach 100 Gigabases

Page 36: Managing Gene Annotation Information the search is over … one problem solved … another begins

Trends in the cost efficiency of DNA sequencing§

§Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335

Page 37: Managing Gene Annotation Information the search is over … one problem solved … another begins

454 Life Sciences Corporation

The first commercial, massively parallel, DNA sequencing technology

Page 38: Managing Gene Annotation Information the search is over … one problem solved … another begins

454 Technology Cyclic-array sequencing on in vitro amplified DNA

molecules individual molecules must be amplified to give a

detectable sequencing signal Instead of biological cloning, we amplify individual

DNA fragments on solid state beads using PCR Instead of terminator-based sequencing,

pyrosequencing used to determine nucleotide order “sequencing by synthesis”

Page 39: Managing Gene Annotation Information the search is over … one problem solved … another begins

454 Process Overview

Page 40: Managing Gene Annotation Information the search is over … one problem solved … another begins

The bottom line … efficiency of DNA sequencing increased 100X cost per finished base declined 10- to 30-fold

… so what happens next? The “democratization” of large-scale genomic biology Many projects are now possible that were once fiscally

inviable We must deal with basic local data management and

information issues or lose this opportunity

Page 41: Managing Gene Annotation Information the search is over … one problem solved … another begins

If you thought bioinformatics was important before

By terminator-based sequencing we @ UF produce 60-70 Mbp per yearBy synthesis-based sequencing we produce 60-70 Mbp per day