Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Genome AnnotationGenome Annotation

Md. Imtiyaz Hassan, Ph.D.Md. Imtiyaz Hassan, Ph.D.

(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08(1128)(1496)(2680) (3825) genome projects:

199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress

small: archaebacterium Nanoarchaeum equitans 500 kbBacillus anthracis (anthrax) 5228 kbS. cerivisiae (yeast) 12,069 kbArabidopsis thaliana 115,428 kbDrosophila melanogaster (fruit fly) 137,000 kbAnopheles gambiae (malaria mosquito) 278,000 kbOryza sativa (rice) 420,000 kbMus musculus (mouse) 2,493,000 kbHomo sapiens (human) 2,900,000 kb

http://www.genomesonline.org/

Genome Sequencing


• Annotation is the process of interpreting raw sequence data into useful biological information

• Annotations describe the genome and transform raw genome sequences into biological information by integrating computational analyses, other biological data and biological expertise.

• Old Days: One Gene done by one Lab = LOTS of INFO• Now: Many genes = Superficial and incomplete of many

genes.• Features could be repeats, genes, promoters, protein

domains……..• Features can be linked to other databases eg

Pfam/Pubmed

Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics

These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions

•high throughput assays•robotics•high speed computing•statistics •bioinformatics

Because of the vast amounts of data that are generated, we need new approaches

Published by AAAS

The ENCODE Project Consortium Science 306, 636 -640 (2004)

Functional genomic elements being identified by the ENCODE pilot phase

Annotation of eukaryotic genomesAnnotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

8

How many genes?

Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on

GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding

sequences? UniGene: > 89,000 clusters of unique

ESTs?

9

Current consensus (in flux …)

15,000 known genes (similarity to 15,000 known genes (similarity to previously isolated genes and expressed previously isolated genes and expressed sequences from a large variety of different sequences from a large variety of different organisms)organisms)

17,000 predicted (GenScan, GeneFinder, 17,000 predicted (GenScan, GeneFinder, GRAIL)GRAIL)

Based on and limited to previous Based on and limited to previous knowledgeknowledge

The Annotation ProcessThe Annotation Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

UsefulInformation

Annotator

A Common Mistake!A Common Mistake!

PROTEIN SEQUENCE

Annotator

BLASTFunction

Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.

• BLAST and FASTA

• Sequence alignment

• Domains

• Prosite

• Pfam/HMMs

• SignalP/ TMHMM

BLASTBLAST

• Local Alignment• Suggests the presence of a common

domain between two proteins.• However common domains can be

conserved between proteins with very different functions

• Eg ATP binding common to many proteins

BLAST/FASTABLAST/FASTA

• FASTA is a global alignment tool • BLAST blast is local

BLAST

FASTA

• Reduces sensitivity increases specificity

Using FASTAUsing FASTA

• Global Alignment• Annotation gained from homology hits is

only as good as the annotation you are transferring.

• Eg there are two different genes called ESAG2 in swall.

• Small changes in “your gene” might confer functional differences.

FASTAFASTA

10-5

Low scoring hitsCan give good alignments

10-8

High scoring hits can give pooralignments

The big problem with searching public The big problem with searching public databases is…databases is…

There is a need to reduceThe amount of sequencesWe search and to prevent bad Annotation from spreading

Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.

• Proteins with common functions have some common features.

• Domains and motifs from conserved residues.

• Families can be grouped, profiles and HMMs derived.

• There is more to life than Blast

Sequence AlignmentSequence Alignment

• Sequence alignments allow us to see which residues are important to a family of proteins.

• This lets us make motifs/profiles/fingerprints/HMMs. To define families

DomainsDomains

• A domain is a functional part of a protein

• It may contain amino acid sequence motifs that can be used to identify it.

• More than one motif is known as a fingerprint

DOMAINSDOMAINS

Fingerprints Blocks

DomainAlignment

Prosite

Motifs

Pfam (HMMs)

Overview Profile DB (1)Overview Profile DB (1)Identifying functional motifs and structural domains by comparing sequences against

PROSITE, BLOCKS, SMART, Pfam, CDD databases, Prodom, Trembl, Interpro• Prosite patterns - http://www.expasy.ch/prosite/• Prosite profiles• Pfam – database of HMMs for domain and families

http://www.sanger.ac.uk/Software/Pfam/index.shtml• SMART - http://smart.embl-heidelberg.de/• Prints• TIGRFAMs• BLOCKS

Alignment databases• ProDom – Protein Domain Database

http://www.toulouse.inra.fr/prodom.html• PIR-ALN• ProtoMap• Domo• ProClass

Overview Profile DB (2)Overview Profile DB (2)

• Integrated Pattern Databases:

MetaFam

IProClass

InterPro

CDD – Common Domain Database http://www.ncbi.nih.gov/Structure/

CDD Search DART

PrositePrositehttp://us.expasy.org/prosite/http://us.expasy.org/prosite/

• Maintained a the swiss institute of Bioinformatics.

• All Motifs are checked for false positives and fine tuned.

• Sometimes a family can be defined by more than one expression.

• Fingerprints and BLOCKs automatically scan proteins for a number of motifs.

• http://bioinf.man.ac.uk/dbbrowser/PRINTS/• http://blocks.fhcrc.org/help/

PfamPfam

• Pfam 7.0 contains a total of 3360 families.

• Pfam is a database of two parts:– Pfam A ..curated– Pfam B automatically generated.

• All HMMs have a seed alignment which is added to using the HMMer package.

PfamPfam

http://www.sanger.ac.uk/Software/Pfam/http://pfam.wustl.edu/

Interpro curation

http://www.ebi.ac.uk/interpro/

Gene OntologyGene Ontology

• http://www.geneontology.org/

TMHMMTMHMMhttp://www.cbs.dtu.dk/services/TMHMM/

Transmembrane Domains: Membrane bound proteinsTransmembrane Domains: Membrane bound proteins

SIGNALPSIGNALP

• What Is a signal Peptide?• Any protein that has to be targeted to a specific

part of the cell requires a signal peptide.• The signal peptide ensures that the protein in

translated at the ER where it can enter the secretory pathway.

• Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm.

Signal Peptides: Secreted/targeted proteins

using secondary databases for using secondary databases for functional functional AssignmentsAssignments

• Better, more detailed, professional annotation.

• More powerful and sensitive search methods, hmms/profiles/weight matrixes.

• Not as good coverage.

Protein Secondary StructureProtein Secondary Structure• CATH (Class, Architecture,Topology, Homology)

http://www.biochem.ucl.ac.uk/dbbrowser/cath/• SCOP (structural classification of proteins) -

hierarchical database of protein folds http://scop.mrc-lmb.cam.ac.uk/scop

• FSSP Fold classification using structure-structure alignment of proteins http://www2.ebi.ac.uk/fssp/fssp.html

• TOPS Cartoon representation of topology showing helices and strands http://tops.ebi.ac.uk/tops/

The Gene Prediction ProcessThe Gene Prediction Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

FunctionalAssignments

Annotator

Prosite

TMHMM

Pfam

SignalP

FASTA

BLAST

Slide Break – EMBL FeaturesSlide Break – EMBL Features

More…More…

• More on gene prediction

• Gene Finding

• Genome Comparison and Further Genome Analysis


• Genome Databases

• The GenBank/EMBL file format

• Editing GenBank/EMBL files with Artemis

• The annotation process

• Common pitfalls

Public DatabasesPublic Databases

• Genbank, EMBL and DDBJ.

• All databases update each other automatically

EMBL and TREMBL EMBL and TREMBL • Patricia Rodriguez-Tomé , Peter J. Stoehr , Graham

N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996

• EMBL currently contains 14366182 entries

EMBL FileEMBL File

• Contains:• A header File containing:

– Information about the sequence– Organism– Authors– References– Comments

• A feature table containing– Sequence features and co-ordinates

ID PFMAL1P4 standard; DNA; INV; 66441 BP.

XX

AC AL031747;

XX

SV AL031747.8

XX

DT 24-SEP-1998 (Rel. 57, Created)

DT 27-APR-2000 (Rel. 63, Last updated, Version 13)

XX

DE Plasmodium falciparum DNA from MAL1P4

XX

KW HTG; rifin; telomere; var; var-like hypothetical protein.

XX

OS Plasmodium falciparum (malaria parasite P. falciparum)

OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.

XX

RN [1]

RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D.,

RA Quail M., Rajandream M., Barrell B.;

RT ;

RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases.

RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome

RL Trust Genome Campus, Hinxton, Cambridge CB10 1S.

Header File

EMBL File Feature TableEMBL File Feature Table

misc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript

primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator

transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal

attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding

Anything that can have a coordinate on a DNA sequence.

Feature qualifiesrFeature qualifiesr

• Additional information about a feature

/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"

/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label

FeaturesFeatures

Annotation in ArtemisAnnotation in Artemis• FT CDS 732..1415

FT /db_xref="IPR002038"FT /gene="PfLtest.01"FT /label=PfLtest.01FT /note="PfLtest.01. len=227aa. Asp-rich protein.PredictedFT by Genefinder, Phat and GlimmerM. Similar to PlasmodiumFT falciparum hypothetical 132.2 kDa protein TR:O97242FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21,FT 44.388% id in 196 aa."FT /product="Asp-rich hypothetical protein"FT /colour=10FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out"FT misc_feature complement(1855..1871)FT /fasta_file="fasta/TEST100.tab.seq.00105.out"FT CDS 3151..4821FT /gene="PfLtest.02"FT /label=PfLtest.02FT /note="PfLtest.02. len=556aa. Predicted by Genefinder,FT Phat and GlimmerM. Unknown hypothetical protein"FT /product="unknown hypothetical protein"FT /colour=8FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"

CDS featuresCDS features

• CDS stands for coding sequence and is used to denote genes and pseudogenes.

• These features are automatically translated on submission and the protein added to the protein databases.

/note/note

• Note field contains all the evidence for a gene call……..plus anything else.– Similarity (fasta or blast)– Domain/motif information (pfam, tmhmm etc)– Unusual features (repeats, aa richness)

/product/product

• The name of the gene product eg Alcohol dehydrogenase

• Unless there is proof we must qualify..• Putative• Possible

• Always be conservative!.. eg. Putative dehydrogenase

dehyrogenase like protein • Only piece of annotation added to the protein

databases.

Naming protocolsNaming protocols

• Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • alcohol dehydrogenase like looks a bit like it, but may not be.

• Putative alcohol dehydrogenase probably a alcohol dehydrogenase

• Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism.

/gene /gene

• The gene name• Eg ADH1

• Only transfer a gene name if it is meaningful

• Never transfer a gene name like PfB0024.

• Is it a gene family? make sure two genes have the same name.

Transitive AnnotationTransitive Annotation

• AKA annotation catastrophe

• Junk in = Junk out

• Miss-annotations spread through incorrect database submissions.

Documents

Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199