65
Genome Annotation Genome Annotation Md. Imtiyaz Hassan, Ph.D. Md. Imtiyaz Hassan, Ph.D.

Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Embed Size (px)

Citation preview

Page 1: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Genome AnnotationGenome Annotation

Md. Imtiyaz Hassan, Ph.D.Md. Imtiyaz Hassan, Ph.D.

Page 2: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08(1128)(1496)(2680) (3825) genome projects:

199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress

small: archaebacterium Nanoarchaeum equitans 500 kbBacillus anthracis (anthrax) 5228 kbS. cerivisiae (yeast) 12,069 kbArabidopsis thaliana 115,428 kbDrosophila melanogaster (fruit fly) 137,000 kbAnopheles gambiae (malaria mosquito) 278,000 kbOryza sativa (rice) 420,000 kbMus musculus (mouse) 2,493,000 kbHomo sapiens (human) 2,900,000 kb

http://www.genomesonline.org/

Genome Sequencing

Page 3: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Genome AnnotationGenome Annotation

• Annotation is the process of interpreting raw sequence data into useful biological information

• Annotations describe the genome and transform raw genome sequences into biological information by integrating computational analyses, other biological data and biological expertise.

• Old Days: One Gene done by one Lab = LOTS of INFO• Now: Many genes = Superficial and incomplete of many

genes.• Features could be repeats, genes, promoters, protein

domains……..• Features can be linked to other databases eg

Pfam/Pubmed

Page 4: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics

These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions

Page 5: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

•high throughput assays•robotics•high speed computing•statistics •bioinformatics

Because of the vast amounts of data that are generated, we need new approaches

Page 6: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Published by AAAS

The ENCODE Project Consortium Science 306, 636 -640 (2004)

Functional genomic elements being identified by the ENCODE pilot phase

Page 7: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Annotation of eukaryotic genomesAnnotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Page 8: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

8

How many genes?

Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on

GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding

sequences? UniGene: > 89,000 clusters of unique

ESTs?

Page 9: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

9

Current consensus (in flux …)

15,000 known genes (similarity to 15,000 known genes (similarity to previously isolated genes and expressed previously isolated genes and expressed sequences from a large variety of different sequences from a large variety of different organisms)organisms)

17,000 predicted (GenScan, GeneFinder, 17,000 predicted (GenScan, GeneFinder, GRAIL)GRAIL)

Based on and limited to previous Based on and limited to previous knowledgeknowledge

Page 10: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

The Annotation ProcessThe Annotation Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

UsefulInformation

Annotator

Page 11: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

A Common Mistake!A Common Mistake!

PROTEIN SEQUENCE

Annotator

BLASTFunction

Page 12: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.

• BLAST and FASTA

• Sequence alignment

• Domains

• Prosite

• Pfam/HMMs

• SignalP/ TMHMM

Page 13: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

BLASTBLAST

• Local Alignment• Suggests the presence of a common

domain between two proteins.• However common domains can be

conserved between proteins with very different functions

• Eg ATP binding common to many proteins

Page 14: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

BLAST/FASTABLAST/FASTA

• FASTA is a global alignment tool • BLAST blast is local

BLAST

FASTA

• Reduces sensitivity increases specificity

Page 15: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Using FASTAUsing FASTA

• Global Alignment• Annotation gained from homology hits is

only as good as the annotation you are transferring.

• Eg there are two different genes called ESAG2 in swall.

• Small changes in “your gene” might confer functional differences.

Page 16: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

FASTAFASTA

10-5

Low scoring hitsCan give good alignments

Page 17: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

10-8

High scoring hits can give pooralignments

Page 18: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 19: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 20: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

The big problem with searching public The big problem with searching public databases is…databases is…

There is a need to reduceThe amount of sequencesWe search and to prevent bad Annotation from spreading

Page 21: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.

• Proteins with common functions have some common features.

• Domains and motifs from conserved residues.

• Families can be grouped, profiles and HMMs derived.

• There is more to life than Blast

Page 22: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Sequence AlignmentSequence Alignment

• Sequence alignments allow us to see which residues are important to a family of proteins.

• This lets us make motifs/profiles/fingerprints/HMMs. To define families

Page 23: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

DomainsDomains

• A domain is a functional part of a protein

• It may contain amino acid sequence motifs that can be used to identify it.

• More than one motif is known as a fingerprint

Page 24: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

DOMAINSDOMAINS

Fingerprints Blocks

DomainAlignment

Prosite

Motifs

Pfam (HMMs)

Page 25: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Overview Profile DB (1)Overview Profile DB (1)Identifying functional motifs and structural domains by comparing sequences against

PROSITE, BLOCKS, SMART, Pfam, CDD databases, Prodom, Trembl, Interpro• Prosite patterns - http://www.expasy.ch/prosite/• Prosite profiles• Pfam – database of HMMs for domain and families

http://www.sanger.ac.uk/Software/Pfam/index.shtml• SMART - http://smart.embl-heidelberg.de/• Prints• TIGRFAMs• BLOCKS

Alignment databases• ProDom – Protein Domain Database

http://www.toulouse.inra.fr/prodom.html• PIR-ALN• ProtoMap• Domo• ProClass

Page 26: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Overview Profile DB (2)Overview Profile DB (2)

• Integrated Pattern Databases:

MetaFam

IProClass

InterPro

CDD – Common Domain Database http://www.ncbi.nih.gov/Structure/

CDD Search DART

Page 27: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

PrositePrositehttp://us.expasy.org/prosite/http://us.expasy.org/prosite/

• Maintained a the swiss institute of Bioinformatics.

• All Motifs are checked for false positives and fine tuned.

• Sometimes a family can be defined by more than one expression.

• Fingerprints and BLOCKs automatically scan proteins for a number of motifs.

• http://bioinf.man.ac.uk/dbbrowser/PRINTS/• http://blocks.fhcrc.org/help/

Page 28: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

PfamPfam

• Pfam 7.0 contains a total of 3360 families.

• Pfam is a database of two parts:– Pfam A ..curated– Pfam B automatically generated.

• All HMMs have a seed alignment which is added to using the HMMer package.

Page 29: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

PfamPfam

http://www.sanger.ac.uk/Software/Pfam/http://pfam.wustl.edu/

Page 30: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Interpro curation

http://www.ebi.ac.uk/interpro/

Page 31: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Gene OntologyGene Ontology

• http://www.geneontology.org/

Page 32: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

TMHMMTMHMMhttp://www.cbs.dtu.dk/services/TMHMM/

Transmembrane Domains: Membrane bound proteinsTransmembrane Domains: Membrane bound proteins

Page 33: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

SIGNALPSIGNALP

• What Is a signal Peptide?• Any protein that has to be targeted to a specific

part of the cell requires a signal peptide.• The signal peptide ensures that the protein in

translated at the ER where it can enter the secretory pathway.

• Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm.

Signal Peptides: Secreted/targeted proteins

Page 34: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

using secondary databases for using secondary databases for functional functional AssignmentsAssignments

• Better, more detailed, professional annotation.

• More powerful and sensitive search methods, hmms/profiles/weight matrixes.

• Not as good coverage.

Page 35: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Protein Secondary StructureProtein Secondary Structure• CATH (Class, Architecture,Topology, Homology)

http://www.biochem.ucl.ac.uk/dbbrowser/cath/• SCOP (structural classification of proteins) -

hierarchical database of protein folds http://scop.mrc-lmb.cam.ac.uk/scop

• FSSP Fold classification using structure-structure alignment of proteins http://www2.ebi.ac.uk/fssp/fssp.html

• TOPS Cartoon representation of topology showing helices and strands http://tops.ebi.ac.uk/tops/

Page 36: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

The Gene Prediction ProcessThe Gene Prediction Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

FunctionalAssignments

Annotator

Prosite

TMHMM

Pfam

SignalP

FASTA

BLAST

Page 37: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Slide Break – EMBL FeaturesSlide Break – EMBL Features

Page 38: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

More…More…

• More on gene prediction

• Gene Finding

• Genome Comparison and Further Genome Analysis

Page 39: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Genome AnnotationGenome Annotation

• Genome Databases

• The GenBank/EMBL file format

• Editing GenBank/EMBL files with Artemis

• The annotation process

• Common pitfalls

Page 40: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Public DatabasesPublic Databases

• Genbank, EMBL and DDBJ.

• All databases update each other automatically

Page 41: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

EMBL and TREMBL EMBL and TREMBL • Patricia Rodriguez-Tomé , Peter J. Stoehr , Graham

N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996

• EMBL currently contains 14366182 entries

Page 42: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

EMBL FileEMBL File

• Contains:• A header File containing:

– Information about the sequence– Organism– Authors– References– Comments

• A feature table containing– Sequence features and co-ordinates

Page 43: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

ID PFMAL1P4 standard; DNA; INV; 66441 BP.

XX

AC AL031747;

XX

SV AL031747.8

XX

DT 24-SEP-1998 (Rel. 57, Created)

DT 27-APR-2000 (Rel. 63, Last updated, Version 13)

XX

DE Plasmodium falciparum DNA from MAL1P4

XX

KW HTG; rifin; telomere; var; var-like hypothetical protein.

XX

OS Plasmodium falciparum (malaria parasite P. falciparum)

OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.

XX

RN [1]

RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D.,

RA Quail M., Rajandream M., Barrell B.;

RT ;

RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases.

RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome

RL Trust Genome Campus, Hinxton, Cambridge CB10 1S.

Header File

Page 44: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

EMBL File Feature TableEMBL File Feature Table

misc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript

primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator

transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal

attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding

Anything that can have a coordinate on a DNA sequence.

Page 45: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Feature qualifiesrFeature qualifiesr

• Additional information about a feature

/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"

/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label

Page 46: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

FeaturesFeatures

Page 47: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Annotation in ArtemisAnnotation in Artemis• FT CDS 732..1415

FT /db_xref="IPR002038"FT /gene="PfLtest.01"FT /label=PfLtest.01FT /note="PfLtest.01. len=227aa. Asp-rich protein.PredictedFT by Genefinder, Phat and GlimmerM. Similar to PlasmodiumFT falciparum hypothetical 132.2 kDa protein TR:O97242FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21,FT 44.388% id in 196 aa."FT /product="Asp-rich hypothetical protein"FT /colour=10FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out"FT misc_feature complement(1855..1871)FT /fasta_file="fasta/TEST100.tab.seq.00105.out"FT CDS 3151..4821FT /gene="PfLtest.02"FT /label=PfLtest.02FT /note="PfLtest.02. len=556aa. Predicted by Genefinder,FT Phat and GlimmerM. Unknown hypothetical protein"FT /product="unknown hypothetical protein"FT /colour=8FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"

Page 48: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

CDS featuresCDS features

• CDS stands for coding sequence and is used to denote genes and pseudogenes.

• These features are automatically translated on submission and the protein added to the protein databases.

Page 49: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

/note/note

• Note field contains all the evidence for a gene call……..plus anything else.– Similarity (fasta or blast)– Domain/motif information (pfam, tmhmm etc)– Unusual features (repeats, aa richness)

Page 50: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

/product/product

• The name of the gene product eg Alcohol dehydrogenase

• Unless there is proof we must qualify..• Putative• Possible

• Always be conservative!.. eg. Putative dehydrogenase

dehyrogenase like protein • Only piece of annotation added to the protein

databases.

Page 51: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Naming protocolsNaming protocols

• Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • alcohol dehydrogenase like looks a bit like it, but may not be.

• Putative alcohol dehydrogenase probably a alcohol dehydrogenase

• Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism.

Page 52: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

/gene /gene

• The gene name• Eg ADH1

• Only transfer a gene name if it is meaningful

• Never transfer a gene name like PfB0024.

• Is it a gene family? make sure two genes have the same name.

Page 53: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199

Transitive AnnotationTransitive Annotation

• AKA annotation catastrophe

• Junk in = Junk out

• Miss-annotations spread through incorrect database submissions.

Page 54: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 55: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 56: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 57: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 58: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 59: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 60: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 61: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 62: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 63: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 64: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199
Page 65: Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199