Upload
eustace-hutchinson
View
225
Download
4
Tags:
Embed Size (px)
Citation preview
Genome AnnotationGenome Annotation
Md. Imtiyaz Hassan, Ph.D.Md. Imtiyaz Hassan, Ph.D.
(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08(1128)(1496)(2680) (3825) genome projects:
199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress
small: archaebacterium Nanoarchaeum equitans 500 kbBacillus anthracis (anthrax) 5228 kbS. cerivisiae (yeast) 12,069 kbArabidopsis thaliana 115,428 kbDrosophila melanogaster (fruit fly) 137,000 kbAnopheles gambiae (malaria mosquito) 278,000 kbOryza sativa (rice) 420,000 kbMus musculus (mouse) 2,493,000 kbHomo sapiens (human) 2,900,000 kb
http://www.genomesonline.org/
Genome Sequencing
Genome AnnotationGenome Annotation
• Annotation is the process of interpreting raw sequence data into useful biological information
• Annotations describe the genome and transform raw genome sequences into biological information by integrating computational analyses, other biological data and biological expertise.
• Old Days: One Gene done by one Lab = LOTS of INFO• Now: Many genes = Superficial and incomplete of many
genes.• Features could be repeats, genes, promoters, protein
domains……..• Features can be linked to other databases eg
Pfam/Pubmed
Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics
These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions
•high throughput assays•robotics•high speed computing•statistics •bioinformatics
Because of the vast amounts of data that are generated, we need new approaches
Published by AAAS
The ENCODE Project Consortium Science 306, 636 -640 (2004)
Functional genomic elements being identified by the ENCODE pilot phase
Annotation of eukaryotic genomesAnnotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide
folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction
Comparative gene prediction
Functional identification
Gm3
8
How many genes?
Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on
GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding
sequences? UniGene: > 89,000 clusters of unique
ESTs?
9
Current consensus (in flux …)
15,000 known genes (similarity to 15,000 known genes (similarity to previously isolated genes and expressed previously isolated genes and expressed sequences from a large variety of different sequences from a large variety of different organisms)organisms)
17,000 predicted (GenScan, GeneFinder, 17,000 predicted (GenScan, GeneFinder, GRAIL)GRAIL)
Based on and limited to previous Based on and limited to previous knowledgeknowledge
The Annotation ProcessThe Annotation Process
DNA SEQUENCE
AN
NA
LY
SIS
SO
FT
WA
RE
UsefulInformation
Annotator
A Common Mistake!A Common Mistake!
PROTEIN SEQUENCE
Annotator
BLASTFunction
Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.
• BLAST and FASTA
• Sequence alignment
• Domains
• Prosite
• Pfam/HMMs
• SignalP/ TMHMM
BLASTBLAST
• Local Alignment• Suggests the presence of a common
domain between two proteins.• However common domains can be
conserved between proteins with very different functions
• Eg ATP binding common to many proteins
BLAST/FASTABLAST/FASTA
• FASTA is a global alignment tool • BLAST blast is local
BLAST
FASTA
• Reduces sensitivity increases specificity
Using FASTAUsing FASTA
• Global Alignment• Annotation gained from homology hits is
only as good as the annotation you are transferring.
• Eg there are two different genes called ESAG2 in swall.
• Small changes in “your gene” might confer functional differences.
FASTAFASTA
10-5
Low scoring hitsCan give good alignments
10-8
High scoring hits can give pooralignments
The big problem with searching public The big problem with searching public databases is…databases is…
There is a need to reduceThe amount of sequencesWe search and to prevent bad Annotation from spreading
Protein Families, Motifs & Domains.Protein Families, Motifs & Domains.
• Proteins with common functions have some common features.
• Domains and motifs from conserved residues.
• Families can be grouped, profiles and HMMs derived.
• There is more to life than Blast
Sequence AlignmentSequence Alignment
• Sequence alignments allow us to see which residues are important to a family of proteins.
• This lets us make motifs/profiles/fingerprints/HMMs. To define families
DomainsDomains
• A domain is a functional part of a protein
• It may contain amino acid sequence motifs that can be used to identify it.
• More than one motif is known as a fingerprint
DOMAINSDOMAINS
Fingerprints Blocks
DomainAlignment
Prosite
Motifs
Pfam (HMMs)
Overview Profile DB (1)Overview Profile DB (1)Identifying functional motifs and structural domains by comparing sequences against
PROSITE, BLOCKS, SMART, Pfam, CDD databases, Prodom, Trembl, Interpro• Prosite patterns - http://www.expasy.ch/prosite/• Prosite profiles• Pfam – database of HMMs for domain and families
http://www.sanger.ac.uk/Software/Pfam/index.shtml• SMART - http://smart.embl-heidelberg.de/• Prints• TIGRFAMs• BLOCKS
Alignment databases• ProDom – Protein Domain Database
http://www.toulouse.inra.fr/prodom.html• PIR-ALN• ProtoMap• Domo• ProClass
Overview Profile DB (2)Overview Profile DB (2)
• Integrated Pattern Databases:
MetaFam
IProClass
InterPro
CDD – Common Domain Database http://www.ncbi.nih.gov/Structure/
CDD Search DART
PrositePrositehttp://us.expasy.org/prosite/http://us.expasy.org/prosite/
• Maintained a the swiss institute of Bioinformatics.
• All Motifs are checked for false positives and fine tuned.
• Sometimes a family can be defined by more than one expression.
• Fingerprints and BLOCKs automatically scan proteins for a number of motifs.
• http://bioinf.man.ac.uk/dbbrowser/PRINTS/• http://blocks.fhcrc.org/help/
PfamPfam
• Pfam 7.0 contains a total of 3360 families.
• Pfam is a database of two parts:– Pfam A ..curated– Pfam B automatically generated.
• All HMMs have a seed alignment which is added to using the HMMer package.
PfamPfam
http://www.sanger.ac.uk/Software/Pfam/http://pfam.wustl.edu/
Interpro curation
http://www.ebi.ac.uk/interpro/
Gene OntologyGene Ontology
• http://www.geneontology.org/
TMHMMTMHMMhttp://www.cbs.dtu.dk/services/TMHMM/
Transmembrane Domains: Membrane bound proteinsTransmembrane Domains: Membrane bound proteins
SIGNALPSIGNALP
• What Is a signal Peptide?• Any protein that has to be targeted to a specific
part of the cell requires a signal peptide.• The signal peptide ensures that the protein in
translated at the ER where it can enter the secretory pathway.
• Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm.
Signal Peptides: Secreted/targeted proteins
using secondary databases for using secondary databases for functional functional AssignmentsAssignments
• Better, more detailed, professional annotation.
• More powerful and sensitive search methods, hmms/profiles/weight matrixes.
• Not as good coverage.
Protein Secondary StructureProtein Secondary Structure• CATH (Class, Architecture,Topology, Homology)
http://www.biochem.ucl.ac.uk/dbbrowser/cath/• SCOP (structural classification of proteins) -
hierarchical database of protein folds http://scop.mrc-lmb.cam.ac.uk/scop
• FSSP Fold classification using structure-structure alignment of proteins http://www2.ebi.ac.uk/fssp/fssp.html
• TOPS Cartoon representation of topology showing helices and strands http://tops.ebi.ac.uk/tops/
The Gene Prediction ProcessThe Gene Prediction Process
DNA SEQUENCE
AN
NA
LY
SIS
SO
FT
WA
RE
FunctionalAssignments
Annotator
Prosite
TMHMM
Pfam
SignalP
FASTA
BLAST
Slide Break – EMBL FeaturesSlide Break – EMBL Features
More…More…
• More on gene prediction
• Gene Finding
• Genome Comparison and Further Genome Analysis
Genome AnnotationGenome Annotation
• Genome Databases
• The GenBank/EMBL file format
• Editing GenBank/EMBL files with Artemis
• The annotation process
• Common pitfalls
Public DatabasesPublic Databases
• Genbank, EMBL and DDBJ.
• All databases update each other automatically
EMBL and TREMBL EMBL and TREMBL • Patricia Rodriguez-Tomé , Peter J. Stoehr , Graham
N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996
• EMBL currently contains 14366182 entries
EMBL FileEMBL File
• Contains:• A header File containing:
– Information about the sequence– Organism– Authors– References– Comments
• A feature table containing– Sequence features and co-ordinates
ID PFMAL1P4 standard; DNA; INV; 66441 BP.
XX
AC AL031747;
XX
SV AL031747.8
XX
DT 24-SEP-1998 (Rel. 57, Created)
DT 27-APR-2000 (Rel. 63, Last updated, Version 13)
XX
DE Plasmodium falciparum DNA from MAL1P4
XX
KW HTG; rifin; telomere; var; var-like hypothetical protein.
XX
OS Plasmodium falciparum (malaria parasite P. falciparum)
OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.
XX
RN [1]
RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D.,
RA Quail M., Rajandream M., Barrell B.;
RT ;
RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases.
RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome
RL Trust Genome Campus, Hinxton, Cambridge CB10 1S.
Header File
EMBL File Feature TableEMBL File Feature Table
misc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript
primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator
transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal
attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding
Anything that can have a coordinate on a DNA sequence.
Feature qualifiesrFeature qualifiesr
• Additional information about a feature
/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"
/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label
FeaturesFeatures
Annotation in ArtemisAnnotation in Artemis• FT CDS 732..1415
FT /db_xref="IPR002038"FT /gene="PfLtest.01"FT /label=PfLtest.01FT /note="PfLtest.01. len=227aa. Asp-rich protein.PredictedFT by Genefinder, Phat and GlimmerM. Similar to PlasmodiumFT falciparum hypothetical 132.2 kDa protein TR:O97242FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21,FT 44.388% id in 196 aa."FT /product="Asp-rich hypothetical protein"FT /colour=10FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out"FT misc_feature complement(1855..1871)FT /fasta_file="fasta/TEST100.tab.seq.00105.out"FT CDS 3151..4821FT /gene="PfLtest.02"FT /label=PfLtest.02FT /note="PfLtest.02. len=556aa. Predicted by Genefinder,FT Phat and GlimmerM. Unknown hypothetical protein"FT /product="unknown hypothetical protein"FT /colour=8FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"
CDS featuresCDS features
• CDS stands for coding sequence and is used to denote genes and pseudogenes.
• These features are automatically translated on submission and the protein added to the protein databases.
/note/note
• Note field contains all the evidence for a gene call……..plus anything else.– Similarity (fasta or blast)– Domain/motif information (pfam, tmhmm etc)– Unusual features (repeats, aa richness)
/product/product
• The name of the gene product eg Alcohol dehydrogenase
• Unless there is proof we must qualify..• Putative• Possible
• Always be conservative!.. eg. Putative dehydrogenase
dehyrogenase like protein • Only piece of annotation added to the protein
databases.
Naming protocolsNaming protocols
• Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • alcohol dehydrogenase like looks a bit like it, but may not be.
• Putative alcohol dehydrogenase probably a alcohol dehydrogenase
• Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism.
/gene /gene
• The gene name• Eg ADH1
• Only transfer a gene name if it is meaningful
• Never transfer a gene name like PfB0024.
• Is it a gene family? make sure two genes have the same name.
Transitive AnnotationTransitive Annotation
• AKA annotation catastrophe
• Junk in = Junk out
• Miss-annotations spread through incorrect database submissions.