View
216
Download
1
Category
Preview:
Citation preview
I529: Lab5 02/20/2009
AI : Kwangmin Choi
Today’s topics• Gene Ontology prediction/mapping– AmiGo
• http://amigo.geneontology.org/cgi-bin/amigo/go.cgi– PFP
• http://dragon.bio.purdue.edu/pfp/– GOtcha
• http://www.compbio.dundee.ac.uk/gotcha/
• Pathway prediction/mapping– KAAS
• http://www.genome.jp/kegg/kaas
Gene Ontology
• In a species-independent manner., the GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated
GO:biological process• A biological process is series of events accomplished by one
or more ordered assemblies of molecular functions. – E.g. cellular physiological process or signal transduction. – E.g. pyrimidine metabolic process or alpha-glucoside transport.
• It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.
• A biological process is not equivalent to a pathway; at present, GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.
GO: molecular functions• Molecular function describes activities, such as catalytic or
binding activities, that occur at the molecular level.
• GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions,
• GO milecular function terms do not specify where or when, or in what context, the action takes place. – E..g. (general) catalytic activity, transporter activity, or binding
etc.– E.g. (specific) adenylate cyclase activity, Toll receptor binding
etc.
GO: cellular components• A cellular component is just that, a component of a
cell, but with the proviso that it is part of some larger object;
• Less informative
• This may be an anatomical structure – e.g. rough endoplasmic reticulum or nucleus
• or a gene product group – e.g. ribosome, proteasome or a protein dimer
AmiGO• URL http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
• AmiGO is the official tool for searching and browsing the Gene Ontology database
• Simple blast search is provided (not useful)
• AmiGO consists of a controlled vocabulary of terms covering biological concepts, and a large number of genes or gene products whose attributes have been annotated using GO terms.
PFP (Automated Protein Function Prediction Server)
• Hawkins, T., Luban, S. and Kihara, D. 2006. Enhanced Automated Function Prediction Using Distantly Related Sequences and Contextual Association by PFP. Protein Science 15: 1550-6.
• The PFP algorithm has been shown to increase coverage of sequence-based function annotation more than fivefold by extending a PSI-BLAST search to extract and score GO terms individually
• It applies the Function Association Matrix (FAM), to score significantly associating pairs of annotations.
PFP method• PFP uses a scoring scheme to rank GO
annotations assigned to all of the most similar sequences according to – (1) their frequency of occurrence in those sequences – (2) the degree of similarity of the originating
sequence to the query.
• This is similar to the scoring basis for the R-value used by the GOtcha method to score annotations from pairwise alignment matches (Martin et al. 2004)
PFP method
• A GO term, fa
• s(fa) is the final score assigned to the GO term, fa • N is the number of the similar sequences retrieved by PSI-BLAST • E_value(i) is the E-value given to the sequence I• b = 2 (or log10[100]) to allow the use of sequence matches to an E-value of 100.• Function Association Matrix (FAM),
– fj is a GO term assigned to the sequence i. – P(fa | fj) is the conditional probability that fa is associated with fj, – c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt – c(fj) is the total number of times fj appeared in UniProt, – μ is the size of one dimension of the FAM (i.e., the total number of unique GO terms)– ɛ is the pseudo-count.
PFP
• Web server http://dragon.bio.purdue.edu/pfp/queue/1168_kw.f.result.html
• Local installation– http://dragon.bio.purdue.edu/pfp/dist– Installed in /home/kwchoi/public_html/PFP– You need to specify the path of blastpgp – And also need BLOSUM62
PFP (Automated Protein Function Prediction Server)
• PFP output– /home/kwchoi/public_html/I529-09-lab/Lab5/Data/pfp_data
• Columns– 1: predicted GO term– 2: GO category (f/p/c)– 3: raw term score– 4: term p-value– 5: rank (by p-value)– 6: confidence to be exact match– 7: rank (by column 7)– 8: confidence within 2 edges on the GO DAG– 9: rank (by column 8)– 10: confidence within 4 edges on the GO DAG– 11: rank (by column 10)– 12: GO term short definition
GOtcha
• The GOtcha method – Martin et al. BMC Bioinformatics (2004) 5:178.
• GOtcha assigns functional terms transitively based upon sequence similarity.
• These terms are ranked by probability and displayed graphically on a subtree of Gene Ontology.
• GOtcha performs a BLAST search of the query sequence against individual well annotated genomes.
• Annotations are transitively assigned from all hits, with a score corresponding to the E-value, individual GO-terms receiving cumulative scores from multiple sequence similarity matches.
• Cumulative scores are normalized and, for each term, two scores are obtained – the I-score which is normalized to the root
node, – the C-score which is the cumulative score at
the root node.
• For each GO-term a precomputed scoring table is used to establish the assignment likelihood for that term given that I-score and that C-score. This is represented as a probability
Gotcha method
Pathway mapping
• E.g E.coli K-12 pathway (00300)
KAAS• KAAS (KEGG Automatic Annotation Server) provides
functional annotation of genes in a genome by BLAST comparisons against a manually curated set of ortholog groups in KEGG GENES.
• The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways.
• Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., and Kanehisa, M.; KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 35, W182-W185 (2007). [NAR]
KAAS• Web server: http://www.genome.jp/kegg/kaas/
• KAAS works best when a complete set of genes in a genome is known. Prepare query amino acid sequences and use the BBH (bi-directional best hit) method to assign orthologs.
• KAAS can also be used for a limited number of genes. Prepare query amino acid sequences and use the SBH (single-directional best hit) method to assign orthologs.
• When ESTs are comprehensive enough, a set of consensus contigs can be generated by the EGassembler server and used as a gene set for KAAS with the BBH method. Otherwise, use ESTs as they are with the SBH method.
KAAS workflow
Pathway mapping
• KAAS returns – KO list– KEGG Atlas Metabolism map [Create atlas]– Pathway maps [Create all maps]– Hierarchy files
• You can highlight KEGG maps using KEGG API– http://www.genome.jp/kegg/soap/doc/
keggapi_manual.html– See: color_pathway_by_objects
Recommended