Upload
rickey-wasley
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Assessment of Genome-wide Assessment of Genome-wide Protein Function Classification for Protein Function Classification for
Drosophila melanogasterDrosophila melanogaster
Huaiyu MiHuaiyu [email protected]@fc.celera.com
Panther Protein Informatics groupPanther Protein Informatics groupCelera GenomicsCelera Genomics
How to classify proteins in a robust and accurate way?
Outline
1. Introduction to PANTHER
2. Comparison of functional classification of Drosophila proteins by FlyBase and PANTHER
What is PANTHER?
PANTHER library (PANTHER/LIB)PANTHER library (PANTHER/LIB)
a family treea family tree
a multisequence alignment a multisequence alignment
an HMM an HMM
PANTHER index (PANTHER/X)PANTHER index (PANTHER/X)
Molecular functionMolecular function
Biological processBiological process
Building the library
500,000protein sequences
(filtered GenBank NR)
2200+ protein family clusters
Biologist curation
&
&
40,000 subfamilies
Family and subfamily was labeled with a name and classified by PANTHER/X categories
MSA HMM tree
PANTHER library (PANTHER/LIB)
PANTHER index (PANTHER/X)
PANTHER/X GO
RECEPTOR=> G-protein coupled receptor=> protein kinase receptor=> => serine/threonine protein kinase receptor=> => tyrosine protein kinase receptor
transmembrane receptor protein kinase GO:0019199=> transmembrane receptor protein serine/threonine kinase GO:0004675=> => transforming growth factor alpha receptor GO:0005023=> => transforming growth factor beta receptor GO:0005024=> => => activin receptor GO:0017002=> => => => type I activin receptor GO:0016361=> => => => type II activin receptor GO:0016362=> => => type I transforming growth factor beta receptor GO:0005025=> => => => type I activin receptor GO:0016361=> => => type II transforming growth factor beta receptor GO:0005026=> => => => type II activin receptor GO:0016362=> transmembrane receptor protein tyrosine kinase GO:0004714=> => boss receptor GO:0008288=> => ephrin receptor GO:0005003=> => => GPI-linked ephrin receptor GO:0005004=> => => transmembrane-ephrin receptor GO:0005005=> => epidermal growth factor receptor GO:0005006=> => => gurken receptor GO:0008313=> => fibroblast growth factor receptor GO:0005007=> => hepatocyte growth factor receptor GO:0005008=> => insulin receptor GO:0005009=> => insulin-like growth factor receptor GO:0005010=> => macrophage colony stimulating factor receptor GO:0005011=> => macrophage receptor GO:0008019=> => Neu/ErbB-2 receptor GO:0005012=> => neurotrophin TRK receptor GO:0005013=> => => neurotrophin TRKA receptor GO:0005014=> => => neurotrophin TRKB receptor GO:0005015=> => => neurotrophin TRKC receptor GO:0005016=> => platelet-derived growth factor receptor GO:0005017=> => => platelet-derived growth factor\, alpha-receptor GO:0005018=> => => platelet-derived growth factor\, beta-receptor GO:0005019=> => stem cell factor receptor GO:0005020=> => vascular endothelial growth factor receptor GO:0005021=> vascular endothelial growth factor receptor GO:0005021
signal transducer GO:0004871 => receptor GO:0004872=> => transmembrane receptor GO:0004888
PANTHER Scoring
A fasta file
Family and subfamily HMMs
Score above threshold?
Classified (NameMolecular functionBiological process)
yes
How accurate is PANTHER?
FlyBase
A manually curated database for Drosophila genes
PANTHER
An automated annotation process
Assess the associations
Process for comparison
Fly protein sequences
FlyBase annotationWith GO terms
PANTHER annotation by Scoring against PANTHER
Automated Comparison of FlyBase and Panther assignments
Match
Not Match
Manual review
Correct
Incorrect
Inconclusive
Coverage of Drosophila proteins classified by FlyBase and PANTHER.
C
FlyBase classified to GO
FlyBase not classified to GO
11538
2794
D
PANTHER HMM hits classified to GO
PANTHER HMM hits not classified to GO
Not hit6205
4469
3658
E
B PANTHER HMM hits classified to GO
PANTHER HMM hitsnot classified to GO
Not hit4862
3265
6205
FlyBase classified toGO
FlyBase notclassified to GO
6301
8031
A
PANTHER not classified to GO
FlyBase
PANTHER
Classified overlap 3283
PANTHER
FlyBase
PANTHER not classified to GO
F
Classified overlap 1159
Molecular
function
Biological
process
FlyBase PANTHER Both
Assessment of molecular function associations
FlyBase
2747
663
195
58
37
2737
700
345
50
35
Auto match Manual match Correct Incorrect Inconclusive
PANTHER
Types of errors
•Homology error – an error cause by incorrect functional prediction based on sequence homology.
•Human error – an error on part of the human curator.
•Evidence error – an error by using an evidence that is incorrect.
Analysis of errors
PANTHERPANTHER FlyBaseFlyBase
Number of Number of homology errorshomology errors 88 3535
Number of human Number of human errorserrors 4040 2323
Number of Number of evidence errorsevidence errors 22 00
Total number of Total number of incorrect incorrect associationsassociations
5050 5858
Association error Association error rate (%)rate (%) 1.3%1.3% 1.6%1.6%
PANTHER function inference in the context of a protein sequence tree
FBgn0032382 (CG14934)
FlyBase: alpha glucosidaseneutral amino acid transporter
PANTHER: alpha glucosidase
CG14934
Alphaglucosidase
Neutral a.a.transporter
Example of homology error
Alpha amylase
Alpha amylase
Summary
•PANTHER is an automated method to classify proteins in a robust way.
•The accuracy of PANTHER was assessed by comparing its classification of
Drosophila proteins with FlyBase’s.
•A total of 3283 Drosophila proteins were associated to at least one molecular
function category by both FlyBase and PANTHER (3867 molecular function
associations by PANTHER, and 3700 by FlyBase).
•About 90% of these associations by FlyBase and PANTHER match with each
other.
•Total error rate is < 2% for both methods.
Acknowledgements
Celera Genomics
Paul Thomas
Jody Vandergriff
Michael Campbell
Apurva Narechania
William Majoros
Karen Diemer
Olivier Doremieux
Nan Guo
Anish Kejariwal
Steven Ladunga
Betty Lazareva
Anushya Muruganujan
Steve Rabkin
FlyBase
Michael AshburnerSusanna Lewis