17
Assessment of Genome-wide Assessment of Genome-wide Protein Function Classification Protein Function Classification for for Drosophila melanogaster Drosophila melanogaster Huaiyu Mi Huaiyu Mi [email protected] [email protected] Panther Protein Informatics group Panther Protein Informatics group Celera Genomics Celera Genomics

Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi [email protected] Panther Protein Informatics group Celera

Embed Size (px)

Citation preview

Page 1: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Assessment of Genome-wide Assessment of Genome-wide Protein Function Classification for Protein Function Classification for

Drosophila melanogasterDrosophila melanogaster

Huaiyu MiHuaiyu [email protected]@fc.celera.com

Panther Protein Informatics groupPanther Protein Informatics groupCelera GenomicsCelera Genomics

Page 2: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

How to classify proteins in a robust and accurate way?

Page 3: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Outline

1. Introduction to PANTHER

2. Comparison of functional classification of Drosophila proteins by FlyBase and PANTHER

Page 4: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

What is PANTHER?

PANTHER library (PANTHER/LIB)PANTHER library (PANTHER/LIB)

a family treea family tree

a multisequence alignment a multisequence alignment

an HMM an HMM

PANTHER index (PANTHER/X)PANTHER index (PANTHER/X)

Molecular functionMolecular function

Biological processBiological process

Page 5: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Building the library

500,000protein sequences

(filtered GenBank NR)

2200+ protein family clusters

Biologist curation

&

&

40,000 subfamilies

Family and subfamily was labeled with a name and classified by PANTHER/X categories

MSA HMM tree

Page 6: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

PANTHER library (PANTHER/LIB)

Page 7: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

PANTHER index (PANTHER/X)

PANTHER/X GO

RECEPTOR=> G-protein coupled receptor=> protein kinase receptor=> => serine/threonine protein kinase receptor=> => tyrosine protein kinase receptor

transmembrane receptor protein kinase GO:0019199=> transmembrane receptor protein serine/threonine kinase GO:0004675=> => transforming growth factor alpha receptor GO:0005023=> => transforming growth factor beta receptor GO:0005024=> => => activin receptor GO:0017002=> => => => type I activin receptor GO:0016361=> => => => type II activin receptor GO:0016362=> => => type I transforming growth factor beta receptor GO:0005025=> => => => type I activin receptor GO:0016361=> => => type II transforming growth factor beta receptor GO:0005026=> => => => type II activin receptor GO:0016362=> transmembrane receptor protein tyrosine kinase GO:0004714=> => boss receptor GO:0008288=> => ephrin receptor GO:0005003=> => => GPI-linked ephrin receptor GO:0005004=> => => transmembrane-ephrin receptor GO:0005005=> => epidermal growth factor receptor GO:0005006=> => => gurken receptor GO:0008313=> => fibroblast growth factor receptor GO:0005007=> => hepatocyte growth factor receptor GO:0005008=> => insulin receptor GO:0005009=> => insulin-like growth factor receptor GO:0005010=> => macrophage colony stimulating factor receptor GO:0005011=> => macrophage receptor GO:0008019=> => Neu/ErbB-2 receptor GO:0005012=> => neurotrophin TRK receptor GO:0005013=> => => neurotrophin TRKA receptor GO:0005014=> => => neurotrophin TRKB receptor GO:0005015=> => => neurotrophin TRKC receptor GO:0005016=> => platelet-derived growth factor receptor GO:0005017=> => => platelet-derived growth factor\, alpha-receptor GO:0005018=> => => platelet-derived growth factor\, beta-receptor GO:0005019=> => stem cell factor receptor GO:0005020=> => vascular endothelial growth factor receptor GO:0005021=> vascular endothelial growth factor receptor GO:0005021

signal transducer GO:0004871 => receptor GO:0004872=> => transmembrane receptor GO:0004888

Page 8: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

PANTHER Scoring

A fasta file

Family and subfamily HMMs

Score above threshold?

Classified (NameMolecular functionBiological process)

yes

Page 9: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

How accurate is PANTHER?

FlyBase

A manually curated database for Drosophila genes

PANTHER

An automated annotation process

Assess the associations

Page 10: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Process for comparison

Fly protein sequences

FlyBase annotationWith GO terms

PANTHER annotation by Scoring against PANTHER

Automated Comparison of FlyBase and Panther assignments

Match

Not Match

Manual review

Correct

Incorrect

Inconclusive

Page 11: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Coverage of Drosophila proteins classified by FlyBase and PANTHER.

C

FlyBase classified to GO

FlyBase not classified to GO

11538

2794

D

PANTHER HMM hits classified to GO

PANTHER HMM hits not classified to GO

Not hit6205

4469

3658

E

B PANTHER HMM hits classified to GO

PANTHER HMM hitsnot classified to GO

Not hit4862

3265

6205

FlyBase classified toGO

FlyBase notclassified to GO

6301

8031

A

PANTHER not classified to GO

FlyBase

PANTHER

Classified overlap 3283

PANTHER

FlyBase

PANTHER not classified to GO

F

Classified overlap 1159

Molecular

function

Biological

process

FlyBase PANTHER Both

Page 12: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Assessment of molecular function associations

FlyBase

2747

663

195

58

37

2737

700

345

50

35

Auto match Manual match Correct Incorrect Inconclusive

PANTHER

Page 13: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Types of errors

•Homology error – an error cause by incorrect functional prediction based on sequence homology.

•Human error – an error on part of the human curator.

•Evidence error – an error by using an evidence that is incorrect.

Page 14: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Analysis of errors

PANTHERPANTHER FlyBaseFlyBase

Number of Number of homology errorshomology errors 88 3535

Number of human Number of human errorserrors 4040 2323

Number of Number of evidence errorsevidence errors 22 00

Total number of Total number of incorrect incorrect associationsassociations

5050 5858

Association error Association error rate (%)rate (%) 1.3%1.3% 1.6%1.6%

Page 15: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

PANTHER function inference in the context of a protein sequence tree

FBgn0032382 (CG14934)

FlyBase: alpha glucosidaseneutral amino acid transporter

PANTHER: alpha glucosidase

CG14934

Alphaglucosidase

Neutral a.a.transporter

Example of homology error

Alpha amylase

Alpha amylase

Page 16: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Summary

•PANTHER is an automated method to classify proteins in a robust way.

•The accuracy of PANTHER was assessed by comparing its classification of

Drosophila proteins with FlyBase’s.

•A total of 3283 Drosophila proteins were associated to at least one molecular

function category by both FlyBase and PANTHER (3867 molecular function

associations by PANTHER, and 3700 by FlyBase).

•About 90% of these associations by FlyBase and PANTHER match with each

other.

•Total error rate is < 2% for both methods.

Page 17: Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi mihn@fc.celera.com Panther Protein Informatics group Celera

Acknowledgements

Celera Genomics

Paul Thomas

Jody Vandergriff

Michael Campbell

Apurva Narechania

William Majoros

Karen Diemer

Olivier Doremieux

Nan Guo

Anish Kejariwal

Steven Ladunga

Betty Lazareva

Anushya Muruganujan

Steve Rabkin

FlyBase

Michael AshburnerSusanna Lewis