17
Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 arah Djebali, Franck Delaplace, Hugues Roest Crolli

Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Embed Size (px)

Citation preview

Page 1: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs

 

ENCODE Gene Prediction Workshop - EGASP/2005

Sarah Djebali, Franck Delaplace, Hugues Roest Crollius

Page 2: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

• Human experts generate reference gene annotations automating human expertise could provide highly specific gene models

• What do human experts do? Human experts combine biological objects using heuristic rules Both biological objects and heuristic rules evolve with time

Human experts generate high quality gene annotations

Page 3: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

• Exogean is a generic framework based on directed acyclic coloured multigraphs (DACMs) made to allow the integration of any set of heuristic rules to any set of resources

• In Exogean DACMs: Nodes are biological objects (protein or mRNA alignments, …etc) Multiple edges between nodes are relations between objects

• In terms of DACMs the human expert annotation protocolcorresponds to building, reading and reducing DACMs

Exogean: a highly flexible method that automates human expertise

Page 4: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Exogean main steps

Filter

CDS Identification Filter output

InformationCollection

- Blat- Spidey- Blast… etc

SingleMolecule Clustering

Single Type Multi Molecule Clustering

Multi Type Multi Molecule Clustering

Exogean core: DACM expert annotation

Protein and mRNA alignments called HSPs

Final gene models with multiple transcripts

Reduction Reduction Reduction

DACM1 DACM2 DACM3

Page 5: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

h1 h2 h3 h4rm1

h5 h6rm2

h7 h8 h9rm3

h10 h11 h12 h13pm1

h14 h15 h16pm2

h17 h18pm3

Example: several mRNAs and proteins have been aligned to a specific locus

• rmi = mRNA molecule• pmj= proteinmolecule• hk= mRNA or protein HSP

Page 6: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Building and reducing DACM1 = the Single Molecule Clustering

Page 7: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

mRNA,protein HSPs

DACM1building +reduction

Level3 transcript models

DACM2building +reduction

DACM3building +reduction

Level2 transcript models

Level1 transcript models

DACM expert annotation

Each DACM reduction producesmore complexe transcript models

Page 8: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

M2

M1

DACM3 reduction produces final transcript models

Mi = final multi type multi molecule transcript model in which Exogean searches for a CDS

1 3

1

1

1 223

223 2

Page 9: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Evaluation method

Method_X

HAVANA

FNTPTP

FNFN

FP FP

%504

2

FPTP

TPSp

%405

2

FNTP

TPSn

3

2

2

FN

FP

TP

Page 10: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Specificity on the 44 ENCODE regions

0

10

20

30

40

50

60

70

80

90

100

Exogean Aceview Augustus Ensembl Paraigon+n-scan

Exonhunter

Sp

ecif

icit

y exon overlap

exon exact

CDS overlap

CDS exact

Page 11: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Sensitivity on the 44 ENCODE regions

0

10

20

30

40

50

60

70

80

90

100

Aceview Exogean Augustus Ensembl Exonhunter Paraigon+n-scan

Sen

siti

vity

exon overlap

exon exact

CDS overlap

CDS exact

Page 12: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,
Page 13: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Evaluation method

TP = True Positive : each HAVANA CDS matched exactly by at least one CDS from method_X is counted as TP

FP = False Positive : a virtual HAVANA CDS is defined as a method_XCDS that does not match exactly a HAVANA CDS and is counted as FP

FN = False Negative : each HAVANA CDS that is not matched exactly by at least one method_X CDS is counted as FN

FPTP

TPSp

FNTP

TPSn

Page 14: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

h1 h2 h3 h4r1

h5 h6r2

h7 h8 h9r3

h10 h11p1h12 h13p2

h14h15 h16p3

h17 h18p4

DACM1 reduction produces level1 transcript models

• ri = mRNA level1 transcript model• pj= protein level1 transcript model

Page 15: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Building and reducing DACM2 = the Single Type Multi Molecule Clustering

Page 16: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

DACM2 reduction produces level2 transcript models

• Ri = mRNA level2 transcript model• Pj = protein level2 transcript model

R1

(r1,r3)2 21 1 1

R2

(r2,r3)21 1 1

P2

(p2)1 1

P3

(p4)1 1

P1

(p1,p3)21 1 1

Page 17: Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,

Building and reducing DACM3 = the Multi Type Multi Molecule Clustering