Upload
moses-cox
View
212
Download
0
Embed Size (px)
Citation preview
Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs
ENCODE Gene Prediction Workshop - EGASP/2005
Sarah Djebali, Franck Delaplace, Hugues Roest Crollius
• Human experts generate reference gene annotations automating human expertise could provide highly specific gene models
• What do human experts do? Human experts combine biological objects using heuristic rules Both biological objects and heuristic rules evolve with time
Human experts generate high quality gene annotations
• Exogean is a generic framework based on directed acyclic coloured multigraphs (DACMs) made to allow the integration of any set of heuristic rules to any set of resources
• In Exogean DACMs: Nodes are biological objects (protein or mRNA alignments, …etc) Multiple edges between nodes are relations between objects
• In terms of DACMs the human expert annotation protocolcorresponds to building, reading and reducing DACMs
Exogean: a highly flexible method that automates human expertise
Exogean main steps
Filter
CDS Identification Filter output
InformationCollection
- Blat- Spidey- Blast… etc
SingleMolecule Clustering
Single Type Multi Molecule Clustering
Multi Type Multi Molecule Clustering
Exogean core: DACM expert annotation
Protein and mRNA alignments called HSPs
Final gene models with multiple transcripts
Reduction Reduction Reduction
DACM1 DACM2 DACM3
h1 h2 h3 h4rm1
h5 h6rm2
h7 h8 h9rm3
h10 h11 h12 h13pm1
h14 h15 h16pm2
h17 h18pm3
Example: several mRNAs and proteins have been aligned to a specific locus
• rmi = mRNA molecule• pmj= proteinmolecule• hk= mRNA or protein HSP
Building and reducing DACM1 = the Single Molecule Clustering
mRNA,protein HSPs
DACM1building +reduction
Level3 transcript models
DACM2building +reduction
DACM3building +reduction
Level2 transcript models
Level1 transcript models
DACM expert annotation
Each DACM reduction producesmore complexe transcript models
M2
M1
DACM3 reduction produces final transcript models
Mi = final multi type multi molecule transcript model in which Exogean searches for a CDS
1 3
1
1
1 223
223 2
Evaluation method
Method_X
HAVANA
FNTPTP
FNFN
FP FP
%504
2
FPTP
TPSp
%405
2
FNTP
TPSn
3
2
2
FN
FP
TP
Specificity on the 44 ENCODE regions
0
10
20
30
40
50
60
70
80
90
100
Exogean Aceview Augustus Ensembl Paraigon+n-scan
Exonhunter
Sp
ecif
icit
y exon overlap
exon exact
CDS overlap
CDS exact
Sensitivity on the 44 ENCODE regions
0
10
20
30
40
50
60
70
80
90
100
Aceview Exogean Augustus Ensembl Exonhunter Paraigon+n-scan
Sen
siti
vity
exon overlap
exon exact
CDS overlap
CDS exact
Evaluation method
TP = True Positive : each HAVANA CDS matched exactly by at least one CDS from method_X is counted as TP
FP = False Positive : a virtual HAVANA CDS is defined as a method_XCDS that does not match exactly a HAVANA CDS and is counted as FP
FN = False Negative : each HAVANA CDS that is not matched exactly by at least one method_X CDS is counted as FN
FPTP
TPSp
FNTP
TPSn
h1 h2 h3 h4r1
h5 h6r2
h7 h8 h9r3
h10 h11p1h12 h13p2
h14h15 h16p3
h17 h18p4
DACM1 reduction produces level1 transcript models
• ri = mRNA level1 transcript model• pj= protein level1 transcript model
Building and reducing DACM2 = the Single Type Multi Molecule Clustering
DACM2 reduction produces level2 transcript models
• Ri = mRNA level2 transcript model• Pj = protein level2 transcript model
R1
(r1,r3)2 21 1 1
R2
(r2,r3)21 1 1
P2
(p2)1 1
P3
(p4)1 1
P1
(p1,p3)21 1 1
Building and reducing DACM3 = the Multi Type Multi Molecule Clustering