JIGSAW: a better way to combine predictions

JIGSAW: a better way to combine predictions

J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 2007, 7(Suppl):S9.

J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18): 3596-3603, 2005.

J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using mutliple sources of evidence. Genome Research, 14(1), 2004.

Collecting gene structure evidence for JIGSAW

Figure 1. Evidence from the UCSC genome browser used as input to JIGSAW. Evidence includes: computational gene finders, alignments from gene expression evidence and evidence of cross-species sequence conservation.

Representing gene structure evidence in JIGSAW

• Each evidence source can predict up to six gene features:– Start codon– Stop codon– Intron– Protein coding nucleotides– Donor site– Acceptor site

Figure 3. Four evidence sources mapped to sequence S: gene prediction (GP1) with no confidence score, gene prediction with confidence score 0.65 (GP2), cDNA aligned with 86% identity and an EST aligned with 95% identity. Examples of the different feature vector types are shown: start codon (sta), stop codon (stp), donor site (don), acceptor site (acc), intron (inr) and amino acid codon (cod). Each element in the feature vector is an evidence source’s prediction for that feature type. The possible exon boundaries are k0, k1, …, k6.

Gene pred. 1

Gene pred. 2

cDNA

EST alignment

0.6 0.6

59%

95%

S2Single exon

0.9 0.9

59%

92%92%

S1 Initial exon

Terminal exon

0.92 0.92

59%

85%

SmInitial exon

Terminal exon

…

Internalexon

85%

0.92

Start site feature vectors

Stop sitefeature vectors

Donor sitefeature vectors

Acceptor sitefeature vectors

Example codingfeature vectors

Example intronfeature vectors

Schematic of the JIGSAW training procedure. Known genes are used to evaluate the accuracy of the different combinations of evidence. Prediction accuracy for each feature type (start codon, stop codon, acceptor, donor, amino acid codon and intron) is measured separately.

Training

Fig 4a. The plot shows the accuracy of predictions based on alignments to non-human sequences that overlap a gene finder’s predictions. Each point is a pair of alignments observed in training and their percent identity to the genomic sequence. ‘+’ points are labeled ‘accurate’ and ‘x’ points are labeled ‘inaccurate.’ The two lines correspond to the non-leaf nodes in the decision tree.

Figure 4b. Decision tree used to partition the feature vector space from Figure 4a into three sub-regions. This decision tree indicates that non-human cDNA alignments with > 95% identity to the human sequence (region “V1”) are accurate protein coding predictors.

),,( iiii qebt =Interval: assigns state to the subsequence from to .iq ib ie

S1t 2t

1q 2q0b 0e

0t

0q

JIGSAW dynamic programming

• Dynamic programming algorithm:• at the end of each interval (e0, for example), store the

score of the best parse ending at that location• Modification: store scores for every parse “type” ending

at e0

• Types are start, stop, coding, intron, donor, acceptor

JIGSAW GHMM gene model

Evidence types for JIGSAW experiments on human DNA

• cDNA from human genes

• UniGene transcripts

• GenBank cDNAs matching SwissProt proteins w/at least 98% identity

• RefSeq genes from non-human species

• TIGR Gene Index (human and other)

• Ab initio gene finders

– Genscan, GeneID, GeneZilla, GlimmerHMM

– NOTE: JIGSAW allows you to use the same gene finder as multiple “lines” of evidence - e.g., GlimmerHMM with different parameter settings

• Alignment-based gene finders– Twinscan– SGP

• Predicted conserved elements from phylogenetic analysis (PhastCons)

0

10

2030

40

50

60

70

80

90

100

Genefinders

non-humanEST

humanmRNA

curated all KnownGene(no

JIGSAW)

Percent

Sensitivity

Specificity

F-score

Effects of different evidence sources

Figure 6. JIGSAW prediction performance using different combinations of evidence. Gene finders = ab initio gene finders only; non-human EST = gene finders + non human expression evidence; human mRNA = gene finders + human mRNA; curated cDNA = gene finders + KnownGene, All = all evidence. KnownGene = cDNA evidence from curated proteins (from UCSC) without using JIGSAW.

0

10

20

30

40

50

60

70

80

90

100

JIGSAW Pairagon AugustusFgenesh++

ExoGeanExonHunter

Ensembl

Percent

Sensitivity

Specificity

F-score

Comparison of JIGSAW and othermethods on human ENCODE regions

Sensitivity(Sn)= % of exons correctly predictedSpecificity(Sp)= % exons predictions that are correctF-score=(2 x Sn x Sp) / (Sn + Sp)

Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Top panel: dotplot for sensitivity versus specificity at the exon level for CDS evaluation. Each dot represents the overall value for each program on the 31 test sequences. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2

Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Bottom panel: boxplots of the average sensitivity and specificity for each program. Each dot corresponds to the average in each of the test sequences for which GENCODE annotation existed. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2.

EGASP results:

Gene level accuracy

JIGSAW on other species

http://cbcb.umd.edu/software/jigsaw/

Documents

JIGSAW: a better way to combine predictions