View
32
Download
0
Category
Preview:
DESCRIPTION
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers. Pengyu Hong 10/06/2005. mRNA transcript. Binding sites. Regulators. Genes. Motivation. Understand transcriptional regulation. Gene X. TF. Model transcriptional regulatory networks. Motivation. - PowerPoint PPT Presentation
Citation preview
MotifBooster – A Boosting Approach MotifBooster – A Boosting Approach for Constructing TF-DNA Binding for Constructing TF-DNA Binding
ClassifiersClassifiers
Pengyu HongPengyu Hong
10/06/200510/06/2005
MotivationMotivation Understand transcriptional regulationUnderstand transcriptional regulation
TF Gene X
mRNA transcript
Model transcriptional regulatory networksModel transcriptional regulatory networks
Binding sites
Regulators
Genes
MotivationMotivation
AlignACE (Hughes et al 2000)AlignACE (Hughes et al 2000) ANN-Spec (Workman et al 2000)ANN-Spec (Workman et al 2000) BioProspector (Liu et al 2001)BioProspector (Liu et al 2001) Consensus (Hertz et al 1999)Consensus (Hertz et al 1999) Gibbs Motif Sampler (Lawrence et al 1993)Gibbs Motif Sampler (Lawrence et al 1993) LogicMotif (Keles et al 2004)LogicMotif (Keles et al 2004) MDScan (Liu et al 2002)MDScan (Liu et al 2002) MEME (Bailey and Elkan 1995)MEME (Bailey and Elkan 1995) Motif Regressor (Colon et al 2003)Motif Regressor (Colon et al 2003) … …… …
Previous works on motif findingPrevious works on motif finding
1 2 3 4 5 6 7 8A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80
MotivationMotivation
A widely used model – Motif Weight Matrix A widely used model – Motif Weight Matrix (Stormo et al 1982)(Stormo et al 1982)
A A C A T C C G • • •• • •
Score of the site = + = 10.84
A sequence is a target if it contains a binding site (score > threshold).
vs. vs. thresholdthreshold
Computational << Molecular
MotivationMotivation
• • • • • • CACACCCCCCAATACAT • • •TACAT • • •
• • • • • • CACATTCCCCGGTACAT • • •TACAT • • •
Non-linear binding effects, e.g., different binding modes.Non-linear binding effects, e.g., different binding modes.
Preferred bindingPreferred binding
• • • • • • CACACCCCCCGGTACAT • • •TACAT • • •
• • • • • • CACATTCCCCAATACAT • • •TACAT • • •Non-preferred bindingNon-preferred binding
Mode 1Mode 1
Mode 2Mode 2
Mode 3Mode 3
Mode 4Mode 4
• • • • • • CA CA C/TC/T CC CC A/GA/G TACAT • • • TACAT • • •
ModelingModeling
Model a TF-DNA binding classifier as an ensemble model. Model a TF-DNA binding classifier as an ensemble model.
m immi SqSQ )()(
base classifier base classifier weightweightensemble modelensemble model
0)(1
0)(1)(
i
ii SQ
SQSLabel
ModelingModeling
))(tanh()( imim ShSq
Sequence scoring function:Sequence scoring function: )log()(0)(|
)(
ikmik
ikm
sfs
sfim eSh
ffmm((ssikik) is a site scoring function (weight matrix + threshold).) is a site scoring function (weight matrix + threshold).
The scoring function considersThe scoring function considers(a) the number of matching sites (a) the number of matching sites (b) the degree of matching(b) the degree of matching
hhmm((SSii))
qqmm((SSii))
The The mmth base classifierth base classifier
Training – BoostingTraining – Boosting
Modify the confidence-rated boosting (CRB) algorithm Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models (Schapire et al. 1999) to train ensemble models
m immi SqSQ )()(
(b) Learn the parameters of (b) Learn the parameters of each base classifier and each base classifier and its weight. its weight.
(a) Decide the number (a) Decide the number of base classifiers.of base classifiers.
Why Boosting?Why Boosting?
Booting is a Newton-like technique that iteratively Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound adds base classifiers to minimize the upper bound on the training error. on the training error.
Training errorTraining error Margin of training Margin of training samplessamples
Generalization Generalization errorerror
(Schapire et al. 1998)
ChallengesChallenges
•• Positive sequences – targets of a TFPositive sequences – targets of a TF
•• Negative sequencesNegative sequences
1.1. Sequences are labeled, but not the sites in the sequences.Sequences are labeled, but not the sites in the sequences.2.2. Cannot be well separated by the weight matrix model (linear).Cannot be well separated by the weight matrix model (linear). 3.3. Number of negative sequences >> number of positive Number of negative sequences >> number of positive
sequences. sequences.
BoostingBoosting
InitializationInitialization•• PositivePositive
•• NegativeNegative Total weight of the positive Total weight of the positive samples == Total weight of samples == Total weight of the negative samples. the negative samples.
Since the motif must be an Since the motif must be an enriched pattern in the enriched pattern in the positive sequences, use positive sequences, use Motif Regressor to find a Motif Regressor to find a seed motif matrix seed motif matrix WW00..
BoostingBoosting
Train a base classifier (BC)Train a base classifier (BC)
Refine Refine mm and the parameters and the parameters
of of qqmm(() to minimize) to minimize
i imim
mim Sqyd ))(exp(
Negative information is explicitly Negative information is explicitly used to train used to train qqmm(() and ) and mm..
wherewhere y yii is the label of is the label of SSii and and ddiimm is is
the weight of the weight of SSii in the in the mmth round.th round.
Use the seed matrix Use the seed matrix WW00 + + to to
initialize the initialize the mmth base th base classifier classifier qqmm(() and let ) and let mm=1.=1.
•• PositivePositive
•• NegativeNegative
BC 1
BoostingBoosting
Adjust sample weights and gives Adjust sample weights and gives higher weights to previously higher weights to previously misclassified samples.misclassified samples.
•• PositivePositive
•• NegativeNegative
BC 1
i
mi
imimmim
i d
Sqydd
11 ))(exp(
• yyii is the label of is the label of SSii • ddii
mm is the weight of is the weight of SSii in the in the mmth th
round.round.• ddii
mm+1+1 is the new weight of is the new weight of SSii..
BoostingBoosting
Add a new base classifierAdd a new base classifier•• PositivePositive
•• NegativeNegative
BC 1
BC 2
BoostingBoosting
Add a new base classifierAdd a new base classifier•• PositivePositive
•• NegativeNegative
Decision boundary
BoostingBoosting
Adjust sample weights againAdjust sample weights again•• PositivePositive
•• NegativeNegative
Decision boundary
BoostingBoosting
Add one more base classifierAdd one more base classifier•• PositivePositive
•• NegativeNegativeBC 3
BoostingBoosting
Add one more base classifierAdd one more base classifier•• PositivePositive
•• NegativeNegative
Decision boundary
BoostingBoosting
•• PositivePositive
•• NegativeNegative
Decision boundary
Stop if the result is perfect or Stop if the result is perfect or the performance on the internal the performance on the internal validation sequences drops. validation sequences drops.
ResultsResults
– Positive sequencesPositive sequences– pp-value < 0.001 -value < 0.001 – Number of positive sequences Number of positive sequences 25. 25.
– Negative sequencesNegative sequences– pp-value -value 0.05 & ratio 0.05 & ratio 1 1
Got 40 TFs.Got 40 TFs.
Data: ChIP-chip data of Data: ChIP-chip data of Saccharomyces cerevisiaeSaccharomyces cerevisiae ((Lee et al. 2002 Lee et al. 2002 ))
0.00%
0.00%
0.00%
0.00% 2.43%
3.70%
5.79% 7.97%
9.17% 13
.17%
14.14%
14.24%
14.41%
14.48%
14.90%
15.02%
16.02%
16.41%
18.23%
18.74%
19.01%
19.44%
21.06% 24
.44%
25.18%
25.28%
26.30%
27.66% 30.25% 34
.24%
38.96%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
FKH1
FKH2
RLM
1
YAP6
ABF1
REB1
FHL1
CAD1
NRG1
MBP1
CIN5
GCN4
SMP1
SUM1
HAP4
DAL81
SKN7
BAS1
ACE2
MCM1
SWI4
STE
12
SWI6
CBF1
HSF1
YAP5
YAP1
SWI5
PDR1
PHD1
RAP1
ResultsResults
Horizontal axis: TFsHorizontal axis: TFs
Vertical axis: Vertical axis: Improvements Improvements on specificityon specificity
Boosted models Boosted models vsvs. Seed weight matrices. Seed weight matrices
Leave-one-out test resultsLeave-one-out test results
W
BW
FP
FPFP
Results Results
RAP1RAP1
Weight MatrixWeight Matrix
BoostingBoosting
Base classifier 1Base classifier 1 Base classifier 2Base classifier 2 Base classifier 3Base classifier 3
Capture Position-CorrelationCapture Position-Correlation
++
00
Results Results
REB1REB1
Weight MatrixWeight Matrix
BoostingBase classifier 1Base classifier 1 Base classifier 2Base classifier 2
Capture Position-CorrelationCapture Position-Correlation
Recommended