Upload
nguyennhu
View
218
Download
0
Embed Size (px)
Citation preview
Scikit-ribo - Accurate A-site prediction and robust modeling of translational control
October 29, 2015 Genome Informatics
Han Fang
Acknowledgments
Gholson Lyon Michael Schatz
Lyon Lab Max Doerfel Yiyang Wu Jonathan Crain Jason O’Rawe
Schatz Lab Fritz Sedlazeck Tyler Garvin Hayan Lee James Gurtowski Maria Nattestad Srividya Ramakrishnan
Cold Spring Harbor Laboratory: Yifei Huang Noah Dukler Melissa Kramer
Eric Antoniou Elena Ghiban Stephanie Muller
Stony Brook University:
Rob Patro
Central dogma of biology – Classic view
DNA RNA ProteinTranscription Translation
RNA Polymerase Ribosome
Replication
What is ribosome profiling (Riboseq)?
Less efficient translation Normal translation efficieny (TE) More efficient translation
log2(TE) < 0 log2(TE) = 0 log2(TE) > 0
T E = RP F rpkmRNA rpkm
Less efficient translation Normal translation efficieny (TE) More efficient translation
log2(TE) < 0 log2(TE) = 0 log2(TE) > 0
T E = RP F rpkmRNA rpkm
Less efficient translation Normal translation efficieny (TE) More efficient translation
log2(TE) < 0 log2(TE) = 0 log2(TE) > 0
T E = RP F rpkmRNA rpkm
Ingolia. Science. (2009) Ingolia. Nat Rev Genet. (2014)
Calculate translational efficiency (TE)
Less efficient translation Normal translation efficieny (TE) More efficient translation
log2(TE) < 0 log2(TE) = 0 log2(TE) > 0
T E = RP F rpkmRNA rpkm
Hypothesis: TE distribution could be skewed by ribosome pausing events.
TE =
Riboseq rpkm
RNAseq rpkm
log2(TE) < 0
log2(TE) = 0
log2(TE) > 0
1
TE =
Riboseq rpkm
RNAseq rpkm
log2(TE) < 0
log2(TE) = 0
log2(TE) > 0
1
TE =
Riboseq rpkm
RNAseq rpkm
log2(TE) < 0
log2(TE) = 0
log2(TE) > 0
1
TE =
Riboseq rpkm
RNAseq rpkm
log2(TE) < 0
log2(TE) = 0
log2(TE) > 0
1
With ribosome pausing sites
Simulated S. cerevisiae data - TE distribution are negatively-skewed by ribosome pausing events
Null distribution
Analytical Challenges
Actively translated codons Where does the A-site locate on Riboseq reads?
Assay specific characteristics/ biases
(e.g. ribosome pausing) How is Riboseq different from RNAseq?
Understand translational control How to accurately infer translation efficiency?
Align reads (eg. w/ STAR, HISAT)
Filter rRNA sequences(eg. w/ bowtie)
Identify & trim adapters (eg. w/ cutadapt)
Transcript quantification(eg. w/ cufflinks, salmon)
Joint analysisw/ scikit-ribo
Riboseq RNAseq
Ribosome A-site position predicition
A-site codonlocalization
Ribosome pausingsite calling
Translation efficiencyinference
Differential translation efficiency testing
Introducing scikit-ribo
What and where is the ribosome A-site?
UUAUUUCGGGGAAUGGUGCUGAGGAUAA
5’ 3’
E P A
UUAUUUCGGGGAAUGGUGCUGAGGAUAA
mRNA
Ribosome
M V L R I R L T
UUAUUUCGGGGAAUGGUGCUGAGGAUAA
UUAUUUCGGGGAAUGGUGCUGAGGAUAA
UUAUUUCGGGGAAUGGUGCUGAGGAUAA
5’ 3’
NNN . . . NNN AUG NNN . . . NNN N
60S
40S
0 12 15 28
P A
28nt
Figure adapted from Ingolia et al. Science (2009)
How to predict A-site?
Read lengthOffset
3*x
Training data and features: Classifier and model tuning:
• SVM with RBF kernel (scikit-learn)
• 10 fold cross-validation for grid search
• Make predictions on all reads genome-wide
A-site
Prediction performance by cross validation
Scikit-ribo has much higher accuracy of identifying A-site than the previous method (0.86 vs. 0.64, 10-fold CV).
Scikit-ribo accurately predicted codon usage fraction and codon normalized TE
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●●●
●
●
●● ●
●
● ●
●
●
● ●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Published codon usage fraction
Rib
oseq
cod
on u
sage
frac
tion
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Published normalized TE
Rib
oseq
nor
mal
ized
TE
r = 0.9 r = 0.8
Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding, Pechmann, Frydman (2013).
Finding ribosome pausing sites (peaks) is hard. But it is easier after knowing the A-site location.
Q: how to robustly identify ribosome pausing sites while accounting for over-dispersion?
RNAseq Riboseq
Yes
SIMPLE BUSINESS PROCESS
Given a gene
Likelihood ratio test
Bimodal distribution, i.e.
potential pausing sites
Unimodal distribution, i.e.
no pausing
Bonferroni adjusted p-value
< 0.05
Posterior probability >
cutoff
Not a pausing site
No
Peak/Pausing sitesYes
No
# genes # genes (rpkm > 100) # genes with pausing
# ribosome pausing sites identified
6664 1252 94 180
Yifei Huang P (Xi|⇡i, µi, ki, ri) =
Y
j
⇡iNB(Xij |µi, ri) + (1� ⇡i)NB(Xij |kiµi, ri),
for gene i at position j, where k � 5
H0 : ⇡ = 1
H1 : ⇡ 6= 1
1
P (Xi|⇡i, µi, ki, ri) =Y
j
⇡iNB(Xij |µi, ri) + (1� ⇡i)NB(Xij |kiµi, ri),
for gene i at position j, where k � 5
H0 : ⇡ = 1
H1 : ⇡ 6= 1
1
Ribosome pausing site identification by negative binomial mixture model
−4
0
4
0 100 200 300 400position
PAR
S sc
ores
−4 0 4PARS_score
0
500
1000
1500
2000
0 100 200 300 400position
ribos
ome
coun
ts
ribosome_pausing
FALSE
TRUE
mRNA with stronger secondary structure tend to have ribosome pausing events
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
0.0
0.5
1.0
1.5
No Yes
Seco
ndar
y st
ruct
ure
scor
e
factor(Pausing)NoYes
p-value< 2*10-16
Kertesz et al. Nature (2010)
Fisher exact test p-value = 0.001
TE distributions are negatively-skewed in many studies. Over-structured mRNA show inflated TE.
Improved ribosome-footprint and mRNA measurements provide insights into dynamics and regulation of yeast translation. Weinberg, Shah et al. (2015)
0
10
20
30
−6 −4 −2 0 2log2(TE)
count
Chi Square test p-value < 2*10-16
Summary
Discussed:
1) Introduce scikit-ribo for joint analysis of Riboseq & RNAseq data.
2) Learn from data itself to determine ribosome A-site location.
3) Reveal biases in Riboseq data due to ribosome pausing.
4) How Riboseq biases lead to issues with estimating TE.
Ongoing work:
1) Adjust for those biases and provide an unbiased estimate of TE.
2) Extend the ribosome pausing calling to a HMM based method.
3) Joint inference of translation initiation and elongation rates.
Align reads (eg. w/ STAR, HISAT)
Filter rRNA sequences(eg. w/ bowtie)
Identify & trim adapters (eg. w/ cutadapt)
Transcript quantification(eg. w/ cufflinks, salmon)
Joint analysisw/ scikit-ribo
Riboseq RNAseq
Ribosome A-site position predicition
A-site codonlocalization
Ribosome pausingsite calling
Translation efficiencyinference
Differential translation efficiency testing
https://github.com/hanfang/