17
Scikit-ribo - Accurate A-site prediction and robust modeling of translational control October 29, 2015 Genome Informatics Han Fang

Scikit-ribo - Accurate A-site prediction and robust modeling of

Embed Size (px)

Citation preview

Page 1: Scikit-ribo - Accurate A-site prediction and robust modeling of

Scikit-ribo - Accurate A-site prediction and robust modeling of translational control

October 29, 2015 Genome Informatics

Han Fang

Page 2: Scikit-ribo - Accurate A-site prediction and robust modeling of

Acknowledgments

Gholson Lyon Michael Schatz

Lyon Lab Max Doerfel Yiyang Wu Jonathan Crain Jason O’Rawe

Schatz Lab Fritz Sedlazeck Tyler Garvin Hayan Lee James Gurtowski Maria Nattestad Srividya Ramakrishnan

Cold Spring Harbor Laboratory: Yifei Huang Noah Dukler Melissa Kramer

Eric Antoniou Elena Ghiban Stephanie Muller

Stony Brook University:

Rob Patro

Page 3: Scikit-ribo - Accurate A-site prediction and robust modeling of

Central dogma of biology – Classic view

DNA RNA ProteinTranscription Translation

RNA Polymerase Ribosome

Replication

Page 4: Scikit-ribo - Accurate A-site prediction and robust modeling of

What is ribosome profiling (Riboseq)?

Less efficient translation Normal translation efficieny (TE) More efficient translation

log2(TE) < 0 log2(TE) = 0 log2(TE) > 0

T E = RP F rpkmRNA rpkm

Less efficient translation Normal translation efficieny (TE) More efficient translation

log2(TE) < 0 log2(TE) = 0 log2(TE) > 0

T E = RP F rpkmRNA rpkm

Less efficient translation Normal translation efficieny (TE) More efficient translation

log2(TE) < 0 log2(TE) = 0 log2(TE) > 0

T E = RP F rpkmRNA rpkm

Ingolia. Science. (2009) Ingolia. Nat Rev Genet. (2014)

Page 5: Scikit-ribo - Accurate A-site prediction and robust modeling of

Calculate translational efficiency (TE)

Less efficient translation Normal translation efficieny (TE) More efficient translation

log2(TE) < 0 log2(TE) = 0 log2(TE) > 0

T E = RP F rpkmRNA rpkm

Hypothesis: TE distribution could be skewed by ribosome pausing events.

TE =

Riboseq rpkm

RNAseq rpkm

log2(TE) < 0

log2(TE) = 0

log2(TE) > 0

1

TE =

Riboseq rpkm

RNAseq rpkm

log2(TE) < 0

log2(TE) = 0

log2(TE) > 0

1

TE =

Riboseq rpkm

RNAseq rpkm

log2(TE) < 0

log2(TE) = 0

log2(TE) > 0

1

TE =

Riboseq rpkm

RNAseq rpkm

log2(TE) < 0

log2(TE) = 0

log2(TE) > 0

1

Page 6: Scikit-ribo - Accurate A-site prediction and robust modeling of

With ribosome pausing sites

Simulated S. cerevisiae data - TE distribution are negatively-skewed by ribosome pausing events

Null distribution

Page 7: Scikit-ribo - Accurate A-site prediction and robust modeling of

Analytical Challenges

Actively translated codons Where does the A-site locate on Riboseq reads?

Assay specific characteristics/ biases

(e.g. ribosome pausing) How is Riboseq different from RNAseq?

Understand translational control How to accurately infer translation efficiency?

Page 8: Scikit-ribo - Accurate A-site prediction and robust modeling of

Align reads (eg. w/ STAR, HISAT)

Filter rRNA sequences(eg. w/ bowtie)

Identify & trim adapters (eg. w/ cutadapt)

Transcript quantification(eg. w/ cufflinks, salmon)

Joint analysisw/ scikit-ribo

Riboseq RNAseq

Ribosome A-site position predicition

A-site codonlocalization

Ribosome pausingsite calling

Translation efficiencyinference

Differential translation efficiency testing

Introducing scikit-ribo

Page 9: Scikit-ribo - Accurate A-site prediction and robust modeling of

What and where is the ribosome A-site?

UUAUUUCGGGGAAUGGUGCUGAGGAUAA

5’ 3’

E P A

UUAUUUCGGGGAAUGGUGCUGAGGAUAA

mRNA

Ribosome

M V L R I R L T

UUAUUUCGGGGAAUGGUGCUGAGGAUAA

UUAUUUCGGGGAAUGGUGCUGAGGAUAA

UUAUUUCGGGGAAUGGUGCUGAGGAUAA

5’ 3’

NNN . . . NNN AUG NNN . . . NNN N

60S

40S

0 12 15 28

P A

28nt

Figure adapted from Ingolia et al. Science (2009)

Page 10: Scikit-ribo - Accurate A-site prediction and robust modeling of

How to predict A-site?

Read lengthOffset

3*x

Training data and features: Classifier and model tuning:

•  SVM with RBF kernel (scikit-learn)

•  10 fold cross-validation for grid search

•  Make predictions on all reads genome-wide

A-site

Page 11: Scikit-ribo - Accurate A-site prediction and robust modeling of

Prediction performance by cross validation

Scikit-ribo has much higher accuracy of identifying A-site than the previous method (0.86 vs. 0.64, 10-fold CV).

Page 12: Scikit-ribo - Accurate A-site prediction and robust modeling of

Scikit-ribo accurately predicted codon usage fraction and codon normalized TE

●●

●●

●●

●●

● ●

●●

●●●

●● ●

● ●

● ●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Published codon usage fraction

Rib

oseq

cod

on u

sage

frac

tion

●●

● ●

●●

● ●

●●●

●●

● ●

●●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Published normalized TE

Rib

oseq

nor

mal

ized

TE

r = 0.9 r = 0.8

Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding, Pechmann, Frydman (2013).

Page 13: Scikit-ribo - Accurate A-site prediction and robust modeling of

Finding ribosome pausing sites (peaks) is hard. But it is easier after knowing the A-site location.

Q: how to robustly identify ribosome pausing sites while accounting for over-dispersion?

RNAseq Riboseq

Page 14: Scikit-ribo - Accurate A-site prediction and robust modeling of

Yes

SIMPLE BUSINESS PROCESS

Given a gene

Likelihood ratio test

Bimodal distribution, i.e.

potential pausing sites

Unimodal distribution, i.e.

no pausing

Bonferroni adjusted p-value

< 0.05

Posterior probability >

cutoff

Not a pausing site

No

Peak/Pausing sitesYes

No

# genes # genes (rpkm > 100) # genes with pausing

# ribosome pausing sites identified

6664 1252 94 180

Yifei Huang P (Xi|⇡i, µi, ki, ri) =

Y

j

⇡iNB(Xij |µi, ri) + (1� ⇡i)NB(Xij |kiµi, ri),

for gene i at position j, where k � 5

H0 : ⇡ = 1

H1 : ⇡ 6= 1

1

P (Xi|⇡i, µi, ki, ri) =Y

j

⇡iNB(Xij |µi, ri) + (1� ⇡i)NB(Xij |kiµi, ri),

for gene i at position j, where k � 5

H0 : ⇡ = 1

H1 : ⇡ 6= 1

1

Ribosome pausing site identification by negative binomial mixture model

Page 15: Scikit-ribo - Accurate A-site prediction and robust modeling of

−4

0

4

0 100 200 300 400position

PAR

S sc

ores

−4 0 4PARS_score

0

500

1000

1500

2000

0 100 200 300 400position

ribos

ome

coun

ts

ribosome_pausing

FALSE

TRUE

mRNA with stronger secondary structure tend to have ribosome pausing events

●●

●●

●●

0.0

0.5

1.0

1.5

No Yes

Seco

ndar

y st

ruct

ure

scor

e

factor(Pausing)NoYes

p-value< 2*10-16

Kertesz et al. Nature (2010)

Fisher exact test p-value = 0.001

Page 16: Scikit-ribo - Accurate A-site prediction and robust modeling of

TE distributions are negatively-skewed in many studies. Over-structured mRNA show inflated TE.

Improved ribosome-footprint and mRNA measurements provide insights into dynamics and regulation of yeast translation. Weinberg, Shah et al. (2015)

0

10

20

30

−6 −4 −2 0 2log2(TE)

count

Chi Square test p-value < 2*10-16

Page 17: Scikit-ribo - Accurate A-site prediction and robust modeling of

Summary

Discussed:

1)  Introduce scikit-ribo for joint analysis of Riboseq & RNAseq data.

2)  Learn from data itself to determine ribosome A-site location.

3)  Reveal biases in Riboseq data due to ribosome pausing.

4)  How Riboseq biases lead to issues with estimating TE.

Ongoing work:

1)  Adjust for those biases and provide an unbiased estimate of TE.

2)  Extend the ribosome pausing calling to a HMM based method.

3)  Joint inference of translation initiation and elongation rates.

Align reads (eg. w/ STAR, HISAT)

Filter rRNA sequences(eg. w/ bowtie)

Identify & trim adapters (eg. w/ cutadapt)

Transcript quantification(eg. w/ cufflinks, salmon)

Joint analysisw/ scikit-ribo

Riboseq RNAseq

Ribosome A-site position predicition

A-site codonlocalization

Ribosome pausingsite calling

Translation efficiencyinference

Differential translation efficiency testing

https://github.com/hanfang/