Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGA
Deep learning approaches to decode the human genome
Anshul KundajeGenetics, Computer Science
Stanford University
http://anshul.kundaje.net
TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGTCAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCTTCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGACACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTATCTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACAAACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGCTCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………
2003
~ 3 billion nucleotides
The Human Genome Project
Population sequencing to identify disease-associated genetic variants
Statistically significant association?
Oxford Nanoporetechnology
~ 3 billion nucleotides
TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGTCAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCTTCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGACACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTATCTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACAAACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGCTCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………
Function?
Decoding genome function
ACCAGTTACGACGG
TCAGGGTACTGATA
CCCCAAACCGTTGA
CCGCATTTACAGAC
GGGGTTTGGGTTTT
GCCCCACACAGGTA
CGTTAGCTACTGGT
TTAGCAATTTACCG
TTACAACGTTTACA
GGGTTACGGTTGGG
ATTTGAAAAAAAGT
TTGAGTTGGTTTTT
TCACGGTAGAACGT
ACCTTACAAA…………
One genome many cell types
http://www.roadmapepigenomics.org/
Biochemical markers of cell-type specific functional elements
Active geneRepressed gene
Protein
https://www.broadinstitute.org/news/1504
Control elements99 % Non-
coding
1.5 %
Protein Coding
100s of Cell-Types/Tissues
10
0s
of
cell
typ
es a
nd
tis
sue
s
NIH funded collaborative consortia
Machine learning, Probabilistic models,
Deep learning
Identifying tissue-specific control
elements
Interpreting disease-associated genetic
variation
Learning sequence code of control
elements
Active control elementsActive control elementsActive genes
Repressed elements
• ~20,000 genes
• ~2 million novel putative control elements!
• cell-type specific activity
A comprehensive functional
annotation of the human genome
2M control elements show highly modular tissue-specific activity
• ~20,000 genes
• ~2 million novel putative control elements!
• modular tissue-specific activity!
2M control elements
10
0s
of
Tiss
ues
ActiveInactive
Decoding DNA words and grammars that specify tissue-specific control elements
Regulatory proteins bind DNA words (landing pads) in control
elements!
‘Motif Discovery’
Learning discriminative DNA words from tissue-specific control element sequences
Training
Input sequences
(X)
Classification function
F(X)
Class = +1
Class = +1
Class = +1
Class = -1
Class = -1
Class = -1
Training
Output labels
(Y)
‘Training’ means
learning the
function F(X) from
multiple input,
output pairs (X,Y)
sequences of control elements active in Tissue 1
sequences of control
elements NOT active in Tissue 1 but active in other tissues
C G A T A A C C G A T A T
Learned pattern detectors
One-hot encoded input: DNA sequence represented as ones and zeros
Later layers build on patterns of previous layer
Binary Output: Active (1) vs Inactive (0)
Deep convolutional neural network (CNN) on DNA sequence inputs
ACGT
0100
0010
1000
0001
1000
1000
0100
0100
0010
1000
0001
1000
0001
Is seq. active
in cell type 1?
Deeper conv. layers learn DNA word
combinations (grammars)Score sequence using filters
Convolutional layersNeurons learn DNA word pattern detectors
Is seq. active
in cell type 1?
prediction accuracyMean auROC = 0.82Mean auPRC = 0.65
Is seq. active in
cell type 100?Is seq. active
in cell type 2?
Multi-task learning
Similar toKelley et al. 2016 (Basset)Zhou et al. 2015 (DeepSEA)
Multi-task deep CNNs learn discriminative DNA word pattern detectors
Millions of input sequences of control elements
C G A T A A C C G A T A T
Is seq. active
in cell type 1?
Is seq. active
in cell type 2?
How can we identify important parts of the input sequences?
In-silico mutagenesis• inefficient• misleading results due to
saturation/buffering
A
?
G
T
A
C
T
C
G
T
…................................Alipanahi et al, 2015Zhou & Troyanskaya, 2015Kelley et al 2016
C G A T A A C C G A T A T
Is seq. active
in cell type 1?
Is seq. active
in cell type 2?
Efficient “Backpropagation” based approaches
ACGT
0100
0010
1000
0001
1000
1000
0100
0100
0010
1000
0001
1000
0001
Is seq. active
in cell type 1?
G A T AC C G A A
Gradient based methods• Saliency maps (Simonyan 2013)• Deconv networks (Zeiler, Fergus 2013)• Guided backprop (Springerberg 2014)• Layerwise relevance propagation (Bach
2015)• Integrated gradients (Sundarajan 2016)
Avanti Shrikumar Peyton Greenside
DeepLIFTShrikumar et al. Learning Important Features
Through Propagating Activation Differences
https://arxiv.org/abs/1704.02685
CODE: https://github.com/kundajelab/deeplift
DeepLIFT identifies combinatorial grammars of DNA wordsdefining tissue-specific control elements!
Shrikumar et al. https://arxiv.org/abs/1704.02685
CODE: https://github.com/kundajelab/deeplift
Distinct combinations of DNA words can active same control element in different tissues
Peyton Greenside
Control element sequence
SPI1
DeepLIFT scoresTissue: Blood stem cells
Position along sequence
Gata (Rc) Gata (Rc)GataSPI1
DeepLIFT scoresTissue: Red blood cells
SPI1 protein binding data
GATA1 protein binding data
Validation experiment results
Decoding tissue-specific combinatorial grammars in millions of genomic control elements!
Peyton Greenside
MoDISCO: Identifying recurring DNA words across control elements
Insight: filter contributions are resolved at the nucleotide level
Sequence 1
Sequence 2
Sequence 3
Δp
rob
Δp
rob
Δp
rob
Avanti Shrikumar Peyton Greenside
We learn 1000s of known and novel DNA words defining tissue-specific control elements!
Can deep CNNs trained on control elements be useful for understanding disease-associated genetic variants?
> 1000 population sequencing studies of diverse diseases
> 90% of complex disease-associated variants are not in genes. Highly enriched in control elements!
Deep CNNs can predict and interpret effects of disease-associated genetic variants in relevant tissue context
Original prediction:
0.558
0.528
0.554
0.969
0.960
0.889
Mutated prediction:
0.543
0.583
0.557
0.926
0.900
0.756
Difference (Percent):
-1.5%
+5.4%
0.3%
-4.3%
-5.9%
-13.2%
Breaking the ‘C’ results in significant drop in probability of active control
element!
Unstimulatedcoronary smooth muscle cells
• Breaks the ‘C’ in TGACTCA DNA word which is binding site for an important protein (AP1).
• Variant specifically manifests in stimulated cells
Stimulated coronary smooth muscle cells from patients
A genetic variant C -> T strongly associated with coronary heart disease
Future of personalized medicine
Personal genome sequences
Personal functional genomic data
Electronic medical records / Clinical data / biometrics
/ Literature mining
Lon
gitu
din
al d
ata
Domain-specific machine learning +
AI
Rapid interpretation of personal genomes
Data-driven personal diagnosis (cause rather
than symptoms)
Drug target identification and design
Optimal treatment regimens
How to train your DRAGONNDeep RegulAtory GenOmic Neural Nets
http://kundajelab.github.io/dragonn/
Interactive Cloud based tutorials on deep learning on genomic sequence
Johnny Israeli
Acknowledgements
25
Will Greenleaf
Chuan Sheng Foo
Kundaje Lab members
Johnny Israeli
R01ES02500902
U41-HG007000-04S1
U01HG007919-02 (GGR)
Avanti Shrikumar
PeytonGreenside
Funding
Conflict of Interest: Deep Genomics (SAB), Epinomics (SAB)
Chris Probert
Irene Kaplow