Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
A Content-Addressable DNA Database with Learned
Sequence EncodingsKendall Stewart1, Yuan-Jyue Chen2, David Ward1, Xiaomeng Liu1,
Georg Seelig1, Karin Strauss2, and Luis Ceze1
1 University of Washington 2 Microsoft Research
1
2
Search by Image
3
Bus
Traditional Storage
CPU
Database
4
Bus
Traditional Storage
CPU
Database
5
Bus
Traditional Storage
CPU
Database
❌
6
Bus
Traditional Storage
CPU
Database
❌
Von-NeumannBottleneck
7
DNA Data Storage
8
DNA Data Storage
9
DNA Data Storage
10
DNA Data Storage
11
Input File
DNA Data Storage
Sequences
Full Data Encoding
ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload
ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC
Organick et al., Nat. Bio. 2018
12
Input File Sequences
DNA Data Storage
Full Data Encoding
ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload
ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC
Can we probe the payloads?
Organick et al., Nat. Bio. 2018
13
Input File Sequences
DNA Data Storage
Full Data Encoding
ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload
ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC
Robust encoding requires randomization,
redundancy, ECC, segmentation…
Can we probe the payloads?
Organick et al., Nat. Bio. 2018
14
Input File
Content-Based Encoding
Content-Addressable Storage
ATCGA…GGACGGAATAC{ {Content-Based
Probe RegionFile ID
(Payload)
Metadata Sequence
15
Input File
Feature Extraction [0.1, 0.2, -0.3, 0.8, …]
Feature Vector
Content-Addressable Storage
Sequence Encoding ATCGA…GGACGGAATAC
{ {Content-BasedProbe Region
File ID (Payload)
Metadata Sequence
16
Input File
Feature Extraction [0.1, 0.2, -0.3, 0.8, …]
Feature Vector
Content-Addressable Storage
Sequence Encoding ATCGA…GGACGGAATAC
{ {Content-BasedProbe Region
File ID (Payload)
Metadata Sequence
Focus of this talk
17
Input File
Feature Extraction [0.1, 0.2, -0.3, 0.8, …]
Feature Vector
Content-Addressable Storage
Sequence Encoding ATCGA…GGACGGAATAC
{ {Content-BasedProbe Region
File ID (Payload)
Metadata Sequence
“Address” in Feature Space
Focus of this talk
18Feature Space (2D Projection)
Feature Space
19Feature Space (2D Projection)
Feature Space
20Feature Space (2D Projection)
Feature Space
21Feature Space (2D Projection)
Feature Space
22Feature Space (2D Projection)
Vector Quantization
23Feature Space (2D Projection)
ATCGA…
GATCG…
TGTAT…GCTAT…
Vector Quantization
Reif & LaBean,DNA 6 (2000)
24
Vector Quantization
GCTAT
GCTAT
GCTAT
GCTAT
CGATA
CGAGA
ATCGACGAGA
TGTATGATCG
25Feature Space (2D Projection)
Vector Quantization
Neighbors in different clusters
26
Similarity Preserving Encoding
GCTTT
GTTATGCTAA
CCTAT
CGATA
CGCGA
ATCGA
CGAGATGTAT
GATCG
27
Similarity Preserving Encoding
GCTTT
GTTATGCTAA
CCTAT
CGATA
CGCGA
ATCGA
CGAGATGTAT
GATCG
Naive Encoding:[0.1, 0.7, 0.3, 0.8, …]
[0.00, 0.25) = A [0.25, 0.50) = T [0.50, 0.75) = C [0.75, 1.00] = G
ACTG?
Semantic Hashing
Adapted from Salakhutdinov et al. 2007
Images
Binary Addresses
Similar inFeature Space
(Euclidean)
Similar in Address Space
(Hamming)
28
Semantic Hashing
Images
DNA Sequences
Similar inFeature Space
(Euclidean)
Similar in Sequence Space
(Hybridization Yield)
29
Semantic Hashing
Images
DNA Sequences
Similar inFeature Space
(Euclidean)
Similar in Sequence Space
(Hybridization Yield)
Training a neural network efficiently requires
differentiable operations
NUPACK calculation of hybridization yield is not
differentiable!
30
Approximating Yield
AGTC
AGAC
31
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
32
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
33
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
1� u · v||u|| ||v||
cosine distance(u,v) =
34
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25
35
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25
Equal to Hamming Distance
when representations are “one-hot”
36
Approximating Yield
AGTC
AGACA T C G
A T C G
0 1 2 3
0 1 2 3
Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25
Equal to Hamming Distance
when representations are “one-hot”
37
Approximating Yield
A T C G
0 1 2 3
…
302927 28
AGTC CAGC…
Neural network outputs are not exactly one-hot
38
Approximating Yield
A T C G
0 1 2 3
…
302927 28
AGTC CAGC…
But the approximation is good enough!
Random Pairs of Neural Net Outputs
Mean CosineDistance: 0.29
Neural NetOutput
ATCG
A T G C C T A C G G C T HammingDistance: 0.33
Mean CosineDistance: 0.33
Sequence
One-HotEncoding
ATCG
39
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
40
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
Mean Cosine Distance
41
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
Mean Cosine Distance
Approximate Yield
42
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
Mean Cosine Distance
Approximate Yield
Images Similar? (Distance < 0.2)
43
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
Mean Cosine Distance
Approximate Yield
Images Similar? (Distance < 0.2)
Cross-Entropy Loss
44
Closing the LoopA T C G
0 1 2 3
…
302927 28
AGTC CAGC…
A T C G
0 1 2 3
…
302927 28
GTAC GGTA…
Mean Cosine Distance
Approximate Yield
Images Similar? (Distance < 0.2)
Cross-Entropy Loss
Gradients
45
Database Strand Design
FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt
Image 209
Allows Sequencing
Allows Sequencing
Allows Protection
46
Database Strand Design
FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt
Allows Protection
Double-stranded region prevents interference
47
Query Strand Design
FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt
f(Q)* RP[:6]* B30 nt 6 nt
48
Query Strand Design
FP d(T) IP f(T) RP
f(Q)* RP[:6]* B
49
Dataset ConstructionTargetsQueries TargetsQueries
50
Queries TargetsQueries
Dataset Construction
FP d(T) IP f(T) RP
For each T out of 100 target images
51
TargetsQueries
Dataset Construction
FP d(T) IP f(T) RPf(Q)* RP[:6]* BFor each Q out of 10 query images
TargetsQueries
52
Experimental Results
53
Ideal Data
Experimental Results
54
Chance Retrieval
Experimental Results
55
Query Image:
Experimental Results
56
All Queries
Experimental Results
57
All Queries
Experimental Results
58
Query Image:
Experimental Results
59
Query Image: Query Image:
Experimental Results
60
Experimental Results
61
R2: -0.30 R2: 0.64
Experimental Results
62
R2: -0.30 R2: 0.64
Thank you!
63