A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A...

Preview:

Citation preview

A Content-Addressable DNA Database with Learned

Sequence EncodingsKendall Stewart1, Yuan-Jyue Chen2, David Ward1, Xiaomeng Liu1,

Georg Seelig1, Karin Strauss2, and Luis Ceze1

1 University of Washington 2 Microsoft Research

1

2

Search by Image

3

Bus

Traditional Storage

CPU

Database

4

Bus

Traditional Storage

CPU

Database

5

Bus

Traditional Storage

CPU

Database

6

Bus

Traditional Storage

CPU

Database

Von-NeumannBottleneck

7

DNA Data Storage

8

DNA Data Storage

9

DNA Data Storage

10

DNA Data Storage

11

Input File

DNA Data Storage

Sequences

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Organick et al., Nat. Bio. 2018

12

Input File Sequences

DNA Data Storage

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Can we probe the payloads?

Organick et al., Nat. Bio. 2018

13

Input File Sequences

DNA Data Storage

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Robust encoding requires randomization,

redundancy, ECC, segmentation…

Can we probe the payloads?

Organick et al., Nat. Bio. 2018

14

Input File

Content-Based Encoding

Content-Addressable Storage

ATCGA…GGACGGAATAC{ {Content-Based

Probe RegionFile ID

(Payload)

Metadata Sequence

15

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

16

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

Focus of this talk

17

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

“Address” in Feature Space

Focus of this talk

18Feature Space (2D Projection)

Feature Space

19Feature Space (2D Projection)

Feature Space

20Feature Space (2D Projection)

Feature Space

21Feature Space (2D Projection)

Feature Space

22Feature Space (2D Projection)

Vector Quantization

23Feature Space (2D Projection)

ATCGA…

GATCG…

TGTAT…GCTAT…

Vector Quantization

Reif & LaBean,DNA 6 (2000)

24

Vector Quantization

GCTAT

GCTAT

GCTAT

GCTAT

CGATA

CGAGA

ATCGACGAGA

TGTATGATCG

25Feature Space (2D Projection)

Vector Quantization

Neighbors in different clusters

26

Similarity Preserving Encoding

GCTTT

GTTATGCTAA

CCTAT

CGATA

CGCGA

ATCGA

CGAGATGTAT

GATCG

27

Similarity Preserving Encoding

GCTTT

GTTATGCTAA

CCTAT

CGATA

CGCGA

ATCGA

CGAGATGTAT

GATCG

Naive Encoding:[0.1, 0.7, 0.3, 0.8, …]

[0.00, 0.25) = A [0.25, 0.50) = T [0.50, 0.75) = C [0.75, 1.00] = G

ACTG?

Semantic Hashing

Adapted from Salakhutdinov et al. 2007

Images

Binary Addresses

Similar inFeature Space

(Euclidean)

Similar in Address Space

(Hamming)

28

Semantic Hashing

Images

DNA Sequences

Similar inFeature Space

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

29

Semantic Hashing

Images

DNA Sequences

Similar inFeature Space

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

Training a neural network efficiently requires

differentiable operations

NUPACK calculation of hybridization yield is not

differentiable!

30

Approximating Yield

AGTC

AGAC

31

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

32

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

33

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

1� u · v||u|| ||v||

cosine distance(u,v) =

34

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

35

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

Equal to Hamming Distance

when representations are “one-hot”

36

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

Equal to Hamming Distance

when representations are “one-hot”

37

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

Neural network outputs are not exactly one-hot

38

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

But the approximation is good enough!

Random Pairs of Neural Net Outputs

Mean CosineDistance: 0.29

Neural NetOutput

ATCG

A T G C C T A C G G C T HammingDistance: 0.33

Mean CosineDistance: 0.33

Sequence

One-HotEncoding

ATCG

39

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

40

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

41

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

42

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

43

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

Cross-Entropy Loss

44

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

Cross-Entropy Loss

Gradients

45

Database Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

Image 209

Allows Sequencing

Allows Sequencing

Allows Protection

46

Database Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

Allows Protection

Double-stranded region prevents interference

47

Query Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

f(Q)* RP[:6]* B30 nt 6 nt

48

Query Strand Design

FP d(T) IP f(T) RP

f(Q)* RP[:6]* B

49

Dataset ConstructionTargetsQueries TargetsQueries

50

Queries TargetsQueries

Dataset Construction

FP d(T) IP f(T) RP

For each T out of 100 target images

51

TargetsQueries

Dataset Construction

FP d(T) IP f(T) RPf(Q)* RP[:6]* BFor each Q out of 10 query images

TargetsQueries

52

Experimental Results

53

Ideal Data

Experimental Results

54

Chance Retrieval

Experimental Results

55

Query Image:

Experimental Results

56

All Queries

Experimental Results

57

All Queries

Experimental Results

58

Query Image:

Experimental Results

59

Query Image: Query Image:

Experimental Results

60

Experimental Results

61

R2: -0.30 R2: 0.64

Experimental Results

62

R2: -0.30 R2: 0.64

Thank you!

63

Recommended