63
A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart 1 , Yuan-Jyue Chen 2 , David Ward 1 , Xiaomeng Liu 1 , Georg Seelig 1 , Karin Strauss 2 , and Luis Ceze 1 1 University of Washington 2 Microsoft Research 1

A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

A Content-Addressable DNA Database with Learned

Sequence EncodingsKendall Stewart1, Yuan-Jyue Chen2, David Ward1, Xiaomeng Liu1,

Georg Seelig1, Karin Strauss2, and Luis Ceze1

1 University of Washington 2 Microsoft Research

1

Page 2: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

2

Search by Image

Page 3: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

3

Bus

Traditional Storage

CPU

Database

Page 4: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

4

Bus

Traditional Storage

CPU

Database

Page 5: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

5

Bus

Traditional Storage

CPU

Database

Page 6: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

6

Bus

Traditional Storage

CPU

Database

Von-NeumannBottleneck

Page 7: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

7

DNA Data Storage

Page 8: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

8

DNA Data Storage

Page 9: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

9

DNA Data Storage

Page 10: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

10

DNA Data Storage

Page 11: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

11

Input File

DNA Data Storage

Sequences

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Organick et al., Nat. Bio. 2018

Page 12: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

12

Input File Sequences

DNA Data Storage

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Can we probe the payloads?

Organick et al., Nat. Bio. 2018

Page 13: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

13

Input File Sequences

DNA Data Storage

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Robust encoding requires randomization,

redundancy, ECC, segmentation…

Can we probe the payloads?

Organick et al., Nat. Bio. 2018

Page 14: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

14

Input File

Content-Based Encoding

Content-Addressable Storage

ATCGA…GGACGGAATAC{ {Content-Based

Probe RegionFile ID

(Payload)

Metadata Sequence

Page 15: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

15

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

Page 16: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

16

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

Focus of this talk

Page 17: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

17

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Content-Addressable Storage

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

“Address” in Feature Space

Focus of this talk

Page 18: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

18Feature Space (2D Projection)

Feature Space

Page 19: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

19Feature Space (2D Projection)

Feature Space

Page 20: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

20Feature Space (2D Projection)

Feature Space

Page 21: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

21Feature Space (2D Projection)

Feature Space

Page 22: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

22Feature Space (2D Projection)

Vector Quantization

Page 23: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

23Feature Space (2D Projection)

ATCGA…

GATCG…

TGTAT…GCTAT…

Vector Quantization

Reif & LaBean,DNA 6 (2000)

Page 24: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

24

Vector Quantization

GCTAT

GCTAT

GCTAT

GCTAT

CGATA

CGAGA

ATCGACGAGA

TGTATGATCG

Page 25: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

25Feature Space (2D Projection)

Vector Quantization

Neighbors in different clusters

Page 26: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

26

Similarity Preserving Encoding

GCTTT

GTTATGCTAA

CCTAT

CGATA

CGCGA

ATCGA

CGAGATGTAT

GATCG

Page 27: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

27

Similarity Preserving Encoding

GCTTT

GTTATGCTAA

CCTAT

CGATA

CGCGA

ATCGA

CGAGATGTAT

GATCG

Naive Encoding:[0.1, 0.7, 0.3, 0.8, …]

[0.00, 0.25) = A [0.25, 0.50) = T [0.50, 0.75) = C [0.75, 1.00] = G

ACTG?

Page 28: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Semantic Hashing

Adapted from Salakhutdinov et al. 2007

Images

Binary Addresses

Similar inFeature Space

(Euclidean)

Similar in Address Space

(Hamming)

28

Page 29: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Semantic Hashing

Images

DNA Sequences

Similar inFeature Space

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

29

Page 30: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Semantic Hashing

Images

DNA Sequences

Similar inFeature Space

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

Training a neural network efficiently requires

differentiable operations

NUPACK calculation of hybridization yield is not

differentiable!

30

Page 31: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGAC

31

Page 32: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

32

Page 33: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

33

Page 34: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

1� u · v||u|| ||v||

cosine distance(u,v) =

34

Page 35: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

35

Page 36: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

Equal to Hamming Distance

when representations are “one-hot”

36

Page 37: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

AGTC

AGACA T C G

A T C G

0 1 2 3

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

Equal to Hamming Distance

when representations are “one-hot”

37

Page 38: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

Neural network outputs are not exactly one-hot

38

Page 39: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

But the approximation is good enough!

Random Pairs of Neural Net Outputs

Mean CosineDistance: 0.29

Neural NetOutput

ATCG

A T G C C T A C G G C T HammingDistance: 0.33

Mean CosineDistance: 0.33

Sequence

One-HotEncoding

ATCG

39

Page 40: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

40

Page 41: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

41

Page 42: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

42

Page 43: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

43

Page 44: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

Cross-Entropy Loss

44

Page 45: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

Approximate Yield

Images Similar? (Distance < 0.2)

Cross-Entropy Loss

Gradients

45

Page 46: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Database Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

Image 209

Allows Sequencing

Allows Sequencing

Allows Protection

46

Page 47: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Database Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

Allows Protection

Double-stranded region prevents interference

47

Page 48: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Query Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

f(Q)* RP[:6]* B30 nt 6 nt

48

Page 49: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Query Strand Design

FP d(T) IP f(T) RP

f(Q)* RP[:6]* B

49

Page 50: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Dataset ConstructionTargetsQueries TargetsQueries

50

Page 51: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Queries TargetsQueries

Dataset Construction

FP d(T) IP f(T) RP

For each T out of 100 target images

51

Page 52: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

TargetsQueries

Dataset Construction

FP d(T) IP f(T) RPf(Q)* RP[:6]* BFor each Q out of 10 query images

TargetsQueries

52

Page 53: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

53

Ideal Data

Page 54: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

54

Chance Retrieval

Page 55: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

55

Query Image:

Page 56: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

56

All Queries

Page 57: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

57

All Queries

Page 58: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

58

Query Image:

Page 59: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

59

Query Image: Query Image:

Page 60: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

60

Page 61: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

61

R2: -0.30 R2: 0.64

Page 62: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Experimental Results

62

R2: -0.30 R2: 0.64

Page 63: A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart1, Yuan-Jyue

Thank you!

63