A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A...

A Content-Addressable DNA Database with Learned

Sequence EncodingsKendall Stewart1, Yuan-Jyue Chen2, David Ward1, Xiaomeng Liu1,

Georg Seelig1, Karin Strauss2, and Luis Ceze1

1 University of Washington 2 Microsoft Research

Search by Image

Traditional Storage

Database

Traditional Storage

Database

Traditional Storage

Database

Traditional Storage

Database

Von-NeumannBottleneck

DNA Data Storage

Input File

DNA Data Storage

Sequences

Full Data Encoding

ATCGA…TCGAT…GATAC{ {{ PrimerPrimer Payload

ATCGA…GATCT…GATACATCGA…TGACA…GATACATCGA…GTGTA…GATAC

Organick et al., Nat. Bio. 2018

Input File Sequences

DNA Data Storage

Full Data Encoding

Can we probe the payloads?

Input File Sequences

DNA Data Storage

Full Data Encoding

Robust encoding requires randomization,

redundancy, ECC, segmentation…

Can we probe the payloads?

Input File

Content-Based Encoding

Content-Addressable Storage

ATCGA…GGACGGAATAC{ {Content-Based

Probe RegionFile ID

(Payload)

Metadata Sequence

Input File

Feature Extraction [0.1, 0.2, -0.3, 0.8, …]

Feature Vector

Sequence Encoding ATCGA…GGACGGAATAC

{ {Content-BasedProbe Region

File ID (Payload)

Metadata Sequence

Input File

Feature Vector

File ID (Payload)

Metadata Sequence

Focus of this talk

Input File

Feature Vector

File ID (Payload)

Metadata Sequence

“Address” in Feature Space

Focus of this talk

18Feature Space (2D Projection)

Feature Space

Vector Quantization

ATCGA…

GATCG…

TGTAT…GCTAT…

Vector Quantization

Reif & LaBean,DNA 6 (2000)

Vector Quantization

ATCGACGAGA

TGTATGATCG

Vector Quantization

Neighbors in different clusters

Similarity Preserving Encoding

GTTATGCTAA

CGAGATGTAT

Similarity Preserving Encoding

GTTATGCTAA

CGAGATGTAT

Naive Encoding:[0.1, 0.7, 0.3, 0.8, …]

[0.00, 0.25) = A [0.25, 0.50) = T [0.50, 0.75) = C [0.75, 1.00] = G

Semantic Hashing

Adapted from Salakhutdinov et al. 2007

Images

Binary Addresses

Similar inFeature Space

(Euclidean)

Similar in Address Space

(Hamming)

Semantic Hashing

Images

DNA Sequences

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

Semantic Hashing

Images

DNA Sequences

(Euclidean)

Similar in Sequence Space

(Hybridization Yield)

Training a neural network efficiently requires

differentiable operations

NUPACK calculation of hybridization yield is not

differentiable!

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

1� u · v||u|| ||v||

cosine distance(u,v) =

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

Mean Cosine Distance = (0 + 0 + 1 + 0) / 4 = 0.25

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

Equal to Hamming Distance

when representations are “one-hot”

Approximating Yield

AGACA T C G

A T C G

0 1 2 3

Equal to Hamming Distance

when representations are “one-hot”

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

Neural network outputs are not exactly one-hot

Approximating Yield

A T C G

0 1 2 3

302927 28

AGTC CAGC…

But the approximation is good enough!

Random Pairs of Neural Net Outputs

Mean CosineDistance: 0.29

Neural NetOutput

A T G C C T A C G G C T HammingDistance: 0.33

Mean CosineDistance: 0.33

Sequence

One-HotEncoding

Closing the LoopA T C G

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Mean Cosine Distance

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Approximate Yield

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Approximate Yield

Images Similar? (Distance < 0.2)

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Approximate Yield

Cross-Entropy Loss

0 1 2 3

302927 28

AGTC CAGC…

A T C G

0 1 2 3

302927 28

GTAC GGTA…

Approximate Yield

Cross-Entropy Loss

Gradients

Database Strand Design

FP d(T) IP f(T) RP18 nt 18 nt19 nt 30 nt5 nt

Image 209

Allows Sequencing

Allows Protection

Database Strand Design

Allows Protection

Double-stranded region prevents interference

Query Strand Design

f(Q)* RP[:6]* B30 nt 6 nt

Query Strand Design

FP d(T) IP f(T) RP

f(Q)* RP[:6]* B

Dataset ConstructionTargetsQueries TargetsQueries

Queries TargetsQueries

Dataset Construction

FP d(T) IP f(T) RP

For each T out of 100 target images

TargetsQueries

Dataset Construction

FP d(T) IP f(T) RPf(Q)* RP[:6]* BFor each Q out of 10 query images

TargetsQueries

Experimental Results

Ideal Data

Chance Retrieval

Query Image:

All Queries

Query Image:

Query Image: Query Image:

R2: -0.30 R2: 0.64

Thank you!

A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24-slides.pdf · A...

Documents

Analogue Addressable Fire Panel Range - Morley-IAS Addressable... · MAIN HOSPITAL BUILDING ... fitted with analogue addressable devices from Apollo, ... Every care has been taken

Addressable Fire Control Panels

Addressable Advertising

Addressable Call Points HARD & SOFT ADDRESSABLE FIRE … · 2016-08-17 · Addressable Call Points HARD & SOFT ADDRESSABLE FIRE ALARM SYSTEMS (ZT-CP3/AD & ZT-CP3/WP/AD) The ZT-CP3/AD

Addressable Fire Alarm System

3.3-V 10-Bit Addressable Scan Ports Multidrop-Addressable

ANALOGUE ADDRESSABLE FIRE PANELsarian.ir/fa/images/products/pdf/Addressable Fire Panels...Addressable Fire Panel IRIS - Installation and Programming Manual 51. INTRODUCTION 1.1 General

Addressable and Conventional Fire Protection Fireline... · ESP Addressable and Conventional Fire Protection • 5 96 zone addressable fire alarm panel supplied with 1 loop. Expandable

Zeta Addressable+Catalogue+(screen)

Addressable Devices

A Content-Addressable DNA Database with Learned Sequence ...kstwrt/pubs/dna24.pdfA Content-Addressable DNA Database with Learned Sequence Encodings Kendall Stewart 1, Yuan-Jyue Chen2,

Addressable systemt

TrueAlarm Addressable Fire Alarm Control Panels - Multi …€¦ · TrueAlarm detection points or addressable device points ... TrueAlarm® Addressable Fire Alarm Control Panels

Highway addressable remote transducer

CONTENT ADDRESSABLE NETWORK

Addressable System - hunting-intl.com

Programming Addressable LED Strips

Addressable Fire Alarm Systems

Hochiki Addressable Devices Catalogue

Introduction to Addressable TV & Finecast · Addressable TV* £1.3B BVOD £607m Source: Statista / WARC Expenditure Report - 2019 *GroupM internal data, expected addressable TV forecast