APIs and Synthetic Biology

Preview:

DESCRIPTION

Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.

Citation preview

1

The API

Uri Laserson | @laserson | laserson@cloudera.com21 May 2014

2

The API, or how to make your computational collaborators love you

Uri Laserson | @laserson | laserson@cloudera.com21 May 2014

3

The API, or how to make your computational collaborators love you, and also some perspectives on engineering biology and immunologyUri Laserson | @laserson | laserson@cloudera.com21 May 2014

4

5

NCBI Sequence Read Archive (SRA)

Today…1.14 petabytes

One year ago…609 terabytes

For every “-ome” there’s a “-seq”

Genome DNA-seq

TranscriptomeRNA-seqFRT-seqNET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seqBind-n-seq

7

Crappy academic code

counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1

for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)

8

Crappy academic code

counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1

for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)

SELECT count(*) FROM antibodies GROUP BY junction

vs.

9

What is an API?

10

What is an API?

• Application Programming Interface• Contract (between machines)• Specifications for:

1. Procedures and methods2. Data structures/messages

11

Stripe API

12

Stripe API

13

Java API

public interface List<E> { int size(); boolean isEmpty(); boolean contains(Object o); boolean add(E e); void add(int index, E element); boolean remove(Object o);}

14

Python DB API v2.0 (PEP 249)

http://legacy.python.org/dev/peps/pep-0249/

15

Why use an API?

• Encapsulation/interfaces/abstraction• Loose-coupling of components• Reusable services• Service-oriented architecture

16

Linked-In’s Loose Coupling Architecture

17

Linked-In’s Loose Coupling Architecture

18

(If This Then That)Stitching APIs together

https://ifttt.com/recipes#popular

19

20

IMGT

21

IMGT “Spec”

http://www.imgt.org/IMGTScientificChart/

22

IMGT’s API is an FTP site

23

IMGT does not have an API

def __initVQUESTform(self): # get form request = urllib2.Request( 'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg') response = urllib2.urlopen(request) forms = ClientForm.ParseResponse(response, form_parser_class=ClientForm.XHTMLCompatibleFormParser, backwards_compat=False) response.close() form = forms[0] # fill out base part of form - Synthesis view with no extra options - TEXT form['l01p01c03'] = ['inline'] form['l01p01c07'] = ['2. Synthesis'] form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT' form['l01p01c09'] = ['60'] form['l01p01c35'] = ['F+ORF+ in-frame P'] form['l01p01c36'] = ['0'] form['l01p01c40'] = ['1'] # ['1'] for searching with indels form['l01p01c25'] = ['default’] ...

24

Haussler and genomics services

25

Google Genomics API

26

Google Genomics API

27

Flask/Bottle web server example

@route("/receptor/<id>")def lookup_receptor(id): # get the raw read

@route("/sample/<sample_id>")def sample_summary(sample_id): # impl for getting sample information; can return: # * summary of repertoire information # (num reads, VDJ distribution, etc.) # * demographic info

@route("/sample/<sample_id>/common_junctions")def common_junctions(sample_id): # impl for getting the most common CDR3s

28

Genomics ETL has converged on standards

.fastq .bam .vcf

short read alignment

genotype calling analysisbiochemistry

29

VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

30

What about immune data?

.fastq .bam .vcf

short read alignment

genotype calling analysisbiochemistry

.???immune receptor alignment

31

Multiple models for same types: VDJFasta

sub new { my ($class) = @_; my $self = {}; $self->{filename} = ""; $self->{headers} = []; $self->{sequence} = []; $self->{germline} = []; $self->{nseqs} = 0; $self->{mids} = {};

$self->{accVsegQstart} = {}; # example: 124 $self->{accVsegQend} = {}; # example: 417 $self->{accJsegQstart} = {}; $self->{accJsegQend} = {}; $self->{accDsegQstart} = {};

32

Multiple models for same types: vdj

class ImmuneChain(SeqRecord): def cdr3(self): return len(self.junction)

def num_mutations(self): aln = self.letter_annotations['alignment'] return aln.count('S') + aln.count('I') def v(self): return self.__getattribute__('V-REGION') \ .qualifiers['allele'][0] def v_seq(self): return self.__getattribute__('V-REGION') \ .extract(self.seq.tostring())

33

Interoperability/services depend on being able to communicated data

34

CSV

9 CCTG_PRCONS=IGHC1_R1_IGM unproductive Homsap IGHV5-51*01 F, or Homsap IGHV5-51*03 F Homsap IGHJ4*02 F Homsap 12 GGGG_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-11*01 F Homsap IGHJ1*01 F Homsap IGHD2-2*03 F .......13 CTTC_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV1-2*02 F Homsap IGHJ5*02 F Homsap IGHD5-18*01 F .......18 ACTT_PRCONS=IGHC3_R1_IGA productive Homsap IGKV3-15*01 F, or Homsap IGKV3D-15*01 F or Homsap IGKV3D-15*02 P Homsap 20 GGAC_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-61*02 F Homsap IGHJ4*02 F Homsap IGHD1-26*01 F .......25 TCGT_PRCONS=IGHC2_R1_IGD productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 26 GGTG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 28 GTGA_PRCONS=IGHC5_R1_IGG productive Homsap IGHV1-46*01 F, or Homsap IGHV1-46*02 F or Homsap IGHV1-46*03 F Homsap 31 ACCC_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ3*02 F Homsap 36 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ2*01 F Homsap 39 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-7*01 F Homsap IGHJ6*02 F Homsap IGHD1-7*01 F .......40 GGGT_PRCONS=IGHC1_R1_IGM productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 42 TAGG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-39*01 F, or Homsap IGHV4-39*05 F Homsap IGHJ4*02 F Homsap 47 CAAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-15*01 F, or Homsap IGHV3-15*02 F Homsap IGHJ6*02 F Homsap 48 AGAA_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV3-30*04 F, or Homsap IGHV3-30-3*01 F or Homsap IGHV3-30-3*02 F or Ho52 GCAG_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 53 AACC_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-30*02 F Homsap IGHJ4*02 F Homsap IGHD5-18*01 F .......

35

XML

<ImmuneChain> <c>IGHD</c> <barcode>RL014</barcode> <j_start_idx>389</j_start_idx> <seq>TTGTGGCTATTTTAAA ... CTCGGACT</seq> <descr>003699_0091_0140</descr> <tag>coding</tag> <clone>IGHV3-43_IGHJ4|387</clone> <j>IGHJ4*02</j> <v_end_idx>314</v_end_idx> <v>IGHV3-43*01</v> <junction>TGTGCAAAAGATAATCT ... TCTTTGACTACTGG</junction> <d>IGHD5-24*01</d></ImmuneChain>

36

JSON

{ "v": "IGHV4-39*02", "seq": "CCTATCCCCCTGTGTGCCTT ... CTCCACCAAG", "num_mutations": 43, "name": "HG2DXMN01CY8UH", "letter_annotations": { "alignment": "..............S....S....3333333333333333........S.." }, "junction_nt": "GCGAGGGGCCGATGGGACTTTTATTACATGGACGTC", "j": "IGHJ6*03", "annotations": { "usearch_90_cluster": "6277", "experiment_date": "20120119", "donor": "17517", "sample_type": "memory_B_cells", "source": "SeqWright", "tags": ["revcomp", "coding"], "taxonomy": [] }, "d": "IGHD3-10*01", "features": [ { "strand": 1, "type": "V-REGION", "location": [51, 356], "qualifiers": { "CDR_length": ["[10.7.2]"], "codon_start": ["1"], "gene": ["IGHV4-39"], "allele": ["IGHV4-39*02"] } }, ... ]}

http://www.json.org/

37

JSON

{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000000" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01CY8{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000001" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01A3VH{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000002" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BC6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000003" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DYU{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000004" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01A8F{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000005" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01BDI2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000006" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BS2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000007" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DLL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000008" }, "annotations" : { "D-REGION" : "IGHD6-25*01", "accessions" : "HG2DXMN01BLF{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000009" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01D4TL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000a" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BU6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000b" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BIMG{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000c" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01BM9Z{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000d" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BH9Q{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000e" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BR3

38

Binary formats

• Protobuf, Thrift, or Avro• Flexible data model

• All common primitive types (e.g. int, double string)• Support nested types, including arrays and maps

• Efficient binary encoding• Code generation for many languages (binary

compatible)• Support for schema evolution• Support IDL for data types and services

39

Thrift example: Twitter

service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query);}

struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english"}

40

Thrift example: Immune receptor

cd ~/repos/kiwithrift --gen java kiwi-format/src/main/resources/thrift/kiwi.thriftthrift --gen py:new_style kiwi-format/src/main/resources/thrift/kiwi.thrift

See: https://github.com/laserson/kiwi

41

Questions?

42

Biological parts specifications

• Library of parts with well-characterized input-output characteristics

• In total, similar to API spec

Canton, Nat. Biotech. 26: 787 (2008)

43

Engineering signaling pathways at inputs/outputs

Lim, Nat. Rev. Mol. Cell 11: 393 (2010)

44

Bottom-up genetic circuit design

Brophy, Nature Meth. 11: 508 (2014)

45

Bottom-up genetic circuit design

Brophy, Nature Meth. 11: 508 (2014)

46

Predict composability of genetic elements

Kosuri, PNAS 110: 14024 (2013)

• 114 promoters x 111 RBS

“…rather than relying on prediction or standardization, we can screen synthetic libraries for desired behavior.”

47

Most addressableCheapest to create

ZFN => TALEN => CRISPR/CasLeast addressableMost expensive to create

48

Addressability for precision nanoscale engineering

Douglas, NAR 37: 5001(2009)

49

Addressability for precision nanoscale engineering

Douglas, Nature 459: 414 (2009)

50

Evolution for encapsulation: an evolved electronic thermometer

http://www.genetic-programming.com/hc/thermometer.html

51

Lycopene synthesis optimization

Wang, Nature 460: 894 (2009)

52

Evolutionary encapsulation for signaling pathway engineering

Peisajovich, Science 328: 368 (2010)

53

Evolutionary encapsulation for signaling pathway engineering

Peisajovich, Science 328: 368 (2010)

54

Genetic isolation with Re.coli

Lajoie, Science 342: 357 (2013)

So far, we discussed antibody-only data analysis

Antigen-only data generation

Larman, Nat. Biotech. 29: 535 (2011)

Ben Larman

Steve Elledge

Agilent OLS array

59

Phage immunoprecipitation sequencing (PhIP-seq)

60

Patient A Replica 1

Pat

ient

A R

epl

ica

2

SAPK4

NOVA1

TGIF2LX

log10(-log10 P-value)

PhIP-seq proof-of-principle

61

‘Forward vaccinology’

62

‘Reverse vaccinology’

63

‘Immunization without vaccination’

64

Encapsulation for cancer immunotherapy through TMG processing

Tran, Science 344: 641 (2014)

65

Other examples?

66

Conclusions

• The API perspective helps organize and communicate data

• Use sane file formats if possible:• JSON for lightweight work• Thrift/Avro for heavyweight serialization/communication

• Decouple data modeling for implementation details• Biological engineering: what abstractions are

available?• Evolution as nature’s encapsulator

67

Recommended