Upload
uri-laserson
View
1.439
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
Citation preview
2
The API, or how to make your computational collaborators love you
Uri Laserson | @laserson | [email protected] May 2014
3
The API, or how to make your computational collaborators love you, and also some perspectives on engineering biology and immunologyUri Laserson | @laserson | [email protected] May 2014
4
5
NCBI Sequence Read Archive (SRA)
Today…1.14 petabytes
One year ago…609 terabytes
For every “-ome” there’s a “-seq”
Genome DNA-seq
TranscriptomeRNA-seqFRT-seqNET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seqBind-n-seq
7
Crappy academic code
counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)
8
Crappy academic code
counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)
SELECT count(*) FROM antibodies GROUP BY junction
vs.
9
What is an API?
10
What is an API?
• Application Programming Interface• Contract (between machines)• Specifications for:
1. Procedures and methods2. Data structures/messages
11
Stripe API
12
Stripe API
13
Java API
public interface List<E> { int size(); boolean isEmpty(); boolean contains(Object o); boolean add(E e); void add(int index, E element); boolean remove(Object o);}
14
Python DB API v2.0 (PEP 249)
http://legacy.python.org/dev/peps/pep-0249/
15
Why use an API?
• Encapsulation/interfaces/abstraction• Loose-coupling of components• Reusable services• Service-oriented architecture
16
Linked-In’s Loose Coupling Architecture
17
Linked-In’s Loose Coupling Architecture
18
(If This Then That)Stitching APIs together
https://ifttt.com/recipes#popular
19
20
IMGT
22
IMGT’s API is an FTP site
23
IMGT does not have an API
def __initVQUESTform(self): # get form request = urllib2.Request( 'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg') response = urllib2.urlopen(request) forms = ClientForm.ParseResponse(response, form_parser_class=ClientForm.XHTMLCompatibleFormParser, backwards_compat=False) response.close() form = forms[0] # fill out base part of form - Synthesis view with no extra options - TEXT form['l01p01c03'] = ['inline'] form['l01p01c07'] = ['2. Synthesis'] form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT' form['l01p01c09'] = ['60'] form['l01p01c35'] = ['F+ORF+ in-frame P'] form['l01p01c36'] = ['0'] form['l01p01c40'] = ['1'] # ['1'] for searching with indels form['l01p01c25'] = ['default’] ...
24
Haussler and genomics services
25
Google Genomics API
26
Google Genomics API
27
Flask/Bottle web server example
@route("/receptor/<id>")def lookup_receptor(id): # get the raw read
@route("/sample/<sample_id>")def sample_summary(sample_id): # impl for getting sample information; can return: # * summary of repertoire information # (num reads, VDJ distribution, etc.) # * demographic info
@route("/sample/<sample_id>/common_junctions")def common_junctions(sample_id): # impl for getting the most common CDR3s
28
Genomics ETL has converged on standards
.fastq .bam .vcf
short read alignment
genotype calling analysisbiochemistry
29
VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
30
What about immune data?
.fastq .bam .vcf
short read alignment
genotype calling analysisbiochemistry
.???immune receptor alignment
31
Multiple models for same types: VDJFasta
sub new { my ($class) = @_; my $self = {}; $self->{filename} = ""; $self->{headers} = []; $self->{sequence} = []; $self->{germline} = []; $self->{nseqs} = 0; $self->{mids} = {};
$self->{accVsegQstart} = {}; # example: 124 $self->{accVsegQend} = {}; # example: 417 $self->{accJsegQstart} = {}; $self->{accJsegQend} = {}; $self->{accDsegQstart} = {};
32
Multiple models for same types: vdj
class ImmuneChain(SeqRecord): def cdr3(self): return len(self.junction)
def num_mutations(self): aln = self.letter_annotations['alignment'] return aln.count('S') + aln.count('I') def v(self): return self.__getattribute__('V-REGION') \ .qualifiers['allele'][0] def v_seq(self): return self.__getattribute__('V-REGION') \ .extract(self.seq.tostring())
33
Interoperability/services depend on being able to communicated data
34
CSV
9 CCTG_PRCONS=IGHC1_R1_IGM unproductive Homsap IGHV5-51*01 F, or Homsap IGHV5-51*03 F Homsap IGHJ4*02 F Homsap 12 GGGG_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-11*01 F Homsap IGHJ1*01 F Homsap IGHD2-2*03 F .......13 CTTC_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV1-2*02 F Homsap IGHJ5*02 F Homsap IGHD5-18*01 F .......18 ACTT_PRCONS=IGHC3_R1_IGA productive Homsap IGKV3-15*01 F, or Homsap IGKV3D-15*01 F or Homsap IGKV3D-15*02 P Homsap 20 GGAC_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-61*02 F Homsap IGHJ4*02 F Homsap IGHD1-26*01 F .......25 TCGT_PRCONS=IGHC2_R1_IGD productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 26 GGTG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 28 GTGA_PRCONS=IGHC5_R1_IGG productive Homsap IGHV1-46*01 F, or Homsap IGHV1-46*02 F or Homsap IGHV1-46*03 F Homsap 31 ACCC_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ3*02 F Homsap 36 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ2*01 F Homsap 39 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-7*01 F Homsap IGHJ6*02 F Homsap IGHD1-7*01 F .......40 GGGT_PRCONS=IGHC1_R1_IGM productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 42 TAGG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-39*01 F, or Homsap IGHV4-39*05 F Homsap IGHJ4*02 F Homsap 47 CAAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-15*01 F, or Homsap IGHV3-15*02 F Homsap IGHJ6*02 F Homsap 48 AGAA_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV3-30*04 F, or Homsap IGHV3-30-3*01 F or Homsap IGHV3-30-3*02 F or Ho52 GCAG_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 53 AACC_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-30*02 F Homsap IGHJ4*02 F Homsap IGHD5-18*01 F .......
35
XML
<ImmuneChain> <c>IGHD</c> <barcode>RL014</barcode> <j_start_idx>389</j_start_idx> <seq>TTGTGGCTATTTTAAA ... CTCGGACT</seq> <descr>003699_0091_0140</descr> <tag>coding</tag> <clone>IGHV3-43_IGHJ4|387</clone> <j>IGHJ4*02</j> <v_end_idx>314</v_end_idx> <v>IGHV3-43*01</v> <junction>TGTGCAAAAGATAATCT ... TCTTTGACTACTGG</junction> <d>IGHD5-24*01</d></ImmuneChain>
36
JSON
{ "v": "IGHV4-39*02", "seq": "CCTATCCCCCTGTGTGCCTT ... CTCCACCAAG", "num_mutations": 43, "name": "HG2DXMN01CY8UH", "letter_annotations": { "alignment": "..............S....S....3333333333333333........S.." }, "junction_nt": "GCGAGGGGCCGATGGGACTTTTATTACATGGACGTC", "j": "IGHJ6*03", "annotations": { "usearch_90_cluster": "6277", "experiment_date": "20120119", "donor": "17517", "sample_type": "memory_B_cells", "source": "SeqWright", "tags": ["revcomp", "coding"], "taxonomy": [] }, "d": "IGHD3-10*01", "features": [ { "strand": 1, "type": "V-REGION", "location": [51, 356], "qualifiers": { "CDR_length": ["[10.7.2]"], "codon_start": ["1"], "gene": ["IGHV4-39"], "allele": ["IGHV4-39*02"] } }, ... ]}
http://www.json.org/
37
JSON
{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000000" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01CY8{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000001" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01A3VH{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000002" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BC6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000003" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DYU{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000004" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01A8F{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000005" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01BDI2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000006" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BS2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000007" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DLL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000008" }, "annotations" : { "D-REGION" : "IGHD6-25*01", "accessions" : "HG2DXMN01BLF{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000009" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01D4TL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000a" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BU6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000b" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BIMG{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000c" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01BM9Z{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000d" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BH9Q{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000e" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BR3
38
Binary formats
• Protobuf, Thrift, or Avro• Flexible data model
• All common primitive types (e.g. int, double string)• Support nested types, including arrays and maps
• Efficient binary encoding• Code generation for many languages (binary
compatible)• Support for schema evolution• Support IDL for data types and services
39
Thrift example: Twitter
service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query);}
struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english"}
40
Thrift example: Immune receptor
cd ~/repos/kiwithrift --gen java kiwi-format/src/main/resources/thrift/kiwi.thriftthrift --gen py:new_style kiwi-format/src/main/resources/thrift/kiwi.thrift
See: https://github.com/laserson/kiwi
41
Questions?
42
Biological parts specifications
• Library of parts with well-characterized input-output characteristics
• In total, similar to API spec
Canton, Nat. Biotech. 26: 787 (2008)
43
Engineering signaling pathways at inputs/outputs
Lim, Nat. Rev. Mol. Cell 11: 393 (2010)
44
Bottom-up genetic circuit design
Brophy, Nature Meth. 11: 508 (2014)
45
Bottom-up genetic circuit design
Brophy, Nature Meth. 11: 508 (2014)
46
Predict composability of genetic elements
Kosuri, PNAS 110: 14024 (2013)
• 114 promoters x 111 RBS
“…rather than relying on prediction or standardization, we can screen synthetic libraries for desired behavior.”
47
Most addressableCheapest to create
ZFN => TALEN => CRISPR/CasLeast addressableMost expensive to create
48
Addressability for precision nanoscale engineering
Douglas, NAR 37: 5001(2009)
49
Addressability for precision nanoscale engineering
Douglas, Nature 459: 414 (2009)
50
Evolution for encapsulation: an evolved electronic thermometer
http://www.genetic-programming.com/hc/thermometer.html
51
Lycopene synthesis optimization
Wang, Nature 460: 894 (2009)
52
Evolutionary encapsulation for signaling pathway engineering
Peisajovich, Science 328: 368 (2010)
53
Evolutionary encapsulation for signaling pathway engineering
Peisajovich, Science 328: 368 (2010)
54
Genetic isolation with Re.coli
Lajoie, Science 342: 357 (2013)
So far, we discussed antibody-only data analysis
Antigen-only data generation
Larman, Nat. Biotech. 29: 535 (2011)
Ben Larman
Steve Elledge
Agilent OLS array
59
Phage immunoprecipitation sequencing (PhIP-seq)
60
Patient A Replica 1
Pat
ient
A R
epl
ica
2
SAPK4
NOVA1
TGIF2LX
log10(-log10 P-value)
PhIP-seq proof-of-principle
61
‘Forward vaccinology’
62
‘Reverse vaccinology’
63
‘Immunization without vaccination’
64
Encapsulation for cancer immunotherapy through TMG processing
Tran, Science 344: 641 (2014)
65
Other examples?
66
Conclusions
• The API perspective helps organize and communicate data
• Use sane file formats if possible:• JSON for lightweight work• Thrift/Avro for heavyweight serialization/communication
• Decouple data modeling for implementation details• Biological engineering: what abstractions are
available?• Evolution as nature’s encapsulator
67