43
Python Tools in Computational Chemistry (and Biology) Andrew Dalke Dalke Scientific, AB Göteborg, Sweden EuroSciPy, 26-27 July, 2008

Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Python Tools in Computational Chemistry

(and Biology)

Andrew Dalke

Dalke Scientific, ABGöteborg, Sweden

EuroSciPy, 26-27 July, 2008

Page 2: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Your use case isn't so typical and so suffers on the import time end of the balance.

“Why does ‘import numpy’ take 0.4 seconds?Does it need to import 228 libraries?”

- My first Numpy-discussion post(paraphrased)

- Response from Robert Kern

(Others did complain. Import time down to 0.28s.)

Page 3: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

HEADER PHOTORECEPTOR 23-MAY-90 1BRD 1BRD 2COMPND BACTERIORHODOPSIN 1BRD 3SOURCE (HALOBACTERIUM $HALOBIUM) 1BRD 4EXPDTA ELECTRON DIFFRACTION 1BRD 5AUTHOR R.HENDERSON,J.M.BALDWIN,T.A.CESKA,F.ZEMLIN,E.BECKMANN, 1BRD 6AUTHOR 2 K.H.DOWNING 1BRD 7REVDAT 3 15-JAN-93 1BRDB 1 SEQRES 1BRDB 1REVDAT 2 15-JUL-91 1BRDA 1 REMARK 1BRDA 1 ..ATOM 54 N PRO 8 20.397 -15.569 -13.739 1.00 20.00 1BRD 136ATOM 55 CA PRO 8 21.592 -15.444 -12.900 1.00 20.00 1BRD 137ATOM 56 C PRO 8 21.359 -15.206 -11.424 1.00 20.00 1BRD 138ATOM 57 O PRO 8 21.904 -15.930 -10.563 1.00 20.00 1BRD 139ATOM 58 CB PRO 8 22.367 -14.319 -13.591 1.00 20.00 1BRD 140ATOM 59 CG PRO 8 22.089 -14.564 -15.053 1.00 20.00 1BRD 141ATOM 60 CD PRO 8 20.647 -15.054 -15.103 1.00 20.00 1BRD 142ATOM 61 N GLU 9 20.562 -14.211 -11.095 1.00 20.00 1BRD 143ATOM 62 CA GLU 9 20.192 -13.808 -9.737 1.00 20.00 1BRD 144ATOM 63 C GLU 9 19.567 -14.935 -8.932 1.00 20.00 1BRD 145ATOM 64 O GLU 9 19.815 -15.104 -7.724 1.00 20.00 1BRD 146ATOM 65 CB GLU 9 19.248 -12.591 -9.820 1.00 99.00 1 1BRD 147ATOM 66 CG GLU 9 19.902 -11.351 -10.387 1.00 99.00 1 1BRD 148ATOM 67 CD GLU 9 19.243 -10.169 -10.980 1.00 99.00 1 1BRD 149ATOM 68 OE1 GLU 9 18.323 -10.191 -11.782 1.00 99.00 1 1BRD 150ATOM 69 OE2 GLU 9 19.760 -9.089 -10.597 1.00 99.00 1 1BRD 151ATOM 70 N TRP 10 18.764 -15.737 -9.597 1.00 20.00 1BRD 152ATOM 71 CA TRP 10 18.034 -16.884 -9.090 1.00 20.00 1BRD 153ATOM 72 C TRP 10 18.843 -17.908 -8.318 1.00 20.00 1BRD 154ATOM 73 O TRP 10 18.376 -18.310 -7.230 1.00 20.00 1BRD 155 ..

PDB 52,000 structuresdoubles every 2½ years

Page 4: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew
Page 5: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Parse file into list of atoms Format spec? What spec? Which spec?Distance search to identify bondsResidue assignment Characterize molecules as protein, DNA, waterSecondary structure assignment

Structure input

Page 6: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew
Page 7: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Structure visualization

Part science, part esthetics(Pretty pictures get to be on journal covers.)

Spheres assume everything is equally important.

Specialized ways to visualize protein, DNA, even water.

Molecular surfaces, charge and density isosurfaces

Page 8: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew
Page 9: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Display the results with OpenGL

Atom/region selection (mouse and text)

Change representation style and color

GUIs to control all of this

Interactive use

Page 10: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Scriptability

Setup scripts, movies, demos, analysis

Display other items in the scene

Tcl is a great language for this!

VMD switches between Tcl and Python.PyMol adds a command syntax to Python.

IPython?

Page 11: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Python is popular!VMD, PyMol, Chimera, PMV, Vida,

BALLView, Yasara

Visualization programs are popular!

Tcl(ish): VMD, RasMol, gOpenMol

Java: JMol, MarvinView, OpenAstexViewer

Other/commercial: Sybyl, MOE(and 100+ more)

Page 12: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Molecular Dynamics

F = maU = Ubond + Uangle + Udihedral + Uimproper + UUrey-Bradley + Uelectrostatic + Uvan der Waal

Numerically integrated with ~1femtosecond timesteps

well studied - 1950s for gases, 1970s for biomolecules

Page 14: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

http://www-dsv.cea.fr/instituts/institut-de-recherches-en-technologies-et-sciences-pour-le-vivant-irtsv/unites-de-recherche/laboratoire-chimie-et-biologie-des-metaux-lcbm/equipe-modelisation-interactions-et-repliement/breve-introduction-a-la-mecanique-mm-et-a-la-dynamique-moleculaire-dm

O(n2); or O(n log n) using

Particle mesh Ewald

O(n) with cutoffs

Page 15: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Plus: Choice of force fields Long-range cutoffs User-defined forces Boundary conditions Choice of integration methods Special integrators for hydrogen (SHAKE) Rigid body dynamics Dihedral dynamics Constant E/T/P/count Hybrid quantum/classical ...

Page 16: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

NAMD (C++ with Tcl scripting)DL_POLY (Fortran)

AMBER (Fortran, C, C++)GROMACS (C rewrite of GROMOS/Fortan)

TINKER (Fortran and some C)CHARMM (Fortran)

MOLDY (C, and GPLed)

Where’s Python? MMTK and nMOLDYN, BALLView, Molecular Dynamics Language (MDL).

Minority software

Page 17: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

~3 billion base pairs20-25,000 protein

coding genes

Bioinformatics

Page 18: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

GenBank

82 million records85 billion base pairs

data doubles every 18 months

Page 19: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

LOCUS NM_052942 2431 bp mRNA linear PRI 25-MAY-2008DEFINITION Homo sapiens guanylate binding protein 5 (GBP5), mRNA.ACCESSION NM_052942VERSION NM_052942.2 GI:31377630KEYWORDS .SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 2431) AUTHORS Ito,Y., Shibata-Watanabe,Y., Ushijima,Y., Kawada,J., Nishiyama,Y., Kojima,S. and Kimura,H. TITLE Oligonucleotide microarray analysis of gene expression profiles followed by real-time reverse-transcriptase polymerase chain reaction assay in chronic active Epstein-Barr virus infection JOURNAL J. Infect. Dis. 197 (5), 663-666 (2008) PUBMED 18260761 ...FEATURES Location/Qualifiers source 1..2431 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="1" /map="1p22.2" gene 1..2431 /gene="GBP5" /synonym="GBP-5" /note="guanylate binding protein 5" /db_xref="GeneID:115362" /db_xref="HGNC:19895" /db_xref="HPRD:13571" /db_xref="MIM:611467"

⋯ CDS 525..2285 /gene="GBP5" /codon_start=1 /product="guanylate-binding protein 5" /protein_id="NP_443174.1" /db_xref="GI:16418425" /db_xref="CCDS:CCDS722.1" /db_xref="GeneID:115362" /db_xref="HGNC:19895" /db_xref="HPRD:13571" /db_xref="MIM:611467" /translation="MALEIHMSDPMCLIENFNEQLKVNQEALEILSAITQPVVVVAIV GLYRTGKSYLMNKLAGKNKGFSVASTVQSHTKGIWIWCVPHPNWPNHTLVLLDTEGLG DVEKADNKNDIQIFALALLLSSTFVYNTVNKIDQGAIDLLHNVTELTDLLKARNSPDL DRVEDPADSASFFPDLVWTLRDFCLGLEIDGQLVTPDEYLENSLRPKQGSDQRVQNFN LPRLCIQKFFPKKKCFIFDLPAHQKKLAQLETLPDDELEPEFVQQVTEFCSYIFSHSM TKTLPGGIMVNGSRLKNLVLTYVNAISSGDLPCIENAVLALAQRENSAAVQKAIAHYD QQMGQKVQLPMETLQELLDLHRTSEREAIEVFMKNSFKDVDQSFQKELETLLDAKQND ICKRNLEASSDYCSALLKDIFGPLEEAVKQGIYSKPGGHNLFIQKTEELKAKYYREPR KGIQAEEVLQKYLKSKESVSHAILQTDQALTETEKKKKEAQVKAEAEKAEAQRLAAIQ RQNEQMMQERERLHQEQVRQMEIAKQNWLAEQQKMQEQQMQEQAAQLSTTFQAQNRSL LSELQHAQRTVNNDDPCVLL" ...ORIGIN 1 ctccaggctg tggaaccttt gttctttcac tctttgcaat aaatcttgct gctgctcact 61 ctttgggtcc acactgcctt tatgagctgt aacactcact gggaatgtct gcagcttcac 121 tcctgaagcc agcgagacca cgaacccacc aggaggaaca aacaactcca gacgcgcagc 181 cttaagagct gtaacactca ccgcgaaggt ctgcagcttc actcctgagc cagccagacc 241 acgaacccac cagaaggaag aaactccaaa cacatccgaa catcagaagg agcaaactcc 301 tgacacgcca cctttaagaa ccgtgacact caacgctagg gtccgcggct tcattcttga 361 agtcagtgag accaagaacc caccaattcc ggacacgcta attgttgtag atcatcactt 421 caaggtgccc atatctttct agtggaaaaa ttattctggc ctccgctgca tacaaatcag 481 gcaaccagaa ttctacatat ataaggcaaa gtaacatcct agacatggct ttagagatcc 541 acatgtcaga ccccatgtgc ctcatcgaga actttaatga gcagctgaag gttaatcagg 601 aagctttgga gatcctgtct gccattacgc aacctgtagt tgtggtagcg attgtgggcc 661 tctatcgcac tggcaaatcc tacctgatga acaagctggc tgggaagaac aagggcttct 721 ctgttgcatc tacggtgcag tctcacacca agggaatttg gatatggtgt gtgcctcatc 781 ccaactggcc aaatcacaca ttagttctgc ttgacaccga gggcctggga gatgtagaga 841 aggctgacaa caagaatgat atccagatct ttgcactggc actcttactg agcagcacct 901 ttgtgtacaa tactgtgaac aaaattgatc agggtgctat cgacctactg cacaatgtga 961 cagaactgac agatctgctc aaggcaagaa actcacccga ccttgacagg gttgaagatc 1021 ctgctgactc tgcgagcttc ttcccagact tagtgtggac tctgagagat ttctgcttag 1081 gcctggaaat agatgggcaa cttgtcacac cagatgaata cctggagaat tccctaaggc 1141 caaagcaagg tagtgatcaa agagttcaaa atttcaattt gccccgtctg tgtatacaga 1201 agttctttcc aaaaaagaaa tgctttatct ttgacttacc tgctcaccaa aaaaagcttg 1261 cccaacttga aacactgcct gatgatgagc tagagcctga atttgtgcaa caagtgacag 1321 aattctgttc ctacatcttt agccattcta tgaccaagac tcttccaggt ggcatcatgg 1381 tcaatggatc tcgtctaaag aacctggtgc tgacctatgt caatgccatc agcagtgggg 1441 atctgccttg catagagaat gcagtcctgg ccttggctca gagagagaac tcagctgcag 1501 tgcaaaaggc cattgcccac tatgaccagc aaatgggcca gaaagtgcag ctgcccatgg 1561 aaaccctcca ggagctgctg gacctgcaca ggaccagtga gagggaggcc attgaagtct 1621 tcatgaaaaa ctctttcaag gatgtagacc aaagtttcca gaaagaattg gagactctac 1681 tagatgcaaa acagaatgac atttgtaaac ggaacctgga agcatcctcg gattattgct 1741 cggctttact taaggatatt tttggtcctc tagaagaagc agtgaagcag ggaatttatt 1801 ctaagccagg aggccataat ctcttcattc agaaaacaga agaactgaag gcaaagtact 1861 atcgggagcc tcggaaagga atacaggctg aagaagttct gcagaaatat ttaaagtcca 1921 aggagtctgt gagtcatgca atattacaga ctgaccaggc tctcacagag acggaaaaaa 1981 agaagaaaga ggcacaagtg aaagcagaag ctgaaaaggc tgaagcgcaa aggttggcgg 2041 cgattcaaag gcagaacgag caaatgatgc aggagaggga gagactccat caggaacaag 2101 tgagacaaat ggagatagcc aaacaaaatt ggctggcaga gcaacagaaa atgcaggaac 2161 aacagatgca ggaacaggct gcacagctca gcacaacatt ccaagctcaa aatagaagcc 2221 ttctcagtga gctccagcac gcccagagga ctgttaataa cgatgatcca tgtgttttac 2281 tctaaagtgc taaatatggg agtttccttt ttttactctt tgtcactgat gacacaacag 2341 aaaagaaact gtagaccttg ggacaatcaa catttaaata aactttataa ttattttttc 2401 aaactttaaa aaaaaaaaaa aaaaaaaaaa a//

Page 20: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

How do they sequence a genome?

Page 21: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Is my newly sequence DNAsimilar to existing DNA?

What does “similar” mean?How similar?

Page 22: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

CATCOTCOGDOG

Mutation

T-REETHREE

Insertion/Deletion

What about Levenshtein distance?“edit distance”

Page 23: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew
Page 24: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

mind/mine are closer than mind/mini

3 bases encode one amino acid

CAU→HistidineCAC→HistidineCAA→GlutamineCAG→Glutamine

UAU→TyrosineCAU→HistidineAAU→Asparagine GAU→Aspartic acid

“Silent”mutation}

Change in thefirst base

Change in thethird base

Page 25: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

BLOSUM62

Different gap scores for: - creating a gap - extending an existing gap - leading/trailing gaps

Page 26: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

BLAST - heuristic approximation

Needleman-Wunsch: global alignments

Smith-Waterman: local alignments (FASTA)

implemented in Cvariations on FPGAs, GPUs,

Expectation values based onGumbel extreme value distribution

Page 27: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

>P40620 HMG1/2-LIKE PROTEIN. Length = 149 Score = 64.0 bits (153), Expect = 2e-10 Identities = 36/93 (38%), Positives = 49/93 (51%), Gaps = 6/93 (6%)

Query: 79 PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHP-GLSIGDVAKKLGEMWNNTAA 137 P KG K+ KDPN PKRPPSAFF+F +++R + K +HP S+ V K GE W + + Sbjct: 33 PAKG---KEPKDPNKPKRPPSAFFVFMADFREQYKKDHPNNKSVAAVGKACGEEWKSLSE 89

Query: 138 DDKQPXXXXXXXXXXXXXXDIAAYRAK--GKPD 168 ++K P + AY K GK DSbjct: 90 EEKAPYVDRALKKKEEYEITLQAYNKKLEGKDD 122

Page 28: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Molecular Phylogenetics of Mastodon and Tyrannosaurus rex

Organ, et al., Science 25 April 2008

We report a molecular phylogeny for a nonavian dinosaur, extending our knowledge of trait evolution within nonavian dinosaurs into the macromolecular level of biological organization. Fragments of collagen 1(I) and 2(I) proteins extracted from fossil bones of Tyrannosaurus rex and Mammut americanum (mastodon) ...

Page 29: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

PRANK:

Probabilistic Alignment Kit

Page 30: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

BioPythonSequence and Alignment formats and objects

Interfaces to third-party programsClients to NCBI’s web services

Structural BioinformaticsPopulation geneticsClustering, SVM, ...

pygr - genome analysis, comparative genomics

Galaxy - bioinformatics analysis web application

Page 31: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Why is Perlthe preferred language

in bioinformatics?

Strings, Unix, databases, CGI

Page 32: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

2519 -OEChem-07130818102D

24 25 0 0 0 0 0 0 0999 V2000 3.7321 2.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -1.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -1.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.5443 0.8047 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 0.5000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.5443 -0.8047 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 4.5981 0.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 ... 1.4631 1.3100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.6900 0.4631 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 9 2 0 0 0 0 2 10 2 0 0 0 0 3 8 1 0 0 0 0 3 10 1 0 0 0 0 3 12 1 0 0 0 0 .... 14 23 1 0 0 0 0 14 24 1 0 0 0 0M END> <PUBCHEM_COMPOUND_CID>2519

> <PUBCHEM_IUPAC_SYSTEMATIC_NAME>1,3,7-trimethylpurine-2,6-dione

> <PUBCHEM_IUPAC_TRADITIONAL_NAME>1,3,7-trimethylxanthine

> <PUBCHEM_NIST_INCHI>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3

> <PUBCHEM_EXACT_MASS>194.080376

Cheminformatics

CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12

Page 33: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

19 million records/41 million substances

ChemSpider20+ million records

But number of records is mostly arbitrary.

40 million records

Page 34: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

What do youknow about

this structure ?

Common name: caffeine

IUPAC name: 1,3,7-trimethylxanthine

CAS #: 58-08-2

Page 35: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C1212 3 4 5 6 7 8 9 A B C D E

6

5

3

42

1

ED

C

A

B

97

8

12

Page 36: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Substructure search(subgraph isomorphism)

126,484 substructuresin PubChem

caffeine

adenosine

2-methyl-3-hydroxybutyryl-CoA

purine substructure

Page 37: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Fingerprints

CH3CH3-NCH3-N-CCH3-N-C=OCH3-N-C-N

132932109571430

bit 0 bit 1023

linear fragments hash values

x x x x x

Page 38: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

x x x x x

x x x x x

x x x x xx xx2-methyl-3-hydroxybutyryl-CoA

adenosine

caffeine

Tanimoto score =# bits in A∩B

# bits in A∪B

Page 39: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Lots of Python!PyDaylight - Daylight toolkit via SWIGFrowns - mostly Python, based on PyDaylight API

OEChem - bindings via SWIG

pybel - OpenBabel via SWIGRDKit - C++/Python toolkit using BoostCDK - using Jython

cinfony implements a common API!

JOELib is another Java toolkitABCD is an internal .Net system from J&J

Page 40: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Bad News - Sketchers

ChemDrawIsis/DrawMDLDraw

MarvinSketchJChemPaintChemWriter

Nothing in Python!

Native-code plugins

Java plugins

Page 41: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Gene Expression

BioConductor - R

Page 42: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

Many in Fortran, some in C or C++: Jaguar, NWChem, Gaussian, GAMESS, Mopac, ...

I only know of one using Python: PyQuante

Quantum mechanics

Page 43: Python Tools in Computational Chemistry (and Biology)dalkescientific.com/writings/EuroSciPy2008.pdf · 2008. 9. 22. · Python Tools in Computational Chemistry (and Biology) Andrew

What do I want for the future?

- better documentation- APIs meant for interactive use- better GUI support for interactive use- more training in how to program