Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Python Tools in Computational Chemistry
(and Biology)
Andrew Dalke
Dalke Scientific, ABGöteborg, Sweden
EuroSciPy, 26-27 July, 2008
Your use case isn't so typical and so suffers on the import time end of the balance.
“Why does ‘import numpy’ take 0.4 seconds?Does it need to import 228 libraries?”
- My first Numpy-discussion post(paraphrased)
- Response from Robert Kern
(Others did complain. Import time down to 0.28s.)
HEADER PHOTORECEPTOR 23-MAY-90 1BRD 1BRD 2COMPND BACTERIORHODOPSIN 1BRD 3SOURCE (HALOBACTERIUM $HALOBIUM) 1BRD 4EXPDTA ELECTRON DIFFRACTION 1BRD 5AUTHOR R.HENDERSON,J.M.BALDWIN,T.A.CESKA,F.ZEMLIN,E.BECKMANN, 1BRD 6AUTHOR 2 K.H.DOWNING 1BRD 7REVDAT 3 15-JAN-93 1BRDB 1 SEQRES 1BRDB 1REVDAT 2 15-JUL-91 1BRDA 1 REMARK 1BRDA 1 ..ATOM 54 N PRO 8 20.397 -15.569 -13.739 1.00 20.00 1BRD 136ATOM 55 CA PRO 8 21.592 -15.444 -12.900 1.00 20.00 1BRD 137ATOM 56 C PRO 8 21.359 -15.206 -11.424 1.00 20.00 1BRD 138ATOM 57 O PRO 8 21.904 -15.930 -10.563 1.00 20.00 1BRD 139ATOM 58 CB PRO 8 22.367 -14.319 -13.591 1.00 20.00 1BRD 140ATOM 59 CG PRO 8 22.089 -14.564 -15.053 1.00 20.00 1BRD 141ATOM 60 CD PRO 8 20.647 -15.054 -15.103 1.00 20.00 1BRD 142ATOM 61 N GLU 9 20.562 -14.211 -11.095 1.00 20.00 1BRD 143ATOM 62 CA GLU 9 20.192 -13.808 -9.737 1.00 20.00 1BRD 144ATOM 63 C GLU 9 19.567 -14.935 -8.932 1.00 20.00 1BRD 145ATOM 64 O GLU 9 19.815 -15.104 -7.724 1.00 20.00 1BRD 146ATOM 65 CB GLU 9 19.248 -12.591 -9.820 1.00 99.00 1 1BRD 147ATOM 66 CG GLU 9 19.902 -11.351 -10.387 1.00 99.00 1 1BRD 148ATOM 67 CD GLU 9 19.243 -10.169 -10.980 1.00 99.00 1 1BRD 149ATOM 68 OE1 GLU 9 18.323 -10.191 -11.782 1.00 99.00 1 1BRD 150ATOM 69 OE2 GLU 9 19.760 -9.089 -10.597 1.00 99.00 1 1BRD 151ATOM 70 N TRP 10 18.764 -15.737 -9.597 1.00 20.00 1BRD 152ATOM 71 CA TRP 10 18.034 -16.884 -9.090 1.00 20.00 1BRD 153ATOM 72 C TRP 10 18.843 -17.908 -8.318 1.00 20.00 1BRD 154ATOM 73 O TRP 10 18.376 -18.310 -7.230 1.00 20.00 1BRD 155 ..
PDB 52,000 structuresdoubles every 2½ years
Parse file into list of atoms Format spec? What spec? Which spec?Distance search to identify bondsResidue assignment Characterize molecules as protein, DNA, waterSecondary structure assignment
Structure input
Structure visualization
Part science, part esthetics(Pretty pictures get to be on journal covers.)
Spheres assume everything is equally important.
Specialized ways to visualize protein, DNA, even water.
Molecular surfaces, charge and density isosurfaces
Display the results with OpenGL
Atom/region selection (mouse and text)
Change representation style and color
GUIs to control all of this
Interactive use
Scriptability
Setup scripts, movies, demos, analysis
Display other items in the scene
Tcl is a great language for this!
VMD switches between Tcl and Python.PyMol adds a command syntax to Python.
IPython?
Python is popular!VMD, PyMol, Chimera, PMV, Vida,
BALLView, Yasara
Visualization programs are popular!
Tcl(ish): VMD, RasMol, gOpenMol
Java: JMol, MarvinView, OpenAstexViewer
Other/commercial: Sybyl, MOE(and 100+ more)
Molecular Dynamics
F = maU = Ubond + Uangle + Udihedral + Uimproper + UUrey-Bradley + Uelectrostatic + Uvan der Waal
Numerically integrated with ~1femtosecond timesteps
well studied - 1950s for gases, 1970s for biomolecules
http://www-dsv.cea.fr/instituts/institut-de-recherches-en-technologies-et-sciences-pour-le-vivant-irtsv/unites-de-recherche/laboratoire-chimie-et-biologie-des-metaux-lcbm/equipe-modelisation-interactions-et-repliement/breve-introduction-a-la-mecanique-mm-et-a-la-dynamique-moleculaire-dm
http://www-dsv.cea.fr/instituts/institut-de-recherches-en-technologies-et-sciences-pour-le-vivant-irtsv/unites-de-recherche/laboratoire-chimie-et-biologie-des-metaux-lcbm/equipe-modelisation-interactions-et-repliement/breve-introduction-a-la-mecanique-mm-et-a-la-dynamique-moleculaire-dm
O(n2); or O(n log n) using
Particle mesh Ewald
O(n) with cutoffs
Plus: Choice of force fields Long-range cutoffs User-defined forces Boundary conditions Choice of integration methods Special integrators for hydrogen (SHAKE) Rigid body dynamics Dihedral dynamics Constant E/T/P/count Hybrid quantum/classical ...
NAMD (C++ with Tcl scripting)DL_POLY (Fortran)
AMBER (Fortran, C, C++)GROMACS (C rewrite of GROMOS/Fortan)
TINKER (Fortran and some C)CHARMM (Fortran)
MOLDY (C, and GPLed)
Where’s Python? MMTK and nMOLDYN, BALLView, Molecular Dynamics Language (MDL).
Minority software
~3 billion base pairs20-25,000 protein
coding genes
Bioinformatics
GenBank
82 million records85 billion base pairs
data doubles every 18 months
LOCUS NM_052942 2431 bp mRNA linear PRI 25-MAY-2008DEFINITION Homo sapiens guanylate binding protein 5 (GBP5), mRNA.ACCESSION NM_052942VERSION NM_052942.2 GI:31377630KEYWORDS .SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 2431) AUTHORS Ito,Y., Shibata-Watanabe,Y., Ushijima,Y., Kawada,J., Nishiyama,Y., Kojima,S. and Kimura,H. TITLE Oligonucleotide microarray analysis of gene expression profiles followed by real-time reverse-transcriptase polymerase chain reaction assay in chronic active Epstein-Barr virus infection JOURNAL J. Infect. Dis. 197 (5), 663-666 (2008) PUBMED 18260761 ...FEATURES Location/Qualifiers source 1..2431 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="1" /map="1p22.2" gene 1..2431 /gene="GBP5" /synonym="GBP-5" /note="guanylate binding protein 5" /db_xref="GeneID:115362" /db_xref="HGNC:19895" /db_xref="HPRD:13571" /db_xref="MIM:611467"
⋯
⋯ CDS 525..2285 /gene="GBP5" /codon_start=1 /product="guanylate-binding protein 5" /protein_id="NP_443174.1" /db_xref="GI:16418425" /db_xref="CCDS:CCDS722.1" /db_xref="GeneID:115362" /db_xref="HGNC:19895" /db_xref="HPRD:13571" /db_xref="MIM:611467" /translation="MALEIHMSDPMCLIENFNEQLKVNQEALEILSAITQPVVVVAIV GLYRTGKSYLMNKLAGKNKGFSVASTVQSHTKGIWIWCVPHPNWPNHTLVLLDTEGLG DVEKADNKNDIQIFALALLLSSTFVYNTVNKIDQGAIDLLHNVTELTDLLKARNSPDL DRVEDPADSASFFPDLVWTLRDFCLGLEIDGQLVTPDEYLENSLRPKQGSDQRVQNFN LPRLCIQKFFPKKKCFIFDLPAHQKKLAQLETLPDDELEPEFVQQVTEFCSYIFSHSM TKTLPGGIMVNGSRLKNLVLTYVNAISSGDLPCIENAVLALAQRENSAAVQKAIAHYD QQMGQKVQLPMETLQELLDLHRTSEREAIEVFMKNSFKDVDQSFQKELETLLDAKQND ICKRNLEASSDYCSALLKDIFGPLEEAVKQGIYSKPGGHNLFIQKTEELKAKYYREPR KGIQAEEVLQKYLKSKESVSHAILQTDQALTETEKKKKEAQVKAEAEKAEAQRLAAIQ RQNEQMMQERERLHQEQVRQMEIAKQNWLAEQQKMQEQQMQEQAAQLSTTFQAQNRSL LSELQHAQRTVNNDDPCVLL" ...ORIGIN 1 ctccaggctg tggaaccttt gttctttcac tctttgcaat aaatcttgct gctgctcact 61 ctttgggtcc acactgcctt tatgagctgt aacactcact gggaatgtct gcagcttcac 121 tcctgaagcc agcgagacca cgaacccacc aggaggaaca aacaactcca gacgcgcagc 181 cttaagagct gtaacactca ccgcgaaggt ctgcagcttc actcctgagc cagccagacc 241 acgaacccac cagaaggaag aaactccaaa cacatccgaa catcagaagg agcaaactcc 301 tgacacgcca cctttaagaa ccgtgacact caacgctagg gtccgcggct tcattcttga 361 agtcagtgag accaagaacc caccaattcc ggacacgcta attgttgtag atcatcactt 421 caaggtgccc atatctttct agtggaaaaa ttattctggc ctccgctgca tacaaatcag 481 gcaaccagaa ttctacatat ataaggcaaa gtaacatcct agacatggct ttagagatcc 541 acatgtcaga ccccatgtgc ctcatcgaga actttaatga gcagctgaag gttaatcagg 601 aagctttgga gatcctgtct gccattacgc aacctgtagt tgtggtagcg attgtgggcc 661 tctatcgcac tggcaaatcc tacctgatga acaagctggc tgggaagaac aagggcttct 721 ctgttgcatc tacggtgcag tctcacacca agggaatttg gatatggtgt gtgcctcatc 781 ccaactggcc aaatcacaca ttagttctgc ttgacaccga gggcctggga gatgtagaga 841 aggctgacaa caagaatgat atccagatct ttgcactggc actcttactg agcagcacct 901 ttgtgtacaa tactgtgaac aaaattgatc agggtgctat cgacctactg cacaatgtga 961 cagaactgac agatctgctc aaggcaagaa actcacccga ccttgacagg gttgaagatc 1021 ctgctgactc tgcgagcttc ttcccagact tagtgtggac tctgagagat ttctgcttag 1081 gcctggaaat agatgggcaa cttgtcacac cagatgaata cctggagaat tccctaaggc 1141 caaagcaagg tagtgatcaa agagttcaaa atttcaattt gccccgtctg tgtatacaga 1201 agttctttcc aaaaaagaaa tgctttatct ttgacttacc tgctcaccaa aaaaagcttg 1261 cccaacttga aacactgcct gatgatgagc tagagcctga atttgtgcaa caagtgacag 1321 aattctgttc ctacatcttt agccattcta tgaccaagac tcttccaggt ggcatcatgg 1381 tcaatggatc tcgtctaaag aacctggtgc tgacctatgt caatgccatc agcagtgggg 1441 atctgccttg catagagaat gcagtcctgg ccttggctca gagagagaac tcagctgcag 1501 tgcaaaaggc cattgcccac tatgaccagc aaatgggcca gaaagtgcag ctgcccatgg 1561 aaaccctcca ggagctgctg gacctgcaca ggaccagtga gagggaggcc attgaagtct 1621 tcatgaaaaa ctctttcaag gatgtagacc aaagtttcca gaaagaattg gagactctac 1681 tagatgcaaa acagaatgac atttgtaaac ggaacctgga agcatcctcg gattattgct 1741 cggctttact taaggatatt tttggtcctc tagaagaagc agtgaagcag ggaatttatt 1801 ctaagccagg aggccataat ctcttcattc agaaaacaga agaactgaag gcaaagtact 1861 atcgggagcc tcggaaagga atacaggctg aagaagttct gcagaaatat ttaaagtcca 1921 aggagtctgt gagtcatgca atattacaga ctgaccaggc tctcacagag acggaaaaaa 1981 agaagaaaga ggcacaagtg aaagcagaag ctgaaaaggc tgaagcgcaa aggttggcgg 2041 cgattcaaag gcagaacgag caaatgatgc aggagaggga gagactccat caggaacaag 2101 tgagacaaat ggagatagcc aaacaaaatt ggctggcaga gcaacagaaa atgcaggaac 2161 aacagatgca ggaacaggct gcacagctca gcacaacatt ccaagctcaa aatagaagcc 2221 ttctcagtga gctccagcac gcccagagga ctgttaataa cgatgatcca tgtgttttac 2281 tctaaagtgc taaatatggg agtttccttt ttttactctt tgtcactgat gacacaacag 2341 aaaagaaact gtagaccttg ggacaatcaa catttaaata aactttataa ttattttttc 2401 aaactttaaa aaaaaaaaaa aaaaaaaaaa a//
How do they sequence a genome?
Is my newly sequence DNAsimilar to existing DNA?
What does “similar” mean?How similar?
CATCOTCOGDOG
Mutation
T-REETHREE
Insertion/Deletion
What about Levenshtein distance?“edit distance”
mind/mine are closer than mind/mini
3 bases encode one amino acid
CAU→HistidineCAC→HistidineCAA→GlutamineCAG→Glutamine
UAU→TyrosineCAU→HistidineAAU→Asparagine GAU→Aspartic acid
“Silent”mutation}
Change in thefirst base
Change in thethird base
BLOSUM62
Different gap scores for: - creating a gap - extending an existing gap - leading/trailing gaps
BLAST - heuristic approximation
Needleman-Wunsch: global alignments
Smith-Waterman: local alignments (FASTA)
implemented in Cvariations on FPGAs, GPUs,
Expectation values based onGumbel extreme value distribution
>P40620 HMG1/2-LIKE PROTEIN. Length = 149 Score = 64.0 bits (153), Expect = 2e-10 Identities = 36/93 (38%), Positives = 49/93 (51%), Gaps = 6/93 (6%)
Query: 79 PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHP-GLSIGDVAKKLGEMWNNTAA 137 P KG K+ KDPN PKRPPSAFF+F +++R + K +HP S+ V K GE W + + Sbjct: 33 PAKG---KEPKDPNKPKRPPSAFFVFMADFREQYKKDHPNNKSVAAVGKACGEEWKSLSE 89
Query: 138 DDKQPXXXXXXXXXXXXXXDIAAYRAK--GKPD 168 ++K P + AY K GK DSbjct: 90 EEKAPYVDRALKKKEEYEITLQAYNKKLEGKDD 122
Molecular Phylogenetics of Mastodon and Tyrannosaurus rex
Organ, et al., Science 25 April 2008
We report a molecular phylogeny for a nonavian dinosaur, extending our knowledge of trait evolution within nonavian dinosaurs into the macromolecular level of biological organization. Fragments of collagen 1(I) and 2(I) proteins extracted from fossil bones of Tyrannosaurus rex and Mammut americanum (mastodon) ...
PRANK:
Probabilistic Alignment Kit
BioPythonSequence and Alignment formats and objects
Interfaces to third-party programsClients to NCBI’s web services
Structural BioinformaticsPopulation geneticsClustering, SVM, ...
pygr - genome analysis, comparative genomics
Galaxy - bioinformatics analysis web application
Why is Perlthe preferred language
in bioinformatics?
Strings, Unix, databases, CGI
2519 -OEChem-07130818102D
24 25 0 0 0 0 0 0 0999 V2000 3.7321 2.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -1.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -1.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.5443 0.8047 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 0.5000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.5443 -0.8047 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 4.5981 0.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 ... 1.4631 1.3100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.6900 0.4631 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 9 2 0 0 0 0 2 10 2 0 0 0 0 3 8 1 0 0 0 0 3 10 1 0 0 0 0 3 12 1 0 0 0 0 .... 14 23 1 0 0 0 0 14 24 1 0 0 0 0M END> <PUBCHEM_COMPOUND_CID>2519
> <PUBCHEM_IUPAC_SYSTEMATIC_NAME>1,3,7-trimethylpurine-2,6-dione
> <PUBCHEM_IUPAC_TRADITIONAL_NAME>1,3,7-trimethylxanthine
> <PUBCHEM_NIST_INCHI>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
> <PUBCHEM_EXACT_MASS>194.080376
Cheminformatics
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12
19 million records/41 million substances
ChemSpider20+ million records
But number of records is mostly arbitrary.
40 million records
What do youknow about
this structure ?
Common name: caffeine
IUPAC name: 1,3,7-trimethylxanthine
CAS #: 58-08-2
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C1212 3 4 5 6 7 8 9 A B C D E
6
5
3
42
1
ED
C
A
B
97
8
12
Substructure search(subgraph isomorphism)
126,484 substructuresin PubChem
caffeine
adenosine
2-methyl-3-hydroxybutyryl-CoA
purine substructure
Fingerprints
CH3CH3-NCH3-N-CCH3-N-C=OCH3-N-C-N
132932109571430
bit 0 bit 1023
linear fragments hash values
x x x x x
x x x x x
x x x x x
x x x x xx xx2-methyl-3-hydroxybutyryl-CoA
adenosine
caffeine
Tanimoto score =# bits in A∩B
# bits in A∪B
Lots of Python!PyDaylight - Daylight toolkit via SWIGFrowns - mostly Python, based on PyDaylight API
OEChem - bindings via SWIG
pybel - OpenBabel via SWIGRDKit - C++/Python toolkit using BoostCDK - using Jython
cinfony implements a common API!
JOELib is another Java toolkitABCD is an internal .Net system from J&J
Bad News - Sketchers
ChemDrawIsis/DrawMDLDraw
MarvinSketchJChemPaintChemWriter
Nothing in Python!
Native-code plugins
Java plugins
Gene Expression
BioConductor - R
Many in Fortran, some in C or C++: Jaguar, NWChem, Gaussian, GAMESS, Mopac, ...
I only know of one using Python: PyQuante
Quantum mechanics
What do I want for the future?
- better documentation- APIs meant for interactive use- better GUI support for interactive use- more training in how to program