14
Review Antibody informatics for drug discovery Hiroki Shirai a , Catherine Prades b , Randi Vita c , Paolo Marcatili d , Bojana Popovic e , Jianqing Xu e , John P. Overington f , Kazunori Hirayama a , Shinji Soga a , Kazuhisa Tsunoyama a , Dominic Clark f , Marie-Paule Lefranc g , Kazuyoshi Ikeda f, a Molecular Medicine Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21, Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan b Global Biotherapeutics, Bioinformatics, Sano-Aventis Recherche & Développement, Centre de recherche Vitry-sur-Seine, 13, quai Jules Guesde, BP 14, 94403 Vitry-sur-Seine Cedex, France c Immune Epitope Database and Analysis Project, La Jolla Institute for Allergy & Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA d Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Lyngby, Denmark e MedImmune Ltd, Milstein Building, Granta Park, Cambridge CB21 6GH, UK f The EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK g IMGT®, the international ImMunoGeneTics information system®, Laboratoire d'ImmunoGénétique Moléculaire (LIGM), Université Montpellier 2, Institut de Génétique Humaine, UPR CNRS 1142, 141 rue de la Cardonille, 34396 Montpellier Cedex 5, France abstract article info Article history: Received 27 March 2014 Received in revised form 4 July 2014 Accepted 11 July 2014 Available online xxxx Keywords: Antibody informatics Antibody modeling Antibody database Antibody numbering Drug discovery More and more antibody therapeutics are being approved every year, mainly due to their high efcacy and anti- gen selectivity. However, it is still difcult to identify the antigen, and thereby the function, of an antibody if no other information is available. There are obstacles inherent to the antibody science in every project in antibody drug discovery. Recent experimental technologies allow for the rapid generation of large-scale data on antibody sequences, afnity, potency, structures, and biological functions; this should accelerate drug discovery research. Therefore, a robust bioinformatic infrastructure for these large data sets has become necessary. In this article, we rst identify and discuss the typical obstacles faced during the antibody drug discovery process. We then summa- rize the current status of three sub-elds of antibody informatics as follows: (i) recent progress in technologies for antibody rational design using computational approaches to afnity and stability improvement, as well as ab-initio and homology-based antibody modeling; (ii) resources for antibody sequences, structures, and immune epitopes and open drug discovery resources for development of antibody drugs; and (iii) antibody numbering and IMGT. Here, we review antibody informatics,which may integrate the above three elds so that bridging the gaps between industrial needs and academic solutions can be accelerated. This article is part of a Special Issue entitled: Recent advances in molecular engineering of antibody. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Recent advances in experimental technologies allow researchers to rapidly generate an enormous amount of data using a variety of molec- ular biological methods. This data-driven science should be transformed into a model-based science. Pharmaceutical companies need to handle large biological data sets since molecular biology is signicantly involved in drug discovery, development, and manufacturing. However, the expense involved in catching up with this rapid progress prevents any single company from adapting to these large biological data sets quickly and efciently. Some efforts toward pre-competitive collabora- tions are underway. For instance, since sales of antibody therapeutics continue to rise, the EMBL European Bioinformatics Institute (EMBL- EBI) Industry Programme [1] has focused on antibody or biologics informatics for both academia and industry. Four of the ten highest sell- ing drugs from October 2012 to September 2013 were biologics, and the launch of biosimilars will make this situation even more interesting. There are many more informatics resources available for the analysis of small molecule therapeutics than for the antibody drug discovery process. In this paper, we review antibody informaticsto create a syn- ergetic resource of related efforts. We summarize some of the obstacles for antibody drug discovery and approaches to overcome these obsta- cles using antibody informatics. We rst map the obstacles faced and their relevant informatics tools for the workow of antibody drug discovery such as antibody modeling tools, antibody databases, and Biochimica et Biophysica Acta xxx (2014) xxxxxx Abbreviations: 2D, two-dimensional; 3D, three-dimensional; CDR, complementarity determining region; CMC, chemistry, manufacturing and control; CPCA, composite protein for clinical applications; CS, canonical structure; EMBL-EBI, EMBL European Bioinformatics Institute; FPIA, fusion protein for immune applications; FR, framework region; HTS, high throughput sequencing; IG, immunoglobulin; IgSF, immunoglobulin superfamily; IMGT, the international ImMunoGeneTics information system; MH, major histocompatibility; MhSF, MH superfamily; NGS, next generation sequencing; PD, pharmacodynamics; PK, pharmacokinetics; SPR, surface plasmon resonance; RMSD, root mean square deviation; TR, T cell receptor; VH, variable region of heavy chain; VL, variable region of light chain This article is part of a Special Issue entitled: Recent advances in molecular engineer- ing of antibody. Corresponding author at: Level Five Co.,Ltd., Shiodome Shibarikyu Bldg., 1-2-3 Kaigan, Minato-ku, Tokyo 105-0022, Japan. Tel.: +81 3 5403 5917. E-mail address: [email protected] (K. Ikeda). BBAPAP-39388; No. of pages: 14; 4C: 2, 4, 9, 10, 11 http://dx.doi.org/10.1016/j.bbapap.2014.07.006 1570-9639/© 2014 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap Please cite this article as: H. Shirai, et al., Antibody informatics for drug discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/ j.bbapap.2014.07.006

Antibody informatics for drug discovery

Embed Size (px)

Citation preview

  • Biochimica et Biophysica Acta xxx (2014) xxxxxx

    BBAPAP-39388; No. of pages: 14; 4C: 2, 4, 9, 10, 11

    Contents lists available at ScienceDirect

    Biochimica et Biophysica Acta

    j ourna l homepage: www.e lsev ie r .com/ locate /bbapapReviewAntibody informatics for drug discoveryHiroki Shirai a, Catherine Prades b, Randi Vita c, Paolo Marcatili d, Bojana Popovic e, Jianqing Xu e,John P. Overington f, Kazunori Hirayama a, Shinji Soga a, Kazuhisa Tsunoyama a, Dominic Clark f,Marie-Paule Lefranc g, Kazuyoshi Ikeda f,a Molecular Medicine Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21, Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japanb Global Biotherapeutics, Bioinformatics, Sanofi-Aventis Recherche & Dveloppement, Centre de recherche Vitry-sur-Seine, 13, quai Jules Guesde, BP 14, 94403 Vitry-sur-Seine Cedex, Francec Immune Epitope Database and Analysis Project, La Jolla Institute for Allergy & Immunology, 9420 Athena Circle, La Jolla, CA 92037, USAd Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Lyngby, Denmarke MedImmune Ltd, Milstein Building, Granta Park, Cambridge CB21 6GH, UKf The EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UKg IMGT, the international ImMunoGeneTics information system, Laboratoire d'ImmunoGntiqueMolculaire (LIGM), UniversitMontpellier 2, Institut de GntiqueHumaine, UPR CNRS 1142,141 rue de la Cardonille, 34396 Montpellier Cedex 5, FranceAbbreviations: 2D, two-dimensional; 3D, three-dimedetermining region; CMC, chemistry,manufacturing and cfor clinical applications; CS, canonical structure; EMBL-EBIInstitute; FPIA, fusion protein for immune applications; Fthroughput sequencing; IG, immunoglobulin; IgSF, immuthe international ImMunoGeneTics information system;MhSF, MH superfamily; NGS, next generation sequencinpharmacokinetics; SPR, surface plasmon resonance; RMSTR, T cell receptor; VH, variable region of heavy chain; VL This article is part of a Special Issue entitled: Recent aing of antibody. Corresponding author at: Level Five Co.,Ltd., Shiodome

    Minato-ku, Tokyo 105-0022, Japan. Tel.: +81 3 5403 591E-mail address: [email protected] (K. Ikeda).

    http://dx.doi.org/10.1016/j.bbapap.2014.07.0061570-9639/ 2014 Elsevier B.V. All rights reserved.

    Please cite this article as: H. Shirai, et al., Anj.bbapap.2014.07.006a b s t r a c ta r t i c l e i n f oArticle history:Received 27 March 2014Received in revised form 4 July 2014Accepted 11 July 2014Available online xxxx

    Keywords:Antibody informaticsAntibody modelingAntibody databaseAntibody numberingDrug discoveryMore and more antibody therapeutics are being approved every year, mainly due to their high efficacy and anti-gen selectivity. However, it is still difficult to identify the antigen, and thereby the function, of an antibody if noother information is available. There are obstacles inherent to the antibody science in every project in antibodydrug discovery. Recent experimental technologies allow for the rapid generation of large-scale data on antibodysequences, affinity, potency, structures, and biological functions; this should accelerate drug discovery research.Therefore, a robust bioinformatic infrastructure for these large data sets has become necessary. In this article, wefirst identify and discuss the typical obstacles faced during the antibody drug discovery process.We then summa-rize the current status of three sub-fields of antibody informatics as follows: (i) recent progress in technologiesfor antibody rational design using computational approaches to affinity and stability improvement, as well asab-initio and homology-based antibodymodeling; (ii) resources for antibody sequences, structures, and immuneepitopes and open drug discovery resources for development of antibody drugs; and (iii) antibody numberingand IMGT. Here, we review antibody informatics, which may integrate the above three fields so that bridgingthe gaps between industrial needs and academic solutions can be accelerated. This article is part of a SpecialIssue entitled: Recent advances in molecular engineering of antibody.

    2014 Elsevier B.V. All rights reserved.1. Introduction

    Recent advances in experimental technologies allow researchers torapidly generate an enormous amount of data using a variety of molec-ular biologicalmethods. This data-driven science should be transformedinto a model-based science. Pharmaceutical companies need to handlensional; CDR, complementarityontrol; CPCA, composite protein, EMBL European BioinformaticsR, framework region; HTS, highnoglobulin superfamily; IMGT,MH, major histocompatibility;g; PD, pharmacodynamics; PK,D, root mean square deviation;, variable region of light chaindvances in molecular engineer-

    Shibarikyu Bldg., 1-2-3 Kaigan,7.

    tibody informatics for druglarge biological data sets since molecular biology is significantlyinvolved in drug discovery, development, andmanufacturing. However,the expense involved in catching up with this rapid progress preventsany single company from adapting to these large biological data setsquickly and efficiently. Some efforts toward pre-competitive collabora-tions are underway. For instance, since sales of antibody therapeuticscontinue to rise, the EMBL European Bioinformatics Institute (EMBL-EBI) Industry Programme [1] has focused on antibody or biologicsinformatics for both academia and industry. Four of the ten highest sell-ing drugs fromOctober 2012 to September 2013were biologics, and thelaunch of biosimilars will make this situation even more interesting.There are many more informatics resources available for the analysisof small molecule therapeutics than for the antibody drug discoveryprocess. In this paper, we review antibody informatics to create a syn-ergetic resource of related efforts. We summarize some of the obstaclesfor antibody drug discovery and approaches to overcome these obsta-cles using antibody informatics. We first map the obstacles faced andtheir relevant informatics tools for the workflow of antibody drugdiscovery such as antibody modeling tools, antibody databases, anddiscovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006mailto:[email protected]://dx.doi.org/10.1016/j.bbapap.2014.07.006http://www.sciencedirect.com/science/journal/15709639http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 2 H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxaccurate antibody numbering. We then discuss the current status ofthose three elements in detail.

    2. Obstacles for drug discovery to be tackled usingantibody informatics

    Fig. 1maps the obstacles faced during theworkflowof antibodydrugdiscovery and presents approaches used by antibody informatics tools.In the top box, a rough workflow is described: A host is immunizedwith a selected immunogen in order to obtain antigen specific antibod-ies, whose affinity and in vitro activity are measured. The researchersselect a lead antibody among themmainly based on the in vitro activity,and then proceed to engineer (e.g., through complementarity determin-ing region (CDR) grafting) an optimized antibody. Next, pharmacoki-netics (PK), pharmacodynamics (PD) and toxicological properties aremeasured for the selected antibody. Finally, mass production and chem-istry, manufacturing, and control (CMC) are performed for clinical trials.The second and third boxes of Fig. 1 describe the obstacles and relevantinformatics tools, respectively.

    The design of therapeutic antibodies is a very difficult problem. First,obtaining an antibody specific to a target molecule can often prove dif-ficult, and for some antigens, no specific antibody can be generated.Evenwhenmany antibodieswith high affinity to their antigens are gen-erated, they still may not possess enough functional activity due to theirnon-ideal binding sites. In order to solve this problem, careful design ofthe immunogen used for raising antibodies is required, for example, forgeneration of a stable active form or dissecting the functional domain.Unfortunately, some antibodies may have lower affinity or activity,and shorter PK than the expected values. Poor physicochemical proper-ties, such as lower thermal stability and aggregation tendency, maycause further problems. To deal with these problems, engineering ofthe antibody, such as reducing the physicochemical problems [2,3],Fig. 1. Antibody informatics approaches to antibody

    Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006enhancing the affinity, or elongating half-life [4,5] is needed. Many anti-body or antigen designs can be performed by computational (in silico)approaches [6]. Knowledge, experience, and intuition can also be help-ful in the antibody design process. In the latter case, the researchers,usually non-informatics researchers, need to be familiar with the anti-body as well as its epitope. Antibody modeling and protein dockingare often used to construct antibodyantigen tertiary structural modelsfrom amino acid sequences and play an important role both in the insilico design process and in understanding protein functions [6,7].Although determination of the three-dimensional (3D) structure of aprotein by X-ray crystallography has become easier, it still consumes agreat deal of time and expense, and is not always successful. Nowmore than ever, as the number of antibody sequences available hasbeen rapidly increasing, there is a demand for high quality of antibodymodeling and protein docking, as a rapid suitable alternative for gener-ating structural data is in demand.

    Even after successful modeling and design, a functional antibodythat met the PK/PD and toxicity criteria may have problems in laterstages, such as mass production or CMC, because of its poor physico-chemical properties. Therefore, newantibody selection criteria that pre-dict difficulties at later stages are needed. Methodologies that prioritizetherapeutic antibodies, based on evaluation of their druggability ordevelopability by considering the features of antibody sequence, struc-ture, and its physicochemical nature, would be ideal [8,9].

    Administration of a therapeutic antibody is accompanied by therisks of developing an anti-antibody immune response. Methodologicaldevelopments in CDR grafting and transgenic animals for generatinghumanized and human antibodies, respectively, have reduced the riskof immunogenicity in clinical trials. However, these advances are stillnot perfect. Prediction and elimination of T cell epitopes are two waysto tackle this problem, but themechanisms of immunogenicity are com-plex and the causes are still unclear. Antibody aggregation can also be andrug discovery. t1/2: half-life and Ab: antibody.

    discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 3H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxximportant cause of immunogenicity. Thus, the ability to predict T cellepitopes and aggregation prone regions early in the antibody designprocess would do a great deal to improve these problems [10,11].

    Another more practical problem is the existence of multiple anti-body numbering schemes. Historical antibody numbering schemescontaminate antibody residue numbers with arbitrary ones. Thus,researchers in this field are at the risk of misunderstanding and errone-ously interpreting patents and scientific publications. As the number ofantibody sequences available is rapidly increasing, thanks to deepsequencing technology [12,13], these numbering issues have becomea serious problem. Standardization of antibody numbering must beuniversally adopted to solve this problem.

    Thee major users of antibody informatics are in silico informatics re-searchers themselves, experimental scientists, and project managers.All need to use large antibody data sets effectively. Antibody modelingsoftware, publically available databases, and accurate antibody number-ing make up the backbone of antibody informatics. They are all usefulfor in silico researchers to carry out the rational design and prioritizationof antibody drugsand for experimental researchers to understand anti-body paratopes and epitopes to improve experimental design. We willdiscuss the current status of these three elements in detail.

    3. In silico antibody design

    For non-human antibody libraries, CDR grafting has been the centralmethod for in silico design. It is widely used for reducing the risk of im-munogenicity, and is described in more detail elsewhere [14]. Othertypes of in silico design, such as affinity improvement and removal ofphysicochemical problems are described in a previous review [6] andhere we will focus on the advances since then. In silico antibody designwas used to make highly thermoresistant antibodies by mutating sur-face residues to charged amino acids [15] and it was used to improvecrosslinking efficiencies when introducing non-canonical aminoacids into antibody CDRs[16]. Affinity improvements by computa-tional methods have rapidly increased, for example, the affinity im-provement made by structure based computational design forantibody 11K2, an anti-monocyte chemoattractant protein 1 (MCP-1)antibody [17]. Based on the crystal structure of the 11K2/MCP-1 com-plex, each of the 62 residues belonging to theCDRof 11K2was subjectedto systematic mutagenesis in silico with each of the 19 natural aminoacids (62 19 =1178 mutations). For each mutation, 100 differentmodels (total, 117,800 models) were generated. After the calculationof interaction energy between antigen and antibody in each of themodels, only 12 mutants were selected from the in silico inspection ofthe structures for further examination. Researchers carried out the ex-pression and purification of MCP-1, 11K2 and the mutants, and per-formed binding assays by surface plasmon resonance (SPR) to validatethe model predictions. SPR showed that 5 of 11 mutants had increasedaffinity, with the largest improved affinity found to be a 4.6 fold en-hancement. Antibody 11K2 already had high affinity (4.6 pM) forMCP-1, but rational design proved effective to further enhance antibodyaffinity. The in silico design in this work was achieved by the combina-tion of commercial package software MOE [18] and Discovery Studio[19]. Development of such tools is expanding the capabilities of in silicoantibody design and similar results were obtained by other systems[20]. From these studies, it was found that hot spot residues were notamenable to optimization and that introduction of long-range electro-static interaction was effective.

    These methodologies require crystallization of the antibody/antigencomplex. An alternative computational approach for affinity enhance-ment without the crystal structure is proposed that uses the physico-chemical features common to epitopeparatope interfaces, whichhave been extracted from known 3D structures [21]. This approachwas applied to antibody 4E11, a cross-reactive neutralizing antibodyto dengue virus (DV), without the crystal structure, and a 450-fold im-provement in affinity was achieved. This increased affinity resulted inPlease cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006stronger neutralizing activity in vitro. It also resulted in potent antiviralactivity in a mouse model of DV challenge. Precise and good quality an-tibody modeling is a key to success when using this approach.

    4. Antibody modeling

    Antibody modeling techniques have come a long way since theirbirth. The first insights in the antibody sequencestructure relationshipcan be attributed to seminal works byWu and Kabat [22] that identifiedthe six hypervariable regions on the heavy and light chains, correctlypredicted such regions that arise from a relatively conserved frameworkto be in close spatial proximity and to be responsible for the specificbinding of the antigen, and named them complementarity determiningregions (CDRs) in contrast to the surrounding framework regions(FRs). The first real milestone toward the possibility of building reliableantibodymodels comes from thework of Chothia and Lesk [23]. In theirdefinition of the hypervariable loops they [24] determined therelationship between amino acid sequences and 3D structures in anti-gen binding sites. They discovered that five of six hypervariable regions(L1 to L3, H1 and H2) typically adopt a small number of discretebackbone conformations. Moreover, they identified relatively few resi-dueswithin and outside the hypervariable regionsthat, throughtheir hydrogen bonding, packing, or ability to assume specific confor-mations, are primarily responsible for the main-chain conformationsof the hypervariable loops. These classes of recurring main-chainconformations of the hypervariable regions, identified by the length ofthe loop and by a small number of key residues, are named canonicalstructures (CSs).

    Since then, several investigations have extended the library of ca-nonical structures. Recent research by North et al. updated and revisedthe definition of canonical structures with a systematic approach [25].To date, it has been estimated [2629] that approximately 80% of L1,L2, L3, H1, and H2 loops in solved structures adopt some type of canon-ical structure. The average rootmean square deviation (RMSD) betweenthebackbones of a target loop and a template loopwith the same type ofcanonical structure is approximately 0.7 , and typically not exceeding1.01.2 RMSD.

    The repertoire of canonical structures has largely increased over theyears, and, even though it seems that all the most common structuresfor human and murine antibodies have been already discovered [25],it has to be underlined that this is not the case for other organisms,some of which present immunoglobulins (IGs) with peculiar character-istics that can have a large practical and theoretical impact. Amongthose, let's remember camelid VHH antibodies and bovine IG. TheCamelidae family has a special type of antibody in addition to conven-tional antibodies. These antibodies, called heavy chain antibodies, arecharacterized by the absence of light chains and the first heavy chainC region. In contrast to conventional antibodies, heavy chain antibodieshave been found to be stable and active at high temperatures and inhigh concentrations of denaturants. In addition, VHH has other advan-tages: it can recognize uncommon or hidden epitopes, it is expected tobe orally administered, and it is easy to be engineered. On the basis ofthese features, some clinical trials are on-going. Some preliminary stud-ies on their structures and on the ability of the currentmethods have al-ready been performed, showing encouraging results [30,31]. Bovine IGsometimes has ultra-long VH CDR3 regions with an average length of21 residues [32]. They are composed of a strand (stalk) and knobdomain used to bind antigens [33]. These antibodies with ultra-longVH CDR3 are expected to bind epitopes in a different way than humanor mouse antibodies.

    4.1. Automated antibody modeling protocols

    Building the 3D structure of an antibody from its amino-acidsequence, also known as antibody modeling, was described in detail ina recent review [6]. Nowadays, protein structure modeling is gettingdiscovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • Fig. 2.Comparison of antibody 3Dmodel and the crystal structure. A: Tube representation of the crystal structure of Fab of anti-humanRSV F proteinmAb 101F (PDB ID: 3QQ9). The frame-work region, CDR-L1 to L3, CDR-H1 and H2, and CDR-H3 are colored gray, magenta, brown and red, respectively. 3D model conformation generated by the PIGS server is colored cyan.RMSD value between the model and the crystal structure is 0.34 . B: Tube representation of the crystal structure of anti-NGF antibody (PDB ID: 4M6O), and the framework region,CDR-L1 to L3, CDR-H1 andH2, and CDR-H3 are colored gray,magenta, brown and red, respectively. 3Dmodel conformation generated by PIGS server is colored cyan. RMSDvalue betweenthe model and the crystal structure is 1.3 . This figure was created using MOE.

    4 H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxeasier. Here, we focus on reviewing automatic antibody modeling,which is more useful for not only computational scientists, but also alltypes of researchers.

    Antibody modeling, engineering, and humanization partially orcompletely depend on the definition of canonical structures and onour ability to predict them. Given the sequence of an antibody ofunknown structure, if there is a high probability that the L1, L2, L3,H1, and H2 hypervariable loops have a main chain conformation corre-sponding to a known canonical structure, and that H3 has a conforma-tion similar to that of a known structure, then a reasonable model canbe built for the framework regions and hypervariable loops using a ho-mology modeling technique. The framework regions, being generallyhighly conserved in sequence and structure, can be modeled accuratelywith relatively little difficulty.

    The standard protocol adopted by methods that use variations overthe CS method (such as the PIGS [7], WAM [34], commercial tools)involves the selection of up to eight templates, two for the light andheavy chain frameworks selected by sequence similarity alone, fivetemplates for non-H3 loops selected using the CS method and one forthe H3 loop only partially relying on the CS method (Fig. 2).

    In contrast to the canonical structure approach described above,ab initio methods do not rely on the presence of template loops in thedatabase. Their major limitation is that, due to a still incomplete com-prehension of the physicochemical principles that ultimately governPlease cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006protein structures, the energy functions developed to generate andevaluate the different conformations often cannot differentiate betweena correct prediction and an incorrect one. Approaches combiningknowledge-based and ab initio methods have also been described [35,36]. RosettaAntibody [37,38] is a mixed homology/ab-initio modelingtool that uses the frame of the Rosetta software suite, which is a bioin-formatics platform for protein structural prediction and design [39,40].Themethod builds a preliminary homologymodel by selecting differenttemplates for frameworks and non-H3 CDRs, followed by H3 loopmodeling and VH/VL interface optimization. The loopmodelingmethodemploys advanced loop conformational sampling strategies, usingeither a cyclic coordinate descent loop closure [41] method combinedwith protein fragment insertion, or an ab initio kinematic closure [42]method implemented in the latest version. In terms of non-canonicalloop prediction, a prediction method of CDR structures without the ca-nonical classification is worth mentioning. FREAD is a database loopprediction method for CDRs that does not use canonical definitions [43].

    4.2. H3 modeling

    As observed from solved antibodyantigen complex structures inPDB, most CDRH3 loops have direct contacts with their associated anti-gens. Unfortunately, despite their primary role in the binding, they arereally difficult to model because their both sequences and structuresdiscovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 5H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxare highly variable due to the nature of VDJ gene recombination [44].The analysis of sequence and structure enabled some of us to extractempirical rules [8,4547] to describe a sequencestructure relationshipfor the main chain conformation at the end of the loop C-terminusregion (the torso [45] or base [8,46,47] of the loop). Oncewe identifythe type of torso, the structural features of the tip of the H3 loop can beassumed to some extent when a certain amino acid is located at aspecific position [8].

    In the H3modeling, specific database search procedures can be usedto predict the main chain conformation of the head regions [45]. Suchmethods usually give good results for short (up to 9 amino acids) andmedium (10 to 15 amino acids) sized loops, but for longer loops the de-viation usually exceeds 2 in backboneRMSD. It is to be noted that evenin the unfavorable case of long loops, a good template, providing a finalmodel of H3 with a backbone RMSD below 4 might exist, but currenttemplate selection methods are not always able to identify such tem-plates. On the other hand, ab initio methods display an equal if notworse decrease in accuracy for predicting the structure of long loops.Moreover, it has been observed that the binding of the antigen can in-troduce case-specific conformational changes in the whole antibodyand especially in theH3 region formediumand long sized loops. This ef-fect is in most cases hard if not impossible tomodel and it sets an upperbound to the modeling accuracy of this region.

    4.3. Open problems in antibody modeling

    According to the assessments of antibody modeling accuracyperformed so far, it appears that antibody models built using the mostrecent tools, PIGS and RosettaAntibody, have a similar expected accura-cy, which is typically below 1 RMSD for FRs and non-H3 loops and be-tween 1.5 and 5 for the H3 loop. Four major sources of error in theprocess include the longH3 loopmodeling, the detection and predictionof possible conformational changes occurring upon antigen binding, theidentification of proper templates, and the reconstruction of the correctpacking between the VH and VL domains.

    The first of these problems, involving the modeling of the H3 loop,has been faced in many different ways [43,45,47,48] but still has muchworse accuracy compared to the rest of the model, often impairing theuse of such predictions for many practical applications. It is importantto mention that this problem is crucial in modeling camelid VHH do-mains. They have a specific repertoire of H3 loops, often very long andwith specific conformations, on which the current prediction methodshardly achieve reasonably good results [31].

    Another open problem is the effect of antigen binding on the CDRloop conformation. Even though only minor distortions are observedin most of the loops upon antigen binding, H3 loops (especially longerones) can undergo quite drastic conformational changes when boundto their antigen [49]. It would therefore be important to include thiskind of information in the modeling process that to date is not takeninto account by the automated procedures described above. On theother hand, prediction of short and medium length loops is consideredto be tractable [50]. Recently, another assessment showed that asimulation-based approach improved the initial H3 models, whichwere generated by homology modeling, and refined those within 1.5 RMSD on average if the X-ray structure without H3 is available [51].This suggests that the ab initio method may contribute to predict moreaccurate H3 loop conformation, even if a homologous template for agiven H3 sequence could not be found in PDB.

    The third point to be mentioned is that it is not straightforward toselect the best template from the homologous structures, especiallyfor antibody sequences difficult to be modeled with less informationin PDB. In those cases, an automatic modeling approach is not alwayssuccessful. Therefore, an expert-guided procedure including a manualprocess to select and refine suitable template selection is still neededfor high accuracy modeling. User-friendly modeling tools developedand distributed by commercial software companies may be helpfulPlease cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006not only for bioinformatics experts but also for non-computationalscientists.

    The last point is about the correct packing between VH and VL do-mains. Several recent papers demonstrated the existence of differentpacking modalities and their strong impact on the shape of the antigenbinding site and on antigen specificity [5254]. This becomes relevant inthe assembly of complete antibody models, especially when two differ-ent solved structures are used as templates for the FRs of the two chains.In this case, the regions copied from the two templates need to bepacked together in order to obtain the final model and this processmay introduce deviations between the model and the real structure.The user should then decide whether to use, if possible, the same tem-plate (usually with a lower sequence similarity) for several loops and/or FRs and minimize the number of superposition needed to build thefinal model, or to model the VH/VL packing using a different template.Such a choice is not trivial and depends on the existence of a suitabletemplate with a good sequence identity.

    5. Antibody databases and resources

    There are many free resources that provide relevant antibody infor-mation. In recent years the number of biological databases has been rap-idly growing. As of 2013, the NAR online molecular biology databasecollection (http://www.oxfordjournals.org/nar/database/a/) lists 1512databases relevant to the field of molecular biology alone [55]. Here,we have summarized various antibody databases useful for drug discov-ery (Table 1). These many diverse resources catalog published experi-mental data, provide information on research reagents, and modelmolecular interactions with the overarching goal of helping to increasethe speed of new discoveries.

    5.1. IEDB

    At the Immune Epitope Database (IEDB) [56], which freely presentspublished antibody and T cell epitope data, PhD level scientists read andanalyze all publications relevant to immune epitopes and extract theantibody information in order to make it available to researchers in aneasy to search and consistent manner. The IEDB provides the host spe-cies and strain, gender, and age, the immunogen, route, dose, andadjuvant(s) used to produce an antibody as well as describes all pub-lished experiments whereby an epitope-specific antibody binds to anytested antigen in any assay type, even if the outcome of the experimentwas negative. The IEDB presents all published data, as stated by eachauthor, allowing users to determinewhich results to weighmost heavi-ly or to sum all the data on any given antibody and draw their ownconclusions.

    This experimental antibody information aids antibody researchersby allowing them to quickly determine what antibodies currentlyexist, what their epitopes are, and in which scenarios these antibodiesbind to different antigens. Additionally, quantitative binding constants,such as equilibrium constants and on or off rates and antibody chainGenBank identifiers are also captured, when provided by the authors.This type of detailed information can be very helpful to researchersattempting to design new antibodies or just looking for an antibodythat binds a specific antigen in order to perform an experiment. The re-searcher can easily determinewhich antibodies were previously shownto bind targets of interest at high affinity, aswell aswhich antibodies didnot bindwell orwere previously tested against the target of interest anddid not bind at all. This allows one to easily look for differences in whatimmunogen was used or how the adjuvant, route, dose, carrier, etc. dif-fer between successful antibodies and failed responses. Because theIEDB also captures T cell experimental data, one can compare the loca-tion of antibody reactivity across a target antigen to its experimentallyestablished T cell reactivity. This can be useful for researchers wantingto raise specifically a B cell response without also generating a T cellresponse. Using the IEDB's visualization tool called the Immunomediscovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://www.oxfordjournals.org/nar/database/a/http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • Table 1Characterization of various antibody databases and resources.

    Maincontents

    Sequence & structure Structure Drug information (along with small-molecule drugs)

    Databasename

    DIGIT IMGT abYsis SAbDAb ChEMBL PubChem DrugBank

    Characteristiccontents

    Annotation on the type of antigen, thegermline sequences and pairing informationbetween light and heavy chains.User submitted sequences can be blastedagainst the database and annotated with theKabatChothia numbering scheme, thelocation and canonical structures of the CDRsand the mutations with respect to thegermline.

    Global reference for genes, sequencesand 3D structures of IGs(or antibodies), TR, MH, IgSF andMhSF.Annotation based on the concepts ofIMGT-ONTOLOGY and on the IMGTScientific chart rules (IMGT gene andallele nomenclature, IMGT uniquenumbering).Seven databases and 17 online tools.IMGT/mAb-DB contains therapeuticmAb, FPIA and CPCA, with links toIMGT/3Dstructure-DB and IMGT/2Dstructure-DB and WHO-INN.

    Chothia and Kabatnumbering, canonical classassignment, identification ofunusual residues(distribution of amino acidon each position).Humanness assessment

    Antibodiesregistered inPDB.Structuralproperties suchas CDR loopconformationand H/L orien-tation.Affinity valuescurated fromliterature.

    Antibodiesapproved or inclinical phase.Synonyms,sequence, MOA,target informa-tion linking tosmall moleculedata.

    Synonyms,pharmacologicalaction and linkto PubMedcitations.

    Identification (sequence, CAS number),pharmacology (PK, PD, dose, toxicity,ADME), pharmacoeconomics (price,patent dates), properties (Tm, pI), targetinformation, risk of drugdrug interaction.

    Updateoperation/frequency

    By automated annotation.Regularly updated every 90 days.

    By curation.Continuously updated, with weeklyreleases for IMGT/LIGM-DB andIMGT/GENE-DB, and releasesapproximately every 34 months forIMGT/3Dstructure-DB, and IMGT/mAb-DB.

    Updated several times peryear.In-house commercial versionis updated more frequently.

    By curation.Weeklyupdated.

    By curation.Updatedregularly, withreleasesapproximatelyevery34 months.

    By deposition. By curation.Continuously updated as new informationon antibody drug becomes available.

    Maincontents

    B and T cell epitopes Publicly available antibodies Hybridoma Diagnostics & therapeutics

    Databasename

    IEDB Antibody Registry (AR) AbMiner Antibodypedia The European Collection of CellCultures's Hybridoma Collection

    Monoclonal Antibody Index

    Characteristiccontents

    Comprehensive epitopedatabase (MHC binding data,antibody binding affinities,T cell responses, 3Dstructure etc.)Published experimentaldetails manually curatedAutomated documentcategorization and extensiveuse of ontologies

    Commercial and private antibodiesfrommany sources, which have beenassigned a unique identifier.

    Commercially availableantibodies screened by Westernblot for recognition of NCI-60cells.Includes species reactivityinformation.

    Publicly availableantibodies againsthuman proteintargets.Reliability scoresevaluated by users.

    All cell lines added to thecollection undergo full qualitycontrol and authenticationprocedures.Cell lines can be supplied either asfrozen or growing cultures.

    Fully searchable biotechnology database of e-bookswith yearly updates of information on many anti-bodies produced for diagnosis and therapy.

    Updateoperation/frequency

    Manually curated withweekly updates.

    Both by automation and bydeposition with weekly updates.

    By curation.Updated periodically as newantigen identifier informationbecomes available.

    By deposition.Data registered afterpeer-review.Continually updated.

    By deposition.Continually updated.

    Yearly updated.

    mAb:monoclonal antibody, TR: T cell receptor,MH:major histocompatibility, FPIA: fusion protein for immune applications, CPCA: composite protein for clinical applications, QC: quality control;WHO-INN:WorldHealthOrganization InternationalNonproprietary Name, and ADME: absorption, distribution, metabolism and excretion.

    6H.Shiraietal./Biochim

    icaetBiophysica

    Acta

    xxx(2014)

    xxxxxx

    Pleasecite

    thisarticle

    as:H.Shirai,et

    al.,Antibody

    informatics

    fordrug

    discovery,Biochim.Biophys.A

    cta(2014),http://dx.doi.org/10.1016/

    j.bbapap.2014.07.006

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 7H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxBrowser, one can view all published antibody reactivity displayed alongthe length of a proteinmolecule. This data can be presented either as theresponse frequency or as the number of positive or negative experi-ments that were performed. This tool gives antibody researchers aquick summary of not only which areas of a target protein have alreadybeen extensively studied, but also which areas of the target consistentlyshowed experimental antibody reactivity. The IEDB also allows users tofreely download all data, including information on the original publica-tion where the data originated. Lastly, the IEDB hosts several antibodyepitope prediction tools that are trained using experimental data.Thus, the antibody information presented by the IEDB can save antibodyresearchers a great deal of time during their planning stages as well asoffer insights into the current literature regarding specific antibody tar-gets. In general, there aremany obstacles encountered by curatorswhenextracting these details from publications. Many researchers will cite aprevious publication, rather than describe how an antibody was gener-ated. The IEDB curators often discover that the cited reference lacks de-tailed information on the antibody production, or even fails to mentionthe antibody at all. Different authors may be inconsistentwhen describ-ing how the same antibodywasmade; for example, onewill state that itwas raised inmice and theother that it was raised in rabbits or theymaydiffer in whether they state that the immunogen was a human proteinor a bovine protein. Such details are critical to elucidate when thisdata is to be used to design a highly specific antibody. The IEDB curatorswill often contact authors to clarify these details.

    5.2. DIGIT

    DIGIT [57] is a resource that, by retrieving and automatically anno-tating immunoglobulin sequences, complements the other manuallycurated databases exposed in this review. As aforementioned, the anno-tations that include relevant antibody information, such as the type ofantigen that an antibody binds, its germline sequences, and the correctpairing between light and heavy chains are usually scattered in the liter-ature and hard to collect. To achieve this aim, taking advantage of bioin-formatics and automatic learning tools together with the experienceand suggestions from experimental researchers, we constructed DIGIT(Database of ImmunoGlobulin with Integrated Tools).

    Currently, this database houses 235,600 heavy chain sequences and139,974 light chain sequences (95,553 kappa type and 44,421 lambdatype) retrieved using isotype-specific HMM profiles developed by us,with assigned canonical structures for the hypervariable loops.

    The user can query the database using the antigen type, source or-ganism, accession number, chain type (heavy, lambda, kappa, lambda+ kappa) or free text (disease, process,etc.) with the option of selectingonly complete immunoglobulins (VL + VH). Other annotations arecomputed on the fly (and therefore can also be obtained for user sub-mitted sequences), such as numbering of the sequence according tothe Kabat numbering scheme, identification of the CDRs and of the FRsin the sequence, assignment of canonical structures for the CDRs, iden-tification of mutations with respect to the germline, a direct link to thePIGS 3Dmodeling tool [7], and retrieval of the closest sequences (sortedaccording to e-value or % id) in the database.

    Approximately 4527 human annotated complete antibody se-quences (i.e. with both light and heavy chains) are in the database.Among those, more than 1000 have some information on the antigen,more than 250 are annotated as autoimmune or autoantibodies, andmore than 400 have been sequenced in lymphomas or leukemias, mak-ing possible a large number of biologically and clinically relevant analy-ses [5860].

    5.3. Antibody information in open drug discovery databases

    Recently, open source drug discovery databases such as ChEMBL[6163], PubChem [64] and DrugBank [65] have been widely usedamong academic and industry users. These databases are freely availablePlease cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006and store a number of drug molecules, candidate compounds, targets,and medicinal chemistry data. Previous studies have done a comprehen-sive survey for the drug-targets of small-molecule drugs and therapeuticantibodies [66], suggesting that in human genome there were the drug-targets shared by both small-molecule and antibody drugs as well asthe new targets thatwerefirst exploited for antibody drugs [67]. Combin-ing antibody data with small-molecule data allows us to utilize knowndrugdiscoverydata for further development of antibodydrugs. DrugBankis a database for approved drug and drug-target information, which con-tains over 7000 drug entries including FDA-approved small moleculedrugs and monoclonal antibody drugs. From the DrugBank interface,users can search specific approved antibody drug data such as chemical,pharmacological, and pharmaceutical information, by inputting aminoacid sequences or keywords. ChEMBL is a large-scale database of thestructureactivity relationship (SAR) of small molecules including notonly approved drugs but also drug-like compounds at an early stageof drug discovery, which are abstracted from the medicinal chemistryliterature over three decades. It covers 1.3million compounds, 12millionbioactivities, and 9000 targets including more than 2800 human proteintargets. ChEMBL also has bio-therapeutic data including over 150 anti-bodies and their sequences, including monoclonal antibody drugs andcandidates in clinical trials. In the drug target browse function of theChEMBL interface, each drug/compound entry is also annotated with itstarget information, and searchable by keywords, for instance, the USANor INN stemofmonoclonal antibodies (-mab). Users can easily search an-tibodydrug targets aswell as their knownbioactive compounds byBLASTsearch. The ChEMBL database is regularly updated and does data-sharingwith the PubChembioassay data. The latest FDAdrug information is sum-marized on a table called Drug Approvals in the ChEMBL interface. In thetable, new antibody drug data is updated with basic approval informa-tion, the ATC code of drug classification, and drug icon which graphicallyshows the chemical, therapeutic and administration properties.

    5.4. Other databases

    The Neuroscience Information Framework (NIF) [68] began catalog-ing detailed antibody information relevant to neurosciences, but soonobserved the many inconsistencies present in the published literaturewhen it came to describing how antibodies were generated and func-tioned. In response, the Antibody Registry (AR) [69] was born to giveantibody researchers consistent and accurate information on known an-tibodies by universally identifying antibodies used in publications. Theregistry lists commercial antibodies from more than 200 vendors,which have each been assigned a unique and stable identifier. The ARinformation includes the vendor, catalog number, clone id, antibodytarget, target sub-region (where available), target modification, targetspecies, raised in species, target Entrez id, clonality, name and com-ments. Many more databases and resources exist to provide free accessto antibody data, including the National Cancer Institute's (NCI)AbMiner [70]. AbMiner provides data on commercially available anti-bodieswith their genomic identifiers after screening all entries byWest-ern blot for recognition of NCI-60 cancer cell lines, comprised of 60different cell lines from 9 different tissues. Here are but a few of manymore online resources to aid the antibody researcher: Antibodypediais a searchable database of antibodies against human proteins [71],The European Collection of Cell Cultures's Hybridoma Collection housesover 400 monoclonal antibody-secreting hybridomas [72], and theMonoclonal Antibody Index, which is a database with information onmore than 9000 monoclonal antibodies produced for diagnosis andtherapy [73].

    IMGT, the international ImMunoGeneTics information system[74,75] created by Marie-Paule Lefranc in 1989 at Montpellier, France,emerged at the interface between immunogenetics and bioinformatics,and is at the birth of immunoinformatics [76]. IMGT contains sevendatabases and seventeen online tools for the analysis of sequences,genes, and 3D structures of immunoglobulins (IGs) or antibodies,discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 8 H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxT cell receptors (TRs), major histocompatibility (MH), immunoglobulinsuperfamily (IgSF) proteins (other than IGs and TRs), and MH super-family (MhSF) proteins (other than MH). Owing to IMGT-ONTOLOGY[77], IMGT has become the global reference in immunogenetics andimmunoinformatics. More particularly, therapeutic monoclonal anti-bodies (mAbs), fusion proteins for immune applications (FPIA), andcomposite proteins for clinical applications (CPCA) are managed andannotated in IMGT databases, and online tools are proposed for theiranalysis, using these concepts [78]. This will be described in detail inthe following section, with a focus on V domain antibody numbering,which plays an important role when we have to specify the aminoacid (AA) of interest in a simple and unique manner.6. Antibody numbering and IMGT

    6.1. Traditional numbering schemes

    Antibody numbering plays an important role in specifying the resi-due of interest in a simple and unique manner. Unfortunately, therehas been no single thorough numbering scheme for immunoglobulins(IGs) or antibodies. In fact, there are multiple approaches available(based on sequences, structures, species, combination etc.) to character-ize them.

    In themid and late 1960s, it became possible to determine the aminoacid sequences of fragments of purified antibodies. Observing the com-plexity and diversity of the variable domains of the IG heavy and lightchains, Wu and Kabat made, in the seventies, several assumptionsabout antibody diversification and its implications for the generationof variability, the description of CDRsversus FRs, structurefunction rela-tionships, and even evolutionary considerationsweremade that are stillrelevant today. Thus was born the first and the most commonly usedKabat numbering that has been exclusively developed on a ratherlimited number of sequence alignments [22]. Owing to the growingnumber of antibody sequences, and to address the diversity of thesequences, additional insertions and gaps were necessary at differentpositions [79,80].

    To overcome these difficulties and improve on this first numberingscheme, Chothia introduced in 1987 a second numbering scheme,based on structural data and conformation analysis of CDRs. Thisscheme, modified in 1989 and revised back to the original in 1997,allowed for attributing a number to each residue of the CDRs, thanksto its topological position in the loop structure [23,24,26]. Whereasthis numbering had the advantage of emphasizing the importance ofthe structures, it does not solve the problem of the heterogeneity ofthe numbering for a given position in CDRs of different lengths. More-over, despite their high degree of sequence and structural homology,the IG variable domains of the heavy and light chains, are treated sepa-rately by both Kabat and Chothia, and the CDR delimitations, gap posi-tions, and lengths are heterogeneous, both within and between thetwo schemes, which is rather confusing. Abhinandan andMartin devel-oped AbNum, an automatic numbering program using Kabat, Chothiaand modified-Chothia schemes [81]. They reported that there were er-rors or inconsistencies in the numbering were found in the Kabat data-base. Their numbering scheme is a corrected version of the Chothiascheme, which is structurally correct throughout the CDRs and FRs.The heterogeneity has been further increased by errors and/or modifi-cations in the use of these numberings in the literature and patents.For the analysis of antibodies, in terms of immunology, the alignmentbetween the variable domain of IGs and TRs is useful. The AHo number-ing scheme was proposed by Honegger and Plckthun. This was acommon residue numbering for variable domains of IGs and TRs, by an-alyzing the spatially aligned known 3D structures. In the AHo scheme,insertions and deletions are placed symmetrically around a key position[82]. AHo's amazing atlas of antibody anatomy (AAAAA) summarizesthe scheme of AHo numbering with antibody sequences and structures.Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006However, it focuses on amino acid sequences of an antibody and doesnot account for the DNA sequence.

    6.2. IMGT unique numbering

    In 1997, IMGT unique numbering was created for all IG and TR vari-able regions and domains of all jawed vertebrate species [83]. The firstgraphical two-dimensional (2D) representation or IMGT Collier dePerles was displayed on the IMGT web site in December of the sameyear [84]. Interestingly this first IMGT Collier de Perles (the historicalone, manually drawn, is still online) allowed to detect a missing AA inthe PDB structure [84], demonstrating that IMGT unique numberinghas definitively bridged the gap between sequences and structures.This striking result was however expected because IMGT unique num-bering was set up after aligning more than 5000 sequences while atthe same time relying on the high conservation of the structure of thevariable domain [83,85,86]. IMGT unique numbering takes into accountand combines the definition of the FRs and CDRs[22,79,80], the charac-terization of the hypervariable loops [23,24,26], and structural datafromX-ray diffraction studies [87]. The breakthroughwas the standard-ized delimitation of the FR-IMGT and CDR-IMGT, defined with six an-chors (26 and 39, 55 and 66, 104 and 118) (Fig. 3). The CDR-IMGTlengths became crucial information and were used as standards for es-tablishing the correspondence between IMGT unique numbering andthe other numbering systems [88]. These are used for the characteriza-tion of any V domain sequence or structure and, since 2007, have beenincluded in the definition of monoclonal antibodies in the WHO-INNprogram[89].

    IMGT unique numbering for V domains unifies the V domain de-scription whatever the receptor (IGs or TRs), or the chain type (heavychains, kappa and lambda light chains for the IGs; alpha, beta, gammaand delta chains for the TRs) [85]; it was fully described in 2003 [86]and extended to the C domain in 2005 [90]. Thus, the IMGT uniquenumbering is the definitive system for numbering the V and C domains[91].

    In addition to the numbering scheme, IMGT-ONTOLOGY classifica-tion enabled, for the first time, to classify the antigen receptor (IG andTR) genes [92,93] of any locus (e.g., immunoglobulin heavy (IGH),T cell receptor alpha (TRA)), of any gene configuration (germline, unde-fined, or rearranged) and of any species (from fish to human). The IGmajor loci (and corresponding genes and chains) include the immuno-globulin heavy (IGH), and for the light chains, the immunoglobulinkappa (IGK) and the immunoglobulin lambda (IGL) in higher verte-brates, and the immunoglobulin iota (IGI) in fish. The accuracyand the consistency of the IMGT data are based on this first, and so farunique, IMGT-ONTOLOGY [77,94,95] for immunogenetics and immuno-informatics. IMGT-ONTOLOGY has conceptualized the knowledgein these biological and interdisciplinary domains through diversefacets that rely on seven axioms: IDENTIFICATION, DESCRIPTION,CLASSIFICATION, NUMEROTATION, LOCALIZATION, ORIENTATIONand OBTENTION. The concepts generated from these axioms led to theelaboration of the IMGT standards that constitute the IMGT Scientificchart: e.g., IMGT standardized keywords (IDENTIFICATION) [96],IMGT standardized labels (DESCRIPTION) [97], IMGT standardizedgene and allele nomenclature (CLASSIFICATION) [98], IMGT unique num-bering [91] and standardized graphical 2D representation or IMGT Col-liers de Perles (Fig. 4) [99,100] (NUMEROTATION). IMGT-ONTOLOGYconcepts of identification, description, classification and numerotationhave become a necessity for a standardized description of the IG loci ofnewly sequenced genomes, antibody structure/function characterization,antibody engineering and humanization [76]. The CDR-IMGT lengths arenow required for mAb INN applications and are included in the WHO-INN definitions [89], bringing a new level of standardized informationto the comparative analysis of therapeutic antibodies. Moreover, theuse of the IMGT standardized keywords (e.g., functionality), labels(e.g., V-REGION, CDR-IMGT), gene and allele nomenclature and IMGTdiscovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • Fig. 3.Variable domains forming the binding site of an immunoglobulin or antibody. The variable heavy (VH) and variable light (VL) domains are shownwith the CDR1-IMGT, CDR2-IMGTand CDR3-IMGT, according to the IMGT colormenu (red, orange and purple for VH; blue, green and dark green for VL (here V-kappa)). The VHdomain corresponds to the V-D-J region andtheVL to theV-J region (IMGT labels of description). In IMGTuniquenumbering, the conserved amino acids of a V domain always have the sameposition,whatever the antigen receptor, orthe chain type or the species. The five amino acids of the V domain framework region that contribute to the hydrophobic structural core are shownwith red letters, and comprise cysteineC23 (1st-CYS), tryptophanW41 (CONSERVED-TRP), hydrophobic amino acid (here, leucine L) 89, cysteine C104 (2nd-CYS) and tryptophanW118 (J-TRP) or phenylalanine F118 (J-PHE).The anchors of the CDR1-IMGT (26 and 39) and CDR2-IMGT (55 and 66) are shownwith black letters. 104 and 118 are the anchors of the CDR3-IMGT. The JUNCTION includes the anchorsand extends from 104 to 118, whereas the CDR3-IMGT extends from 105 to 117.

    Fig. 4. IMGT Collier de Perles of the VH and VL domains of trastuzumab. Trastuzumab VH (on the left): CDR-IMGT lengths [8.8.13], and FR-IMGT lengths [25.17.38.11] = 91 AAs.Trastuzumab V-kappa (on the right): CDR-IMGT lengths [6.3.9], and FR-IMGT lengths [26.17.36.10] = 89 AAs. Amino acids are shown as one-letter abbreviations. All proline (P) areshownhighlighted in yellow. IMGT anchors are in squares. Hatched circles are IMGT gaps according to the IMGT unique numbering for V domain. Positionswith bold (red) letters indicatethe four conserved positions that are common to a V domain and to a C domain: 23 (1st-CYS), 41 (CONSERVED-TRP), 89 (hydrophobic), 104 (2nd-CYS), and the fifth conserved positionthat is specific to the IG and TR V-DOMAIN: 118 (J-TRP or J-PHE). The hydrophobic amino acids (hydropathy index with positive value: I, V, L, F, C, M, A) and tryptophan (W) found at agiven position in more than 50% of analyzed sequences are shown (with a blue background color). The motif F/W-G-X-G of the FR4-IMGT characterizes the Jregion. Arrows indicate thedirection of the beta strands and their designations in 3D structures. CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT are colored according to the IMGT color menu: red, orange, and purple forVH, blue, green, and dark green for VL (here V-kappa). The trastuzumab IMGT Colliers de Perles representations are obtained from IMGT/3Dstructure-DB and IMGT/2Dstructure-DB, eitherby a direct query (e.g., typing trastuzumab in Molecule name (receptor or ligand)) or via the IMGT/mAb-DB interface (selecting trastuzumab in INN) (http://www.imgt.org). IMGT/3Dstructure-DB can also be queried using the entry code (PDB) 1n8z of the trastuzumab Fab/ERBB2 complex 3D structure. IMGT/3Dstructure-DB download (http://www.imgt.org/download/3Dstructure-DB/) provides IMGT renumbered coordinate flat files in a PDB-like format (IMGT3DFlatFiles.tgz). Another file (IMGT3DNumComp.tgz) provides the comparisonbetween the PDB numbering, the IMGT renumbered coordinate file numbering, and the IMGT DOMAIN numbering.

    9H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxx

    Please cite this article as: H. Shirai, et al., Antibody informatics for drug discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/j.bbapap.2014.07.006

    http://www.imgt.orghttp://www.imgt.org/download/3Dstructure-DB/http://www.imgt.org/download/3Dstructure-DB/http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 10 H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxunique numbering has allowed IG repertoire studies to move to novelhigh-throughputmethodologies with the same high-quality criteria [76].

    Based on these foundations, major IMGT tools and databases havebeen developed and used for IG and TR repertoire analysis, antibody hu-manization, and IG/Ag and TR/peptide-MH (pMH) structures. Nucleo-tide sequences of V domains from deep sequencing (NGS or HTS) canbe analyzed with IMGT/HighV-QUEST [101,102], the high throughputversion of IMGT/V-QUEST [103,104] and IMGT/JunctionAnalysis [105,106]; the tools can be used for the online analysis of rearranged nucle-otide sequences. Amino acid sequences of V and C domains can be rep-resented with the IMGT/Collier-de-Perles tool [107] and analyzed withFig. 5. IMGT/DomainGapAlign results for trastuzumab VH. A. Alignment with the closest genesuffix: -zumab) and the AA sequence is aligned against theHomo sapiens (human) V domain dirallele (for trastuzumab VH, H. sapiens IGHV3-66*01 and IGHJ4*01), and determines the V regiochanges compared to the germline V and J reference sequences are shown in bold on the thirdaccording to the IMGTuniquenumbering [85,90]. B. IMGT Collier de Perles. The IMGT Collier deindicating amino acid changes compared to the closest gene and allele (IGHV3-66*01 and IGHJand CDR2-IMGT and in the four FR-IMGT, compared to the germlineH. sapiens IGHV3-66*01 anbering (e.g., T29NN) and the modification () or conservation (+) of IMGT AA properties (e.gses) and the characterization of the AA change (e.g., dissimilar).

    Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006IMGT/DomainGapAlign [108110], the tool for antibody humanization(Fig. 5). Lastly, IMGT/3Dstructure-DB, the IMGT database for 3D struc-tures [110,111] and its extension, IMGT/2Dstructure-DB for 2D struc-tures (e.g., antibodies and other proteins for which the 3D structure isnot available), provide an identical description of the chains anddomains, which enables an easy comparison between antibody 3Dstructures and sequences.

    IMGT unique numbering bridges the gap between sequences and 3Dstructures in these tools and databases, allowing standardized statisticalanalysis of antibody V-REGION amino acid physicochemical propertiesdefined by the eleven IMGT amino acid physicochemical classes [112]and allele from the IMGT V domain directory. Trastuzumab is a humanized antibody (INNectory. The tool creates IMGT gaps, identifies the closest germline IGHV and IGHJ gene andn percent of identity (81.60% with IGHV3-66*01, results shown above the alignment). AAline. Delimitations of the FR-IMGT and CDR-IMGT, and of the strands and loops, are givenPerles, generated automatically from the IMGT/DomainGapAlign results, shows pink circles4*01) from the IMGT reference directory. C. Table with the AA changes in the CDR1-IMGTd IGHJ4*01. AA changes are describedwith their positions according to IMGT unique num-., (+), presented in the order: hydropathy, volume, and IMGT physicochemical clas-

    discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 11H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxx(Fig. 5C), which can be correlated, if 3D structures are available in IMGT/3Dstructure-DB [110,111], with the display of hydrogen bonds in IMGTColliers de Perles on 2 layers, the results of contact analysis (Fig. 6), andthe paratope/epitope description. These IMGT tools and databases areregularly updated and run against IMGT reference directories built fromsequences annotated in IMGT/LIGM-DB [113], the IMGT nucleotidedatabase [175,406 sequences from 346 species in November 2013]and from IMGT/GENE-DB [114], the IMGT gene database (3117Fig. 6. IMGT/3Dstructure-DB contacts between the trastuzumab VH and ERBB2 in the IG/Ag comERBB2 (1n8z_C) display AAs for which contacts are identified in IMGT/3Dstructure-DB [109,110the AAs that interactwith the ligand essentially belong to the CDR1-IMGT, CDR2-IMGT and CDRtwo AAs of the FR-IMGT that interact with the ligand are the anchors 55 and 66 of the CDR2-IM(here, arginine). Amino acids that interact can be localized in the IMGT Collier de Perles (Fig. 4 aand Y113 (CDR3-IMGT). TwoAAs stand out by their number of contacts: Y64 (pink circle in Fig.53 atom pair contacts, respectively, out of the 160 contacts contributed by the CDR-IMGT.

    Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006genes and 4732 alleles from 17 species, containing 695 genes and1420 alleles for Homo sapiens and 868 genes and 1318 alleles for Musmusculus in November 2013). To bring this cutting edge discovery di-rectly to therapeutic development, an interface, IMGT/mAb-DB [115],has been set up to provide an easy access to therapeutic antibodyamino acid sequences (links to IMGT/2Dstructure-DB) and structures(links to IMGT/3Dstructure-DB), if 3D structures are available. IMGT/mAb-DB data include monoclonal antibodies (mAbs, INN suffix -mAbplex (PDB ID: 1n8z). Interactions between the trastuzumab VH (1n8z_B) and the ligand], and the number of atom pair contact types (polar, hydrogen, nonpolar). They show that3-IMGT (red, orange and purple, respectively, according to the IMGT colormenu). The onlyGT and these contacts are not unusual when these positions are occupied by large size AAsnd 5). They are Y38 (CDR1-IMGT), Y57, Y64, and T65 (CDR2-IMGT) andW107, G111, F112,4, indicating anAA change) and Y113 (near the top of the loop)which participate to 44 and

    discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 12 H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxxthat is defined by the presence of at least an IG variable domain) andfusion proteins for immune applications (FPIA, INN suffix -cept that isdefined by a receptor fused to a Fc) from the WHO-INN program [89].This database also includes a few composite proteins for clinical applica-tions (CPCA) (e.g., protein or peptide fused to a Fc for only increasingtheir half-life, identifiedby the INNprefix ef-) and some related proteinsof the immune system (RPI) used, unmodified, for clinical applications.Since 2007, IMGT gene and allele names have been used for the de-scription of the therapeutic monoclonal antibodies (mAbs) and FPIA ofthe WHO-INN program[89].

    By facilitating comparisons between the sequences and the descrip-tions of alleles, mutations, AA changes and, for the C domains, allotypes[116], the IMGT unique numbering scheme represents a big step for-ward in the analysis of the IGs and TRs of all species. It offers the possi-bility of comparing genetic data from sequences and structureswith theultimate goal of providing information for personalized medicine [117].This illustrates the possibility that when a numbering scheme is robustand consistent, itmay allow the linking of genes to proteins, sequence tostructure [78,118120] and perhaps physicochemical properties of theantibody to its stability over time.

    To conclude, by identifying a common and robust scheme, this IMGTexample shows how a unique numbering scheme is powerful. Indeed, itcould be easier for all scientists to share and interpret the experimentaldata. Once done, the next antibody generation which will benefit fromthis harmonization step, should be easier to develop and perhaps betterfor the patients. Using robust and common definitions and numberingschemes for antibody engineering can be of real value in knowledge ac-quisition [78,118120]. In addition, CDR lengths should be taken intoaccount when assigning the structural location of specific residues.This focus could be a next step in antibody engineering informatics. Rec-ommendations for a standard numbering scheme for our publications tomake it more consistent across the community were suggested in July2012 at the EMBL-EBI Industry Programme: IMGT would suit this pur-pose. This recommendation would need to be accepted and endorsed.

    7. Conclusions

    The various in silico technologies, databases and infrastructurediscussed here can be summarized as a top-level category of antibodyinformatics. Some antibodies or antigen designs can be performed byan in silico approach, as well as by knowledge, experience and intuition.High quality antibodymodeling is necessary for the rational design andaffinity improvement of antibodies and their interactionswith antigens.Current automatic modeling methods, such as RosettaAntibody andPIGS, generate sufficient quality of models for the hypervariable region,except for loop H3. The ab initiomethodmay contribute to predictmoreaccurate H3 loop conformation. However, we need to emphasize thatexpert knowledge is still needed, for example, to select suitable tem-plates or for the rational design of antibodies when new antibody se-quences emerge. We summarized various antibody data resources interms of their contents and features including DIGIT, IEDB, and IMGT.Compared to small molecule therapeutics, antibody drug discoverydata is still poorly supported by informatics resources. Combining anti-body datawith small-molecule data in ChEMBL andDrugBank allows usto utilize known drug discovery data for the development of antibodydrugs. Use of the IMGT-ONTOLOGY concepts, and in particular usingthe definitive IMGT unique numbering for V and C domains would beone of the ways to address some of the inconsistencies within the anti-body informatics field. The analysis of the IGs and TRs of the adaptiveimmune repertoire using IMGT/HighV-QUEST (1.87 billion sequences,from different species and different IG and TR loci, as of January 2014)has unambiguously demonstrated that IMGT unique numbering is par-ticularly well adapted for large data sets. The in silico technologies, data-bases and infrastructure discussed here give useful hints for all typesof computational and experimental researchers that can have a realimpact in aiding antibody drug discovery.Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006References

    [1] http://www.ebi.ac.uk/industry.[2] X. Wang, T.K. Das, S.K. Singh, S. Kumar, Potential aggregation prone regions in

    biotherapeutics: a survey of commercial monoclonal antibodies, MAbs 1 (2009)254267.

    [3] N. Chennamsetty, V. Voynov, V. Kayser, B. Helk, B.L. Trout, Design of therapeuticproteins with enhanced stability, Proc. Natl. Acad. Sci. U. S. A. 106 (2009)1193711942.

    [4] P.R. Hinton, M.G. Johlfs, J.M. Xiong, K. Hanestad, K.C. Ong, C. Bullock, S. Keller, M.T.Tang, J.Y. Tso, M. Vasquez, N. Tsurushita, Engineered human IgG antibodies withlonger serum half-lives in primates, J. Biol. Chem. 279 (2004) 62136216.

    [5] A. Datta-Mannan, D.R. Witcher, Y. Tang, J. Watkins, W. Jiang, V.J. Wroblewski,Humanized IgG1 variants with differential binding properties to the neonatal Fcreceptor: relationship to pharmacokinetics in mice and primates, Drug Metab.Dispos. 35 (2007) 8694.

    [6] D. Kuroda, H. Shirai, M.P. Jacobson, H. Nakamura, Computer-aided antibody design,Protein Eng. Des. Sel. 25 (2012) 507521.

    [7] P. Marcatili, A. Rosi, A. Tramontano, PIGS: automatic prediction of antibody struc-tures, Bioinformatics 24 (2008) 19531954.

    [8] D. Kuroda, H. Shirai, M. Kobori, H. Nakamura, Structural classification of CDR-H3revisited: a lesson in antibody modeling, Proteins 73 (2008) 608620.

    [9] X. Yang, W. Xu, S. Dukleska, S. Benchaar, S. Mengisen, V. Antochshuk, J. Cheung, L.Mann, Z. Babadjanova, J. Rowand, R. Gunawan, A. McCampbell, M. Beaumont, D.Meininger, D. Richardson, A. Ambrogelly, Developability studies before initiationof process development: improving manufacturability of monoclonal antibodies,MAbs 5 (2013) 787794.

    [10] V. Jawa, L.P. Cousens, M. Awwad, E. Wakshull, H. Kropshofer, A.S. De Groot, T-celldependent immunogenicity of protein therapeutics: preclinical assessment andmitigation, Clin. Immunol. 149 (2013) 534555.

    [11] S. Paul, R.V. Kolla, J. Sidney, D. Weiskopf, W. Fleri, Y. Kim, B. Peters, A. Sette, Evalu-ating the immunogenicity of protein drugs by applying in vitro MHC binding dataand the immune epitope database and analysis resource, Clin. Dev. Immunol. 2013(2013) 467852.

    [12] S.T. Reddy, X. Ge, A.E. Miklos, R.A. Hughes, S.H. Kang, K.H. Hoi, C. Chrysostomou, S.P.Hunicke-Smith, B.L. Iverson, P.W. Tucker, A.D. Ellington, G. Georgiou, Monoclonalantibodies isolated without screening by analyzing the variable-gene repertoireof plasma cells, Nat. Biotechnol. 28 (2010) 965969.

    [13] X. Wu, T. Zhou, J. Zhu, B. Zhang, I. Georgiev, C. Wang, X. Chen, N.S. Longo, M.Louder, K. McKee, S. O'Dell, S. Perfetto, S.D. Schmidt, W. Shi, L. Wu, Y. Yang, Z.Y.Yang, Z. Yang, Z. Zhang, M. Bonsignori, J.A. Crump, S.H. Kapiga, N.E. Sam, B.F.Haynes, M. Simek, D.R. Burton, W.C. Koff, N.A. Doria-Rose, M. Connors, J.C.Mullikin, G.J. Nabel, M. Roederer, L. Shapiro, P.D. Kwong, J.R. Mascola, Focused evo-lution of HIV-1 neutralizing antibodies revealed by structures and deep sequenc-ing, Science 333 (2011) 15931602.

    [14] J.C. Almagro, J. Fransson, Humanization of antibodies, Front. Biosci. 13 (2008)16191633.

    [15] A.E. Miklos, C. Kluwe, B.S. Der, S. Pai, A. Sircar, R.A. Hughes, M. Berrondo, J. Xu, V.Codrea, P.E. Buckley, A.M. Calm, H.S. Welsh, C.R. Warner, M.A. Zacharko, J.P.Carney, J.J. Gray, G. Georgiou, B. Kuhlman, A.D. Ellington, Structure-based design ofsupercharged, highly thermoresistant antibodies, Chem. Biol. 19 (2012) 449455.

    [16] J. Xu, D. Tack, R.A. Hughes, A.D. Ellington, J.J. Gray, Structure-based non-canonicalamino acid design to covalently crosslink an antibodyantigen complex, J. Struct.Biol. 185 (2014) 215222.

    [17] M. Kiyoshi, J.M. Caaveiro, E. Miura, S. Nagatoishi, M. Nakakido, S. Soga, H. Shirai, S.Kawabata, K. Tsumoto, Affinity improvement of a therapeutic antibody by structure-based computational design: generation of electrostatic interactions in the transitionstate stabilizes the antibodyantigen complex, PLoS One 9 (2014) e87099.

    [18] http://www.chemcomp.com/index.htm.[19] http://accelrys.com/.[20] A. Fukunaga, K. Tsumoto, Improving the affinity of an antibody for its antigen via

    long-range electrostatic interactions, Protein Eng. Des. Sel. 26 (2013) 773780.[21] K. Tharakaraman, L.N. Robinson, A. Hatas, Y.L. Chen, L. Siyue, S. Raguram, V.

    Sasisekharan, G.N. Wogan, R. Sasisekharan, Redesign of a cross-reactive antibodyto dengue virus with broad-spectrum activity and increased in vivo potency,Proc. Natl. Acad. Sci. U. S. A. 110 (2013) E1555E1564.

    [22] T.T. Wu, E.A. Kabat, An analysis of the sequences of the variable regions of BenceJones proteins and myeloma light chains and their implications for antibody com-plementarity, J. Exp. Med. 132 (1970) 211250.

    [23] C. Chothia, A.M. Lesk, Canonical structures for the hypervariable regions of immu-noglobulins, J. Mol. Biol. 196 (1987) 901917.

    [24] C. Chothia, A.M. Lesk, A. Tramontano, M. Levitt, S.J. Smith-Gill, G. Air, S. Sheriff, E.A.Padlan, D. Davies, W.R. Tulip, et al., Conformations of immunoglobulin hypervari-able regions, Nature 342 (1989) 877883.

    [25] B. North, A. Lehmann, R.L. Dunbrack, A new clustering of antibody CDR loop con-formations, J. Mol. Biol. 406 (2011) 228256.

    [26] B. Al-Lazikani, A.M. Lesk, C. Chothia, Standard conformations for the canonicalstructures of immunoglobulins, J. Mol. Biol. 273 (1997) 927948.

    [27] A. Chailyan, P. Marcatili, D. Cirillo, A. Tramontano, Structural repertoire of immuno-globulin lambda light chains, Proteins 79 (2011) 15131524.

    [28] D. Kuroda, H. Shirai, M. Kobori, H. Nakamura, Systematic classification of CDR-L3in antibodies: implications of the light chain subtypes and the VLVH interface,Proteins 75 (2009) 139146.

    [29] A.C. Martin, J.M. Thornton, Structural families in loops of homologous proteins:automatic classification, modelling and application to antibodies, J. Mol. Biol. 263(1996) 800815.discovery, Biochim. Biophys. Acta (2014), http://dx.doi.org/10.1016/

    http://www.ebi.ac.uk/industryhttp://refhub.elsevier.com/S1570-9639(14)00174-5/rf0005http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0005http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0005http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0010http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0010http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0010http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0015http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0015http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0015http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0020http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0020http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0020http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0020http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0025http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0025http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0505http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0505http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0030http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0030http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0035http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0035http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0035http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0035http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0035http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0040http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0040http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0040http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0045http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0045http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0045http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0045http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0050http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0050http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0050http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0050http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0055http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0060http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0060http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0065http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0065http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0065http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0065http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0070http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0070http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0070http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0075http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0075http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0075http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0075http://www.chemcomp.com/index.htmhttp://accelrys.com/http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0080http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0080http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0085http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0085http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0085http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0085http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0090http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0090http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0090http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0095http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0095http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0100http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0100http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0100http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0105http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0105http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0110http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0110http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0115http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0115http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0120http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0120http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0120http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0125http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0125http://refhub.elsevier.com/S1570-9639(14)00174-5/rf0125http://dx.doi.org/10.1016/j.bbapap.2014.07.006http://dx.doi.org/10.1016/j.bbapap.2014.07.006

  • 13H. Shirai et al. / Biochimica et Biophysica Acta xxx (2014) xxxxxx[30] P. Verdino, D.A. Witherden, K. Podshivalova, S.E. Rieder, W.L. Havran, I.A. Wilson,cDNA sequence and Fab crystal structure of HL4E10, a hamster IgG lambda lightchain antibody stimulatory for gammadelta T cells, PLoS One 6 (2011) e19828.

    [31] A. Sircar, K.A. Sanni, J. Shi, J.J. Gray, Analysis and modeling of the variable region ofcamelid single-domain antibodies, J. Immunol. 186 (2011) 63576367.

    [32] O. Lopez, C. Perez, D. Wylie, A single VH family and long CDR3s are the targetsfor hypermutation in bovine immunoglobulin heavy chains, Immunol. Rev. 162(1998) 5566.

    [33] F. Wang, D.C. Ekiert, I. Ahmad, W. Yu, Y. Zhang, O. Bazirgan, A. Torkamani, T.Raudsepp, W. Mwangi, M.F. Criscitiello, I.A. Wilson, P.G. Schultz, V.V. Smider,Reshaping antibody diversity, Cell 153 (2013) 13791393.

    [34] N.R. Whitelegg, A.R. Rees, WAM: an improved algorithm for modelling antibodieson the WEB, Protein Eng. 13 (2000) 819824.

    [35] L. Holm, L. Laaksonen, M. Kaartinen, T.T. Teeri, J.K. Knowles, Molecular modellingstudy of antigen binding to oxazolone-specific antibodies: the Ox1 idiotypic IgGand its mature variant with increased affinity to 2-phenyloxazolone, Protein Eng.3 (1990) 403409.

    [36] A.C. Martin, J.C. Cheetham, A.R. Rees, Modeling antibody hypervariable loops: acombined algorithm, Proc. Natl. Acad. Sci. U. S. A. 86 (1989) 92689272.

    [37] A. Sircar, E.T. Kim, J.J. Gray, RosettaAntibody: antibody variable region homologymodeling server, Nucleic Acids Res. 37 (2009) W474W479.

    [38] B.D. Weitzner, D. Kuroda, N. Marze, J. Xu, J.J. Gray, Blind prediction performance ofRosettaAntibody 3.0: grafting, relaxation, kinematic loop modeling, and full CDRoptimization, Proteins 82 (2014) 16111623.

    [39] K.W. Kaufmann, G.H. Lemmon, S.L. Deluca, J.H. Sheehan, J. Meiler, Practicallyuseful: what the Rosetta protein modeling suite can do for you, Biochemistry 49(2010) 29872998.

    [40] A. Leaver-Fay, M. Tyka, S.M. Lewis, O.F. Lange, J. Thompson, R. Jacak, K. Kaufman, P.D. Renfrew, C.A. Smith,W. Sheffler, I.W. Davis, S. Cooper, A. Treuille, D.J. Mandell, F.Richter, Y.E. Ban, S.J. Fleishman, J.E. Corn, D.E. Kim, S. Lyskov, M. Berrondo, S.Mentzer, Z. Popovic, J.J. Havranek, J. Karanicolas, R. Das, J. Meiler, T. Kortemme, J.J.Gray, B. Kuhlman, D. Baker, P. Bradley, ROSETTA3: an object-oriented softwaresuite for the simulation and design of macromolecules, Methods Enzymol. 487(2011) 545574.

    [41] A.A. Canutescu, R.L. Dunbrack Jr., Cyclic coordinate descent: a robotics algorithmfor protein loop closure, Protein Sci. 12 (2003) 963972.

    [42] D.J. Mandell, E.A. Coutsias, T. Kortemme, Sub-angstrom accuracy in protein loop re-construction by robotics-inspired conformational sampling, Nat. Methods 6 (2009)551552.

    [43] Y. Choi, C.M. Deane, Predicting antibody complementarity determining regionstructures without classification, Mol. Biosyst. 7 (2011) 33273334.

    [44] P.T., C.A. Janeway Jr., M. Walport, et al., Immunobiology: The Immune System inHealth and Disease, 5th edition Garland Science, New York, 2001.

    [45] V. Morea, A. Tramontano, M. Rustici, C. Chothia, A.M. Lesk, Conformations of thethird hypervariable region in the VH domain of immunoglobulins, J. Mol. Biol.275 (1998) 269294.

    [46] H. Shirai, A. Kidera, H. Nakamura, Structural classification of CDR-H3 in antibodies,FEBS Lett. 399 (1996) 18.

    [47] H. Shirai, A. Kidera, H. Nakamura, H3-rules: identification of CDR-H3 structures inantibodies, FEBS Lett. 455 (1999) 188197.

    [48] S.T. Kim, H. Shirai, N. Nakajima, J. Higo, H. Nakamura, Enhanced conformational di-versity search of CDR-H3 in antibodies: role of the first CDR-H3 residue, Proteins37 (1999) 683696.

    [49] I. Sela-Culang, S. Alon, Y. Ofran, A systematic comparison of free and bound antibodiesreveals binding-related conformational changes, J. Immunol. 189 (2012) 48904899.

    [50] K. Zhu, D.L. Pincus, S. Zhao, R.A. Friesner, Long loop prediction using the proteinlocal optimization program, Proteins 65 (2006) 438452.

    [51] H. Shirai, High-resolutionmodeling of antibody structures by a combination of bio-informatics, expert knowledge, and molecular simulations, The Annual Meeting ofthe Antibody Society, Huntington Beach, CA, December 812, 2013.

    [52] K.R. Abhinandan, A.C. Martin, Analysis and prediction of VH/VL packing in antibod-ies, Protein Eng. Des. Sel. 23 (2010) 689697.

    [53] A. Chailyan, P. Marcatili, A. Tramontano, The association of heavy and lightchain variable domains in antibodies: implications for antigen specificity, FEBS J.278 (2011) 28582866.

    [54] J. Dunbar, A. Fuchs, J. Shi, C.M. Deane, ABangle: characterising the VHVL orienta-tion in antibodies, Protein Eng. Des. Sel. 26 (2013) 611620.

    [55] X.M. Fernandez-Suarez, D.J. Rigden, M.Y. Galperin, The 2014 Nucleic AcidsResearch Database Issue and an updated NAR online Molecular Biology DatabaseCollection, Nucleic Acids Res. 42 (2014) D1D6.

    [56] R. Vita, L. Zarebski, J.A. Greenbaum,H. Emami, I. Hoof, N. Salimi, R. Damle, A. Sette, B.Peters, The immune epitope database 2.0, Nucleic Acids Res. 38 (2010)D854D862.

    [57] A. Chailyan, A. Tramontano, P. Marcatili, A database of immunoglobulins with inte-grated tools: DIGIT, Nucleic Acids Res. 40 (2012) D1230D1234.

    [58] P. Marcatili, F. Ghiotto, C. Tenca, A. Chailyan, A.N. Mazzarello, X.J. Yan, M. Colombo,E. Albesiano, D. Bagnara, G. Cutrona, F. Morabito, S. Bruno, M. Ferrarini, N. Chiorazzi,A. Tramontano, F. Fais, Igs expressed by chronic lymphocytic leukemia B cells showlimited binding-site structure variability, J. Immunol. 190 (2013) 57715778.

    [59] F. Ghiotto, P. Marcatili, C. Tenca, M.G. Calevo, X.J. Yan, E. Albesiano, D. Bagnara, M.Colombo, G. Cutrona, C.C. Chu, F. Morabito, S. Bruno, M. Ferrarini, A. Tramontano, F.Fais, N. Chiorazzi, Mutation pattern of paired immunoglobulin heavy and lightvariable domains in chronic lymphocytic leukemia B cells, Mol. Med. 17 (2011)11881195.

    [60] S. Zibellini, D. Capello, F. Forconi, P. Marcatili, D. Rossi, S. Rattotti, S. Franceschetti, E.Sozzi, E. Cencini, R. Marasca, L. Baldini, A. Tucci, F. Bertoni, F. Passamonti, E. Orlandi,M. Varettoni, M. Merli, S. Rizzi, V. Gattei, A. Tramontano, M. Paulli, G. Gaidano, L.Please cite this article as: H. Shirai, et al., Antibody informatics for drugj.bbapap.2014.07.006Arcaini, Stereotyped patterns of B-cell receptor in splenic marginal zone lympho-ma, Haematologica 95 (2010) 17921796.

    [61] A. Gaulton, L.J. Bellis, A.P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S.Mcglinchey, D.Michalovich, B. Al-Lazikani, J.P. Overington, ChEMBL: a large-scale bio-activity database for drug discovery, Nucleic Acids Res. 40 (2012) D1100D1107.

    [62] A.P. Bento, A. Gaulton, A. Hersey, L.J. Bellis, J. Chambers, M. Davies, F.A. Kruger, Y.Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos, J.P. Overington,The ChEMBL bioactivity database: an update, Nucleic Acids Res. 42 (2014)D1083D1090.

    [63] L.J. Bellis, R. Akhtar, B. Al-Lazikani, F. Atkinson, A.P. Bento, J. Chambers, M. Davies, A.Gaulton, A. Hersey, K. Ikeda, F.A. Krger, Y. Light, S. McGlinchey, R. Santos, B.Stauch, J.P. Overington, Collation and data-mining of literature bioactivity datafor drug discovery, Biochem. Soc. Trans. 39 (2011) 13651370.

    [64] Q. Li, T. Cheng, Y.Wang, S.H. Bryant, PubChem as a public resource for drug discovery,Drug Discov. Today 15 (2010) 10521057.

    [65] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. Pon, K. Banco, C.Mak, V. Neveu,Y. Djoumbou, R. Eisner, A.C. Guo, D.S. Wishart, DrugBank 3.0: a comprehensive re-source for omics research on drugs, Nucleic Acids Res. 39 (2011) D1035D1041.

    [66] J.P. Overington, B. Al-Lazikani, A.L. Hopkins, Howmany drug targets are there? Nat.Rev. Drug Discov. 5 (2006) 993996.

    [67] M. Rask-Andersen, M.S. Almen, H.B. Schioth, Trends in the exploitation of noveldrug targets, Nat. Rev. Drug Discov. 10 (2011) 579590.

    [68] D. Gardner, H. Akil, G.A. Ascoli, D.M. Bowden,W. Bug, D.E. Donohue, D.H. Goldberg,B. Grafstein, J.S. Grethe, A. Gupta, M. Halavi, D.N. Kennedy, L. Marenco, M.E.Martone, P.L. Miller, H.M. Muller, A. Robert, G.M. Shepherd, P.W. Sternberg, D.C.Van Essen, R.W. Williams, The neuroscience information framework: a data andknowledge environment for neuroscience, Neuroinformatics 6 (2008) 149160.

    [69] http://antibodyregistry.org.[70] S.M. Major, S. Nishizuka, D. Morita, R. Rowland, M. Sunshine, U. Shankavaram, F.

    Washburn,D. Asin,H. Kouros-Mehr, D. Kane, J.N.Weinstein, AbMiner: a bioinformaticresource on available monoclonal antibodies and corresponding gene identifiers forgenomic, proteomic, and immunologic studies, BMC Bioinformatics 7 (2006) 192.

    [71] http://www.antibodypedia.com.[72] https://www.phe-culturecollections.org.uk/products/celllines/hybridoma/search.jsp.[73] http://www.gallartinternet.com/mai/index.htm.[74] http://www.imgt.org.[75] M.-P. Lefranc, V. Giudicelli, C. Ginestoux, J. Jabado-Michaloud, G. Folch, F.

    Bellahcene, Y. Wu, E. Gemrot, X. Brochet, J. Lane, L. Regnier, F. Ehrenmann, G.Lefranc, P. Duroux, IMGT, the international ImMunoGeneTics information system,Nucleic Acids Res. 37 (2009) D1006D1012.

    [76] M.-P. Lefranc, Immunoglobulin (IG) and T cell receptor genes (TR): IMGT and thebirth and rise of immunoinformatics, Front. Immunol. 5 (2014) 22.

    [77] V. Giudicelli, M.-P. Lefranc, IMGT-ONTOLOGY, in: O.W.W. Dubitzky, K.-H. Cho, H.Yokota (Eds.), Encyclopedia of Systems Biology, Springer Science + BusinessMedia, LLC012, New York, 2013, pp. 964972.

    [78] M.-P. Lefranc, Antibody informatics: IMGT, the international ImMunoGeneTicsinformation system, Microbiol. Spectr. 2 (2) (2014) (AID-0001).

    [79] E.A. Kabat, T.T. Wu, M. Reid-Miller, H. Perry, K. Gottesman, Sequences of Proteinsof Immunological Interest, 4th ed. U. S. Govt., Bethesda, MD, 1987. Printing Off.No. 165492.

    [80] E.A. Kabat, T.T. Wu, H. Perry, K. Gottesman, C. Foeller, Sequences of Proteins ofImmunological Interest, 5th ed. NIH Publication, Bethesda, MD, 1991. No. 913242

    [81] K.R. Abhinandan, A.C. Martin, Analysis and improvements to Kabat and structurallycorrect numbering of antibody variable domains, Mol. Immunol. 45 (2008)38323839.

    [82] A. Honegger, A. Plckthun, Yet another numbering scheme for immunoglobulinvariable domains: an automatic modeling and analysis tool, J. Mol. Biol. 309 (2001)657670.

    [83] M.-P. Lefranc, Unique database numbering system for immunogenetic analysis,Immunol. Today 18 (1997) 509.

    [84] http://www.imgt.org/IMGTrepertoire/2D-3Dstruct/2D-representations/mouse/IG/E5.2Fv/ighVD-J_E5_2Fv.html.

    [85] M.-P. Lefranc, T