Enzyme Technologies (Metagenomics, Evolution, Biocatalysis, and Biosynthesis) || Principles of Enzyme Optimization for the Rapid Creation of Industrial Biocatalysts

4PRINCIPLES OF ENZYMEOPTIMIZATION FOR THE RAPIDCREATION OF INDUSTRIALBIOCATALYSTS

Richard J. Fox and Lori GiverCodexis, Inc., Redwood City, California

I. INTRODUCTION

Enzymes are incredibly proficient molecular machines, having evolved throughseveral hundred million years of natural selection to catalyze thousands of bio-chemical reactions critical to all life on the planet. When operating on naturalsubstrates and products under physiological conditions, they can accelerate reac-tions up to 1017 over that of uncatalyzed reactions [1]. Unfortunately, theytypically do not perform well for industrial applications, where pH, tempera-ture, and solvent conditions as well as the substrates and products they operateon can deviate significantly from the environment in which they evolved. Thus,natural enzymes usually require some degree of optimization to function effec-tively as industrial biocatalysts. To address the limitations of natural enzymes,significant scientific and engineering efforts have been devoted to the subject ofenzyme optimization over the last few decades, resulting in large advances in thespeed and degree to which these proteins can be discovered and improved.

Although there are many ways to view the task of enzyme optimization, threecritical aspects have emerged that offer a conceptual framework in which toapproach the problem: (1) the fitness function, (2) diversity generation, and (3) thesearch algorithm. These aspects capture many of the important features of enzyme

Enzyme Technologies: Metagenomics, Evolution, Biocatalysis, and Biosynthesis,Edited by Wu-Kuang Yeh, Hsiu-Chiung Yang, and James R. McCarthyCopyright © 2010 John Wiley & Sons, Inc.

99

100 RAPID CREATION OF INDUSTRIAL BIOCATALYSTS

optimization and are all active targets of research aimed at improving the speed ofenabling commercial processes. In the following sections we provide an overviewof these aspects by highlighting their important features as well as discussing thecurrent state of knowledge, existing challenges, and areas of opportunity.

Before discussing these aspects, however, it is useful to discuss some of thegeneral principles at work in evolutionary enzyme optimization, both natural andartificial. As enzymes are products of natural selection, some understanding of thefitness landscapes under which they have evolved are worth noting. The notionof a fitness landscape has been with us for many decades. Sewall Wright was oneof the first to articulate the idea that organisms exposed to evolutionary pressurescan be viewed as entities climbing mountain ranges in a high-dimensional space[2]. The dimensions consist of inputs (genotypes) and outputs (phenotypes orfunctions of interest). Further popularized by writers such as Richard Dawkins[3] and Stuart Kauffman [4,5], who expanded it to include molecular entitiessuch as genes and proteins, the idea can be a useful way to visualize some ofthe important features of evolution. An example of such a sequence–functionlandscape is shown in Figure 1. The task of ascending the surface has been ele-gantly referred to by Dawkins as “climbing Mount Improbable.” The expressionnicely encapsulates the notion of the low probability that random moves resultin increased function, but that a steady ascent is possible, even likely, if futuremoves are built on past gains, even if the process is completely blind to theultimate goal and mechanisms by which moves are generated.

One of the most important features to note about optimization on high-dimensional landscapes is that there are likely to be many paths upwardand there are probably numerous acceptable solutions. One could envisiona multitude of intertwined ridges with numerous intersections that allow fordifferent paths up Mount Improbable. In the field of biocatalysis, obtaining thebest enzyme is usually not the goal, and nature instructs us that there are oftenmany ways to achieve solutions to a given problem. Thus, efforts to optimizeenzymes usually benefit from an emphasis on the search for good prospectsover a singleminded pursuit of the best. The multitude of numerous acceptablesolutions also entails the notion that many of these solutions will not be obvious.Leslie Orgel’s second rule states that “evolution is cleverer than you are.” Thebeauty of evolutionary approaches to enzyme optimization is that they often leadto solutions that would have been difficult or impossible to discover through ourlimited understanding of the system.

Another important consideration of a fitness landscape is the degree to whichnearby coordinates have correlated heights. When nearby regions in sequencespace have similar heights, the landscape is smooth. Conversely, landscapes inwhich nearby points have very different heights are rugged. In general, the easewith which such landscapes are traversed by a given search algorithm is pro-portional to their smoothness. In the extreme case of complete ruggedness, thereis no search algorithm more effective than exhaustive sampling of all nearbyregions, an unrealistic task for even moderately complex problems [5,6].

INTRODUCTION 101

(a)

(b)

Diversity generationSearch algorithm

FIGURE 1 Aspects of enzyme optimization: (a) and (b) represent a sequence–functionlandscape referred to as Mount Improbable. The horizontal axes in (a) correspond todifferent genotypes, while the vertical axis or height represents some phenotype or functionof interest. Part (b) shows a contour plot of the surface shown in (a). Points 1, 2, and 3correspond to the same location in sequence–function space for both figures. Part (b) alsoshows the generalized steps used to explore the landscape: Diversity generation is usedto discover viable moves around the local landscape (most moves lead to no gain or lossof function); the search algorithm is used to move rapidly in the direction of increasedfunction by exploring a variety of high gradient directions.


It is important to note that the following discussion is concerned primarilywith optimization of the existing, measurable functions of an extant protein. Thediscovery of novel functions (such as catalyzing reactions not carried out bynatural enzymes) is a separate field of inquiry and has its own set of challengesand opportunities [7]. Although the fields share some commonalities, as wouldbe expected given they both concern the engineering of proteins, they are alsodistinct fields in much the same way as the study of the origin of life is from theevolution of life.

II. FITNESS FUNCTION

A. Screening Conditions

The first law of directed evolution [8,9] states that “you get what you screenfor.” Thus, the setting of appropriate screening conditions in order to obtaindesired properties constitutes an indispensible aspect of enzyme optimization.If an enzyme must perform its reaction at a given pH and temperature, forexample, screening under such conditions is almost always eventually necessary,at least for a subset of variants that are being considered for scale-up testing.Although it is often useful to screen enzymes under less stringent conditionsduring the optimization process, the target conditions for operation remain theultimate goal. Such gradual increases in the stringency of assays can be used toimprove enzymes toward a desired endpoint without discarding an unacceptablyhigh number of candidates along the way.

Improving enzyme activity is usually one of the primary goals of the opti-mization program, and measures of catalytic efficiency are often required toselect variants. While miniaturized biocataltyic reactions usually provide a goodindication of an enzyme’s catalytic ability, traditional measures of enzyme per-formance may not be the best metrics to select for improved variants. To helpunderstand this performance, consider the simplest form of the reversible velocityequation:

v = (VmaxS/KMS)[S] − (VmaxP/KMP)[P]

1 + [S]/KMS + [P]/KMP

(1)

where [S] and [P] are the substrate and product concentrations, VmaxS and VmaxP

the maximum velocities of the forward and reverse reactions, and KMS andKMP are the Michaelis constants for the substrate and product, respectively.As discussed by Eisenthal and co-workers [10], the ratio VmaxS/KMS, oftenused by enzymologists to characterize natural enzymes under physiological sub-strate/product concentrations, may not be applicable to industrial biostransforma-tions [11]. To address this and other limitations, the reaction-averaged velocitycan be used to account for the widely varying substrate/product concentrationsover the course of an industrial transformation [12,13]. Under a typical reactionrequiring 99% transformation of substrate to product, and assuming that the reac-tion is irreversible (VmaxP = 0), the velocity equation adopts a form familiar to

FITNESS FUNCTION 103

analysts of traditional enzyme kinetics: namely,

v = VmaxS

1 + KeffMS

/[S0](2)

where v is the average velocity over the course of reaction and [S0] is theinitial substrate concentration. However, the apparent Michaelis constant usedin traditional enzyme kinetics, K

appMS

= KMS(1 + [P]/KMP), is replaced by aneffective Michaelis constant, Keff

MS= 4.61KMS(1 + 0.78[S0]/KMP), which can be

used to assess the expected performance over the course of the entire reaction[13]. Enzymes that begin the reaction at high velocity but encounter difficultytoward the end of the reaction, owing to Keff

MS≥ [S0], may not be preferable to

enzymes with more moderate but consistent velocities throughout. The interplayof various kinetic parameters, such as KMS, and product inhibition constants,such as KMP = Ki , as well as the starting substrate concentration [S0] can playan important role in determining the time to completion. Thus, the use of single-condition parameters such as the ratio VmaxS/KMS may lead to incorrect selectionof variants for scale-up testing and evaluation.

Implicit in the foregoing discussion is the notion that the kinetic parame-ters themselves are fixed values based on the underlying rate constants. Becauseindustrial biostransformations often occur over widely varying substrate/productconcentrations, it is likely that the parameters themselves may change overthe course of the reaction. Thus, kinetic experiments conducted on individualenzymes may not represent constants in a traditional biochemical sense. Suchconstants should be interpreted with care: They are empirical, best-fit estimatesto a nonlinear least-squares function over a wide time and/or substrate and productrange but may not be correct for any single condition of interest.

The substrate ranges encountered during industrial biotransformations maylead to evolutionary pressures that may give rise to behaviors that could benefitor harm the ability of an enzyme to carry out the full reaction. For example,it is not always widely appreciated that enzymes can catalyze forward reactionsmuch more readily than the reverse reaction, even when the equilibrium constantis near unity [14,15]. But such behaviors can occur at high substrate concentra-tions, [S0] � KMS, and when the binding affinity for the substrate is less than thatof the product, KMS � KMP . The free energy available in large substrate con-centrations can be used to enhance the destabilization of the enzyme–substratecomplex, allowing easier access to the transition state and thereby promotinghigher rates of catalysis in the forward reaction. From an enzyme engineeringstandpoint, this property can be desirable or detrimental, depending on the par-ticular circumstances. Screening at high substrate concentrations may facilitatethis behavior more readily, but if it comes at the cost of difficulties in complet-ing a reaction, the potential gains at the start of the reaction could be offset byincreased product inhibition at the end.


B. Screening Throughput

The discussion above should not be construed as an indictment of miniaturizedbiocatalytic reactions that are used as surrogates for the large scale. Indeed,for most industrial applications, miniaturized, single-condition reactions serve asthe mainstay of enzyme optimization programs and generally serve as adequatesurrogates under scaled conditions [16,17]. Although some have advocated theuse of very low throughput screens to reduce surrogacy errors [18], such strategiescome at the expense of testing fewer variants. As discussed in the followingsections, modern machine learning–based searched algorithms are well suitedto the task of sifting through large combinatorial spaces with a relatively smallnumber of experiments. However, as diversity is the ultimate “fuel” on whichthe optimization is dependent, a reasonable medium- to high-throughput screenprovides a distinct advantage with which very low throughput methods may findit difficult to compete. Moreover, it is worth noting that the two strategies arenot mutually exclusive. Tiered screens that move from high to low throughputcan serve to generate viable diversity as well as subject variants to more process-like conditions at later tiers, increasing the overall efficiency of the optimizationprogram.

Because the differences between small- and large-scale reactions can resultin less precise selective pressure for the most important industrial properties,methods to either rationally or empirically predict which properties are mostsensitive to scale and/or environment would probably reduce complications owingto surrogacy. Ideally, such predictive power could be incorporated into the designor analysis of small-scale screening conditions to increase the probability ofselecting variants with improved large-scale properties. Consequently, improvingthe predictive power of miniaturized biocatalytic reactions via the developmentof faster, more accurate analytical methods will always be a welcome advancein this regard.

C. Fitness Landscape

The shape of the fitness landscape also has important implications for the effi-ciency of optimization. For example, in the simplified picture given in Figure 1,steep cliffs on either side of a narrow ridge make ascending Mount Improbable amore risky proposition. Only a small number of directions will be tolerated andonly a fraction of those will result in increased function. Conversely, smooth,broad landscapes like those of Mt. Fuji are generally easier to traverse, withreduced risk of incurring steep drops in function and a greater fraction of accept-able directions that lead uphill to a more optimal enzyme. Until recently, theshape of the fitness landscape was considered to be a fixed attribute. However,researchers have discovered that they can control certain features of the landscapeby engineering more robust proteins. For example, Bershtein and co-workersshowed that proteins can display a kind of stability threshold that, once crossed,results in higher rates of lost function [19]. Epistasis , interactions between muta-tions, is usually regarded as a direct, local phenomenon. However, the stability

DIVERSITY GENERATION 105

threshold model suggests that epistasis can be a global phenomenon: Once aprotein’s stability reserve is exhausted, additional mutations are more likely toreduce or destroy protein function than would be expected if they were intro-duced into more stable scaffolds [20]. In a compelling demonstration of thisphenomenon, Bloom et al. were able to show that stabilized backbones served asbetter starting points for introduction of mutations that confer new or improvedfunction [21]. Indeed, in at least two well-documented cases, mutations that con-ferred improved function could be tolerated only within a stabilized backbone.These same mutations resulted in misfolding when incorporated into a less stableprotein. Bershtein et al. also showed that more robust proteins were found to bemore evolvable than their less stable counterparts [22]. By subjecting proteins tointense neutral drifts, they discovered numerous mutations conferring improvedstability. Interestingly, many of the mutations were “back to consensus” muta-tions, indicating that evolutionary signals in protein alignments can be a valuableresource for purposes of enzyme optimization. These studies indicate that therelationship between stability and evolvability can have an important influenceon the way enzyme engineers think about optimization. Clearly, the ability tomanipulate the sequence–function landscape itself to make it more amenable tooptimization represents a powerful addition to the toolbox of methods availableto enzyme engineers. A deeper understanding of the sequence–fitness landscapemay allow new methods to promote or identify evolvability, probably accelerat-ing enzyme optimization by allowing engineers to operate in less rugged portionsof Mount Improbable.

III. DIVERSITY GENERATION

A. Discovering Important Dimensions

In traditional optimization problems, the set of variables under considerationis usually finite and known a priori. Although efforts to identify the mostimportant determinants of the response function are usually required, there isgenerally not a need to search through thousands of possible variables beforeoptimization can begin. In contrast, enzyme engineers are usually faced with theformidable problem of identifying which of the thousands of possible variablesthey should seek to optimize over (i.e., the amino acid mutations). Althoughrational approaches to predicting which amino acid mutations will improveprotein function has long been the goal of computational approaches, ourignorance of protein sequence–function relationships based on first principlesor semiempirical models remains profound. There have been a small number ofsuccessful efforts to redesign proteins based on in silico predictions [23–26],but the consensus within the community is that we are still far from being ableto rely on strictly rational approaches to optimizing enzymes with improvedproperties for industrial applications.

Unfortunately, it is a brute empirical fact that most mutations one can maketo an enzyme are neutral or deleterious. Viewing this fact from the perspectiveof the ridge on Mount Improbable in Figure 1, any random change will usually


fail to lead upward, and often descend into lower function. Thus, all diversitygeneration methods are concerned with the essential problem of how to favorthe generation of beneficial mutations [27]. Importantly, this does not necessarilymean that the probability of success on a per site or per mutation basis need bemaximized; indeed, this attempt may be counterproductive. Instead, as discussedbelow, optimizing the number of beneficial mutations discovered for a fixedscreening resource is the ultimate key to effective diversity generation.

B. Arational Methods

In the absence of methods to predict rationally which amino acid mutations tomake in an enzyme, engineers have historically relied on arational or semira-tional methods to generate beneficial diversity. Irrational methods for diversityare usually based on some form of random mutagenesis. Although simple to exe-cute, random mutagenesis protocols suffer from several limitations, chief amongthem being their indiscriminate nature and their inability to access more than afraction of the possible amino acid mutations. The genetic code is structured insuch a way that only about three to seven of the 19 substitutions available atany one site are accessible through single base-pair changes, and roughly half ofthose substitutions are chemically similar [28]. Indiscriminate mutagenesis of theentire gene may be helpful when properties such as thermostability and activityare the focus of optimization. However, when properties such as specificity orenantioselectivity are of interest, studies have shown that approaches targetingregions of the protein (e.g., active sites) that tend to modulate such propertiespreferentially may be more fruitful [29] (although not always). In addition, set-ting and controlling the optimal random mutation rate is still an open problemin enzyme engineering. High mutation rates will usually result in libraries with alarge fraction of inactive variants, while lower mutation rates tend to oversamplea large fraction of variants without any mutations.

Given the limitations of random mutagenesis, researchers have sought formore efficient ways of accessing new diversity. Completely arational methods thatsubject the entire gene to saturation mutagenesis at every site in the correspondingprotein offer one way to explore the space of all single amino acid mutations,but they often require extensive molecular biology efforts to manufacture andmanage the screening of hundreds of small libraries [30–34] or thousands ofsequence-verified clones [35]. New developments for randomly accessing a widerarray of mutations include random insertion and deletion mutagenesis (RID) [36]and trinucleotide exchange (TriNEx) protocols [37]. Although these alternativeprotocols may be less expensive to implement, they are not without difficultiesor biases [38], so the search to find ever more efficient mechanisms of randomaccess to all diversity remains an open challenge. In response, protein engineershave begun to investigate the use of restricted codon sets to reduce the numberof amino acid substitutions to less than the 20 possible residues [39,40]. Suchapproaches may help reduce the set of options to those more likely to conferimproved function.


C. Semirational Methods

Given the generally low success rate (per mutation) of identifying beneficialmutations through purely arational approaches, enzyme engineers have increas-ingly sought to incorporate evolutionary principles and information into the taskof diversity generation. A now common procedure that collects diversity fromrelated proteins has met with good success. Indeed, the original success of DNAfamily shuffling is attributed, at least in part, to the use of previously acceptablediversity [41–44]. While specific mutations from related proteins may not func-tion well in a new context, they are far more likely to be tolerated than randomamino acid mutations given that nature has vetted them to some degree for struc-tural and functional acceptance: a fact that is important to distinguish from thepower of DNA shuffling as a search algorithm (discussed in detail below).

Along the same lines, more recent attention has been paid to incorporatingspecific mutations from related proteins based on their predicted ability to conferdesired properties. For example, consensus mutagenesis [45–49], where an align-ment of related proteins is used to identify mutations that exist predominantlyin other proteins but are absent in the protein of interest, is now recognizedas an effective method to improve the stability of enzymes. Individual muta-tions collected from related proteins can also be a useful source of diversityfor conferring improved activity, as demonstrated powerfully by Castle and co-workers, who used numerous mutations identified in homologs to improve thecatalytic efficiency of an enzyme about 10,000-fold [50,51]. Numerous othersemirational methods based on phylogenetic or computational studies can beused to restrict the search for beneficial diversity into smaller, more manage-able libraries. Chaparro-Riggers et al. nicely summarize the types of data-drivenmethods that have achieved various levels of success over the last 10 years [49].

D. Rational Methods

As mentioned previously, purely rational approaches to enzyme optimizationhave met with little industrial success, owing largely to our poor understandingof sequence–function relationships. Nevertheless, semirational methods based oncomputational analysis are often a good source of hypotheses that can be testedexperimentally as part of a diversity generation campaign. When crystal struc-tures or homology models are available, analysis of the active site can usuallypoint to one or more sites to target for site-directed or site-saturation mutagenesis.This method is particularly useful when modulating properties such as selectiv-ity or specificity [29,52,53]. Along similar lines, there have been great stridesrecently in the ability of physics-based energy functions to predict mutations thatare likely to confer increased stability. Two popular algorithms, FoldX [54,55]and RosettaDesign [56,57], have been used to perform in silico mutagenesis stud-ies [58,59] and increasingly as an engineering tool to stabilize proteins [23,60].Recently, knowledge-based statistical potentials that make use of residue patternsobserved in extant proteins have shown promise as an alternative to or in con-junction with physics-based approaches [61]. Although widespread examples of


utilizing these algorithms to design a small number of stabilized, active variantsis still lacking, they can serve as excellent starting points for diversity generationor as in silico filters that can be used in conjunction with other considerations,such as those based on evolutionary principles. The key idea is to utilize as muchof the information contained in evolutionary or computational analyses to con-struct libraries or to generate site-directed mutants as required to keep the poolof diversity well stocked. Until we can accurately predict the effects of mutationson the properties of interest, the fastest, most accurate computer in the universefor these types of calculations will continue to be nature itself [62].

E. Optimal Strategies

A key consideration that arises in any diversity-generation technique is the extentto which mutations are additive. Data from double-mutant experiments suggestthat enzyme properties such as stability and activity tend to be additive [63],particularly when the side chains are separated by more than 4 A. In the limit ofstrict free-energy additivity, beneficial mutations discovered in one context willconfer the same effects in other contexts. Considering the case of double-mutantenzymes, there are three possible deviations from additivity [64]: (1) the result-ing free-energy change is less than that predicted by the sum of the individualfree-energy changes but greater than either mutation alone (partially additive),(2) the resulting free-energy change is greater than that predicted by the sum ofthe individual free-energy changes (synergistic), or (3) the resulting free-energychange is less than that either mutation alone (antagonistic). Only the last class,antagonistic mutations, represents a problem for diversity-generation strategiesthat seek to identify individually beneficial mutations. The second class, syner-gistic mutations, could represent a problem for those strategies that do not intendto discover individually neutral or deleterious mutations that may be beneficial incombination with other mutations. However, it is not clear to what extent discov-ery of such synergistic combinations is necessary in order to achieve improvedenzyme function. Often, the screening resources involved in searching double- ortriple-mutant libraries of naive diversity are significantly greater than that neces-sary to search for partially additive mutations at many other sites. For example,while a double-mutant randomization (where each site contains all 20 amino acidsubstitutions) may be able to discover an optimal combination of mutations thatcould not otherwise be discovered by individual, stepwise acquisition, the samescreening effort could be utilized to search 20 separate sites within the proteinat the same level of coverage [65]. Similarly, the screening efforts required tosample a triple randomization could be used to scan nearly every site along a400-residue protein. Although it is likely that many of the possible optimiza-tion paths contain beneficial, interacting mutations that could not be discoveredthrough single amino acid substitutions, it is possible that strategies predicatedon the discovery of individual mutations that are at least quasiadditive may beequally or more efficient [66–68].

The decision to pursue the best mutations at a small number of sites versusa broader yet less exhaustive sweep over larger regions of the protein is of


critical concern. Although properties such as specificity and selectivity are often(but not always) modulated by mutations at a smaller number of identifiablepositions (e.g., an active site), properties such as activity (Vmax) and stability(i.e., thermal, pH, etc.) can be discovered across large regions of the protein, andcasting a wider net is often preferable [16,29,35,50]. To approach this questionmore quantitatively, consider the following simple model. The number of uniquevariants, f , expected from sampling t variants from a pool of size n is given by

f = n

[1 − exp

(− t

n

)](3)

Equation (3) for the number of unique variants expected for different pool sizesand screening efforts is shown graphically in Figure 2. Typically, positions identi-fied for site saturation mutagenesis are screened deeply to obtain 95% or greatercoverage (i.e., there is a 95% probability of observing a given variant withinthe pool). The strategy is predicated on being able to predict the locations thatare most likely to yield beneficial mutations. However, one consideration, oftenoverlooked in such a strategy, is that many of the same variants will be sampledmultiple times to achieve 95% coverage, providing no additional information orchance to discover beneficial mutations at other, untargeted sites. For example, apool of five sites with 20 amino acid mutations per site would contain n = 100unique variants. Assuming that each variant is represented equally in the pop-ulation, the pool could be screened to obtain 95% coverage using 300 assays.The effort would be expected to yield about 95 unique variants. Alternatively,one could target 20 sites having a total n = 400 unique variants. Such a librarycould be screened with the equivalent effort expended on the five site design, buta coverage of about 53% would yield about 211 unique variants.

If only a small number of sites are believed to be important for modulatinga particular property, the smaller targeted approaches will probably yield greaterimprovements, despite the fact that much of the screening effort is consumedby replicate measurements. However, one disadvantage worth considering evenfor designs based on such good hypotheses is that any mutation discovered at aparticular site cannot be recombined with any other mutation at that site. Thus,discovering a small number of the best mutations at just one or two sites bydeep sampling can come at the expense of discovering perhaps less influentialbut more numerous mutations at other sites that can then be recombined in turnto achieve a greater net gain in function.

To use a betting analogy, library designs often consist of identifying a rea-sonable strategy that is predicted to yield good results on a per site basis andthen going “all in” on those sites. However, the alternative analogy of invest-ment diversification may be preferable, as certain complementary designs mayyield additional, non-mutually exclusive mutations. Under these circumstancesthe respective designs can be used to optimally partition available screeningresources. For example, libraries believed to be more inherently promising couldbe screened with greater efforts, while the remaining assay resources could be


200100

200300

400

Uni

que

varia

nts

(f)

100

200

300

400

400 600

Screening size (t )

Pool size (n )

800 1000 1200

FIGURE 2 Unique variants for different pool sizes. The plot shows the number ofunique variants expected (f ) on the vertical axis for different library pool sizes (n) andscreening efforts (t) on the horizontal axes. For large screening resources relative to thepool size, for example, t > 3n, the expected number of unique variants per unit of unit ofscreening effort is low (f/t < 0.3), resulting in many replicate observations of the samevariant. For low screening resources relative to the pool size (i.e., t ≤ n), the expectednumber of unique variants per unit of unit of screening effort is high (0.6 < f/t < 1),but the probability of observing all unique variants within the pool is low.

utilized on more speculative libraries. Indeed, a key principle behind diversitygeneration is the notion that screening resources and library designs should bewell matched. An unused or underutilized screening resource is always oppor-tunity lost. Even the most facile library designs, such as random mutagenesis,are preferable to no design at all, and parallel diversity generation techniquesdesigned to feed recombination-based search algorithms continuously are moreefficient than serial rounds of exploration, where diversity generation campaignsare often staggered inefficiently.

IV. SEARCH ALGORITHM

A. Navigating Sequence–Function Space

Whereas the fitness function and diversity generation aspects serve to define theproblem space for enzyme optimization, a third, equally important aspect is con-cerned with methods for searching optimally over the important dimensions. Foryears, enzyme engineering was accomplished through serial rounds of randommutagenesis [8,69]. Such an approach effectively combines diversity generation

SEARCH ALGORITHM 111

with a facile search algorithm that explores a single dimension at a time of thelocal sequence–function space. The advantage of the approach is that it does notrely on any structural or rational analysis and is extremely easy to implement,particularly from a molecular biology perspective. Unfortunately, the approachdiscards many beneficial mutations at each round of evolution [70]. Such ben-eficial mutations can only be rediscovered through additional mutagenesis on anew, usually improved protein.

As mentioned earlier, there is no guarantee that mutations identified indi-vidually will contribute additively to improved function; however, as a workingstrategy it is often the case that individual mutations will, when recombined, con-fer at least partial additivity toward improved function [63,68]. Thus, in general,algorithms that can exploit the tendency of independent variables to contributesubstantial main effects when recombined will tend to explore a given space muchmore efficiently than serial exploration along a single dimension at a time. Withthe invention of DNA shuffling [71,72], Stemmer was the first to reduce this pow-erful concept to practice within the field of directed evolution, and the methodis widely regarded to have revolutionized the field. Since then, the techniqueand its offshoots have been used in many protein engineering efforts to rapidlyimprove protein function [41,50,73–76]. Unlike serial rounds of random muta-genesis, recombination-based evolution can make full use of parallel diversitygeneration efforts and is much less likely to discard beneficial mutations.

The process of recombination-based directed evolution has been equated to agenetic algorithm (GA) [77] conducted in vivo or in vitro [78–80]. Interestingly,genetic algorithms, widely used within the field of computer science to attackcomplex combinatorial optimization problems, were originally inspired by its bio-logical counterpart, natural selection. The canonical genetic algorithm consists ofoperators such as mutation, recombination, and selection, rendered as functionswithin a computer implementation of the problem. Generations of evolution areconducted in silico, and the resulting offspring solutions that are more “fit” aremated with each other to generate new populations continuously. An exampleof the process is shown in Figure 3. Typically, genetic algorithms are applied todifficult but well-defined optimization problems, where the relevant variables areavailable at the outset. Important variable changes usually get sampled throughthe process of mutation, although it can often take dozens or hundreds of gener-ations in silico to gradually accumulate the optimal combinations. The mutationrate is usually kept low to avoid disrupting candidate solutions. Although the fit-ness function is often expensive to calculate in terms of CPU time, it is generallynot a problem that these in silico algorithms take a gradual approach to variableidentification and recombination. Unfortunately, optimization of real molecularentities such as enzymes often requires significant resources—it usually requiresdays to weeks to obtain values for the fitness function, and substantial human andfinancial resources are often required. Thus, in the lab it is usually not feasibleto conduct dozens or hundreds of rounds of evolution, reinforcing the need forrapid identification of beneficial mutations described previously. Nevertheless,parallels between biological and computer-based evolution are still relevant, and


−3 −2 −1 0

Round 1

PopulationMaximum

Maximum Effective Screening Size

Round 2

Round1234

3.025.035.917.02

7.2×102

3.9×106

5.7×108

8.9×1011

Round 3

Round 4P

roba

bilit

y de

nsity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3

Fitness (Standard deviations)

4 5 6 7 8

FIGURE 3 Simulated molecular evolution. The y-axis shows the probability density fitto a normal distribution of different populations over the course of evolution. The x-axisshows the number of standard deviations above the initial population. The simulationconsists of evolving a set of t = 1000 variants, generated randomly in silico on an NKfitness landscape [4,6] with N = 80 and K = 1. At each round of evolution the top 10solutions are bred together with a uniform crossover operator, and a new populationof t = 1000 progeny variants is generated. The solid circles show the maximum fitnessobtained at each round of evolution. The inset table gives the maximum fitness obtainedat a given round along with the effective screening size according to Eq. (9).

a number of observations are worth recognizing toward the goal of obtainingmaximally efficient search algorithms.

B. The Power of Recombination

One of the important features of recombination-based search algorithms is therelative unimportance of screening size. When numerous individual mutationsare shuffled together in combinatorial libraries, the resulting fitness distributioncan, to a first approximation, be modeled as a normal distribution [81,82]. Theproperties of normal distributions dictate that large deviations from the mean areincreasingly difficult to obtain by deep sampling of the tail. For example, basedon extreme value theory [see Eq. (6) in the Appendix], a screen of 500 variantswould yield an expected fitness gain of 3.05 standard deviations above the mean.Additional screening of 10,000 variants would yield a modest improvement of3.86 standard deviations above the mean. Thus only about a 26% improvement infitness would be obtained, despite the fact that 20 times the screening resourceswere expended. Yet recombination-based algorithms have been shown repeatedly


to perform exceedingly well on many fitness landscapes. The key is to realize thatmore rounds of evolution are far more effective than deep screening within eachround. A facile model of the process can be constructed to develop an expressionfor the effective screening size, teff, given a number of variants screened perround, t , and the number of rounds, r (see the Appendix):

teff ≈ 2π

(t

2π

)[(r+1)/2]2

(4)

The effective screening size, teff, is defined as the amount of screening that onewould have to devote to sampling the original population distribution to obtaina given fitness expected from recombination-based evolution over r rounds. Forexample, screening t = 1000 variants per round over r = 4 rounds and recom-bining the best variants in each round yields teff ≈ 3.6 × 1014. In other words, aspace of about 1014 can be effectively searched by examining only 4000 solutions.The strong dependency of the exponent of Eq. (4) on the number of rounds, r , iswhat confers on recombination-based evolution the ability to generate extremelyhigh levels of improbability [83] in a very rapid manner.

Another important feature of the algorithm is revealed in Figure 3 in the formof a decreasing phenotypic variance over the course of evolution. Because thesimulation has a fixed number of variables to start with, as beneficial mutationsbecome fixed in the population, the genetic and therefore phenotypic variancedecreases. The only way to maintain phenotypic variance is to supply new muta-tions continually. This is the “fuel” by which all forms of evolution ultimatelyoperate, and maintaining a plentiful stock of useful mutations is critical to achiev-ing rapid, continual increases in enzyme function.

Although recombination-based molecular evolution is an extremely power-ful search algorithm, as mentioned earlier, it is somewhat mismatched with thecanonical form of the genetic algorithm, in that for laboratory-based evolutionthe fitness function is usually expensive and time consuming to evaluate. Thus,it is natural to ask if other search algorithms are available that are less blindand make more efficient use of the information present in the sequence-functionmeasurements. By blind, we do not mean to imply that the goal of the process isblind; indeed, it is highly directed toward whatever goal(s) the enzyme engineerwishes to achieve. Instead, blind here refers to the fact that the algorithm doesnot explicitly represent or make direct use of the known sequence–function rela-tionships that have usually been obtained at significant cost. Beneficial mutationspercolate toward fixation in the population, but the process is often relativelygradual compared to the desired rate of achieving functional improvement.

C. Machine Learning–Guided Strategies

To address the limitations of directed-evolution efforts based on blind GA-typeprinciples while still leveraging useful concepts from the field of optimizationtheory, methods based on ideas from design of experiments (DOE) and machine


learning have recently garnered attention. Interestingly, the latter developmentswithin the field of enzyme engineering are actually drawn from ideas in the statis-tics community that predate the development of GAs. The latest techniques forenzyme optimization have been enabled not so much by developments in estab-lished fields of computer science or statistics, but by innovation in the fields ofhigh-throughput sequencing, molecular biology, biochemistry, and a deep appre-ciation for the need to maximize the amount of information extracted from eachrelatively valuable experiment.

In a DOE one is usually interested in maximizing the information content con-tained within a small number of experiments. This information can then be usedto build statistical models that correlate inputs with one or more desired outputs.One central concern of a DOE is with the proper design of the input matrixthat defines the variable settings for each experiment. A great deal of theorydevoted to this goal has been developed over the last 70 years since R. A. Fisherfirst proposed methods for optimal construction of test matrices [84]. Whereasthe proper design of test conditions so as to maximize information content isextremely valuable and has solidified much of what it means to conduct proper,controlled scientific and engineering studies, it is the application of the resultsto response optimization that is most relevant to the task of enzyme optimization.To that end, once a statistical model has been constructed, it can be interrogated todetermine the optimal variable settings so as to maximize the predicted response[85]. This approach to response optimization has found widespread use withindisparate fields, such as finance, agriculture, and engineering.

The type of data generated during an enzyme evolution program is ideallysuited to the same type of analyses that have proved useful in DOE responseoptimization. Namely, once a combinatorial library has been screened, a subsetof variants drawn from a range of activities and having different sequences canbe used to construct models that correlate input (mutations) with a desired output[function(s) or response(s) of interest]. The resulting model can then be viewedas a map of the local sequence–function space that can be used to explore highgradient directions in the next round of evolution by suggesting which variables(mutations) to keep, which to retest, and which to discard [16,66,78]. Althoughclassical linear regression can be used to construct such models, more advancedmachine learning methods capable of capturing nonlinear features can be usedprovided that sufficient data exist to train the models reliably. Even withoutexplicit representation of interaction terms, the linear models can still functionwell over even semirugged portions of the sequence–function landscape [66,86].

The linear approach to modeling sequence–function relationships consists offormulating an equation of the form

y = y0 + c1x1 + c2x2 + · · · + cixi + · · · + cpxp + ε (5)

where y is the predicted response, y0 an intercept term, ci the regression coef-ficient for mutation i, xi a dummy variable indicating the presence (xi = 1) orabsence (xi = 0) of mutation i, p the number of mutations in the training set,


and ε a random error term. The machine learning phase consists of adjusting thevalues for the regression coefficients so as to minimize the difference betweenmeasured and predicted responses. In practice, this phase is achieved throughany number of techniques, the simplest of which is classical multiple linearregression. When the number of independent variables (mutations) exceeds thenumber of measurements, linear regression is no longer applicable, and othertechniques must be used to deal with such underdetermined systems. This isoften the case when combinatorial libraries contain random mutations in addi-tion to programmed mutations. The random mutations must either be ignored,which will contribute to model errors as well as miss potentially beneficial muta-tions, or techniques that reduce the dimensionality of the problem or control forthe infinite number of possible solutions must be utilized. Nonlinear terms thatexplicitly model interaction effects can be added to Eq. (5) if sufficient data existin the training set to estimate their regression coefficients robustly [86]. Suchinteraction terms can be chosen based on structural or other information in orderto carefully control against the wanton addition of terms that spuriously correlatewith the measured response but have no counterpart in reality.

After the statistical model has been formulated and trained, the next phaseof the search algorithm consists of evaluating the effects of different mutationsin silico. For simple linear models it is sufficient to judge mutational effects byexamining the value of their corresponding regression coefficients. Large positiveregression coefficients indicate that the mutation has a beneficial main effect, largenegative coefficients correspond to a deleterious main effect, and coefficients nearzero correspond to neutral main effects. The main effect (the linear regressioncoefficient) for each mutation represents the average contribution to function overmany contexts. Thus, nonlinear effects may appear indirectly with one or moreother mutations within a training set.

Mutations with large, positive regression coefficients are good candidates forincorporation into subsequent combinatorial libraries if not already present in theDNA template(s) used to construct the library. Mutations with large, negativecoefficients are usually discarded or targeted for reversion if they are presentin the new template. While neutral regression coefficients may indicate that themutation has little effect on protein function, it may also be possible that themutation is interacting with other mutations and could provide further increasesin function if taken forward into a subsequent library. Some judgment regardingthe probability of whether a mutation may be beneficial in the proper context isusually required. Considerations such as whether the mutation is in proximity toanother mutation and the reliability of the model are often required. For example,linear models that exhibit a poor correlation between the measured and predictedresponses may indicate the presence of nonlinear effects, high experimental noise,or both.

Once decisions have been made about which mutations to retain, which toincorporate, and which to discard have been made, subsequent libraries can beconstructed through any number of combinatorial library generation techniques,including fully [87] or semisynthetic DNA shuffling [16,88,89]. An advantage of


the machine learning–guided approach to evolution is that combinatorial librariesdevoted to searching high gradient directions of the sequence–function landscapecan be constructed and screened in parallel with ongoing diversity generationefforts. In the limit of at least partial additivity, mutations discovered in onecontext will contribute to improved function when incorporated into new libraries.In addition, multiobjective optimization is readily amenable to this type of searchalgorithm, as statistical models for each objective can be constructed individually,and mutations that are beneficial for at least one objective and not significantlydeleterious for the others can be recombined into new libraries.

The machine learning–guided approaches to searching sequencing space areequally amenable to the type of modest sampling requirements of GAs. In prac-tice, only a few dozen to a few hundred variants across a range of activitiesare required to build predictive models [16,78]. In either case, a better use ofscreening resources is to search for new diversity that can be used to fill thepipeline with beneficial mutations. When screening resources are particularlyconstrained, for example in the rare case that a reliable surrogate assay or aprimary screen is unavailable, a small number of specific variants (e.g., ten toseveral hundred) could be synthesized according to standard DOE techniquessuch as D-optimal design [85], thereby maximizing information content whileallowing for the enforcement of design constraints such as a specified numberof mutations per variant. This is a useful approach when the significant addi-tional cost and time associated with de novo gene synthesis is small comparedto assay time and resources. Future advancements in the speed and cost of genesynthesis would be a welcome development in this regard. Ideally, when synthe-sis becomes as fast and inexpensive as combinatorial library generation, it willprobably become the preferred method of generating variant sequences.

The key advantage of using statistical models to identify beneficial mutationsis to accelerate the rate at which variants accumulate those mutations, resulting ina more rapid increase in desired enzyme function compared to traditional methods[16]. Although it may be tempting to aggressively recombine all mutations thatmay provide some benefit, experience has shown that such greedy extrapolationalong the single highest gradient often leads to reduced function. Even whenthe local fitness landscape of enzymes tends to be additive, as the mutationalload increases, the context in which the mutations were identified changes tosuch a degree that the predictions may begin to break down. This responsecorresponds conceptually to the ridge shown in Figure 1, where too large amove in sequence space can lead to falling down either side of the ridge. Theadvantage of the machine learning–guided algorithm described here is that itstochastically explores numerous high gradient directions, increasing the odds ofreaching higher fitness on Mount Improbable.

V. CONCLUSIONS

We have described three major aspects of enzyme optimization that form a basiswith which to view the problem: the fitness function, diversity generation, and

APPENDIX 117

the search algorithm. All three aspects are indispensable in the sense that carefulattention should be paid to each to achieve efficient enzyme optimization. Inthe worst case, if no attention is given to one or more of these crucial aspects,enzyme optimization is likely to be severely limited or impossible. It is worthnoting that while we have presented these aspects as more or less orthogonal, theyshould be considered jointly in order to foster efficient optimization. For example,certain search algorithms may contain within them varying degrees of diversitygeneration. The original format for DNA shuffling was a search algorithm usedto sift through extant diversity as well as to provide for the introduction of newmutations. It is an open question whether it is more efficient to separate theseefforts from each other or if, for reasons of ease of execution and/or screeningcapacity, it may be reasonable to combine them. Similarly, diversity generationand the fitness function may be inextricably linked in cases where evolvabilityitself is engineered via the discovery and introduction of new mutations.

More broadly, it should be appreciated that these aspects are not meant to beunderstood and processed in a vacuum or without respect to one another. Indeed,it is a deep understanding of their interplay that leads to efficient optimization.For example, diversity generation is necessary to keep the stockpile of diversityhigh enough for the search algorithm to proceed, but not so high as to overloadit. Ultimately, it is important to monitor the size of the existing diversity stock(the first aspect) relative to the search algorithm’s capacity to sift through thatstock (the second aspect) to create a good match with the available screeningresources of the fitness function (the third aspect).

As enzyme engineering continues to advance, developments along the aspectsdescribed here will probably accelerate the speed of optimization. However, it isworth noting that while large advances along any one single aspect will alwaysbe welcome, the ability to confer improvements on the overall process will beconstrained by the limitations of all aspects considered together. For example,the best search algorithm in the world is of marginal utility without an adequatesource of diversity on which to operate. Thus, research efforts directed towardaddressing certain aspects would do well to consider the overall impact of likelyoutcomes, however successful, and enzyme engineering teams that can seamlesslybalance all aspects within an integrated environment will probably achieve thegoal of obtaining the most industrially important results in the shortest span time.

APPENDIX

The expected gain in fitness, 〈x〉, based on extreme value theory [90] isgiven by

〈x〉 = �−1(

1 − 1

t

)+ 0.57772

[�−1

(1 − 1

te

)− �−1

(1 − 1

t

)](6)

where �−1 is the inverse cumulative distribution function for the standard normaland t is the number of variants screened. For large screening sizes Eq. (6) can


be approximated by

〈x〉 ≈ �−1(

1 − 1

t

)(7)

Furthermore, the cumulative distribution function of the standard normal for x > 2is well approximated by

�(x) ≈ 1 − exp(−x2/2

)x√

2π(8)

Inserting Eq. (7) into (8) gives (suppressing expectation notation)

t ≈ x√

2π exp

(x2

2

)(9)

The Lambert function W(z) can be used to solve explicitly for x, where W (z)is the solution to the equation z = W(z) exp W(z):

x =√

W

(t2

2π

)(10)

A simple model of the recursive power of evolution can then be obtained bymaking an additional, simplifying assumption that the mean of the recentereddistribution after recombination of the top-performing variants is given by halfthe expected gain of the top-performing variant. If we ignore the loss of varianceassociated with exhaustion of the diversity supply, the total number of standarddeviations traversed over the course of r rounds of evolution is given by

y = x

(r + 1

2

)(11)

Inserting (10) into (11) then gives

y =√

W

(t2

2π

)r + 1

2(12)

Finally, inserting (12) back into a form similar to that given by (9) yields theeffective screening size:

teff = c(t, r)

(t

2π

)[(r+1)/2]2

(13)

where

c(t, r) =√

π

2(r + 1)

√W

(t2

2π

)(2π

/W

(t2

2π

))(1/8)(r+1)2

(14)

REFERENCES 119

Although the function c(t, r) appears complex, it is not a particularly strongfunction of r and t . It varies by about one order of magnitude over the range102 < t < 106 and 1 < r < 4 relative to its initial value of c(t, r = 1) = 2π.In comparison, the dominant factor is (t/2π)[(r+1)/2]2

which varies by some 30orders of magnitude over the same range. Thus, a good approximation for theeffective screening size is given by

teff ≈ 2π

(t

2π

)[(r+1)/2]2

(15)

Acknowledgments

The authors would like to thank Michael D. Clay for his thorough, careful reviewof the manuscript and very helpful suggestions to strengthen its content andpresentation. The authors also acknowledge the generous support of Codexis andits many illustrious scientists, who have helped shape and refine our perspectiveson this topic.

REFERENCES

1. DL Nelson, MC Cox. Enzymes. In: DL Nelson, MC Cox, eds. Lehninger’s Principlesof Biochemistry . New York: Worth Publishers, 2003, pp. 243–292.

2. S Wright. The roles of mutation, inbreeding, crossbreeding, and selection in evolution.Proceedings of the Sixth International Congress on Genetics , 1932, pp. 356–366.

3. R Dawkins. Climbing Mount Improbable. New York: W.W. Norton, 1996.

4. S Kauffman. The structure of rugged fitness landscapes. In: S Kauffman, ed. TheOrigins of Order , New York: Oxford University Press, 1993, pp. 33–67.

5. S Kauffman. Prolegomenon to a general biology. In: S Kauffman, ed. Investigations .New York: Oxford University Press, 2000, pp. 1–22.

6. SA Kauffman, ED Weinberger. The nk model of rugged fitness landscapes and itsapplication to maturation of the immune response. J Theor Biol 141:211–245, 1989.

7. H Zhao. Directed evolution of novel protein functions. Biotechnol Bioeng98:313–317, 2007.

8. FH Arnold. Design by directed evolution. Acc Chem Res 31:125–131, 1998.

9. C Schmidt-Dannert, FH Arnold. Directed evolution of industrial enzymes. TrendsBiotechnol 17:135–136, 1999.

10. R Eisenthal, MJ Danson, DW Hough. Catalytic efficiency and kcat/km: A usefulcomparator? Trends Biotechnol 25:247–249, 2007.

11. CY Chen, I Georgiev, AC Anderson, BR Donald. Computational structure-basedredesign of enzyme activity. Proc Natl Acad Sci USA 106:3764–3769, 2009.

12. K Bucholz, V Kasche, UT Bornscheuer. Biocatalysts and Enzyme Technology . Wein-heim, Germany: Wiley-VCH, 2005.

13. RJ Fox, MD Clay. Catalytic effectiveness, a measure of enzyme proficiency for indus-trial applications. Trends Biotechnol 27:137–140, 2009.


14. A Cornish-Bowden. One-way enzymes. In: A Cornish-Bowden, ed. Fundamental ofEnzyme Kinetics . London: Portland Press, 2004, pp. 53–55.

15. WP Jencks. Binding energy, specificity, and enzymic catalysis: the circe effect. AdvEnzymol Relat Areas Mol Biol 43:219–410, 1975.

16. RJ Fox, SC Davis, EC Mundorff, LM Newman, V Gavrilovic, SK Ma, LM Chung,C Ching, S Tam, S Muley, et al. Improving catalytic function by prosar-driven enzymeevolution. Nat Biotechnol 25:338–344, 2007.

17. S Luetz, L Giver, J Lalonde. Engineered enzymes for chemical production. BiotechnolBioeng 101:647–653, 2008.

18. C Gustafsson, S Govindarajan, J Minshull. Putting the engineering back into pro-tein engineering: bioinformatic approaches to catalyst design. Curr Opin Biotechnol14:1–5, 2003.

19. S Bershtein, M Segal, R Bekerman, N Tokuriki, DS Tawfik. Robustness–epistasislink shapes the fitness landscape of a randomly drifting protein. Nature 444:929–932,2006.

20. JD Bloom, FH Arnold, CO Wilke. Breaking proteins with mutations: threads andthresholds in evolution. Mol Syst Biol 3:76, 2007.

21. JD Bloom, ST Labthavikul, CR Otey, FH Arnold. Protein stability promotes evolv-ability. Proc Natl Acad Sci USA 103:5869–5874, 2006.

22. S Bershtein, K Goldin, DS Tawfik. Intense neutral drifts yield robust and evolvableconsensus proteins. J Mol Biol 379:1029–1044, 2008.

23. A Korkegian, ME Black, D Baker, BL Stoddard. Computational thermostabilizationof an enzyme. Science 308:857–860, 2005.

24. HS Park, SH Nam, JK Lee, CN Yoon, B Mannervik, SJ Benkovic, HS Kim. Designand evolution of new catalytic activity with an existing protein scaffold. Science311:535–538, 2006.

25. L Jiang, EA Althoff, FR Clemente, L Doyle, D Rothlisberger, A Zanghellini,JL Gallaher, JL Betker, F Tanaka, CF Barbas 3rd, et al. De novo computationaldesign of retro-aldol enzymes. Science 319:1387–1391, 2008.

26. D Rothlisberger, O Khersonsky, AM Wollacott, L Jiang, J DeChancie, J Betker,JL Gallaher, EA Althoff, A Zanghellini, O Dym, et al. Kemp elimination catalystsby computational enzyme design. Nature 453:190–195, 2008.

27. AV Shivange, J Marienhagen, H Mundhada, A Schenk, U Schwaneberg. Advances ingenerating functional diversity for directed protein evolution. Curr Opin Chem Biol13:19–25, 2009.

28. TS Wong, D Roccatano, M Zacharias, U Schwaneberg. A statistical analysis ofrandom mutagenesis methods used for directed protein evolution. J Mol Biol355:858–871, 2006.

29. KL Morley, RJ Kazlauskas. Improving enzyme properties: When are closer mutationsbetter ? Trends Biotechnol 23:231–237, 2005.

30. AI Solbak, TH Richardson, RT McCann, KA Kline, F Bartnek, G Tomlinson, X Tan,L Parra-Gessert, GJ Frey, M Podar, et al. Discovery of pectin-degrading enzymes anddirected evolution of a novel pectate lyase for processing cotton fabric. J Biol Chem280:9431–9438, 2005.

REFERENCES 121

31. KA Kretz, TH Richardson, KA Gray, DE Robertson, X Tan, JM Short. Gene sitesaturation mutagenesis: a comprehensive mutagenesis approach. Methods Enzymol388:3–11, 2004.

32. V Brissos, T Eggert, JM Cabral, KE Jaeger. Improving activity and stability of cuti-nase towards the anionic detergent AOT by complete saturation mutagenesis. ProteinEng Des Sel 21:387–393, 2008.

33. KA Gray, TH Richardson, K Kretz, JM Short, F Bartnek, R Knowles, L Kan,PE Swanson, DE Robertson. Rapid evolution of reversible denaturation and ele-vated melting temperature in a microbial haloalkane dehalogenase. Adv Synth Catal343:607–617, 2001.

34. G DeSantis, K Wong, B Farwell, K Chatman, Z Zhu, G Tomlinson, H Huang, X Tan,L Bibbs, P Chen, et al. Creation of a productive, highly enantioselective nitrilasethrough gene site saturation mutagenesis (gssm). J Am Chem Soc 125:11476–11477,2003.

35. DA Estell, A Wolfgang. Systematic evaluation of sequence and activity relationshipsusing site evaluation libraries for engineering multiple properties. US2008/0004186:Danisco US Inc., Genencor Division, USPTO, Rochester, NY, 2008.

36. H Murakami, T Hohsaka, M Sisido. Random insertion and deletion of arbitrary num-ber of bases for codon-based random mutation of DNAS. Nat Biotechnol 20:76–81,2002.

37. AJ Baldwin, K Busse, AM Simm, DD Jones. Expanded molecular diversity genera-tion during directed evolution by trinucleotide exchange (trinex). Nucleic Acids Res36:e77, 2008.

38. C Neylon. Chemical and biochemical strategies for the randomization of proteinencoding DNA sequences: library construction methods for directed evolution.Nucleic Acids Res 32:1448–1459, 2004.

39. MT Reetz, D Kahakeaw, R Lohmer. Addressing the numbers problem in directedevolution. ChemBioChem 9:1797–1804, 2008.

40. FA Fellouse, C Wiesmann, SS Sidhu. Synthetic antibodies from a four-amino-acidcode: a dominant role for tyrosine in antigen recognition. Proc Natl Acad Sci USA101:12467–12472, 2004.

41. A Crameri, SA Raillard, E Bermudez, WP Stemmer. DNA shuffling of a familyof genes from diverse species accelerates directed evolution. Nature 391:288–291,1998.

42. CC Chang, TT Chen, BW Cox, GN Dawes, WP Stemmer, J Punnonen, PA Patten.Evolution of a cytokine using DNA family shuffling. Nat Biotechnol 17:793–797,1999.

43. FC Christians, L Scapozza, A Crameri, G Folkers, WP Stemmer. Directed evolu-tion of thymidine kinase for AZT phosphorylation using DNA family shuffling. NatBiotechnol 17:259–264, 1999.

44. JR Cochran, YS Kim, SM Lippow, B Rao, KD Wittrup. Improved mutants fromdirected evolution are biased to orthologous substitutions. Protein Eng Des Sel19:245–253, 2006.

45. M Lehmann, C Loch, A Middendorf, D Studer, SF Lassen, L Pasamontes, AP vanLoon, M Wyss. The consensus concept for thermostability engineering of proteins:further proof of concept. Protein Eng 15:403–411, 2002.


46. M Lehmann, L Pasamontes, SF Lassen, M Wyss. The consensus concept for ther-mostability engineering of proteins. Biochim Biophys Acta 1543:408–415, 2000.

47. M Lehmann, M Wyss. Engineering proteins for thermostability: the use of sequencealignments versus rational design and directed evolution. Curr Opin Biotechnol12:371–375, 2001.

48. N Amin, AD Liu, S Ramer, W Aehle, D Meijer, M Metin, S Wong, P Gualfetti,V Schellenberger. Construction of stabilized proteins by combinatorial consensusmutagenesis. Protein Eng Des Sel 17:787–793, 2004.

49. JF Chaparro-Riggers, KM Polizzi, AS Bommarius. Better library design: data-drivenprotein engineering. Biotechnol J 2:180–191, 2007.

50. LA Castle, DL Siehl, R Gorton, PA Patten, YH Chen, S Bertain, HJ Cho, N Duck,J Wong, D Liu, MW Lassner. Discovery and directed evolution of a glyphosatetolerance gene. Science 304:1151–1154, 2004.

51. DL Siehl, LA Castle, R Gorton, RJ Keenan. The molecular basis of glyphosate resis-tance by an optimized microbial acetyltransferase. J Biol Chem 282:11446–11455,2007.

52. MT Reetz, C Torre, A Eipper, R Lohmer, M Hermes, B Brunner, A Maichele,M Bocola, M Arand, A Cronin, Y Genzel, A Archelas, R Furstoss. Enhancing theenantioselectivity of an epoxide hydrolase by directed evolution. Org Lett 6:177–180,2004.

53. MT Reetz, LW Wang, M Bocola. Directed evolution of enantioselective enzymes:iterative cycles of casting for probing protein-sequence space. Angew Chem Int Ed45:1236–1241, 2006.

54. R Guerois, JE Nielsen, L Serrano. Predicting changes in the stability of proteins andprotein complexes: a study of more than 1000 mutations. J Mol Biol 320:369–387,2002.

55. J Schymkowitz, J Borg, F Stricher, R Nys, F Rousseau, L Serrano. The foldx Webserver: an online force field. Nucleic Acids Res 33:W382–W388, 2005.

56. Y Liu, B Kuhlman. Rosetta design server for protein design. Nucleic Acids Res34:W235–W238, 2006.

57. CA Rohl, CE Strauss, KM Misura, D Baker. Protein structure prediction usingRosetta. Methods Enzymol 383:66–93, 2004.

58. N Tokuriki, F Stricher, L Serrano, DS Tawfik. How protein stability and new functionstrade off. PLoS Comput Biol 4:e1000002, 2008.

59. N Tokuriki, F Stricher, J Schymkowitz, L Serrano, DS Tawfik. The stability effectsof protein mutations appear to be universally distributed. J Mol Biol 369:1318–1332,2007.

60. G Dantas, C Corrent, SL Reichow, JJ Havranek, ZM Eletr, NG Isern, B Kuhlman,G Varani, EA Merritt, D Baker. High-resolution structural and thermodynamic anal-ysis of extreme stabilization of human procarboxypeptidase by computational proteindesign. J Mol Biol 366:1209–1221, 2007.

61. M Masso, Vaisman II. Accurate prediction of stability changes in protein mutants bycombining machine learning with structure based computational mutagenesis. Bioin-formatics 24:2002–2009, 2008.

62. S Lloyd. Computational capacity of the universe. Phys Rev Lett 88:237901–237904,2002.

REFERENCES 123

63. JA Wells. Additivity of mutational effects in proteins. Biochemistry 29:8509–8517,1990.

64. AS Mildvan. Inverse thinking about double mutants of enzymes. Biochemistry43:14517–14520, 2004.

65. R Kazlauskas. Biological chemistry: enzymes in focus. Nature 436:1096–1097, 2005.

66. RJ Fox, GW Huisman. Enzyme optimization: moving from blind evolution to statis-tical exploration of sequence–function space. Trends Biotechnol 26:132–138, 2008.

67. MP Styczynski, CR Fischer, GN Stephanopoulos. The intelligent design of evolution.Mol Syst Biol 2:2006–2020, 2006.

68. CA Tracewell, FH Arnold. Directed enzyme evolution: climbing fitness peaks oneamino acid at a time. Curr Opin Chem Biol 13:3–9, 2009.

69. DW Leung, E Chen, DV Goeddel. A method for random mutagenesis of a definedDNA segment using a modified polymerase chain reaction. Technique 1:11–15, 1989.

70. L Giver, FH Arnold. Combinatorial protein design by in vitro recombination. CurrOpin Chem Biol 2:335–338, 1998.

71. WP Stemmer. Rapid evolution of a protein in vitro by DNA shuffling. Nature370:389–391, 1994.

72. WP Stemmer. DNA shuffling by random fragmentation and reassembly: In vitrorecombination for molecular evolution. Proc Natl Acad Sci USA 91:10747–10751,1994.

73. A Crameri, G Dawes, E Rodriguez, Jr, S Silver, WP Stemmer. Molecular evolutionof an arsenate detoxification pathway by DNA shuffling. Nat Biotechnol 15:436–438,1997.

74. K Proba, A Worn, A Honegger, A Pluckthun. Antibody scfv fragments without disul-fide bonds made by molecular evolution. J Mol Biol 275:245–253, 1998.

75. JE Ness, M Welch, L Giver, M Bueno, JR Cherry, TV Borchert, WP Stemmer,J Minshull. DNA shuffling of subgenomic sequences of subtilisin. Nat Biotechnol17:893–896, 1999.

76. T Yano, S Oue, H Kagamiyama. Directed evolution of an aspartate aminotransferasewith new substrate specificities. Proc Natl Acad Sci USA 95:5511–5515, 1998.

77. JH Holland. Adaption in natural and artificial systems. Cambridge, MA: MIT Press,1975.

78. R Fox, A Roy, S Govindarajan, J Minshull, C Gustafsson, J Jones, R Emig. Optimiz-ing the search algorithm for protein engineering by directed evolution. Protein Eng16:589–597, 2003.

79. D Youvan. Searching sequence space. Nat Biotechnol 13:722–723, 1995.

80. C Gustafsson, S Govindarajan, R Emig. Exploration of sequence space for proteinengineering. J Mol Recog 14:308–314, 2001.

81. W Peng, H Levine, T Hwa, DA Kessler. Analytical study of the effect of recombi-nation on evolution via DNA shuffling. Phys Rev E 69:051911–051925, 2004.

82. H Muhlenbein, D Schlierkamp-Voosen. The science of breeding and its applicationto the breeder genetic algorithm (BGA). Evol Comput 1:335–360, 1993.

83. AW Edwards. The genetical theory of natural selection. Genetics 154:1419–1426,2000.

84. RA Fisher. The Design of Experiments . Edinburgh, UK: Oliver & Boyd, 1937.


85. RH Myers, DC Montgomery. Response Surface Methodology: Process and ProductOptimization Using Designed Experiments . Hoboken, NJ: Wiley, 1995.

86. R Fox. Directed molecular evolution by machine learning and the influence of non-linear interactions. J Theor Biol 234:187–199, 2005.

87. JE Ness, S Kim, A Gottman, R Pak, A Krebber, TV Borchert, S Govindarajan,EC Mundorff, J Minshull. Synthetic shuffling expands functional protein diversityby allowing amino acids to recombine independently. Nat Biotechnol 20:1251–1255,2002.

88. K Stutzman-Engwall, S Conlon, R Fedechko, H McArthur, K Pekrun, Y Chen,S Jenne, C La, N Trinh, S Kim, et al. Semi-synthetic DNA shuffling of avec leadsto improved industrial scale production of doramectin by Streptomyces avermitilis .Metab Eng 7:27–37, 2005.

89. A Herman, DS Tawfik. Incorporating synthetic oligonucleotides via gene reassem-bly (ISOR): a versatile tool for generating targeted libraries. Protein Eng Des Sel20:219–226, 2007.

90. E Castillo. Extreme Value Theory in Engineering . San Diego, CA: Academic Press,1988.

Documents

Enzyme Technologies (Metagenomics, Evolution, Biocatalysis, and Biosynthesis) || Principles of Enzyme Optimization for the Rapid Creation of Industrial Biocatalysts