De Novo Protein Structure Prediction - Tulane Universitynowadays in protein design. Also in 1987, McGregor et al. [8] used 61 high-resolution structures to examine the influence of

De Novo Protein Structure Prediction

Multiple Sequence Alignment-

T

AA

G

T

TC

G

T


T

AA

G

T

TC

G

T

ACT-- ATT-- --TGG

• What about the simple extension from 2D? !

• There are seven possible “endings”:

endings for sequences. Why?

u

v

w

u�

vm

wn

2k � 1 k

�


vm

wn

u�

wn

u�

vm

wn

u�

vm

�

�

��

�

�

�(x, y, z)

si,j,k = max

�⌅⌅⌅⌅⌅⌅⌅⌅⌅⌅⇤

⌅⌅⌅⌅⌅⌅⌅⌅⌅⌅⇥

si�1,j�1,k�1 + �(ui, vj , wk)si�1,j�1,k + �(ui, vj ,�)si�1,j,k�1 + �(ui,�, wk)si,j�1,k�1 + �(�, vj , wk)si�1,j,k + �(ui,�,�)si,j,k�1 + �(�,�, wk)si,j�1,k + �(�, vj ,�)si,j,k�1 + �(�,�, wk)


is an entry in the 3D scoring matrix

Time and space grow exponentially with number of sequences

a multiple alignment “projection” of sequence ( with gaps) score of pairwise alignment

S(A) =m�1�

i=1

m�

j=i+1

S(si, sj)

Asi

S(si, sj)i

ScoringSum-of-Pairs Scoring (SP):

Idea: A good multiple alignment should contain good pairwise alignments

i

Sv

� Fv

Fv

K

Sv + Fv < K

Pruning the DP MatrixThe dynamic programming matrix is large, but we only want the best alignment, and most matrix elements are not on that path. !

Can we ‘direct’ the search to avoid evaluating cells that are provably not on the best path?

v

Sv Score of the best path from start to Bound on the best path from to the end Score of best known alignment

vv

What if:

v = (3, 2, 2)

v

Pruning the DP Matrix

ARSTVK, ASVK, ARTR

Let

is score of best alignment of: ARS, AS, ARFv

Sv

is upper bound on score of aligning: TVK, VK, TR

Sv + Fv < KIf then mark as dead-ending (aka prune )v

Runtime for computing (using dynamic programming):

S(si, sj) � S(si, sj)S(A) �

m�1�

k=1

m�

l=k+1

S(sk, sl)

F

Fv =m�1�

k=1

m�

l=k+1

S(skvk+1...nk

, slvl+1...nl

)

O(n2m2)

Pruning the DP Matrix

S(A) =m�1�

i=1

m�

j=i+1

S(si, sj)

We know the alignment score is:

So our bound can be:

Observation:

Native State

Backbone

A protein conformation can be represented by a vector of DOF choices, and the conformation with minimum (potential) energy is:

Each point on the energy landscape defines a conformation and associated energy. !How many degrees of freedom should we have? How many do we want?

✓ = (. . . , ri, . . .) E(✓) =X

i 6=j

Ei,j +X

i

Ei

Native State

Backbone

✓ = (. . . , ri, . . .) ✓⇤ = argmin✓

E(✓)

Each point on the energy landscape defines a conformation and associated energy. !How many degrees of freedom should we have? How many do we want?

A protein conformation can be represented by a vector of DOF choices, and the conformation with minimum (potential) energy is:

Conformation Space

Energy Function

Primary Sequence

In order to apply a discrete optimization technique, we need a discretized search space!

Conformation Space

Primary Sequence

Algorithm

Energy Function

Conformation Space

(Homologous) Backbone

Algorithm

Energy Function

Conformation Space


Algorithm

X-ray Data

Conformation Space


Algorithm

Energy Function

X-ray Data

Conformation Space


Algorithm

PDB Statistics

Energy Function

X-ray Data

Conformation Space


Algorithm

PDB Statistics

Energy Function

NMR Data

Best-Fit Model

(Sequence/Fold, Energy Function, Statistics, Experimental Data)

(3D Structure, Backbone, Sidechains, Docking, design)

Prior Knowledge and Observations

Conformation SpaceAlgorithm



Best-Fit Model

(Sequence/Fold, Energy Function, PDB Statistics, Experimental Data)


Fast (enough)?

Accurate (enough) model?

Correct (enough) solution?



Best-Fit Model

(Sequence/Fold, Energy Function, Statistics, Experimental Data)


O(nc)and not

O(cn)

We want

Fast (enough)?

Correct objective function?

Guarantees on solution quality?

Discretizing Sidechains

rotamer library for all sidechains through χ4. Janin et al.,in 1978 [4], provided secondary-structure-dependent datasummed over all sidechains (excluding short and β-branchedsidechains). In 1983, James and Sielecki [5] used fivehigher resolution structures to produce dihedral angledistributions with lower variances than the larger sampleused by Janin et al. and were therefore the first to emphasizeusing better, if fewer, structures for deriving rotamerlibraries. Also in 1983, Benedetti et al. [6] presented datafrom 258 peptide crystal structures, with 321 sidechainsavailable for analysis. In addition to backbone-independentdihedral angle distributions, they showed Ramachandrandistributions for each χ1 rotamer, demonstrating the stronginterdependence of backbone and sidechain conformations.

In 1987, Ponder and Richards [7] presented the firstcomplete rotamer library — a list of all likely conformationsof sidechains and their average dihedral angles, variancesand frequencies. Their library was derived for use indetermining what sequences would favor a knownbackbone conformation, essentially the procedure usednowadays in protein design. Also in 1987, McGregor et al.[8] used 61 high-resolution structures to examine theinfluence of secondary structure on rotamer populations,producing a secondary-structure-dependent rotamer library.In the context of sidechain conformation prediction,Tuffery et al. [9] derived a backbone-independent rotamerlibrary based on 53 high-resolution structures availablein 1991. In 1993, Dunbrack and Karplus [10] presentedthe first backbone-dependent rotamer library for use insidechain conformation prediction. This library consistedof frequencies of the χ1-χ2 rotamers for each residue typein populated regions of the Ramachandran map, dividedinto 20° × 20° bins of the φ,ψ dihedral angles. Averagesidechain dihedral angles were not given. At the sametime, Schrauber et al. [11] examined ‘nonrotamericity’, givingaverage dihedral angles in a backbone-independentfashion, but frequencies dependent on secondary structure.

Kono and Doi [12] used cluster analysis in 1996 toderive a backbone-independent rotamer library includingfrequencies, average dihedral angles and variances. In1997, De Maeyer et al. [13] expanded the Ponder andRichards library [7] by adding rotamers to fill out allpossible staggered conformations for sp3−sp3 dihedralangles and to sample the broad distribution of amide andcarboxylate dihedral angles. They also added rotamers byincluding conformations one standard deviation in eachdirection away from the Ponder and Richards averages,producing a “highly detailed” rotamer library for sidechainconformation prediction.

In 1997, Dunbrack and Cohen [14] used Bayesian statisticsto estimate populations and dihedral angles for all rotamersof all sidechain types at all values of φ and ψ. This wasaccomplished by deriving an informative prior distributionbased on the product of the φ and ψ dependencies, andusing the Bayesian formalism to combine the fullyφ,ψ-dependent data likelihood (a multinomial distribution)with the prior distributions expressed as Dirichlet functions.In populated parts of the Ramachandran map, the resultsof this calculation are populations and dihedral angles veryclose to the data values; in sparsely populated regions ofthe Ramachandran map, the informative prior distributiondominates the predicted populations and angles. TheBayesian mechanism allowed us to achieve a smoothtransition between these two situations in a statisticallysound manner.

In a significant recent development, Richardson andcolleagues [15••,16] have used much stricter criteria forincluding sidechains in a data set used to build a backbone-independent rotamer library. In addition to using morehighly resolved and refined structures than previouslibraries, these criteria included eliminating sidechains withhigh B-factors for any atom, eliminating sidechains withclashes of any atom (including hydrogens built with the

432 Engineering and design

Table 1

Published rotamer libraries.

Authors Year Type of library Number of proteins inlibrary

Resolution (Å)

Chandrasekaran and Ramachandran [2] 1970 B BIND 3 NAJanin et al. [4] 1978 B BIND, SSDEP 19 2.5Bhat et al. [3] 1979 B BIND 23 NAJames and S ielecki [5] 1983 B BIND 5 1.8, R-factor < 0.15Benedetti et al. [6] 1983 B BIND 238 peptides R-factor < 0.10Ponder and Richards [7] 1987 B BIND 19 2.0Mc G regor et al. [8] 1987 SSDEP 61 2.0Tuffery et al. [9] 1991 B BIND 53 2.0Dunbrack and Karplus [10] 1993 B BIND, B BDEP 132 2.0Schrauber et al. [11] 1993 B BIND, SSDEP 70 2.0Kono and Doi [12] 1996 B BIND 103 NADe Maeyer et al. [13] 1995 B BIND 19 2.0Dunbrack and Cohen [14] 1997–2002 B BIND, B BDEP 850* 1.7Lovell et al. [15••] 2000 B BIND, SSDEP 240 1.7

*Latest update, May 2002. NA, not available.

[Dunbrack, “Rotamer Libraries in the 21st Century”, 2002]

Energy Functions

Etotal

= Ebonded

+ Eunbonded

where:

Ebonded

= Ebond

+ Eangle

+ Edihedral

Enonbonded

= Eelectrostatic

+ EvdW

Standard approaches (e.g. Amber, CHARMM, GROMACS) model potential energy as :

Conformation Space


Algorithm

Energy Function

Phenylalanine Isoleucine

1

0

32

4 56 7

R

!For a protein with amino acids, the protein backbone has degrees of freedom. !Sidechain conformations are also defined by dihedral angles, but can be discretized by rotamers.

n2n� 2

[Shapovalov, Dunbrack ’11]

Native State

Backbone

A sidechain conformation can be represented by a vector of rotamer choices, and the conformation with minimum (potential) energy is:

Each point on the energy landscape defines a conformation and associated energy. !For sidechain placement, we have degrees of freedom. Each amino acid has a number of states equal to the number of rotamers for that type.

n

✓ = (. . . , ir, . . .) ✓⇤ = argmin✓

E(✓)

Dead End Elimination

Original DEE

One of the only deterministic, non-trivial, and effective combinatorial optimization algorithms in Computational

Structural Biology

Used ForSide-Chain Placement (tertiary structure prediction) Protein Design

Prunes rotamers that are provably NOT part of the GMEC

1

3

2

Dead End EliminationTotal Energy

ir it

Total Energy

3

2

3

2


ir it

Total Energy

3

2

3

2


ir it

Total Energy

3

2

3

2


ir it?

?

?

?3

2

3

2

Dead End EliminationOriginal DEE (Simplified)

ir it?

?

?

?

min

min

max

max3

2

3

2


Pierce, Spriet, Desmet, Mayo, JCC, 2000


Dead End EliminationOriginal DEE:


Dead End EliminationGoldstein Criterion:


E(ir)� E(it) +X

j 6=i

mins

{E(ir, js)� E(it, js)} > 0



PIERCE ET AL.

FIGURE 2. Different dead-end elimination criteria for sample energy profiles. The abscissa represents all possibleconformations of the protein and the ordinate describes the net energy contribution produced by interactions withspecific rotamers at position i. (a) Original DEE: ir is eliminated by it1 but not by it2. (b) Simple Goldstein DEE: ir iseliminated by either it1 or it2. (c) General Goldstein DEE: ir cannot be eliminated by either it1 or it2, but can be eliminatedby a weighted average of the two. (d) Bottom line DEE: theoretically, ir can be eliminated if the minimum of the it1 and it2profiles always falls below the ir profile. (e) Simple split DEE: ir is eliminated by it1 and it2 in the partitions correspondingto splitting rotamers kv2 and kv1, respectively.

1002 VOL. 21, NO. 11

Goldstein Criterion:

E(ir)� E(it) +X

j 6=i

mins

{E(ir, js)� E(it, js)} > 0

Dead End EliminationGeneralized Goldstein Criterion:


E(ir)�X

t=1,T

CtE(it) +X

j 6=i

{mins

E(ir, js)�X

t=1,T

CtE(it, js)} > 0

PIERCE ET AL.


1002 VOL. 21, NO. 11

Conformation Space

ka

kb

kc

kd ke

The idea behind bottom line DEE is that the conformation space can be partitioned to improve pruning. !If a particular rotamer can be eliminated in any partition, then it is not in the GMEC.

Dead End EliminationSimple Split DEE (for each partition):


PIERCE ET AL.


1002 VOL. 21, NO. 11

E(ir)� E(it) +X

j 6=k 6=i

mins

{E(ir, js)� E(it, js)}+ [E(ir, kv)� E(it, kv)] > 0

PIERCE ET AL.

TABLE II.CPU Minutes Consumed Using Goldstein (T = 1) DEE, Split (s = 1) DEE, and Split (s = 2mb) DEE for Each ofThree Test Cases.

Case Method (T = 1) time (s = 1) time (s = 2mb) time Doubles time Total time

1 Goldstein (T = 1) 108.4 — — 299.4 418.5a

Split (s = 1) 1.2 2.0 — 7.8 11.6Split (s = 2mb) 1.1 2.2 0.9 7.1 11.8

2 Goldstein (T = 1) 660.2 — — 1114.1 1793.4a

Split (s = 1) 7.4 26.0 — 198.0 234.3Split (s = 2mb) 5.9 21.3 13.0 175.7 219.2

2 Goldstein (T = 1) 1981.3 — — 978.2 3005.8a

Split (s = 1) 1338.4 3037.3 — 1062.0 5479.5a

Split (s = 2mb) 229.1 522.8 713.7 458.2 1939.0

a Failed to converge due to combinatorial explosion in the number of superrotamers created by unification.

BENCHMARK COMPUTATIONS

Timing results for the three benchmark designcases are provided in Table II with correspondingconvergence histories shown in Figures 5–7. For thecore design of case 1 (see Fig. 5), split (s = 1) andsplit (s = 2mb) DEE converge to the GMEC confor-mation in under 12 minutes. By contrast, Goldstein(T = 1) DEE reaches a plateau with 4.5 × 1011

conformations remaining, and is eventually forcedto terminate after 418 minutes when combinato-rial buildup via unification causes the maximumallowable number of rotamers (npmax = 104) to besurpassed.

For the core/boundary design of case 2 (seeFig. 6), the standard Goldstein (T = 1) DEE algo-

FIGURE 5. Plastocyanin core design (the two splitmethods are indistinguishable).

rithm plateaus at 1.1 × 1013 conformations beforeterminating due to combinatorial buildup (npmax =104) after 1793 minutes. By contrast, split (s = 1)DEE converges to the GMEC conformation in 234minutes and split (s = 2mb) DEE converges slightlyfaster in 219 minutes.

For the surface design of case 3, the rotamersinteract weakly relative to interactions in the coreand boundary. Using the same maximum allow-able number of rotamers as before (npmax = 104),split (s = 2mb) DEE converge successfully in 2167minutes whereas both Goldstein (T = 1) and split(s = 1) DEE quickly overrun the maximum rotamerlimit (not shown). To observe a longer convergencepath for these two algorithms, the maximum ro-tamer limit was increased (npmax = 2 × 104) andthe results are shown in Figure 7. Using Goldstein

FIGURE 6. Protein G core/boundary design.

1008 VOL. 21, NO. 11

PIERCE ET AL.



1 Goldstein (T = 1) 108.4 — — 299.4 418.5a

Split (s = 1) 1.2 2.0 — 7.8 11.6Split (s = 2mb) 1.1 2.2 0.9 7.1 11.8

2 Goldstein (T = 1) 660.2 — — 1114.1 1793.4a

Split (s = 1) 7.4 26.0 — 198.0 234.3Split (s = 2mb) 5.9 21.3 13.0 175.7 219.2

2 Goldstein (T = 1) 1981.3 — — 978.2 3005.8a

Split (s = 1) 1338.4 3037.3 — 1062.0 5479.5a

Split (s = 2mb) 229.1 522.8 713.7 458.2 1939.0










1008 VOL. 21, NO. 11

PIERCE ET AL.



1 Goldstein (T = 1) 108.4 — — 299.4 418.5a

Split (s = 1) 1.2 2.0 — 7.8 11.6Split (s = 2mb) 1.1 2.2 0.9 7.1 11.8

2 Goldstein (T = 1) 660.2 — — 1114.1 1793.4a

Split (s = 1) 7.4 26.0 — 198.0 234.3Split (s = 2mb) 5.9 21.3 13.0 175.7 219.2

2 Goldstein (T = 1) 1981.3 — — 978.2 3005.8a

Split (s = 1) 1338.4 3037.3 — 1062.0 5479.5a

Split (s = 2mb) 229.1 522.8 713.7 458.2 1939.0










1008 VOL. 21, NO. 11

CONFORMATIONAL SPLITTING

FIGURE 7. Protein G surface design.

(T = 1) DEE, a plateau is reached at 9.3 × 1018 con-formations and the calculation terminates due to ro-tamer buildup after 3006 minutes. Using split (s = 1)DEE, the number of conformations is reduced to6.4 × 1011 before the calculation is terminated after5480 minutes. Convergence to the GMEC confor-mation is achieved only with split (s = 2mb) DEE,requiring 1939 minutes. For the hardest problems,which involve weak interactions between surfaceresidues, the more powerful (s = 2mb) criterioncan lead to substantial improvements in the overallperformance of the algorithm, even relative to split(s = 1) DEE.

Conclusions

Conformational splitting criteria significantly in-crease the power of dead-end elimination algo-rithms for the purposes of sequence selection incomputational protein design. For challenging de-sign calculations, the two splitting methods (s = 1)and (s = 2mb) dramatically increase the efficiencyof DEE relative to existing state-of-the-art methods

based on Goldstein (T = 1) singles elimination.Although the two split DEE methods perform sim-ilarly for the design of core and boundary residues,the more powerful split (s = 2mb) algorithm canprovide significant advantages for calculations in-volving weakly interacting surface residues. Usingthis improved method, it is now possible to performprotein design calculations that were previouslyintractable using reasonable time and memory allo-cations on current supercomputers.

Acknowledgments

N.A.P. thanks D. B. Gordon for many helpful dis-cussions. J.A.S. and J.D. thank I. Lasters and M. DeMaeyer for their interest and occasional help aspeers in the common field.

References

1. Dahiyat, B. I.; Mayo, S. L. Science 1997, 278, 82.2. Desmet, J.; De Maeyer, M.; Hazes, B.; Lasters, I. Nature 1992,

356, 539.3. Janin, J.; Wodak, S.; Levitt, M.; Maigret, D. J Mol Biol 1978,

125, 357.4. Ponder, J. W.; Richards, F. J Mol Biol 1987, 193, 775.5. Dunbrack, R. L.; Karplus, M. J Mol Biol 1993, 230, 543.6. Lasters, I.; Desmet, J. Prot Eng 1993, 6, 717.7. Goldstein, R. F. Biophys J 1994, 66, 1335.8. Lasters, I.; De Maeyer, M.; Desmet, J. Prot Eng 1995, 8, 815.9. Dahiyat, B. I.; Mayo, S. L. Prot Sci 1996, 5, 895.

10. Gordon, D. B.; Mayo, S. L. J Comput Chem 1998, 19, 1505.11. Desmet, J.; De Maeyer, M.; Lasters, I. In: Altman, R. B., et al.,

eds. Pacific Symposium on Biocomputing ’97; World Scien-tific: Singapore, 1997; p 122.

12. Dahiyat, B. I.; Gordon, D. B.; Mayo, S. L. Prot Sci 1997, 6,1333.

13. Street, A. G.; Mayo, S. L. Folding Des 1998, 3, 253.14. Gordon, D. B.; Marshall, S. A.; Mayo, S. L. Curr Opin Struct

Biol 1999, 9, 509.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1009

Extensions and ResultsSidechain placement vs. design, is there a difference?

DEE can be an extremely powerful pruning strategy, what do we do in cases where the conformation space remains large?

Can we do better than looking at conformations exhaustively?

12 1 2

33

Explicit: (1,1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)

Far apart

An explicit representation considers all possible conformations individually, leading to an exponentially sized conformation space.

Conformation Space

12 1 2

33

Implicit: (1, 2, 3), (1, 2, 3)

Far apart

Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, “factored” representation of the conformation space.

“Factorized” Conformation Space

12 1 2

33

Implicit: (1, 2, 3), (1, 2, 3)



i jE(i, j) ⌧ E(i, i0)E(i, j) ⌧ E(j, j0)



We can construct an interaction graph of residues in which edges are defined for residues that are “close enough.” These define pairwise energy terms for any chosen pair of rotamers.

Protein Interaction Graph

The sidechain placement problem is then to select rotamers at each position so as to minimize the sum over all edges of interaction energies.


Linear Programming

LP Solver

X

i

cixi

Minimize

Subject to:

x

⇤1, x

⇤2, . . . , x

⇤n

Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. The general problem of Linear Programming is polynomial-time solvable.

A · x b

A,b, c

Linear Programming

x

⇤1, x

⇤2, . . . , x

⇤n

Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. Integer Linear Programming is not known to be polynomial-time solvable.

ILP Solver

X

i

cixi

Minimize

Subject to:

xi 2 Z

A · x b

A,b, c

Linear Programming

x

⇤1, x

⇤2, . . . , x

⇤n

ILP Solver

X

i

cixi

Minimize

Subject to:

A · x b

A,b, c

xi 2 {0, 1}

Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. Integer Linear Programming is not known to be polynomial-time solvable.

The feasible region of a set of constraints can be viewed as the set of all points that satisfy the constraints. !All LP solvers search the space of solutions and try to find a point that maximizes the objective function.

3

1

1

P

[Wikipedia]

3

1

1

P

Fact: Some vertex of the feasible region is optimal. Fact: A vertex is optimal if there is no better neighboring vertex. !Dantzig (1947) came up the simplex algorithm: Set v = any vertex!! While a neighbor vertex v’ has better cost:!! ! v = v’

[Wikipedia]

LP for Sidechain Placement

C.L.Kingsford et al.

the backbone and the chosen rotamer ir at position i as well as the intrinsicenergy of rotamer ir , andE(ir js ) accounts for the pairwise interaction energybetween chosen rotamers ir and js . In this discretized setting, the placementof each side chain is reduced to finding an assignment of rotamers to positionsthat minimizes the overall energy of the system (the global minimum energyconformation).It is convenient to reformulate the SCP problem in graph-theoretic terms.

LetG be an undirected p-partite graph with node set V1 ∪ · · ·∪Vp , where Vi

includes a node u for each rotamer ir at position i; the Vi ’s may have varyingsizes. Each node u ofVi is assigned a weightEuu = E(ir ); each pair of nodesu ∈ Vi and v ∈ Vj (i = j ), corresponding to rotamers ir and js respectively,is joined by an edge with a weight of Euv = E(ir js ). Zero-weight edges canbe thought of as equivalent to the absence of an edge. The global minimumenergy conformation is achieved by picking one node per Vi to minimize theweight of the induced subgraph.

Integer linear programming formulationWe first formulate the SCP problem as an ILP, so that a solution to theILP gives an optimal solution to the SCP problem. The ILP is based onthe graph formulation of SCP discussed above. The vertex set of this graphis V = V1 ∪ · · · ∪ Vp , and its edge set D = {(u, v) : u ∈ Vi , v ∈ Vj ,i = j}.We introduce a {0, 1} decision variable xuu for each node u in V , and a

{0, 1} decision variable xuv for each edge in D. Setting xuu to 1 correspondsto choosing rotamer u, and similarly setting xuv to 1 corresponds to choosingto ‘pay’ the energy between rotamers u and v. We constrain our optimizationso that only one rotamer is chosen per residue, and so that we pay the costfor edge {u, v} if and only if rotamers u and v are both chosen. The followinginteger program ensures these conditions:

Minimize E = !u∈V Euuxuu + !

{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!

u∈Vjxuv = xvv for j = 1, . . . ,p and v ∈ V \ Vj

xuu, xuv ∈ {0, 1}.

(IP1)

The first set of constraints ensures that we choose exactly one rotamer for eachresidue. The second set of constraints demands that we set the edge variablesxuv to 1 for edges that are in the subgraph induced by the choice of rotamers:if xvv = 0 then no adjacent edges can be chosen, and if xvv = 1 then exactlyone adjacent edge is chosen for each vertex set. This formulation is similarto the version of (Althaus et al., 2000) (without modifying the energies to benegative) and simpler than that of (Eriksson et al., 2001). Additionally, on theexperimental side, Klepeis et al. (2003) use a similar integer programmingformulation to design variants of the peptide Compstatin that are predictedto improved inhibitory activity in complement pathways. However, this isa slightly different model in which side-chain positions are not explicitlyrepresented.In practice, the ILP given above can have many variables and constraints

that do not affect the optimization, and the system can be pruned dramatically.In particular, if all the pairwise energies between rotamers in positions i andj are non-positive, then we can remove all variables xuv with u ∈ Vi andv ∈ Vj such that Euv = 0, and modify the equality constraints in (IP1) thatcontain such an xuv by removing those variables and changing ‘=’ to ‘≤’.Because we are minimizing and all the energies between i and j are zero orless, this change does not affect the optimal solution. A frequent special casehas zero energies between all rotamers in two positions; this correspondsto residues that are too far apart in the structure to have any rotamers thatinteract with each other. The more general case involves residues that are farenough apart that only a subset of their rotamers have interactions with eachother.More formally, for each Vj , let N +(Vj ) be the set union of the Vi for

which there exists some v ∈ Vi and u ∈ Vj with Euv > 0. Let D′ be the setof pairs {u, v}with u ∈ Vj such that either v ∈ N +(Vj ), or v ∈ N +(Vj ) but

Euv < 0. There will be edge variables xuv only for pairs inD′. Our modifiedILP is as follows:

Minimize E ′ = !u∈V Euuxuu + !

{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj

xuv = xvv for j = 1, . . . ,p and v ∈ N +(Vj )

"

u∈Vj :Euv<0xuv ≤ xvv for j = 1, . . . ,p and v ∈ N +(Vj )

xuu, xuv ∈ {0, 1}

(IP2)

An inequality constraint is not included if the sum on the left-hand side isempty. The simple modification of (IP1) given in (IP2) is crucial in practice,providing in some cases an order of magnitude speed up.

Multiple solutionsSometimes it is desirable to find several optimal and near-optimal solutions.In the present framework, the LP/ILP can be solved iteratively to find anensemble of low-energy solutions. At iteration m, all previously discoveredsolutions are excluded by adding the constraints

"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)

to (IP2), where Sk contains the optimal set of rotamers found in iteration k.This requires that the new solution differs from all previous ones in at leastone position. As pointed out by an anonymous reviewer, it may be desirableto obtain successive solutions that differ more from each other, and this canbe accomplished by replacing p − 1 in (1) by p − q, where 1 < q ≤ p.

LP/ILP approachThe ILP formulation is as hard to solve as the original SCP problem. Ifwe relax the integrality constraints xuv ∈ {0, 1} by replacing them withconstraints 0 ≤ xuv ≤ 1 for u, v ∈ V , we obtain a linear program,which can be solved efficiently. If the optimal solution to the relaxed lin-ear program is integral—all variables are set to either 0 or 1—then thatsolution is also an optimal solution to the ILP and SCP problem. So ourLP/ILP approach to find optimal solutions is as follows: solve the prob-lem of interest using the computationally easier LP formulation. If thesolution returned is integral, then the problem instance was easy to solve,and we have the optimal solution to the original SCP problem. Otherwise,we run polynomial-time Goldstein dead-end elimination (DEE) (Goldstein,1994) until no more rotamers can be eliminated and then solve the moredifficult ILP.The CPLEX package (ILOG CPLEX, 2000, http://www.ilog.com/

products/cplex/) with AMPL (Fourer et al., 2002) was used to solve thelinear and integer programs. All computation was done on a single Sparc1200 MHz processor.

DatasetThe primary protein set (Table 1) consists of 25 proteins taken from Xiangand Honig (2001). The proteins vary in size, ranging from 50 to 221residues with more than one possible rotamer. As in Xiang and Honig(2001), only the first chain in the Protein Data Bank (PDB) file is used forexperiments.For homology modeling, 33 homologs to the proteins of Table 1 are

also used. These protein pairs share between 29 and 87% sequence iden-tity (Table 2). Whereas for some proteins there are other more similarprotein sequences present in the PDB, for evaluation purposes, the chosenhomologs give a wider range of sequence identity. ClustalW (Thompsonet al., 1994), with default settings, was used to align the protein pairs.For each pair, the protein in the original dataset was taken as the template

1030

Integer linear programming (ILP) gives linear constraints on a set of variables, and a linear cost function. !The goal is to minimize cost (determined by variable choices) while satisfying the constraints. !ILP does not care about the energy function, or about the fact that the interaction graph comes from a protein structure.

[Kingsford et al. ’05]

One simple optimization is to only include rotamer pairs that will ever interact with a non-zero pairwise energy. !These pairs can be precomputed ahead of time, and we can reduce the number of constraints.

[Kingsford et al. ’05]








{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!


xuu, xuv ∈ {0, 1}.

(IP1)






{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj


"


xuu, xuv ∈ {0, 1}

(IP2)



"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)






1030









{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!


xuu, xuv ∈ {0, 1}.

(IP1)






{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj


"


xuu, xuv ∈ {0, 1}

(IP2)



"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)






1030

xuu, xuv 1xuu, xuv � 0

What does it mean for the “integrality” constraints to be relaxed?









{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!


xuu, xuv ∈ {0, 1}.

(IP1)






{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj


"


xuu, xuv ∈ {0, 1}

(IP2)



"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)






1030


Relaxing the integrality constraints allows the application of a polynomial-time algorithm for finding an optimal solution for the given set of constraints and objective. What does it mean to have a fractional solution?









{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!


xuu, xuv ∈ {0, 1}.

(IP1)






{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj


"


xuu, xuv ∈ {0, 1}

(IP2)



"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)






1030


We can also run this method iteratively, excluding previously identified minimum-energy conformations from being selected.








{u,v}∈D Euvxuv

subject to!u∈Vj

xuu = 1 for j = 1, . . . ,p!


xuu, xuv ∈ {0, 1}.

(IP1)






{u,v}∈D′ Euvxuv

subject to"

u∈Vj

xuu = 1 for j = 1, . . . ,p

"

u∈Vj


"


xuu, xuv ∈ {0, 1}

(IP2)



"

u∈Sk

xuu ≤ p − 1 for k = 1, . . . ,m − 1 (1)






1030


LP/ILP approaches for side-chain positioning

Table 3. Prediction of side-chain conformations on native backbones, witha comparison of the LP/ILP prediction with those of other methods and thecrystal structure

Core residues All residues

(a) LP/ILP χ1/χ1+2 87%/62% 80%/51%(b) Scwrl χ1/χ1+2 88%/60% 80%/49%(c) LP/ILP rmsd 1.079 Å 1.553 Å(d) Scwrl rmsd 1.170 Å 1.649 Å

All values are averaged over the 25 proteins of Table 1. (a) The percentage of residuesover all proteins for which LP/ILP predicted conformation has the χ1 and χ1+2 dihedralangles within 20◦ of the native structure; (b) these values for Scwrl; (c) the rmsd ofthe predicted side-chain conformations from those of the native side chains using theLP/ILP method; and (d) these are values for Scwrl.

with respect to the energy function, and that these optimal solutionscorrespond to predicted structures of quality similar to that of otherpopular approaches.

Homology modelingWe next explore the combinatorial problems associated with homo-logy modeling. The 33 pairs of homologous proteins considered,their percentage sequence identity, and the rmsd between theirbackbones are shown in Table 2.We solved the resulting LP formulations for all 33 problems; this

took under 12 min of CPU time. The LP found optimal solutions for31 of the 33 pairs. For only two template/target pairs, 1d4t/1luk and1qu9/1qd9, the optimal LP solutions were not integral. For these twoproblems, the optimal integral solution was found using DEE andthe integer programming algorithm of CPLEX. A good measure ofhow close the LP relaxation objective is to the optimal solution is therelative gap, defined as:

100|OPT − lp|

|OPT| (2)

where OPT is the energy value of the optimal integral solution and lpis the optimal objective for the LP relaxation. The relative gaps forboth 1d4t/1luk and 1qu9/1qd9 were fairly small (0.207 and 15.260,respectively), and the total time for solving these two integer linearprograms was <1 min.In order to show that the basic energy function is useful in the

homology modeling scenario, we report the accuracies of our pre-dicted structures. We computed the side-chain rmsd between thetarget and predicted structures, as well as the side-chain rmsdobtained by the Scwrl rotamer choices. The average side-chain rmsdobtained by the LP/ILP approach with the basic energy function is3.230 Å, which is competetive with Scwrl’s performance of 3.260 Åwhen run on the same test set (Table 4).For these tests, we did not optimize many important aspects of

homology modeling, such as choosing the homolog with the mostsimilar sequence or correcting alignments, hence the results shouldnot be taken to be the best possible for any of the methods. However,the use of a simplified energy function results in predicted structuresthat are biologically reasonable. Additionally, optimal solutions withrespect to this energy function are easily found using the LP/ILPapproach.

Table 4. Prediction of side-chain conformations using homology modeling,with a comparison of the LP/ILP prediction with those of other methods andthe crystal structure

Core residues (Å) All residues (Å)

(a) LP/ILP rmsd 2.177 3.230(b) Scwrl rmsd 2.137 3.260(c) Backbone rmsd 1.385 1.978

All values are averaged over the 33 problems of Table 2. (a) The rmsd between just side-chain atoms when comparing the LP/ILP predicted structure with the crystal structure;(b) this value when comparing the Scwrl predictions with the native structure; and (c)the rmsd between template and target structures when only considering backbone atoms.

Protein designWe considered the problem of designing novel sequences that foldinto known backbones. We partitioned the amino acids into the fol-lowing classes: AVILMF / HKR / DE / TQNS / WY / P / C / G.For each of the 25 proteins in our native test set (Table 1), we fixedthe surface residues and the native backbone and allowed the coreresidues to assume any rotamer of any amino acid in the same classas the native residue. We focused on core residues since the basicenergy function optimizes primarily van der Waals interactions. Thesizes of the resulting problems are shown in Table 5.When applying LP to the resulting problems with the basic energy

function, only 6 out of 25 solutions had integral solutions. Thus,from the perspective of this LP, the design problem is more difficultthan fitting side chains on native and homologous backbones. CPUtime for solving the the 25 LP problems was approximately 20 h,with one protein (1qj4) taking ∼10.5 h.To obtain optimal solutions for the 19 proteins with non-integral

solutions, we apply DEE and then run the ILP solver of CPLEX.When solving the ILP, CPLEX, in addition to usingmany other heur-istics, solves several linear programs that are subproblems of the ILP(these subproblems are referred to as the branch-and-bound nodes).The number of such subproblems is a very rough indication of thecomputational effort expended by CPLEX. The number used for thedesign problems is shown in the ‘N ’ column of Table 5. For several ofthe problems, many branch-and-bound nodes were needed. CPLEXwas able to find the optimal integral solutions to all the problemsin ∼138 h. Nearly all of that time (125 h) was spent on the largestproblem, 1qj4; the other 18 problems took only 13 h of computation.The best way to test a designed sequence is tomake the protein and

confirm its structure and/or biological properties (e.g. Dahiyat andMayo, 1997; Harbury et al., 1998; Malakauskas and Mayo, 1998;Looger et al., 2003; Klepeis et al., 2003; Lilien et al., 2004); this isbeyond the scope of this paper. However, the basic energy functionis reasonable for designing protein cores as it focuses on van derWaal interactions, and the use of other energy functions is not likelyto make the problem easier (see below). Thus, while the LP/ILPapproach found optimal solutions for these protein design problems,our analysis shows that protein design problems are likely to be con-siderably more difficult to solve than homology modeling problems.

Other energy functionsWe also investigated how changing the energy function affects theability of LP to find optimal solutions. For five proteins from Table 1

1033


Table 5. Proteins for which the core was redesigned

Prot. Varlen

Rot Size Time(ILP)

Relgap

N

1aac 38 2153 62 3.3e2 (1.3e2) 2.630 41aho 18 668 22 4.4 Integral1b9o 48 1842 69 2.4e2 (9.4) 1.099 01c5e 25 1369 42 5.8e1 Integral1c9o 14 757 24 9.1e1 (4.6e1) 3.936 341cc7 18 866 29 9.5e1 (2.4) 0.272 01cex 78 3926 126 2.6e3 (7.0e2) 0.913 301cku 22 897 31 8.8 Integral1ctj 24 1262 40 2.8e1 Integral1cz9 53 2664 87 1.2e3 (3.2e2) 0.702 271czp 30 1475 47 4.4e2 (1.4e2) 1.202 391d4t 32 1691 52 1.8e2 (8.9e1) 1.039 331igd 11 552 18 3.4 Integral1mfm 46 3215 80 6.5e3 (5.4e3) 3.234 2331plc 33 1691 54 4.7e2 (1.3e2) 3.991 81qj4 124 6655 201 3.8e4 (4.5e5) 2.677 72931qq4 72 3500 115 1.5e3 (6.9e2) 4.272 381qtn 49 2181 74 2.6e2 (7.0e1) 0.558 81qu9 43 2057 70 2.3e2 (6.4) 0.162 21rcf 65 3189 105 2.7e3 (9.6e1) 0.053 01vfy 15 665 20 4.0 Integral2pth 76 4395 127 1.1e4 (2.4e4) 2.115 16233lzt 48 1940 71 4.2e2 (3.9e2) 3.445 455p21 70 3624 114 4.1e3 (1.3e4) 2.259 14537rsa 46 1993 66 5.7e2 (1.4e1) 0.120 0

Var len gives the number of core positions that were allow to vary, and Rot gives thetotal number of rotamers considered. Size is the log10 of the search space size. Timeis the number of seconds CPLEX spent solving the LP, and given in parentheses, thetime for solving the ILP. Rel gap gives the relative gap, as defined in Equation (2), andis a measure of how far the energy of the solution of the LP is from that of the optimalrotamer choice. N gives the number of subproblems CPLEX considered in finding theoptimal choice of rotamers.

(1c9o, 1czp, 1d4t, 1qtn and 1vfy), we fit side chains on their nativebackbones using two additional energy function variants.In the first variant, the self-energies include the van der Waals

interactions with the backbone (as before), but the statistical termis replaced by a torsion term as well as intra-side-chain van derWaals interactions. These self-energy terms are meant to measurethe local favorability of a side-chain conformation. The pairwiseinteraction energies between rotamers consist of only van der Waalsinteractions.The second variant is the same as the first, except that the self-

energies include electrostatic interactions with the backbone and thepairwise energies include electrostatic interactions between side-chains. In all cases, the electrostatic interactions were modeledusing the distance-dependent electrostatic component (ϵ = r) ofthe AMBER96 force field.In contrast to the basic energy function, for which 100% of the

solutions were integral, the LP finds optimal solutions for only 60%(three out of five) of the proteins using either variant of the energyfunction. Thus, small changes in the energy function can influencethe ease with which solutions are found. We note that ILP can stillfind optimal solutions for these problems, and additionally that thebasic energy function gives the best accuracy over these proteins

Instance1 2 3 4 5 6 7 8 9 10

Rel

ativ

e ga

p

0

0.1

0.2

0.3

0.4

0.5

1aac

1aho

1cex

1ctj1igd

2 26 50 74 980

0.10.20.30.40.5 1aac

Fig. 2. Relative gap between the optimal solution (with value OPT) and thenine next lowest-energy solutions (where the i-th solution has value xi ). Insetshows relative gaps for the 100 lowest-energy solutions for 1aac. Relativegap at each iteration i is defined as 100(|OPT − xi |/|OPT|).

(1.634 Å average rmsd versus 2.069 and 2.409 Å for variants 1 and2, respectively).

Obtaining multiple solutionsBy adding constraints (1) to the integer program, we can look at anensemble of provably near-optimal solutions. Near-optimal solutionscan be used to generate several candidates for protein design, aswell as to analyze the energy landscape and gauge the difficulty ofthe global optimization problem. We found the 10 lowest-energysolutions for four proteins (1aho, 1cex, 1ctj and 1igd) and the 100lowest-energy solutions for 1aac, using the basic energy function tofit each sequence onto its native backbone. Since at each step we areexcluding all previously found solutions, each successive solutiontakes longer to find. The relative gap [Equation (2)] between eachsuccessive solution and the global optimum is plotted in Figure 2.These gaps are very small, and from the point of view of this energyfunction, any of several solutions perform similarly. This indicatesthat even though LP has no difficulty finding optimal solutions, noone choice of rotamers clearly stands out as the right one.

DISCUSSIONOur experiments suggest that mathematical programming shouldbecome a widely used technique for attacking SCP in the context ofboth homologymodeling andprotein design. The described approachexploits general, highly developed optimization machinery, and itis likely that problems much larger than those studied here can besolved by employing faster hardware andmore effectively exploitingthe CPLEX package (e.g. using parallelized versions of the software,or specifying alternate strategies for branching and node selection).The addition of valid inequalities in a branch and cut framework as inAlthaus et al. (2000)might further speed up solution of the problems.For even larger problems, further specialized optimizations may

be necessary. As a first step, we have shown how to reduce the size ofthe ILP dramatically, without compromising optimality, by exploit-ing the fact that in protein structures amino acids do not interact withother amino acids that are far away in 3D. Furthermore, in practice, tosolve large instances optimally, we would suggest first running basicDEE, and then followingwith eitherLPor ILP.Wealsonote that some

1034


Table 5. Proteins for which the core was redesigned

Prot. Varlen

Rot Size Time(ILP)

Relgap

N

1aac 38 2153 62 3.3e2 (1.3e2) 2.630 41aho 18 668 22 4.4 Integral1b9o 48 1842 69 2.4e2 (9.4) 1.099 01c5e 25 1369 42 5.8e1 Integral1c9o 14 757 24 9.1e1 (4.6e1) 3.936 341cc7 18 866 29 9.5e1 (2.4) 0.272 01cex 78 3926 126 2.6e3 (7.0e2) 0.913 301cku 22 897 31 8.8 Integral1ctj 24 1262 40 2.8e1 Integral1cz9 53 2664 87 1.2e3 (3.2e2) 0.702 271czp 30 1475 47 4.4e2 (1.4e2) 1.202 391d4t 32 1691 52 1.8e2 (8.9e1) 1.039 331igd 11 552 18 3.4 Integral1mfm 46 3215 80 6.5e3 (5.4e3) 3.234 2331plc 33 1691 54 4.7e2 (1.3e2) 3.991 81qj4 124 6655 201 3.8e4 (4.5e5) 2.677 72931qq4 72 3500 115 1.5e3 (6.9e2) 4.272 381qtn 49 2181 74 2.6e2 (7.0e1) 0.558 81qu9 43 2057 70 2.3e2 (6.4) 0.162 21rcf 65 3189 105 2.7e3 (9.6e1) 0.053 01vfy 15 665 20 4.0 Integral2pth 76 4395 127 1.1e4 (2.4e4) 2.115 16233lzt 48 1940 71 4.2e2 (3.9e2) 3.445 455p21 70 3624 114 4.1e3 (1.3e4) 2.259 14537rsa 46 1993 66 5.7e2 (1.4e1) 0.120 0

Var len gives the number of core positions that were allow to vary, and Rot gives thetotal number of rotamers considered. Size is the log10 of the search space size. Timeis the number of seconds CPLEX spent solving the LP, and given in parentheses, thetime for solving the ILP. Rel gap gives the relative gap, as defined in Equation (2), andis a measure of how far the energy of the solution of the LP is from that of the optimalrotamer choice. N gives the number of subproblems CPLEX considered in finding theoptimal choice of rotamers.

(1c9o, 1czp, 1d4t, 1qtn and 1vfy), we fit side chains on their nativebackbones using two additional energy function variants.In the first variant, the self-energies include the van der Waals

interactions with the backbone (as before), but the statistical termis replaced by a torsion term as well as intra-side-chain van derWaals interactions. These self-energy terms are meant to measurethe local favorability of a side-chain conformation. The pairwiseinteraction energies between rotamers consist of only van der Waalsinteractions.The second variant is the same as the first, except that the self-

energies include electrostatic interactions with the backbone and thepairwise energies include electrostatic interactions between side-chains. In all cases, the electrostatic interactions were modeledusing the distance-dependent electrostatic component (ϵ = r) ofthe AMBER96 force field.In contrast to the basic energy function, for which 100% of the

solutions were integral, the LP finds optimal solutions for only 60%(three out of five) of the proteins using either variant of the energyfunction. Thus, small changes in the energy function can influencethe ease with which solutions are found. We note that ILP can stillfind optimal solutions for these problems, and additionally that thebasic energy function gives the best accuracy over these proteins

Instance1 2 3 4 5 6 7 8 9 10

Rel

ativ

e ga

p

0

0.1

0.2

0.3

0.4

0.5

1aac

1aho

1cex

1ctj1igd

2 26 50 74 980

0.10.20.30.40.5 1aac

Fig. 2. Relative gap between the optimal solution (with value OPT) and thenine next lowest-energy solutions (where the i-th solution has value xi ). Insetshows relative gaps for the 100 lowest-energy solutions for 1aac. Relativegap at each iteration i is defined as 100(|OPT − xi |/|OPT|).

(1.634 Å average rmsd versus 2.069 and 2.409 Å for variants 1 and2, respectively).

Obtaining multiple solutionsBy adding constraints (1) to the integer program, we can look at anensemble of provably near-optimal solutions. Near-optimal solutionscan be used to generate several candidates for protein design, aswell as to analyze the energy landscape and gauge the difficulty ofthe global optimization problem. We found the 10 lowest-energysolutions for four proteins (1aho, 1cex, 1ctj and 1igd) and the 100lowest-energy solutions for 1aac, using the basic energy function tofit each sequence onto its native backbone. Since at each step we areexcluding all previously found solutions, each successive solutiontakes longer to find. The relative gap [Equation (2)] between eachsuccessive solution and the global optimum is plotted in Figure 2.These gaps are very small, and from the point of view of this energyfunction, any of several solutions perform similarly. This indicatesthat even though LP has no difficulty finding optimal solutions, noone choice of rotamers clearly stands out as the right one.

DISCUSSIONOur experiments suggest that mathematical programming shouldbecome a widely used technique for attacking SCP in the context ofboth homologymodeling andprotein design. The described approachexploits general, highly developed optimization machinery, and itis likely that problems much larger than those studied here can besolved by employing faster hardware andmore effectively exploitingthe CPLEX package (e.g. using parallelized versions of the software,or specifying alternate strategies for branching and node selection).The addition of valid inequalities in a branch and cut framework as inAlthaus et al. (2000)might further speed up solution of the problems.For even larger problems, further specialized optimizations may

be necessary. As a first step, we have shown how to reduce the size ofthe ILP dramatically, without compromising optimality, by exploit-ing the fact that in protein structures amino acids do not interact withother amino acids that are far away in 3D. Furthermore, in practice, tosolve large instances optimally, we would suggest first running basicDEE, and then followingwith eitherLPor ILP.Wealsonote that some

1034

12 1 2

33

Implicit: (1, 2, 3), (1, 2, 3)



i jE(i, j) ⌧ E(i, i0)E(i, j) ⌧ E(j, j0)



We can construct an interaction graph of residues in which edges are defined for residues that are “close enough.” These define pairwise energy terms for any chosen pair of rotamers.


Suppose we know that the likelihood function is:

y

x

u

v

zf1

f2

f3

f4

f5

MarginalizationFind the “marginal” value of on a particular variable. For example:

g

z

=X

u,v,x,y

f1(u, v) · f2(v, x) · f3(x, y) · f4(y, z) · f5(z)

MAP ConfigurationFind the configuration of variables that maximizes :

Factor Graphs

[Loeliger et al. ’01][Pearl ’88, Jordan...]


y

x

u

v

zf1

f2

f3

f4

f5

Factor Graphs


Here, variables take on a fixed number of states, and factors define local interactions.

Find the configuration of variables that maximizes :

MAP Configuration

y

x

u

v

zf1

f2

f3

f4

f5

MAP ConfigurationFind the configuration of variables that maximizes :

max

z

f5(z)

✓max

y

f4(y, z)

⇣max

x

f3(x, y)

⇣max

v

f2(v, x)max

u

f1(u, v)

⌘⌘◆


Factor Graphs


max

hu,v,x,y,zig(u, v, x, y, z) =

Pr[✓] / e�E(✓)

We can define likelihoods using the Boltzmann distribution:

The model (with appropriate parameters) can be used to analyze protein energetics [Yanover/Weiss ’02, Xu ’05, Kamisetty et al ’07, ’11].

1UBQ - Ubiquitin

To construct a “protein factor graph”, we take each amino acid in the primary sequence as a variable, and its sidechains as states.

Univariate and bivariate factors are defined using self- and pairwise energies (i.e., probabilities).

y

x z

f4

f5f2f1

f3

A MAP configuration corresponds to a minimum-energy conformation.

Pr[✓] / e�E(✓)Boltzmann distribution:

Maximization can be computed by “message passing”:

Once all messages have been passed, we can assign a maximizing configuration starting at leaf factors.

[Pearl ’88]

y

x

u

v

zf1

f2

f3

f4

f5

µ

fj!xi(xi

) = max

Xj\xi

f

j

(X

j

)

Y

x2Xj\xi

µ

x!fj (x)

Max-Product Algorithm

max

z

f5(z)

✓max

y

f4(y, z)

⇣max

x

f3(x, y)

⇣max

v

f2(v, x)max

u

f1(u, v)

⌘⌘◆

Computing marginals or MAP configurations exactly in a model with cycles is “NP-hard”. !However, we can still use the sum-product algorithm in two ways: !

• “Collapse” multiple variables into a single variable to eliminate cycles. !

• Run sum-product as before, but until convergence. !

One method is exact, while the other is “approximate”.[Pearl ’88, Yedidia et

al. ’05...]

y

x

u

v

zf1

f2

f3

f4

f5

???

Dealing with Cycles

Computing marginals or MAP configurations exactly in a model with cycles is NP-hard. !However, we can still use the sum-product algorithm in two ways: !


• Run sum-product as before, but until convergence. !

Variable/Factor grouping must be chosen carefully to avoid state-space explosion.[Pearl ’88, Yedidia et

al. ’05...]

y

u

v

zf1

f4

f5

xx

f2 � f3

Dealing with Cycles

Unfortunately exact methods are prohibitively expensive if we consider longer-range interactions. We can approximate by stopping message passing near (or at) convergence.

Tree Decomposition

1. T = (I, F ) is a tree with a node set I and an edge set F ,

2. X = {Xi|i ∈ I, Xi ∈ V } and!

i∈I Xi = V . That is, each node in the tree T represents a subset

of V and the union of all the subsets is V,

3. for every edge e = {v, w} ∈ E, there is at least one i ∈ I such that both v and w are in Xi,

and

4. for all i, j, k ∈ I , if j is a node on the path from i to k in T, then Xi"

Xk ⊆ Xj .

The width of a tree decomposition is maxi∈I(|Xi| − 1). The tree width of a graph G, denoted

by tw(G), is the minimum width over all the tree decompositions of G.

According to the above definition, the decomposition of a graph into biconnected components

is also a tree decomposition. Each biconnected component corresponds to a node in T and anytwo biconnected components share at most one articulation vertex in G. However, the width ofa biconnected-component decomposition could be O(|V |), which is much bigger than the tree

width of a graph G if G is sparse. For example, when a graph is a cycle, this graph has onlyone biconnected component—itself. In contrast, the tree width of a cycle is only 2. Figure 1, 2and 3 show an example of an interaction graph, its biconnected component decomposition with

width 6 and a tree decomposition with width 3. The width of a tree decomposition is a key factordetermining the computational complexity of all the tree decomposition based algorithms. The

smaller the width of a tree decomposition is, the more efficient the algorithm. Therefore, we needto optimize the tree decomposition of the residue interaction graph such that we can have a verysmall tree width. In the next subsection, we will describe a tree decomposition based side-chain

assignment algorithm and analyze its computational complexity.

b

a

c

d

e

fg

h

i

jkl

m

Fig. 1. Example of a residue interaction graph.

eijclk

fh

fgabcdefm

Fig. 2. Example of the biconnected-component decomposition

of a graph. The width of this decomposition is 6.

3.2 Tree Decomposition Based Side Chain Assignment Algorithm

In this subsection, we describe an algorithm to search for the optimal side-chain assignment basedon a tree decomposition (T, X) of a residue interaction graph G. For simplicity purpose, we assume

Given a factor graph, we can actually reorganize it as long as we don’t lose any dependencies. But, we don’t want to add too many unnecessary ones either.

Tree Decomposition

Given a factor graph, we can actually reorganize it as long as we don’t lose any dependencies. But, we don’t want to add too many unnecessary ones either. !In general, which trees capture the original graph, and how can we measure how good a particular tree it?

abd acd

clk

fh

fg

eij

cdem defm

Fig. 3. Example of a tree decomposition of a graph with width 3.

that tree T has a root r and that each node is associated with a height. The height of a node is

equal to the maximum height of its child nodes plus one. Our tree decomposition based side-chain assignment algorithm consists of two steps. One is the calculation of the optimal energy ina bottom-to-top way and the other is the extraction of the optimal assignment in a top-to-bottom

way.

Bottom-to-Top Suppose we start from a leaf node i in the tree T and node j is the parent

of i. Let Xi,j denote the intersection between Xi and Xj . Given a side-chain assignment A(Xi,j)to the residues in Xi,j, we enumerate all the possible side-chain assignments to the residues in

Xi−Xi,j and then find the best side-chain assignment such that the energy of the subgraph inducedby Xi is minimized. We record this optimal energy as the multi-body score of Xi,j, which onlydepends on A(Xi,j). All the residues in Xi,j form a hyper edge, which is added into the subgraph

induced by Xj. When the energy of the subgraph induced by Xj is calculated, the multi-bodyscore corresponding to this hyper edge should be included. In addition, we also save the optimal

assignment to all the residues in Xi − Xi,j for each A(Xi,j) since in the top-to-bottom step weneed it for traceback. For example, in Figure 3, if we assume the node acd is the root, then nodedefm is an internal node with parent cdem. For each side-chain assignment to residues d, e and

m, we can find the best side-chain assignment to residue f such that the energy of the subgraphinduced by d, e, f and m is minimized. Then we add one hyper edge (d, e, m) to node cdem. In thisbottom-to-top process, a tree node can be calculated only after all of its child nodes are calculated.

When calculating the root node of T , we enumerate the side-chain assignments to all the residuesin it and obtain the optimal assignment such that the energy is minimized. This minimized energy

is also the optimal menergy of the whole system.

Top-to-Bottom After calculating the root node of tree T , we have the optimal assignment to

all the residues in the root. Now we trace back from the parent node to its child nodes to extract outthe optimal assignment to all the residues in a child node. Assume that the optimal assignment toall the residues in node j are already known and node i is a child node of j. We can easily extract

out the optimal assignment to all the residues in Xi −Xi,j based on the assignment to the residuesin Xi,j since we have already saved this assignment in the bottom-to-top step. Recursively, we can

track down to the leaf nodes of T to extract out the optimal assignment to all the residues in G. Ifwe want to save the memory consumption, then we do not need to save the optimal assignmentsin the bottom-to-top step. Instead, in this step we can enumerate all the assignments to Xi − Xi,j

to obtain the optimal assignment to all the residues in Xi −Xi,j. The computational effort for thisenumeration is much cheaper than that in the bottom-to-top step since there is only one side-chainassignment to Xi,j.

Tree Decomposition

A tree decomposition is a tree on vertex subsets that satisfies the following: !1. The union of all vertex subsets equals the original vertex set.

2. For any edge in the original graph, there is some with .

3. If and , then for all on the path between and .

abd acd

clk

fh

fg

eij

cdem defm

Fig. 3. Example of a tree decomposition of a graph with width 3.

that tree T has a root r and that each node is associated with a height. The height of a node is

equal to the maximum height of its child nodes plus one. Our tree decomposition based side-chain assignment algorithm consists of two steps. One is the calculation of the optimal energy ina bottom-to-top way and the other is the extraction of the optimal assignment in a top-to-bottom

way.

Bottom-to-Top Suppose we start from a leaf node i in the tree T and node j is the parent

of i. Let Xi,j denote the intersection between Xi and Xj . Given a side-chain assignment A(Xi,j)to the residues in Xi,j, we enumerate all the possible side-chain assignments to the residues in

Xi−Xi,j and then find the best side-chain assignment such that the energy of the subgraph inducedby Xi is minimized. We record this optimal energy as the multi-body score of Xi,j, which onlydepends on A(Xi,j). All the residues in Xi,j form a hyper edge, which is added into the subgraph

induced by Xj. When the energy of the subgraph induced by Xj is calculated, the multi-bodyscore corresponding to this hyper edge should be included. In addition, we also save the optimal

assignment to all the residues in Xi − Xi,j for each A(Xi,j) since in the top-to-bottom step weneed it for traceback. For example, in Figure 3, if we assume the node acd is the root, then nodedefm is an internal node with parent cdem. For each side-chain assignment to residues d, e and

m, we can find the best side-chain assignment to residue f such that the energy of the subgraphinduced by d, e, f and m is minimized. Then we add one hyper edge (d, e, m) to node cdem. In thisbottom-to-top process, a tree node can be calculated only after all of its child nodes are calculated.

When calculating the root node of T , we enumerate the side-chain assignments to all the residuesin it and obtain the optimal assignment such that the energy is minimized. This minimized energy

is also the optimal menergy of the whole system.

Top-to-Bottom After calculating the root node of tree T , we have the optimal assignment to

all the residues in the root. Now we trace back from the parent node to its child nodes to extract outthe optimal assignment to all the residues in a child node. Assume that the optimal assignment toall the residues in node j are already known and node i is a child node of j. We can easily extract

out the optimal assignment to all the residues in Xi −Xi,j based on the assignment to the residuesin Xi,j since we have already saved this assignment in the bottom-to-top step. Recursively, we can

track down to the leaf nodes of T to extract out the optimal assignment to all the residues in G. Ifwe want to save the memory consumption, then we do not need to save the optimal assignmentsin the bottom-to-top step. Instead, in this step we can enumerate all the assignments to Xi − Xi,j

to obtain the optimal assignment to all the residues in Xi −Xi,j. The computational effort for thisenumeration is much cheaper than that in the bottom-to-top step since there is only one side-chainassignment to Xi,j.

Xi

(u, v) Xi u, v 2 Xi

v 2 Xi v 2 Xj v 2 Xk Xk Xi

Xj

Stereo Vision

Signal Processing

[Sontag ’10]

Coding

[Söding ’05]

[McEliece et al. ’98]

“Standard” Applications

General Graphs

“Loopy” Graph Junction Tree

Tree Decomposition(NP-Hard)

u v

x y

a b

c d

e f

u vx

x

e

a bc d

a b

df

f

y

ye

c

To deal with graphs with cycles, we “group” variables such that the original “likelihood” function is unchanged but we obtain a tree-structured model. If this “junction-tree” has treewidth , sum-product requires time.

O(n · d⌧ )

Sum-Product is Fragile

y

x

u

v

zf1

f2

f3

f4

f5

y

x

u

v

zf1

f2

f3

f4

f5

Update Updating a tree-structured factor graph can change messages in an execution of the sum-product algorithm.

Adding a cycle to the input graph can change nodes in the junction tree (and associated factor graph).

u v

x y

a b

c d

e f

u vx

x

e

a bc d

a b

df

f

y

ye

c

u

u

u

uu

u

u

Add (u, v)

Clustering in Factor Graphs

In each round of clustering, we “rake” all leaves and “compress” a maximal independent set of degree-two nodes [Miller/Reif ’84], while computing cluster functions.

y

x

u

v

zf1

f2

f3

f4

f5

y

x

zf1

f2

u

v

f3

f4

f5

Rake ,

Compress , ,

Cluster Functions

Tree Contraction

y

x

zf1

f2

u

v

f3

f4

f5

yf2

f1

x

zRake ,

Compress

y

z

f2

y

Compress

Finalize�f1

(u, v) = f1(u, v)

�f2(y) =

�

u,v,x

�f1(u, v)�x(x, y)f2(v, x, y)

=�

u,v,x

f1(u, v)f3(x, y)f2(v, x, y)

How long do intermediate cluster function computations take? How may rounds until everything is eliminated?

Cluster Tree

y

x

u

v

zf1

f2

f3

f4

f5

y

x z

u v

f1

f2

f3 f4 f5

timeO(n · d3⌧ )

We also keep track of the “boundaries”, defined as the set of edges leaving a cluster at the time of its creation during contraction.

Computing Marginals

y

x z

u v

f1

f2

f3 f4 f5

�v

�x

�zMf2

Mf1

My

Any marginal can be computed in time.O(d2⌧ · log n)

y

x

u

v

zf1

f2

f3

f4

f5

'v

'z

'x

[Pearl ’88, Yedidia et al. ’05...]

Dealing with Cycles

Computing marginals or MAP configurations exactly in a model with cycles is NP-hard. !However, we can still use the sum-product algorithm in two ways: !


• Run max-product as before, but until convergence. !

The focus of research in approximate methods is in improving convergence times.

y

x

u

v

zf1

f2

f3

f4

f5

Message Passing and Free Energy

G = H � TS

We have been trying to minimize the potential energy of a protein conformation. But given that proteins exist in an ensemble of conformations, what do we minimize? !The free energy of a protein is defined as: !!!Where is the enthalpy of the system, and is the entropy of the system.H S

How does this relate to graphical models? We can define: !!!!Here, is the normalizing constant, or “partition function.”Z

G =X

✓

p(✓)E(✓) + TX

✓

p(✓) ln(p(✓)) = � lnZ

Approximate InferenceCan we simplify the model in order to make it tractable? How do we do this? What can we say about the associated global likelihood? We’d like to relate our approximation with the underlying global distribution . The Kullback-Leibler distance between and is defined as:

Using that we get that:

which is minimized when and we get:

D(b; p) =X

✓

b(✓) ln(b(✓)) +X

✓

b(✓)E(✓) + lnZ

p(✓)b(✓)

p(✓) b(✓)

D(b; p) =X

✓

b(✓) lnb(✓)

p(✓)

p(✓) = e�E(✓)/Z

b = p

G =X

✓

p(✓)E(✓) + TX

✓

p(✓) ln(p(✓)) = � lnZ

Variational InferenceNow, the “fit” of our estimated can be measured using:

!

The “variational” approach to message-passing seeks to perform inference efficiently, while using bounds on to obtain a goodness of fit.

Can you think of a lower bound for ? An upper bound? A key area of research is to develop bounds useful for performing inference.

How does all this relate back to protein structure?

b(✓)

D(b; p) =X

✓

b(✓) ln(b(✓)) +X

✓

b(✓)E(✓) + lnZ

lnZ

lnZ

[Lange Lab, TU-München]

How does the potential-energy based view of protein design differ from the free-energy based view?

Free EnergyIs it easy to compute the free energy of a given protein sequence (with fixed backbone)?

Can we minimize the free energy for a particular choice of sequence for protein design?

How can we use graphical models?

Are there other (more efficient/accurate) approaches?

Documents

De Novo Protein Structure Prediction - Tulane Universitynowadays in protein design. Also in 1987, McGregor et al. [8] used 61 high-resolution structures to examine the influence of