Computational Approaches to Problems in Protein Structure and Function

Computational Approaches to Problems in

Protein Structure and Function

Carleton L. Kingsford

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

By the Department of

Computer Science

November 2005

Abstract

We present computational approaches to solve several problems arising in protein structure

and function.

In the first part of this thesis, we develop a new method for finding the lowest energy

positions of side chains when given the backbone of a protein, a widely studied problem

that has applications in homology modeling and protein design. We present an integer

linear programming formulation of side-chain positioning and relax it to give a polynomial-

time linear programming heuristic that allows us to tackle large problems. We test the

integer and linear programming approach on native and homologous backbones, where we

show that optimal solutions can usually be found using linear programming, and in protein

redesign, where we find that instances often cannot be solved using linear programming

directly, but where optimal solutions for large instances can be found using the more

expensive integer programming procedure. We also present an alternative formulation of

the side-chain positioning problem as a semidefinite program, which provides a tighter

relaxation than the linear program. We introduce two novel rounding schemes to convert

fractional solutions of the semidefinite program into choices of rotamers and provide some

theoretical justifications for their effectiveness. We extensively test the semidefinite pro-

gramming formulation and rounding schemes on simulated data and on the redesign of

two naturally occurring protein cores and show that the approach finds good solutions.

The second part of this thesis considers the problem of finding transcription factor

binding sites by locating a collection of mutually similar subsequences within the upstream

DNA sequences of genes. Our approach to side-chain positioning can be recast to solve

this problem, and it has previously been shown that this is a promising direction to pursue.

We improve the mathematical programming formulation to find binding sites up to 45

times faster.

Finally, in the last part of the thesis, we investigate protein function more broadly and

give extensions to the popular phylogenetic profile method for predicting shared function

from cross-genomic evolutionary history. For many biological functions, our methods are

better able to identify functionally linked proteins than previously introduced methods.

Acknowledgments

The main content of several chapters has appeared as papers or submissions, and I thank

my co-authors for allowing our joint work to be included in this thesis.

The work on applying linear programming to side-chain positioning in Chapter 3

appeared in the journal Bioinformatics (Kingsford et al., 2005) and is joint work with

Bernard Chazelle and Mona Singh. I thank Amy Keating, Jessica Fong, Gevorg Grigoryan,

Elena Nabieva, Robert Osada, Elena Zaslavsky and the reviewers for insightful comments

on that paper.

Chapter 4 on applying semidefinite programming to side-chain positioning appeared

in the Special Issue in Computational Molecular Biology / Bioinformatics of the IN-

FORMS Journal on Computing (Chazelle et al., 2004). (A preliminary version appeared

in (Chazelle et al., 2003).) This work is also joint work with Bernard Chazelle and Mona

Singh, and I thank Tony Wirth for helpful comments on the manuscript.

Chapter 5 is joint work with Elena Zaslavsky and Mona Singh. I thank Elena Nabieva

for her comments on an earlier version of the manuscript that became this chapter.

I was in part supported by a DIMACS Graduate Student Summer Award while working

on Chapter 6, which is joint work with Mona Singh. Thanks to Jessica Fong for reading

this chapter.

Bernard Chazelle and Olga Troyanskaya each gave helpful and much appreciated com-

ments on the entire thesis in their capacity as readers on my thesis committee. I am

also grateful to Rob Schapire and Michael Hecht for serving on my thesis committee as

non-readers.

This work was supported financially through my advisor and Princeton University via

the National Science Foundation and the Defense Advanced Research Projects Agency.

For the work described here, at least 50,000 lines of software code were written. I

would like to thank my advisor, Mona Singh, for making these many lines of code seem

worthwhile and for her consistent insights.

Jessica Fong played a special role in this thesis, keeping the hours not spent with those

lines of code dependably enjoyable.

Finally, I especially want to thank my parents, Howard and Geraldine, and sister,

Carriann, for their support and encouragement.

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

1 Introduction 1

2 The Side-chain Positioning Problem 5

2.1 Brief Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The Side-chain Positioning Problem . . . . . . . . . . . . . . . . . . . . . 6

2.3 Formal Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Applications and Successes . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Previous Methods to Solve the SCP Problem . . . . . . . . . . . . . . . . 11

2.6 The Inapproximability of Side-chain Positioning . . . . . . . . . . . . . . . 12

3 Solving and Analyzing Side-chain Positioning Problems Using Linear

and Integer Programming 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Biological Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Integer Linear Programming Formulation . . . . . . . . . . . . . . 20

3.2.2 Multiple Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 LP/ILP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4 Integrality Gap of the LP Relaxation . . . . . . . . . . . . . . . . . 25

3.2.5 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.6 Rotamer Library and Structure Manipulation . . . . . . . . . . . . 30

3.2.7 Energy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.8 Evaluating Predicted Structures . . . . . . . . . . . . . . . . . . . 34

3.3 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Native Backbone Tests . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Homology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.3 Protein Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 Other Energy Functions . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.5 Obtaining Multiple Solutions . . . . . . . . . . . . . . . . . . . . . 45

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 A Semidefinite Programming Approach to Side-Chain Positioning with

New Rounding Strategies 50

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 A Semidefinite Programming Heuristic . . . . . . . . . . . . . . . . . . . . 53

4.2.1 Projection Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.2 Perron-Frobenius Rounding . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Cold Shock Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.2 Triose Phosphate Isomerase . . . . . . . . . . . . . . . . . . . . . . 67

4.3.3 Uniform Random Graphs . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.4 Neighborhood Random Graphs . . . . . . . . . . . . . . . . . . . . 70

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Improving a Mathematical Programming Approach for Motif Finding 75

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Formal Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Integer and Linear Programming Formulations . . . . . . . . . . . . . . . 79

5.3.1 Original Integer Linear Programming Formulation . . . . . . . . . 79

5.3.2 New Integer Linear Programming Formulation . . . . . . . . . . . 79

5.3.3 Advantages of IP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.4 Linear Programming Relaxation . . . . . . . . . . . . . . . . . . . 82

5.3.5 Integrality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4.2 Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.3 Performance of the LP Relaxations . . . . . . . . . . . . . . . . . . 92

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Identifying Functionally Related Yeast Proteins Using Inferred Evolu-

tionary History 96

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Computing the Phylogenetic Profiles . . . . . . . . . . . . . . . . . 102

6.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.3 Definition of Pathways . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.4 Framework for Inferring Ancestral Gene State . . . . . . . . . . . . 106

6.2.5 Comparing Profiles Using Tree Labelings . . . . . . . . . . . . . . 113

6.2.6 Assessing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.1 Limit on ROC Performance of Phylogenetic Profiles . . . . . . . . 116

6.3.2 Performance of Mutual Information for Finding Functional Linkages 121

6.3.3 Predicting Linkages From Shared Gains and Losses . . . . . . . . . 123

6.3.4 Predicting Linkages By Comparing Likelihoods . . . . . . . . . . . 124

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.5 List of Pathways and the Phylogenetic Tree . . . . . . . . . . . . . . . . . 133

7 Conclusion and Future Work 138

List of Figures

2.1 SCP graph formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Converting a 3-CNF formula to an SCP problem. . . . . . . . . . . . . . . 14

3.1 Interacting pairs of residues in two proteins. . . . . . . . . . . . . . . . . . 23

3.2 Flow chart of the LP/ILP approach. . . . . . . . . . . . . . . . . . . . . . 26

3.3 Bad integrality ratio example. . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 The distribution of rotamers in the rotamer library. . . . . . . . . . . . . . 31

3.5 The 6–12 Lennard-Jones van der Waals force. . . . . . . . . . . . . . . . . 32

3.6 The average rmsd over 5 proteins for various values of C. . . . . . . . . . 33

3.7 Native backbone running times. . . . . . . . . . . . . . . . . . . . . . . . . 35

3.8 Native test set χ1 angle errors. . . . . . . . . . . . . . . . . . . . . . . . . 37

3.9 Performance compared with exposed surface error. . . . . . . . . . . . . . 37

3.10 Distribution of mistakes broken down by amino acid. . . . . . . . . . . . . 38

3.11 Design problem solve times. . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.12 Design problem relative gaps. . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.13 Near-optimal solution relative gaps. . . . . . . . . . . . . . . . . . . . . . 46

4.1 Geometry of the solution vectors. . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Cold-shock protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Triose phosphate isomerase. . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Largest eigenvalues of uniform random graphs. . . . . . . . . . . . . . . . 70

4.5 Gaps and rounded energies for uniform random graphs. . . . . . . . . . . 71

4.6 Gaps and rounded energies for neighborhood random graphs. . . . . . . . 72

5.1 Schematic of the new ILP formulation. . . . . . . . . . . . . . . . . . . . . 81

5.2 Notation used in the faster ILP formulation. . . . . . . . . . . . . . . . . . 84

5.3 Graph used to show the new ILP can be made as tight as the old one. . . 84

5.4 Bad example for the heuristic set of constraints. . . . . . . . . . . . . . . . 87

5.5 Bad integrality ratio example for motif finding. . . . . . . . . . . . . . . . 90

5.6 Speed up and size reduction of new ILP formulation. . . . . . . . . . . . . 93

6.1 Transforming e-values to probabilities. . . . . . . . . . . . . . . . . . . . . 103

6.2 Fraction of S. cerevisiae genes that have homologs. . . . . . . . . . . . . . 104

6.3 Sizes of the KEGG pathways. . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4 Gene gain / loss probability model. . . . . . . . . . . . . . . . . . . . . . . 108

6.5 Plot of the function maximized in the M-step. . . . . . . . . . . . . . . . . 112

6.6 Maximal ROC performance for KEGG pathways. . . . . . . . . . . . . . . 116

6.7 Colliding profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.8 Colliding profiles at Hamming distance 10% . . . . . . . . . . . . . . . . . 120

6.9 ROC analysis of mutual information. . . . . . . . . . . . . . . . . . . . . . 121

6.10 Per-function analysis of mutual information. . . . . . . . . . . . . . . . . . 122

6.11 Comparison of SGL with binary MI . . . . . . . . . . . . . . . . . . . . . 123

6.12 Root label probabilities for LRATIO . . . . . . . . . . . . . . . . . . . . . 124

6.13 Transition probabilities for LRATIO . . . . . . . . . . . . . . . . . . . . . 125

6.14 Complementary evolution schematic. . . . . . . . . . . . . . . . . . . . . . 127

6.15 Comparing LRATIO to MI. . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.16 Pathways ranked the highest by LRATIO. . . . . . . . . . . . . . . . . . . 129

6.17 Comparison of LRATIO using real tree and a random tree. . . . . . . . . 130

6.18 Shared presence and shared absence edges. . . . . . . . . . . . . . . . . . . 132

List of Tables

3.1 Native backbone problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Homology modeling problems and their sizes. . . . . . . . . . . . . . . . . 29

3.3 Run time for homology modeling test set. . . . . . . . . . . . . . . . . . . 30

3.4 Average performance on native test set. . . . . . . . . . . . . . . . . . . . 36

3.5 Average performance on homology modeling problems. . . . . . . . . . . . 40

3.6 Redesign test set and solve times. . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Sizes for motif finding problems. . . . . . . . . . . . . . . . . . . . . . . . 92

Chapter 1

Introduction

In this thesis, we present computational approaches to solve several problems in protein

structure and function. Algorithms and computational methods are becoming essential

to make sense of the vast amount of data that is now available about workings of the cell.

Our main task in this thesis is the development of algorithmic and analytical methods for

understanding proteins — the workhorse molecules of life — and their function. In partic-

ular, we address the problems of predicting and designing protein structures, discovering

protein binding sites in DNA, and assigning proteins to biological pathways.

In the first part of this thesis, we focus on protein structure. A central problem

in molecular biology is that of predicting a protein’s three-dimensional fold when given

only its one-dimensional amino acid sequence. The structure of a protein plays a critical

role in its function. While the number of known protein sequences is growing rapidly,

their corresponding protein structures are being determined at a significantly slower pace.

Despite decades of work, the problem of predicting the 3D structure of a protein from its

amino acid sequence remains unsolved. In the first part of this thesis, we consider the side-

chain positioning (SCP) problem, a challenging and important component of the general

protein structure prediction problem, with applications in homology modeling and protein

design. For side-chain positioning, the task is to find the lowest energy conformation of

a protein’s side chains on a given, fixed backbone. In Chapters 2, 3, and 4, we study

a widely-used version of the problem where the side-chain positioning procedure uses a

library of discrete side-chain conformations and an energy function that can be expressed

as a sum of pairwise terms. In practice, this problem is tackled by a variety of general

search techniques and specialized heuristics. Chapter 2 is an introduction to the SCP

problem, previous methods that have been used to solve it, and some successes obtained

with these methods. It also provides a brief overview of the biology behind the problem,

and we show that it is NP-complete to find even a reasonable approximate solution to the

problem.

In Chapter 3 we present an integer linear programming (ILP) formulation of side-chain

positioning. Formulating the SCP problem as an ILP requires that we represent the space

of possible side-chain conformations using a system of binary variables and express the

quality of a conformation as a linear function of those variables. A low-energy structure

will arise from a setting of these variables that minimizes this function. Because it is

difficult to enforce, we relax the constraint that the value of the variables be either 0 or

1 to give a polynomial-time linear programming (LP) heuristic that allows us to tackle

large problem sizes.

To test the effectiveness of the heuristic, we apply it to place side chains on native and

homologous backbones and to choose amino acid types for protein design. Surprisingly,

when positioning side chains on native and homologous backbones, optimal solutions using

a biologically relevant energy function can usually be found using LP. On the other hand,

design problems often cannot be solved using LP directly, but optimal solutions for large

instances can still be found using the computationally more expensive ILP procedure. We

briefly explore the effect of the choice of energy function used to evaluate the quality of

a structure on the difficulty of the problem. While the ease with which solutions can be

found does vary with the choice of energy function, the LP/ILP approach described in

Chapter 3 is able to find optimal solutions for the energy function variants we considered.

Our analysis is the first large-scale demonstration that LP-based approaches are highly

effective in finding optimal (and near-optimal) solutions for the side-chain positioning

problem, and our success in finding optimal solutions puts the theoretical hardness results

into practical context.

Because solutions to the LP relaxation of design problems often have variables set to

values that are not 0 or 1, we also explore a tighter relaxation to the integer program.

In Chapter 4, we write the side-chain positioning problem as an instance of semidefinite

programming (SDP). We introduce two novel rounding schemes and provide theoretical

justifications for their effectiveness under various conditions. We extensively test the

SDP formulation and the new rounding schemes on simulated data, as well as on the

computational redesign of two naturally occurring protein cores. We show that our SDP

approach generally finds good solutions. The rounding schemes should be applicable

outside context of side-chain positioning.

In the second part of this thesis, we study the problem of predicting a protein’s bind-

ing sites in DNA. One of the most important roles that proteins play in the cell is the

regulation of the expression of other proteins, and in Chapter 5 we consider the problem

of finding transcription factor binding sites in the regions of DNA upstream of genes. The

motif-finding problem is that of locating a collection of mutually similar subsequences

within a given set of DNA sequences; these subsequences often correspond to regulatory

elements, the discovery of which can help explain the circuitry of the cell. We study a com-

binatorial framework for the motif-finding problem, where the goal is to find a minimum

weighted clique in a k-partite graph. Previous approaches to find these cliques have relied

on graph pruning and divide-and-conquer techniques. Though the side-chain positioning

and motif-finding problems seem very different, the same combinatorial problem lies at

their heart. Recently, it has been shown that mathematical programming is a promising

approach for motif finding using an integer program identical to the one we apply to SCP

in Chapter 3. In Chapter 5, we describe a novel, faster integer linear programming formu-

lation for the problem. A key observation driving the improvement is that the weights on

the edges in the graph come from a small set of possibilities. By exploiting this property

we often solve these problems to optimality an order of magnitude (and up to 45 times)

faster than the existing mathematical programming approach. We show that our new

formulation leads to a method that is highly effective in practice on instances arising from

biological sequence data.

Though very different biological applications, the two problems of motif finding and

side-chain positioning are connected by their underlying graph problem, and our ap-

proaches to them share the philosophy of focusing on optimal (rather than simply “good”)

solutions while optimizing a clear objective function.

In the third part of the thesis, we investigate protein function more generally, and

develop several new methods for predicting the role a protein plays in the cell using infor-

mation from its evolutionary history by extending the widely used method of phylogenetic

profiles. A phylogenetic profile is a vector indicating whether a protein is present across a

variety of organisms. Similar profiles are taken to indicate similar function, but determin-

ing which profiles should be considered similar is not straightforward. In Chapter 6, we

show that if the measure of similarity incorporates the broad evolutionary relationships

between organisms, it is possible to make better predictions of functional links. We relate

the species involved in the profiles by a species tree and infer whether each protein was

present or absent in the non-extant, ancestor organisms implied by the internal nodes of

the tree. We give two new measures of profile similarity that use these inferred, ances-

tral gene states to better predict shared function. We also investigate the best way to

assess the quality of the predictions and present evidence that considering performance

on each function separately gives a more accurate picture of performance than lumping

all functions together, as is often done.

In the next chapter, we begin the study of the first of these problems, the side-chain

positioning problem.

Chapter 2

The Side-chain Positioning

Problem

In this chapter, we start with some biological background and introduce the

side-chain positioning problem — the problem of determining a low-energy

configuration of the side-chain atoms of a protein. We describe some ap-

plications and conclude by showing that the combinatorial problem behind

side-chain positioning is hard to approximate.

2.1 Brief Biological Background

A protein molecule is formed from a chain of amino acids. Each amino acid consists

of a central carbon atom, and attached to this carbon are a hydrogen atom, an amino

group (NH2), a carboxyl group (COOH) and a side chain that characterizes the amino

acid. Side chains vary in composition; for example, the side chain for the amino acid

glycine consists of a single hydrogen atom, and the side chain for the amino acid alanine

consists of a carbon atom with three hydrogen atoms attached. The amino acids of a

protein are connected in sequence, with the carboxyl group of one amino acid forming

a peptide bond with the amino group of the next amino acid. This forms the protein

backbone, and the repeating amino acid units (also called residues) within the protein

consist of both the main-chain atoms that comprise the backbone as well as the side-

chain atoms. There are 20 commonly occurring amino acids, and each protein molecule

is specified by a sequence corresponding to the amino acids that make it up. Whereas a

protein’s sequence immediately reveals its chemical composition, its structure, specified

by the spatial coordinates of its main-chain and side-chain atoms, is significantly more

difficult to determine.

It is generally believed that a protein’s native structure corresponds to the conforma-

tion with the minimum global free energy. It follows that one approach to predict protein

structure computationally is to start with the protein’s amino acid sequence, specify an

appropriate energy function, and find the conformation that minimizes the energy. A sim-

ilar strategy can be used to design novel folds sequences by searching for a sequence that,

when placed onto a backbone, yields a low energy structure. Protein structures are diffi-

cult to predict and design due to inaccuracies in energy functions as well as the infeasibility

of computationally searching over all possible conformations; in practice, predictions are

often made by settling for less than optimal solutions when considering imperfect energy

functions.

2.2 The Side-chain Positioning Problem

In the side-chain positioning problem (SCP), one is given a fixed backbone and a protein

sequence, and the task is to predict the best conformation of the protein’s side chains on

this backbone. SCP is a key step in computational methods for predicting and designing

protein structures (see, e.g., (Summers and Karplus, 1989; Holm and Sander, 1991; Lee

and Subbiah, 1991; Ventura and Serrano, 2004; Park et al., 2004)).

The problem is made discrete by the observation that in actual protein structures side

chains tend to occupy one of a small number of conformations (Ponder and Richards,

1987), called rotamers. These rotamers are identified by finding frequently occurring side-

chain conformations in databases of protein structures. Common conformations for each

side chain are collected into rotamer libraries (e.g. (Dunbrack Jr and Karplus, 1993)).

The total energy of the molecule is expressed as a sum of pairwise energies between atoms

(i.e., when computing energies, only two residues are considered at a time). SCP can then

be formulated as a combinatorial optimization problem: choose a rotamer for each side

chain such that the overall energy of the molecule is minimized (see Section 2.3).

This formulation of SCP has been the basis of some of the more successful methods for

homology modeling (e.g., (Lee and Subbiah, 1991; Petrey et al., 2003; Xiang and Honig,

2001; Jones and Kleywegt, 1999; Bower et al., 1997)), and protein design (e.g., (Dahiyat

and Mayo, 1997; Malakauskas and Mayo, 1998; Looger et al., 2003)). In homology mod-

eling, the goal is to predict the structure for a protein that is homologous (operationally

defined as having similar sequence) to another one of known structure. When two pro-

teins have high sequence similarity, they almost always have a similar overall shape, and

thus the backbone of the protein of known structure can be used as a reasonable tem-

plate backbone for the protein under investigation. In the homology-modeling setting,

the rotamers for a single amino acid are considered at each position.

In protein design (or redesign), the goal is to find a sequence of amino acids that will

fold into a given shape. Though the goal in the design setting is very different from that in

homology modeling, the underlying formulation for both problems is identical. The design

problem is often reduced to SCP by the following method: rather than specifying exactly

the amino acid at each position, we allow the optimization problem to choose among

rotamers from several different types of amino acids at each position. The optimization

problem is solved and the amino acid that corresponds to the rotamer that was chosen at

position i is taken to be the ith amino acid in the sequence. This sequence is the one that

best fits this backbone, and thus, it is hoped, will fold into this shape. This approach has

led to some dramatic successes in protein design, including the design of a 28-residue zinc

finger domain that folds in the absence of zinc (Dahiyat and Mayo, 1997) and the design

of novel receptor proteins for several small molecules (Looger et al., 2003).

2.3 Formal Problem Description

The SCP problem can be stated as follows (Desmet et al., 1992): given a fixed backbone

of length p, each residue position i is associated with a set of possible candidate rotamers

{ir}. Once a single rotamer for each residue position has been chosen, the potential energy

of a protein system is given by the formula

E = E0 +∑

E(ir) +∑

E(irjs) , (2.1)

where E0 is the self-energy of the backbone, E(ir) is the energy resulting from the interac-

tion between the backbone and the chosen rotamer ir at position i as well as the intrinsic

energy of rotamer ir, and E(ir, js) accounts for the pairwise interaction energy between

chosen rotamers ir and js. In this discretized setting, the placement of each side chain

is reduced to finding an assignment of rotamers to positions that minimizes the overall

energy of the system (the global minimum energy conformation, or GMEC ).

A pairwise additive energy of the form (2.1) can capture most of the forces that are

commonly taken into account in energetic calculations of protein structure. Empirical

force fields that approximate the forces acting on a system of atoms are often used in

molecular dynamics simulations and other energetic calculations. Two common force fields

used in these simulations and in energy minimization are AMBER (Cornell et al., 1995)

and CHARMM (Brooks et al., 1983; MacKerell et al., 1998), and these force fields are often

set up to include terms that model van der Waals interactions, prefered dihedral angles,

bond angles, and bond lengths. Each of these terms is inherently pairwise. Electrostatic

effects are also important in the dynamics of the molecule and are modeled by distributing

fractional point charges to the atom centers in the molecule (taking into account only

position i

rotamer i r

Figure 2.1: The graph formulation. In this hypothetical example, there are four positionsin the protein, one with two rotamer possibilities and the others with three rotamerpossibilities each.

chemical, but not three-dimensional, structure), and summing the pairwise Coloumbic

potential between these point charges. With this common approach, the electrostatic

energy term can be fit into the framework of (2.1). On the other hand, terms to account

for the effects of solvent usually depend on a molecule’s exposed surface area and do not

fit well into this framework. However, many approaches to protein structure prediction

and design consider proteins in a vacuum and do not explicitly model this term.

It is convenient to reformulate the SCP problem in graph-theoretic terms. Let G be an

undirected p-partite graph with node set V1∪· · ·∪Vp , where Vi includes a node u for each

rotamer ir at position i; the Vi’s may have varying sizes. Each node u of Vi is assigned

a weight Euu = E(ir); each pair of nodes u ∈ Vi and v ∈ Vj (i 6= j), corresponding to

rotamers ir and js respectively, is joined by an edge with a weight of Euv = E(irjs).

Zero-weight edges can be thought of as equivalent to the absence of an edge. The global

minimum energy conformation is achieved by picking one node per Vi to minimize the

weight of the induced subgraph.

When applied to protein structure prediction using either a native backbone or the

backbone of a homologous protein, because the sequence is known, the rotamers in each

set Vi are conformations of a single, particular amino acid. In contrast, in the design

scenario, it is the sequence itself that is the object of the search. This is handled by

putting rotamers from several amino acids into the sets Vi. The designed sequence is

taken to be the sequence of amino acids corresponding to the chosen rotamers in the

low-energy solution, under the assumption that if a sequence fits well onto a backbone, it

will fold into that shape.

2.4 Applications and Successes

The formulation of the SCP problem studied in this thesis has been at the heart of some

dramatic protein design successes. The first major success was the redesign of a 28-residue

zinc finger so that it folds into the native shape without the presence of zinc (Dahiyat and

Mayo, 1997). The Zif268 zinc finger domain consists of a alpha helix and a 2-stranded

beta sheet (ββα motif). In the wild type, the beta-sheet and alpha-helix are held together

by interactions between a zinc ion and two cysteine and two histidine residues. A new

sequence was designed that folds into the same ββα motif without requiring coordination

by the zinc atom. Rather than allowing all amino acids at all positions, the redesign

allowed mutations to Ala, Val, Leu, Ile, Phe, Tyr, and Trp in the core and Ala, Ser, Thr,

His, Asp, Asn, Glu, Gln, Lys, and Arg for surface residues. Any of the preceding amino

acids were allowed in boundary positions. The total search space consisted of 1.1 × 1062

possible choices of rotamers, and finding the optimal rotamer sequence via DEE required

37 CPU hours (on 3.9 GFLOPS processors), while the energy calculations required 53

CPU hours. The same formulation was use to redesign several boundary residues of

the β1 domain of protein G from Streptococcus in order to increase the stability of the

protein (Malakauskas and Mayo, 1998).

More recently, the SCP problem has also been used to redesign protein-protein inter-

actions as well as protein-ligand interactions. Novel receptor proteins have been designed

to bind to trinitrotoluene, L-lactate, and serotonin (Looger et al., 2003). Many ligand

orientations are docked to the protein that is to be redesigned. For each orientation of

the ligand, the ligand-contacting side chains (12–18 residues) are redesigned using DEE.

For each of the ligand-protein pairs, the design process takes about 60 CPU days. In (Ko-

rtemme et al., 2004), the E7/Im7 protein-protein interface was redesigned to create several

pairs of mutants that bind preferentially to each other but not with the wild type. Side-

chain repacking is used as a subroutine, applied to successive fixed backbones.

In general, the protein-protein interface is often mediated by interlocking side chains,

and thus side-chain positioning can be a valuable way to augment ridge-body transforma-

tions when trying to dock one protein onto another. See, for example, (Grey et al., 2003;

Wang et al., 2005).

Side-chain positioning finds its way into methods to design completely novel proteins.

The design of a 93-residue α/β protein with a novel fold (dubbed Top7) was reported

in (Kuhlman et al., 2003) using the method first described in (Kuhlman and Baker,

2000). There the method of design alternates between side-chain positioning on a fixed

backbone and continuous relaxation of the backbone given a designed sequence.

Homology modeling is another widely used application for this formulation of the SCP

problem. The popular Scwrl package allows rapid low-resolution predictions of structures

by using homologous proteins as backbones (Bower et al., 1997). When homologous

backbones are not available, SCP has been used with much success in ab initio structure

prediction — that is determining the complete 3D coordinates of the atoms of a protein

given only its amino acid sequence. For example, the Rosetta program (Simons et al.,

1997) repacks side chains by choosing rotamers from a rotamer library as it performs a

Monte Carlo search for an optimal backbone shape. For speed, simulated annealing is

used to find a good choice of rotamers. See, e.g., (Aloy et al., 2003) for a discussion of

other recent successes in ab initio fold prediction.

2.5 Previous Methods to Solve the SCP Problem

There has been considerable progress in development of both exhaustive and heuristic

techniques for this problem. Within the past dozen years, a series of papers on dead-end

elimination have given rules for throwing out rotamers that cannot possibly be in the

optimal solution (e.g., (Desmet et al., 1992; Desmet et al., 1994; Goldstein, 1994; Lasters

et al., 1995; Gordon and Mayo, 1998; Looger and Hellinga, 2001; Gordon et al., 2002)).

In Chapter 3 we will use the Goldstein (Goldstein, 1994) variant of dead-end elim-

ination condition to prune some instances. Following the notation of Section 2.3, this

condition says that a rotamer u ∈ Vi can be thrown out if there is some other rotamer

v ∈ Vi such that

Euu − Evv +∑

minw∈Vj

(Euw − Evw) > 0 .

The rotamers u are selected in sequence starting with an arbitrary rotamer. Every possible

v is tested to see if the above condition holds, indicating that rotamer u can be eliminated.

This process stops when a pass through the rotamers finds none that can be removed.

Special-purpose heuristic search techniques for specific energy functions have been

successfully applied, as in the original Scwrl package (Bower et al., 1997). More gen-

eral search methods such as simulated annealing (e.g., (Lee and Subbiah, 1991; Holm

and Sander, 1991)), A∗ (Leach and Lemon, 1998), Monte Carlo search (e.g., (Xiang and

Honig, 2001)), and mean-field optimization (Lee, 1994) have also been used. Special-

ized graph-theoretic approaches have also been developed (Samudrala and Moult, 1998;

Canutescu et al., 2003; Bahadur et al., 2004). Among these previous methods, the ex-

haustive methods always find the optimal solution but are not efficient (i.e., may require

exponential search), whereas the heuristics are efficient but do not guarantee finding the

optimal solution.

2.6 The Inapproximability of Side-chain Positioning

It has been shown that SCP is NP-complete (Pierce and Winfree, 2002). Accordingly,

it is unlikely that there is any efficient algorithm for solving the problem optimally. For

many NP-complete problems, it is possible to develop approximation algorithms that can

efficiently find a suboptimal solution that is within a provable factor of the optimal one.

For SCP, we show below that it is unlikely that there is any approximation algorithm

with a reasonable performance guarantee.

Theorem 2.6.1 It is NP-complete to approximate the minimum energy of the global min-

imum energy conformation (GMEC) within a factor of cn, where c is a positive constant

and n is the total number of rotamers.

One detail is that complexity results are proved for yes/no decision questions. The

SCP problem is an optimization problem in which we are given an instance of a side-

chain positioning problem, and we seek the best conformation as well as its energy. It

is turned into a yes/no decision problem by providing as additional input an integer k,

and asking whether the GMEC of the instance has energy less than k. Note that this

modified problem is not harder than the original optimization version: if one could solve

the optimization version, one could easily solve this yes/no decision version.

Theorem 2.6.1 does not mean that good algorithms and methods for the side-chain

positioning problem cannot be shown to work well in practice. Indeed, several papers have

presented efficient algorithms that seem to work well in practice (e.g. (Xiang and Honig,

2001; Bower et al., 1997)), or algorithms that are designed to find optimal solutions but

complete quickly for some problems (e.g. (Goldstein, 1994; Lasters et al., 1995; Desmet

et al., 1992; Pierce et al., 2000; Kingsford et al., 2005)).

Proof of Theorem 2.6.1. We will prove this theorem by showing that if the SCP problem

has a good approximation algorithm then we would also have an efficient algorithm for a

problem for which it is likely that none exists, namely a problem involving satisfiability

of boolean formulas.

A 3-CNF (conjunctive normal form) formula is a conjunction of clauses, each one

consisting of the disjunction of three literals (not necessarily distinct). An example of

such a boolean formula is shown at the top of Figure 2.2. In that figure, letters a, b, . . .

represent variables and a, b, . . . represent the negation of those variables. ∨,∧ represent

(a b c) . . .(f g d) (b c d)

(a d f)

Figure 2.2: Converting a 3-CNF formula to an SCP problem.

“or” and “and,” respectively. A formula is satisfiable if the variables can be assigned

values true or false such that the whole formula is true. Given a 3-CNF formula, it is

NP-complete to determine whether it is satisfiable.

The PCP theorem (Arora et al., 1998; Arora and Safra, 1998) asserts that, given any

3-CNF formula Φ on n variables, there exists another one, denoted by Ψ, which contains

nO(1) variables and is satisfiable if and only if Φ is satisfiable. Furthermore, if Ψ is not

satisfiable, then it is strongly unsatisfiable, meaning that no truth assignment can satisfy

more than a fraction α of its clauses, for some constant 0 < α < 1. Finally, Ψ can

be derived from Φ in polynomial time. Since 3-CNF satisfiability is NP-complete, it is

then also NP-complete to distinguish between formulas that are satisfiable and those that

are strongly unsatisfiable. (If there is an efficient algorithm that distinguishes between

such formulas, then any 3-CNF formula can be efficiently tested for satisfiability by first

converting it to a strongly unsatisfiable formula and then using this algorithm.)

Given a 3-CNF formula with p clauses that is either satisfiable or strongly unsatisfiable,

we create an SCP problem such that if the formula is satisfiable then the GMEC = 1,

but if the formula is not satisfiable then the GMEC will tell us how many clauses can be

satisfied in the original 3-CNF.

We build a p + 1-partite graph G as follows: each clause i corresponds to a set Vi of

4 vertices. In each Vi three vertices are associated with the literals of clause i. These

vertices have no self-weights. Two vertices in Vi and Vj are joined in G if and only if the

literal of one is the negation of the other. Each such edge is assigned weight 3. The 4th

vertex in each Vi is an “extra” vertex with no adjacent edges and vertex weight 1. We

add an additional position with a single node of weight 1. The total number of rotamers,

or nodes, in this SCP instance is n = 4p+ 1. This reduction is depicted in Figure 2.2.

If the CNF formula is satisfiable then for each Vi we select a literal set to true as the

GMEC vertex. These p vertices form an independent set (i.e., since one cannot set both

a variable and its negation to true, these vertices have no edges between them) and the

energy of the system is 1.

If the CNF formula is not satisfiable then the GMEC is formed by picking the largest

independent set among the vertices, including at most one vertex per Vi, and completing

the selection for the remaining Vi by choosing the fourth “extra” vertex for each. (Picking

any pair of adjacent vertices would be a mistake since that choice could be locally improved

by choosing an isolated vertex in each position of weight 1.) We can set to true the literals

corresponding to the vertices of the independent set. Therefore, the energy of the GMEC

is p − c + 1, where c is the maximum number of satisfiable clauses in the CNF formula.

Because the CNF formula is strongly unsatisfiable, the minimum number of unsatisfied

clauses p−c is at least (1−α)p, and the optimal GMEC of the corresponding SCP problem

is at least (1 − α)p + 1.

Thus, suppose we had an efficient algorithm that was guaranteed to find a solution

at most (1−α)4 n times the optimal. Then for any satisfiable formula, this algorithm would

find a solution to the corresponding SCP problem of value at most (1−α)4 n. Since this is

less than (1 − α)p + 1, the minimum possible value of the GMEC corresponding to an

unsatisfiable formula, this algorithm could distinguish between satisfiable and strongly

unsatisfiable formulas, something the PCP Theorem implies we cannot do. �

Chapter 3

Solving and Analyzing Side-chain

Positioning Problems Using

Linear and Integer Programming

We describe an integer linear programming formulation of the side-chain po-

sitioning problem and show how to use integer programming and a linear

programming heuristic to find optimal solutions to large native-backbone, ho-

mology modeling, and design problems. We also show how to find several

near-optimal solutions, which are often useful in protein design. Finally, we

investigate how the method of evaluating the total energy of a choice of ro-

tamers affects the ease with which solutions are found.

3.1 Introduction

In this chapter, we give an integer linear programming (ILP) formulation of the side-chain

positioning problem described in Chapter 2 and show that it can tackle large problem

sizes and can obtain successive near-optimal solutions. Multiple near-optimal solutions

are especially useful in protein design, where it may be desirable to find several possible

sequences for a particular shape.

An optimal solution to the ILP we describe below gives an optimal solution to the

underlying SCP problem. By relaxing the integrality constraint of our ILP, we get a

polynomial-time linear programming (LP) heuristic, where the solution may or may not

correspond to a solution for the SCP problem. Our overall LP/ILP approach for SCP is as

follows. First, we apply LP to an instance of SCP. If the solution using LP is integral, then

that solution is provably the conformation with the global minimum energy, and we have

found, in polynomial-time, the optimal solution to the SCP instance. On the other hand,

if the solution using LP is fractional, we run the computationally more expensive (i.e.,

no longer polynomial-time) ILP procedure to find the optimal SCP solution, after first

preprocessing with Goldstein dead-end elimination (Goldstein, 1994) (see Section 2.5).

Using our LP/ILP approach, we evaluate instances arising when positioning side chains

on native and homologous backbones, and when choosing side chains for protein design.

We show that LP and ILP are highly effective methods for obtaining optimal solutions

for the SCP problem. The LP/ILP approach is shown to tackle problems of size up to

10218 easily when packing side chains on native and homologous backbones, and of size

up to 10201 when redesigning protein cores. As proof of principle, we also obtain multiple

(100) near-optimal solutions for a native-backbone SCP problem of size 1079.

Although mathematical programming approaches to SCP (Althaus et al., 2000; Eriks-

son et al., 2001; Chazelle et al., 2004) have been suggested previously, they are extensively

tested for the first time here, improved to take advantage of the geometry of the protein,

as well as extended to handle larger problem sizes and to find near-optimal solutions.

In (Althaus et al., 2000), an LP relaxation similar to the one we study below is used in

a branch-and-cut framework to solve three small (50-54 side-chains) docking problems.

In (Eriksson et al., 2001), a different LP formulation is used in a branch-and-bound ap-

proach on a single backbone.

SCP methods are commonly evaluated using two scales. In the predictive scale, one

asks how well the side-chain conformations predicted by the method agree with those

that are found in the actual structure; or, in the case of protein design, whether the newly

designed sequence folds into the desired shape. In the combinatorial scale, one asks how

close the total energy resulting from the predicted side-chain conformations is to the lowest

possible minimum energy using the given rotamer library and energy function. Of course,

the predictive scale measures what we are ultimately interested in (i.e., the quality of the

end result). However, the combinatorial scale is useful for improving search algorithms and

energy functions, and such improvements are necessary to get higher-quality predictions

of side-chain conformations. Theoretical results argue that the SCP problem is difficult

on the combinatorial scale: the combinatorial problem underlying SCP is not just NP-

complete (Pierce and Winfree, 2002), but also inapproximable (Section 2.6). That is, it is

unlikely that there exists a polynomial-time method that can guarantee a good (let alone

optimal) solution to SCP for all instances of the problem. However, these are worst-case

results: they may not hold for the classes of problems and energy functions that occur in

practice. In this chapter, we hope to put these theoretical hardness results into practical

context.

We use our LP formulation to probe the difficulty of SCP instances arising in differ-

ent applications. We label an instance as “easy” if LP finds an integral (i.e., optimal)

solution. In contrast, if LP finds a fractional solution, we use this as evidence that the

instance is more difficult to solve. Our computational experiments on 25 native-backbone

problems and 33 homology-modeling problems show that LP can almost always find an

integral solution when using an energy function based on van der Waals interactions and

a statistical rotamer self-energy term. Similar, even simpler, energy functions have been

the basis for successful homology-modeling packages (Bower et al., 1997). Since SCP is

NP-complete, it is intriguing that integral solutions are found so readily, and in these

cases, since a polynomial-time procedure has provably found optimal solutions, it appears

that the theoretical hardness results do not apply in practice. On the other hand, when

using the same energy function on 25 protein design problems of approximately the same

size, the LP does not often find integral solutions. This suggests that the optimization

problems underlying protein design may be considerably more difficult to solve than those

arising in the native- or homology-modeling settings. We also explore how changing the

energy functions affects the problem’s hardness. The LP approach sometimes finds op-

timal solutions under energy function variants; however, different energy functions affect

its ability to do so.

3.1.1 Biological Relevance

While our primary goal is to study the combinatorial nature of SCP, in order to verify

that the energy functions considered are appropriate for predicting protein structures for

native and homologous backbones, we compare side-chain conformations predicted by the

LP/ILP approach with those in the native structures. The solutions found for native

and homologous backbones give structures that are comparable in quality to those found

by other methods using the same rotamer library (Bower et al., 1997; Xiang and Honig,

2001).

3.1.2 Practical Implications

There are several immediate practical consequences of our analysis. First, our work argues

that attempts to improve search methods should be focused on protein design problems,

as they seem to be computationally more difficult to solve than homology modeling prob-

lems. Second, in our experience, even seemingly small differences in problem instances can

have a large impact on the ease with which solutions are obtained. This makes it hard to

compare different published benchmarks of SCP algorithms, as these algorithms are often

tested with differing energy functions and in different settings (e.g., design vs. homology

modeling). To facilitate comparisons and to encourage the use of LP/ILP approaches, we

are making our software for generating the LP/ILP publicly available. Third, our analysis

suggests that the choice of an energy function should depend on two factors: how bio-

logically meaningful it is and how it affects the ease with which optimal or near-optimal

solutions are found. For example, a combinatorially “easy” energy function may be useful

in finding a subset of reasonable predictions that can then be evaluated using the desired

energy function. Finally, and most importantly, our analysis includes the first large-scale

test of an LP/ILP approach, and we demonstrate that such an approach provides an effec-

tive and practical technique for solving the SCP problem for both homology modeling and

protein design applications. Decades of research on LP furnish us with highly-developed

machinery that we can exploit; the advantage of relying on this off-the-shelf technology

is that any subsequent progress in optimizing linear programs will translate into faster

running times for our method. While there are many fast heuristics for side-chain po-

sitioning, in many cases, optimal and successive near-optimal solutions are desired. In

these cases, LP-based approaches provide a general, state-of-the-art methodology.

3.2 Methods

3.2.1 Integer Linear Programming Formulation

We first formulate the side-chain positioning problem as an integer linear program (ILP),

so that a solution to the ILP gives an optimal solution to the SCP problem. The ILP is

based on the graph formulation of SCP discussed in Section 2.3. The vertex set of this

graph is V = V1 ∪ · · · ∪ Vp , and its edge set D = {(u, v) : u ∈ Vi, v ∈ Vj , i 6= j }. Recall

that the nodes correspond to rotamers, and the edges to interactions between rotamers.

We introduce a {0, 1} decision variable xuu for each node u in V , and a {0, 1} decision

variable xuv for each edge in D. Setting xuu to 1 corresponds to choosing rotamer u, and

similarly setting xuv to 1 corresponds to choosing to “pay” the energy between rotamers u

and v. We constrain our optimization so that only one rotamer is chosen per residue, and

so that we pay the cost for edge {u, v} if and only if rotamers u and v are both chosen.

The following integer program ensures these conditions:

Minimize E =∑

u∈V Euuxuu +∑

{u,v}∈D Euvxuv

subject to

∑u∈Vj

xuu = 1 for j = 1, . . . , p

∑u∈Vj

xuv = xvv for j = 1, . . . , p and v ∈ V \ Vj

xuu, xuv ∈ {0, 1}

The first set of constraints ensures that we choose exactly one rotamer for each residue.

The second set of constraints demands that we set the edge variables xuv to 1 for edges

that are in the subgraph induced by the choice of rotamers: if xvv = 0 then no adjacent

edges can be chosen, and if xvv = 1 then exactly one adjacent edge is chosen for each vertex

set. Though derived independently, this formulation is similar to the version of (Althaus

et al., 2000) (without modifying the energies to be negative) and simpler than that of

(Eriksson et al., 2001). Additionally, on the experimental side, (Klepeis et al., 2003) use

a similar integer programming formulation to design variants of the peptide Compstatin

that are predicted to improved inhibitory activity in complement pathways. However, this

is a slightly different model in which side-chain positions are not explicitly represented.

In practice, the ILP given above can have many variables and constraints that do not

affect the optimization, and the system can be pruned dramatically. In particular, if all

the pairwise energies between rotamers in positions i and j are non-positive, then we can

remove all variables xuv with u ∈ Vi and v ∈ Vj such that Euv = 0, and modify the

equality constraints in (IP1) that contain such an xuv by removing those variables and

changing “=” to “≤.” Because we are minimizing and all the energies between i and j

are zero or less, this change does not affect the optimal solution. A frequent special case

has zero energies between all rotamers in two positions; this corresponds to residues that

are too far apart in the structure to have any rotamers that interact with each other. The

more general case involves residues that are far enough apart that only a subset of their

rotamers have interactions with each other.

More formally, for each Vj, let N+(Vj) be the set union of the Vi for which there exists

some v ∈ Vi and u ∈ Vj with Euv > 0. Let D′ be the set of pairs {u, v} with u ∈ Vj such

that either v ∈ N+(Vj), or v 6∈ N+(Vj) but Euv < 0. There will be edge variables xuv

only for pairs in D′. Our modified ILP is as follows:

Minimize E ′ =∑

u∈V Euuxuu +∑

{u,v}∈D′ Euvxuv

subject to

∑u∈Vj

xuu = 1 for j = 1, . . . , p

∑u∈Vj

xuv = xvv for j = 1, . . . , p and v ∈ N+(Vj)

∑u∈Vj :Euv<0 xuv ≤ xvv for j = 1, . . . , p and v 6∈ N+(Vj)

xuu, xuv ∈ {0, 1}

An inequality constraint is not included if the sum on the left-hand side is empty.

The simple modification of (IP1) given in (IP2) is crucial in practice, providing in some

cases an order of magnitude speed up. For example, when focusing on packing side chains

the principal component of the energy function is the van der Waals force. Since this force

quickly becomes negative, and asymptotically goes to zero, each residue can have positive

interactions with only a few nearby residues. In other words, if p is the total number of

side chains, and if we treat the maximum number of rotamers in a position as a constant,

then the geometry of the problem can reduce the number of variables from O(p2) to O(p),

where the constant is related to the radius of influence of the van der Waals force and how

many residues can be packing into a sphere of that radius. As an example, Figure 3.1 show

two proteins in our test set. In Figures 3.1(a) and (b) lines are drawn between residues

with some non-zero energy. For these we must keep at least some variables. Though at

(a) (b)

(c) (d)

Figure 3.1: (a-b) Interacting residues in 1igd. (a) those pairs that have some non-zerointeraction. (b) those pairs of residues that have some positive interaction. (c-d) Theanalogue of (a-b) for 1aac.

first glance visually these proteins now look quite dense, on closer inspection it can be

seen that far fewer than the O(n2) pairs are connected by an edge; for many positions no

edge variables are necessary. The effect is more pronounced for elongated proteins and

can be seen clearly at the “tail” of the protein 1igd in Figure 3.1(a). Figures 3.1(b) and

(d) show pairs of residues that have rotamers with some positive energy between them.

These are the only pairs of positions for which we cannot eliminate any variables using

the technique above. It is apparent visually that far fewer than the O(p2) pairs of residues

have positive interactions.

3.2.2 Multiple Solutions

Sometimes it is desirable to find several optimal and near-optimal solutions. In the present

framework, the LP/ILP can be solved iteratively to find an ensemble of low-energy so-

lutions. At iteration m, all previously discovered solutions are excluded by adding the

constraints∑

u∈Sk

xuu ≤ p− 1 for k = 1, . . . ,m− 1 (3.1)

to (IP2), where Sk contains the optimal set of rotamers found in iteration k. This will

require that the new solution differs from all previous ones in at least one position. It

may be desirable to obtain successive solutions that differ more from each other, and this

can be accomplished by replacing p − 1 in (3.1) by p − q, where 1 < q ≤ p. Further, by

summing over only a subset of the positions, we can force, say, a core residue to assume

a different rotamer.

3.2.3 LP/ILP Approach

The ILP formulation is as hard to solve as the original SCP problem. If we relax the

integrality constraints xuv ∈ {0, 1} by replacing them with constraints 0 ≤ xuv ≤ 1 for

u, v ∈ V , we obtain a linear program, which can be solved efficiently. If the optimal

solution to the relaxed linear program is integral—all variables are set to either 0 or 1—

then that solution is also an optimal solution to the ILP and SCP problem. So our LP/ILP

approach to find optimal solutions is as follows (Figure 3.2): solve the problem of interest

using the computationally easier LP formulation. If the solution returned is integral, then

the problem instance was easy to solve, and we have the optimal solution to the original

SCP problem. Otherwise, we run polynomial-time Goldstein DEE (Goldstein, 1994) (See

Section 2.5) until no more rotamers can be eliminated and then solve the more difficult

ILP. None of the problems considered here converge when this simple DEE process is

applied.

We solve the LP relaxation before applying DEE principally so that we can test the

LP in isolation. We would like to know what size problems can we tackle using the LP

alone. Also, we will see below, somewhat surprisingly, that the solution to the LP is often

integral. It is interesting to see how often this happens even before the simplifying DEE

step. Finally, in some cases, the LP on the full graph can be solved more quickly than

running DEE to reduce the graph. This may be due to the highly optimized solver code

that is available. Using these off-the-shelf solvers leverages the many years of engineering

work that has been spent speeding up the code.

The CPLEX package (ILOG CPLEX, 2000) with AMPL (Fourer et al., 2002) was

used to solve the linear and integer programs. All computation was done on a single

Sparc 1200MHz processor.

3.2.4 Integrality Gap of the LP Relaxation

Restricting ourselves to non-negative energy functions for this section (the Scwrl energy

function is one such energy function, for example), the integrality gap is a measure of how

good a lower bound on the optimal solution can be guaranteed by an LP relaxation of a

particular ILP. The integrality gap for a minimization problem is defined (Vazirani, 2001,

pg. 102) as

OPT (I)

OPTf (I), (3.2)

C o n v e r g e d ?

N oY e s

Figure 3.2: Flow chart of the LP/ILP approach which we test here.

Figure 3.3: Example for which the integrality ratio of the LP relaxation is very large. Thedrawn edges have positive weight (say weight 1), while the weight between pairs of nodesbetween which no edge is drawn is 0.

where I is an instance of the problem, OPT (I) is the true optimum, and OPTf (I) is the

optimal value of the LP relaxation.

Consider the graph of Figure 3.3, where the drawn edges have weight 1 and the remain-

ing edges have weight 0. If this is the underlying graph of an IP2 formulation, any integral

solution must choose exactly one edge between positions, making OPT = 1. However, by

placing 0.5 on each of the nodes and 0.5 on the undrawn edges between positions, the LP

relaxation of IP2 can achieve an value of 0. The integrality gap is therefore infinite. (If a

zero energy solution is troublesome, weight the undrawn edges by some small ǫ.)

The small example of Figure 3.3 is not unrealistic in the SCP setting. A similar set of

weights can be realized by, say, a planar configuration of three residues, where one set of

rotamers forms an overlapping “tepee” on one side of the plane, while the other set forms

a mirror tepee on the other side. Therefore, the LP relaxation could perform very badly.

Perhaps surprisingly, we present experimental evidence below, that on the contrary, this

formulation is often very tight.

Prot Len Var Rot Size Time Prot Len Var Rot Size Time

1aac† 105 85 1523 79 14 1mfm 153 118 2134 112 231aho† 64 54 981 49 7 1plc 99 82 1156 73 81b9o 123 112 2056 112 25 1qj4 256 221 4080 218 1001c5e 95 71 1108 61 8 1qq4 198 143 2045 121 291c9o 66 53 1130 56 9 1qtn 152 134 2516 132 331cc7 72 66 1396 66 17 1qu9 126 100 1817 94 201cex† 197 146 2556 136 36 1rcf 169 142 2396 139 431cku 85 60 1093 58 10 1vfy 67 63 939 56 71ctj† 89 61 1021 62 6 2pth 193 151 3077 151 681cz9 139 111 2332 111 56 3lzt 129 105 2074 102 281czp 98 83 1170 75 10 5p21 166 144 2874 146 781d4t 104 89 1636 84 19 7rsa 124 109 1958 100 261igd† 61 50 926 47 6

Table 3.1: The native backbone problem sizes. For each protein, Prot gives its PDBidentifier, Len gives its length, Var indicates how many of its side chains have more thanone possible rotamer and Rot gives the total number of rotamers considered. Size givesthe log10 of the search space size. Time gives the number of seconds for the solve phaseof CPLEX.†These proteins were used to determine the weight of the statistical potential in the basicenergy function (see text).

Template / Seq Var Template / Seq Vartarget id len Rot Size target id len Rot Size

1aac/1id2 62 86 1608 82 1igd/1mi0 78 49 723 441aac/2b3i 29 87 1242 73 1mfm/1b4l 54 117 1978 1051aho/1dq7 50 53 719 44 1mfm/1cob 80 119 1980 1081b9o/1f6r 75 114 1999 111 1mfm/1xso 65 114 1826 1041c9o/1csp 82 53 1076 56 1plc/1byo 71 79 1131 701c9o/1g6p 61 54 1409 60 1plc/1jxf 44 77 1093 641c9o/1mjc 57 52 862 48 1qj4/1e89 75 220 4154 2181cc7/1fe4 37 62 1222 60 1qq4/1hpg 34 139 1514 1051cku/1eyt 87 61 1095 58 1qu9/1j7h 75 101 1885 971cku/3hip 73 65 1079 59 1qu9/1qd9 49 104 1749 971ctj/1c6r 79 64 1030 62 1rcf/1czh 69 140 2151 1351ctj/1cyj 64 66 1291 69 1vfy/1hyj 40 57 1060 531ctj/1f1f 46 64 1219 62 3lzt/2mef 59 105 2320 1081czp/1doy 73 81 990 69 5p21/1kao 49 147 2977 1481czp/4fxc 79 81 961 70 7rsa/1bsr 81 110 2242 1041d4t/1luk 31 93 1877 91 7rsa/1rra 67 112 2111 1041igd/1fcl 75 51 899 48

Table 3.2: Template gives the PDB identifier for the protein used as the template back-bone, and Target gives the PDB identifier of the protein for which the structure is tobe predicted. Seq ID gives percentage identity between template and target protein se-quences, Var Len gives the number of side-chains that are varied, and Rot gives the totalnumber of rotamers considered. Size is the log10 of the search space size.

3.2.5 Data Set

The primary protein set (Table 3.1) consists of 25 proteins taken from (Xiang and Honig,

2001). The proteins vary in size, ranging from 50 to 221 residues with more than one

possible rotamer. As in (Xiang and Honig, 2001), only the first chain in the PDB file is

used for experiments.

For homology modeling, 33 homologs to the proteins of Table 3.1 are also used. These

protein pairs share between 29% and 87% sequence identity (Table 3.2). Whereas for some

proteins there are other more similar protein sequences present in the PDB, for evaluation

purposes, the chosen homologs give a wider range of sequence identity. ClustalW (Thomp-

son et al., 1994), with default settings, was used to align the protein pairs. For each pair,

Template / Time Template / Time Template / Timetarget (ILP) target (ILP) target (ILP)

1aac/1id2 14 1ctj/1cyj 10 1plc/1jxf 81aac/2b3i 13 1ctj/1f1f 9 1qj4/1e89 1201aho/1dq7 4 1czp/1doy 8 1qq4/1hpg 141b9o/1f6r 24 1czp/4fxc 6 1qu9/1j7h 271c9o/1csp 7 1d4t/1luk 26 (1) 1qu9/1qd9 19 (2)1c9o/1g6p 13 1igd/1fcl 7 1rcf/1czh 381c9o/1mjc 3 1igd/1mi0 3 1vfy/1hyj 91cc7/1fe4 13 1mfm/1b4l 23 3lzt/2mef 521cku/1eyt 10 1mfm/1cob 19 5p21/1kao 711cku/3hip 12 1mfm/1xso 16 7rsa/1bsr 411ctj/1c6r 7 1plc/1byo 7 7rsa/1rra 42

Table 3.3: Solve times (in CPU seconds) for the LP relaxation for the 33 homologymodeling problems. For the two problems which did not return integral solutions, the IPsolve time is given in parentheses.

the protein in the original data set was taken as the template backbone, and its sequence

homolog was taken as the target protein to be predicted. If the ith residue of the target

sequence is aligned to the jth residue of the template sequence, then rotamers corre-

sponding to the ith residue were considered at the jth position in the template backbone.

Any gaps in the target sequence were handled by modeling the side chains of the native

residues of the template. Any gaps in the template sequence caused the corresponding

residues in the target sequence to be left out of the model.

3.2.6 Rotamer Library and Structure Manipulation

We used Dunbrack’s backbone-dependent rotamer library (Dunbrack Jr and Karplus,

1993). For each 10◦ range of φ, ψ backbone angles, this library has 320 rotamers, with

the largest number of rotamers, 81, belonging to arginine and lysine. Figure 3.4 shows

the distribution of rotamers for each (φ,ψ) bin. Backbones were held fixed, and missing

hydrogens were added using the BALL C++ library (Kohlbacher and Lenhof, 2000), which

was also used to manipulate rotamers. All non-protein atoms were ignored. Each choice

Figure 3.4: The distribution of rotamers in each (φ,ψ) bin of Dunbrack’s backbone-dependent rotamer library. Each bin has 320 rotamers, with longer residues getting morerotamers.

of rotamers was converted to a 3-dimensional structure using the given backbone atoms

and the stock side chains from (Kohlbacher and Lenhof, 2000). For all computations, the

backbone, alanines and glycines were held fixed.

3.2.7 Energy Function

All the energy functions considered consist of a rotamer self-energy term and a pairwise

rotamer interaction term. For the basic energy function, used for all computations unless

otherwise specified, pairwise rotamer energies are computed using van der Waals interac-

tions, and self-energies are computed using both statistical potentials and van der Waals

interactions. The basic energy function is similar to that of the Scwrl package (Bower

et al., 1997), though we use a more realistic van der Waals term.

Van der Waals interactions between rotamers. The pairwise van der Waals inter-

action energy between rotamers u and v is the sum of the van der Waals interactions

between the side-chain atoms of u and v. We use the 6–12 Lennard-Jones formulation of

the van der Waals force (Figure 3.5). The parameters used in the van der Waals force are

those of AMBER96 except the hydrogen radii are reduced by 50% to account for their

uncertain position. As in AMBER96, for atoms separated by 3 bonds (1–4 pairs), van

Figure 3.5: A schematic of the 6–12 Lennard-Jones approximation to the van der Waalsforce. The van der Waals force is asymptotically infinite as two atoms approach eachother, and asymptotically zero as the atoms are separated. There is a small range ofdistances where the atoms are attracted to each other with a small negative energy.

der Waals interaction parameters are reduced by half, and there is no van der Waals con-

tribution between atoms separated by fewer than 3 bonds. Each atom-atom interaction

is capped at 100 kcal/mol. As an optimization, the van der Waals interactions are taken

to be zero at distances longer than 10 A and residues are assumed not to interact if their

Cβ atoms are farther apart than 8.0 A plus the longest possible extensions of their side

chains. Any value less than 10−6 is considered to be 0. These approximations generally

have insignificant effects on the calculated energies.

Van der Waals interactions in self-energy terms. For each rotamer, the van der

Waals energy is computed (as described above) between each of its atoms and all the fixed

backbone atoms in the system except those corresponding to the current residue and the

residues on either side of it. The self-energy also includes the van der Waals interactions

with atoms in fixed residues.

Statistical self-energies. For each amino acid i in a particular backbone setting, let

piu be the fraction of times amino acid i is found in rotamer u, and pi0 be the fraction

C0 10 20 30 40 50 60 70 80 90 100

Figure 3.6: The average rmsd over 5 proteins for various values of C.

of times amino acid i is in its most common rotamer. These values are obtained from

the rotamer library (Dunbrack Jr and Karplus, 1993). As in (Bower et al., 1997), the

statistical self-energy term for a particular rotamer u is given by − ln(piu/pi0), so that the

more common a rotamer, the lower the energy assigned to it.

Combining the statistical self-energies with the van der Waals interactions. In

summing up the total energy of the system, the statistical self-energy term is weighted

by a constant C that is the relative weighting of it in comparison to the physical van der

Waals term. The choice of C can have a large effect on the accuracy of the solution and

the ease with which it can be found. C can be thought of as the inertia for a residue to

remain in a highly-favored side-chain conformation. To calibrate C, five proteins of varied

structure (1aac, 1aho, 1cex, 1ctg and 1igd) were selected from the test set. The LP/ILP

algorithm was applied to each for values of C ranging between 0.5 and 100. Figure 3.6

shows the average side-chain root mean squared deviation (rmsd) over the five proteins

for various values of C. It is best to set C to the smallest value that works well so as to

use as much information about the specific fold as possible. C = 10 was taken to be a

good choice.

3.2.8 Evaluating Predicted Structures

For each protein, we compare the predicted side-chain conformations with those found in

its crystal structure. We use two measures of accuracy. First, we compute the percent-

age of χ1 side-chain dihedral angles predicted within 20 degrees of the native structure,

and the percentage of both χ1 and χ2 side-chain dihedral angles predicted within 20 de-

grees of native. Second, we compute the rmsd between the predicted structure and the

crystal structure. When positioning side chains on native backbones, rmsd is computed

between corresponding side-chain atoms only. When positioning side chains of a tar-

get protein on a homologous backbone, the native backbone of the target protein and the

homologous backbone are first fit together using all the non-hydrogen atoms in both struc-

tures (McLachlan, 1982; Martin, 2001), and then rmsd is computed over the side-chain

atoms.

Because performance can vary greatly depending on the location of the residue in the

protein, in addition to evaluating predictions over all residues, we report performance

over only core residues, defined to be those that have less than 10% of their possible

surface area exposed in the crystal structure. For each residue, exposed surface area is

determined as a percentage of the surface area of the residue in isolation using the Surfv

package (Nicholls et al., 1991).

3.3 Computational Results

We test the hardness of SCP instances and evaluate the LP/ILP approach on problems

resulting from three applications: predicting the conformations of a protein’s side chains

on its native backbone, predicting the structure of a protein using the backbone of a

homologous sequence as a template, and designing a protein sequence for a given backbone.

Figure 3.7: Running times (in CPU seconds) of the 25 native backbone problems.

3.3.1 Native Backbone Tests

For the each of 25 proteins in Table 3.1, we ran the LP/ILP approach to predict side-

chain conformations on native backbones. We used the native protein sequence from the

PDB file and allowed each residue to assume all the rotamers listed in the library for the

given amino acid and φ, ψ backbone angles. This resulted in search spaces with up to

10218 possibilities. Using the basic energy function described in the previous section, all

problems returned optimal integral solutions using LP, and it was not necessary to use the

more computationally expensive ILP formulation. The total CPU time for solving the 25

LPs using formulation (IP2) was under 12 minutes; this is approximately 13 times faster

than when using the formulation (IP1). Figure 3.7 shows the number of CPU sections

spent in the solve phase of CPLEX on each of the native backbone problems.

To ensure that the energy function produces meaningful structures, we compare the

side-chain conformations predicted by the LP with the side-chain conformations in the

crystal structure (Table 3.4). Over all the residues, we find that 80% of χ1 angles and

51% of the χ1 and χ2 angles are predicted within 20◦ of native. For just core residues,

our approach leads to 87% of χ1 angles and 62% of the χ1 and χ2 angles predicted

correctly. Additionally, our method obtains an average rmsd per protein of 1.553A. This

is a reasonable level of accuracy, as it is comparable with values obtained when running

Coreresidues

Allresidues

(a) LP/ILP χ1 / χ1+2 87% / 62% 80% / 51%(b) Scwrl χ1 / χ1+2 88% / 60% 80% / 49%(c) LP/ILP rmsd 1.079 A 1.553 A(d) Scwrl rmsd 1.170 A 1.649 A(e) Best rmsd 0.575 A 0.640 A(f) Simple method rmsd 1.796 A 1.954 A

Table 3.4: Prediction of side-chain conformations on native backbones, with a comparisonof the LP/ILP prediction with those of other methods and the crystal structure. All valuesare averaged over the 25 proteins of Table 3.1. (a) The percentage of residues over allproteins for which LP/ILP predicted conformation has the χ1 and χ1+2 dihedral angleswithin 20 degrees of the native structure; (b) these values for Scwrl; (c) the rmsd of thepredicted side chain conformations from those of the native side chains using the LP/ILPmethod; (d) these values for Scwrl; (e) the best rmsd obtainable with the rotamer libraryused in these experiments; (f) the rmsd obtained by choosing the most common rotamerin the library.

the widely-used Scwrl package (version 2.9) (Bower et al., 1997) (see Table 3.4) and with

what is reported in (Xiang and Honig, 2001) when using the same rotamer library (on a

slightly different test set).

The majority of side chains have their χ1 angle are predicted correctly within 20%.

Figure 3.8 shows the fraction of χ1 angles predicted correctly within various tolerances.

The jump at about 120◦ is expected because many side chains prefer to be in a conforma-

tion in which their χ1 torsion angle (the rotation of the Cα–Cβ bond) is in one of three

evenly spaced values at 0◦, 120◦, and 240◦. Thus, if the native residue is near a rotameric

state, but the optimization chooses the wrong one, the correct one will be approximately

120◦ degrees away.

As expected, prediction is more successful for core residues (Figure 3.9). This is

both because we do not model the effect of solvent, which plays an important role in

positioning of the side chains on the surface, and because the side chains in the core are

more constrained, resulting in fewer physically realizable packings. The positions of the

surface side chains are uncertain even in the crystal structures to which we are comparing,

0 20 40 60 80 100 120 140 160 180

Angle tolerance

Angle Tolerance vs. Fraction Correct

Figure 3.8: The distribution of χ1 angle errors for the native test set.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction exposed

Fraction Exposed vs. Fraction Correct

Figure 3.9: Performance compared with the fraction of the residue exposed for the nativetest set. The fraction correct corresponds to χ1 angles predicted within within 20◦ of thenative conformations. Fraction exposed was computed with surfv.

Amino AcidARG ASN ASP CYS GLN GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

180Right

Amino AcidARG ASN ASP CYS GLN GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

Figure 3.10: (Top) The number of core residues with their χ1 angles predicted correctly(within 20◦) and incorrectly (> 20◦) in the native backbone tests, broken down by aminoacid. (Bottom) The same data as in the top graph, expressed as the percentage of coreresidues for which the χ1 angle was predicted correctly (within 20◦).

meaning that our standard of correctness is uncertain, further complicating the prediction

of the positions of these side chains.

Not all amino acids are equally easy to predict (see Figure 3.10). Cysteines are espe-

cially difficult. This is not surprising as pairs of cysteines often interact via salt bridges

which are not modeled at all by our packing energy function. Prolines are also difficult

to predict, again likely due to their unique structure and interaction with the backbone.

In our tests, serines were also more likely to be incorrectly positioned than other amino

acids, an effect also apparent in (Xu, 2005), where serines were predicted with the lowest

accuracy using either their tree decomposition method or Scwrl 3.0.

For comparison, Table 3.4 also shows the best rmsd obtainable with Dunbrack’s library

if the rotamer with the smallest rmsd with the crystal structure is chosen. This is a lower

bound on the rmsd that any algorithm can achieve using the given library. The results

are also shown for the simple method that choses the rotamer that is most common for

each side-chain. This method uses no information about the global fold. Together, the

values for “best” and “simple” define the range any prediction algorithm should fall into.

Table 3.4 shows that though the use of a van der Waals energy function beats the

simple prediction scheme, it does not approach the theoretical limit of this rotamer library.

However, the percentage of torsional angles that are correct and the RMSD do not tell

the whole story. Over the 25 proteins, the simple scheme produces 782 clashes — the

center of one atom is placed inside the van der Waals radii of another. Even the best

rotamers frequently clash: by choosing the rotamer with the lowest rmsd to the native

side chain, we introduce 56 clashes, though many (but not all) clashing pairs are between

cysteines that form disulfide bonds and are thus not sufficiently close to a rotamer state

in the library. The LP/IP approach avoids the clashes almost entirely, allowing only two.

Overall, our testing on native backbones shows that when using a simplified energy

function, LP can readily obtain optimal solutions with respect to the energy function, and

that these optimal solutions correspond to predicted structures of quality similar to that

of other popular approaches.

3.3.2 Homology Modeling

We next explore the combinatorial problems associated with homology modeling. The 33

pairs of homologous proteins considered, their percent sequence identity, and the rmsd

between their backbones are shown in Table 3.2.

We solved the resulting LP formulations for all 33 problems; this took under 12 min-

utes of CPU time. The LP found optimal solutions for 31 of the 33 pairs. For only two

template/target pairs, 1d4t/1luk and 1qu9/1qd9, the optimal LP solutions were not in-

tegral. For these two problems, the optimal integral solution was found using dead-end

Coreresidues

Allresidues

(a) LP/ILP rmsd 2.177 A 3.230 A(b) Scwrl rmsd 2.137 A 3.260 A(c) Backbone rmsd 1.385 A 1.978 A(d) Simple method rmsd 2.504 A 3.425 A

Table 3.5: Prediction of side-chain conformations using homology modeling, with a com-parison of the LP/ILP prediction with those of other methods and the crystal structure.All values are averaged over the 33 problems of Table 3.2. (a) the rmsd between just side-chain atoms when comparing the LP/ILP predicted structure with the crystal structure;(b) this value when comparing the Scwrl predictions with the native structure; (c) thermsd between template and target structures when only considering backbone atoms; (d)the side-chain rmsd when choosing the most common rotamer at each position.

elimination and the integer programming algorithm of CPLEX. A good measure of how

close the LP relaxation objective is to the optimal solution is the relative gap, defined as:

100|OPT − lp|

|OPT| (3.3)

where OPT is the energy value of the optimal integral solution and lp is the optimal

objective for the LP relaxation. The relative gaps for both 1d4t/1luk and 1qu9/1qd9

were fairly small (0.207 and 15.260, respectively), and the total time for solving these two

integer linear programs was less than one minute.

In order to show that the basic energy function is useful in the homology modeling

scenario, we report the accuracies of our predicted structures (Table 3.5). We computed

the side-chain rmsd between the target structures and predicted structures, as well as

the side-chain rmsd obtained by the Scwrl rotamer choices. The average side-chain rmsd

obtained by the LP/ILP approach with the basic energy function is 3.230A, which is

competitive with Scwrl’s performance of 3.260A when run on the same test set. We also

show the results for blindly choosing the most common rotamer in each position.

For these tests, we did not optimize many important aspects of homology modeling,

such as choosing the homolog with the most similar sequence or hand fixing alignments, so

the results should not be taken to be the best possible for any of the methods. However,

the use of a simplified energy function results in predicted structures that are biologically

reasonable. Additionally, optimal solutions with respect to this energy function are easily

found using the LP/ILP approach.

3.3.3 Protein Design

We considered the problem of designing novel sequences that fold into known backbones.

We partitioned the amino acids into the following classes: AVILMF / HKR / DE / TQNS

/ WY / P / C / G. For each of the 25 proteins in our native test set (Table 3.1), we fixed

the surface residues and the native backbone and allowed the core residues to assume any

rotamer of any amino acid in the same class as the native residue. We focused on core

residues since the basic energy function optimizes primarily van der Waals interactions.

The sizes of the resulting problems are shown in Table 3.6.

When applying LP to the resulting problems with the basic energy function, only 6 out

of 25 solutions had integral solutions. Thus, from the perspective of this LP, the design

problem is more difficult than fitting side chains on native and homologous backbones.

CPU time for solving the 25 LP problems was approximately 20 hours, with one protein

(1qj4) taking about 10.5 of those hours.

To obtain optimal solutions for the 19 proteins with non-integral solutions, we apply

Goldstein dead-end elimination and then run the ILP solver of CPLEX. When solving the

ILP, CPLEX, in addition to using many other heuristics, solves several linear programs

that are subproblems of the ILP (these subproblems are referred to as the branch-and-

bound nodes). The number of such subproblems is a very rough indication of the compu-

tational effort expended by CPLEX. The number used for the design problems is shown

in the N column of Table 3.6. For several of the problems, many branch-and-bound nodes

were needed. CPLEX was able to find the optimal integral solutions to all the problems

in approximately 138 hours. Nearly all of that time (125 hours) was spent on the largest

Prot Var len Rot Size Time (ILP) Rel gap N

1aac 38 2153 62 3.3e2 (1.3e2) 2.630 41aho 18 668 22 4.4 Integral1b9o 48 1842 69 2.4e2 (9.4) 1.099 01c5e 25 1369 42 5.8e1 Integral1c9o 14 757 24 9.1e1 (4.6e1) 3.936 341cc7 18 866 29 9.5e1 (2.4) 0.272 01cex 78 3926 126 2.6e3 (7.0e2) 0.913 301cku 22 897 31 8.8 Integral1ctj 24 1262 40 2.8e1 Integral1cz9 53 2664 87 1.2e3 (3.2e2) 0.702 271czp 30 1475 47 4.4e2 (1.4e2) 1.202 391d4t 32 1691 52 1.8e2 (8.9e1) 1.039 331igd 11 552 18 3.4 Integral1mfm 46 3215 80 6.5e3 (5.4e3) 3.234 2331plc 33 1691 54 4.7e2 (1.3e2) 3.991 81qj4 124 6655 201 3.8e4 (4.5e5) 2.677 72931qq4 72 3500 115 1.5e3 (6.9e2) 4.272 381qtn 49 2181 74 2.6e2 (7.0e1) 0.558 81qu9 43 2057 70 2.3e2 (6.4) 0.162 21rcf 65 3189 105 2.7e3 (9.6e1) 0.053 01vfy 15 665 20 4 Integral2pth 76 4395 127 1.1e4 (2.4e4) 2.115 16233lzt 48 1940 71 4.2e2 (3.9e2) 3.445 455p21 70 3624 114 4.1e3 (1.3e4) 2.259 14537rsa 46 1993 66 5.7e2 (1.4e1) 0.120 0

Table 3.6: Var Len gives the number of core positions that were allow to vary, and Rotgives the total number of rotamers considered. Size is the log10 of the search space size.Time is the number of seconds CPLEX spent solving the LP, and given in parentheses, thetime for solving the ILP. Rel Gap gives the relative gap, as defined in Equation (3.3), andis a measure of how far the energy of the solution of the LP is from that of the optimalrotamer choice. N gives the number of subproblems CPLEX considered in finding theoptimal choice of rotamers.

Number of Residues0 20 40 60 80 100 120 140 160

1aac1aho 1b9o1c5e1c9o1cc7

1cku1ctj1cz9

1czp1d4t1igd

1qq41qtn1qu9

Number of Residues0 20 40 60 80 100 120 140 160

50010002000350055009000

1400022000345005400084000

131000204000317000492500

1cku1ctj

1czp1d4t

(a) (b)

Figure 3.11: The time spent in the CPLEX solve phase (in CPU seconds) of the designproblems considered here. (a) The time spent on the full problem on the initial LPrelaxation. (b) The time spent on solving the IP to optimality for the problems afterbeing reduced by Goldstein DEE. The y-axis is in log scale.

problem, 1qj4; the other 18 problems took only 13 hours of computation. The running

times for each of the design problems are shown in Figure 3.11.

While we do not have a way to convert a fractional solution to a choice of rotamers that

always ensures a low energy choice, empirically, the LP relaxation does give us an estimate

of the energy of the optimal solution. The integrality ratio of Section 3.2.4 suggests that

the energy of the LP may be very low, while the energy of the optimal solution may

remain high. How good an estimate of the optimal energy does the LP relaxation give us

in practice? The relative gaps, defined as in Equation (3.3), for these 25 design problems

are all less than 5% (Figure 3.12). One way to construe the design tests above is that

they are tests of fitting a sequence pattern onto a structure. The pattern is defined by the

groups of amino acids allowed at each position. The small integrality gaps suggest that

in practice, the LP may give a good estimate of how well a sequence pattern can be fit

onto a backbone. While in traditional design applications we require a choice of amino

acid for each designed position, there may be database search applications that require

Figure 3.12: The relative gaps (Eqn. 3.3) for the 25 design problems. The gaps are allless than 5%.

only an estimate of how well a sequence pattern can be fit onto a backbone. For example,

when redesigning an active site of a protein, there may be constraints on the sequence

to preserve functionality. The protein designer may wish to search a database of solved

structures for backbone shapes onto which a sequence matching a pattern can be fit with

low energy. Here, the LP may provide a fast estimate of the fit between sequence pattern

and backbone shape.

The best way to test a designed sequence is to make the protein and confirm the

predicted structure (e.g., (Dahiyat and Mayo, 1997; Harbury et al., 1998; Malakauskas

and Mayo, 1998; Looger et al., 2003; Klepeis et al., 2003; Lilien et al., 2004)); this is

beyond the scope of this thesis. However, the basic energy function is reasonable for

designing protein cores as it focuses on van der Waal interactions, and the use of other

energy functions is not likely to make the problem easier (see below). Thus, while the

LP/ILP approach found optimal solutions for these protein design problems, our analysis

shows that protein design problems are likely to be considerably more difficult to solve

than homology modeling problems.

3.3.4 Other Energy Functions

We also investigated how changing the energy function affects the ability of LP to find

optimal solutions. For five proteins from Table 3.1 (1c9o, 1czp, 1d4t, 1qtn, 1vfy), we fit

side chains on their native backbones using two additional energy function variants.

In the first variant, the self-energies include the van der Waals interactions with the

backbone (as before), but the statistical term is replaced by a torsion term as well as intra-

side-chain van der Waals interactions. These self-energy terms are meant to measure the

local favorability of a side-chain conformation. The pairwise interaction energies between

rotamers consist of only van der Waals interactions.

The second variant is the same as the first, except that the self-energies include elec-

trostatic interactions with the backbone and the pairwise energies include electrostatic

interactions between side-chains. In all cases, the electrostatic interactions were modeled

using the distance-dependent electrostatic component (ǫ = r) of the AMBER96 force field.

In contrast to the basic energy function, for which 100% of the solutions were integral,

the LP finds optimal solutions for only 60% (three out of five) of the proteins using the

either of variant of the energy function. Thus, small changes in the energy function can

influence the ease with which solutions are found. We note that ILP can still find optimal

solutions for these problems, and additionally that the basic energy function gives the

best accuracy over these proteins (1.634 A average rmsd versus 2.069 A and 2.409 A for

variants 1 and 2, respectively).

3.3.5 Obtaining Multiple Solutions

By adding constraints (3.1) to the integer program, we can look at an ensemble of provably

near-optimal solutions. Near-optimal solutions can be used to generate several candidates

for protein design, as well as to analyze the energy landscape and gauge the difficulty

of the global optimization problem. We found the 10 lowest-energy solutions for four

proteins (1aho, 1cex, 1ctj and 1igd) and the 100 lowest-energy solutions for 1aac, using

Instance

1 2 3 4 5 6 7 8 9 10

1ctj1igd

2 26 50 74 980

0.10.20.30.40.5 1aac

Figure 3.13: Relative gap between the optimal solution (with value OPT) and the 9 nextlowest-energy solutions (where the i-th solution has value xi). Inset shows relative gapsfor the 100 lowest-energy solutions for 1aac. Relative gap at each iteration i is defined as100 |OPT−xi|

|OPT | ,

the basic energy function to fit each sequence onto its native backbone. Since at each

step we are excluding all previously found solutions, each successive solution takes longer

to find. The relative gap (Equation 3.3) between each successive solution and the global

optimum is plotted in Figure 3.13. These gaps are very small, and from the point of view

of this energy function, any of several solutions perform similarly. This indicates that even

though LP has no difficulty finding optimal solutions, no one choice of rotamers clearly

stands out as the right one.

3.4 Discussion

Our experiments suggest that mathematical programming should become a widely-used

technique for attacking SCP in the context of both homology modeling and protein de-

sign. The described approach exploits general, highly-developed optimization machinery,

and it is likely that problems much larger than those studied here can be solved by em-

ploying faster hardware and more effectively exploiting the CPLEX package (e.g., using

parallelized versions of the software, or specifying alternative strategies for branching and

node selection). The addition of valid inequalities in a branch and cut framework as in

(Althaus et al., 2000) might further speed up solution of the problems.

For even larger problems, further specialized optimizations may be necessary. As a first

step, we have shown how to reduce the size of the ILP dramatically, without compromising

optimality, by exploiting the fact that in protein structures amino acids do not interact

with other amino acids that are far away in 3D. Furthermore, in practice, to solve large

instances optimally, we would suggest first running basic DEE, and then following with

either LP or ILP. We also note that some of the techniques developed for DEE can be

incorporated directly into ILP if necessary. For example, we can disallow choosing a

certain pair of rotamers (between positions that have some positive pairwise rotamer

energy between them) by removing the corresponding edge variable from the objective

function and constraints. Alternatively, the LP/ILP approach can be applied in cases

where the DEE procedure does not converge to a single solution. Finally, as compared

with other methods, the LP/ILP approach is simple to model and flexible enough to

extend easily. For example, we have already shown how to use ILP to obtain successive

near-optimal solutions.

Our analysis suggests that protein design problems are considerably more difficult to

solve than homology modeling problems. For native-backbone and homology modeling,

optimal, biologically realistic solutions can usually be found quickly using a simple LP

relaxation. For protein design, fewer solutions of the LP relaxation are integral, even

with the same energy function. As suggested by (Gordon et al., 2002), it is possible that

for repacking side chains on backbones, there are only a few good rotamer choices for

each side chain, whereas for protein design there are several amino acid choices for each

position, each with a few good rotamer choices. From a computational viewpoint, this

suggests that the design problems are those on which efforts to improve the optimization

scheme should be focused.

We also find that the choice of energy function affects the ease with which optimal

solutions are found using LP. For positioning side chains on native and homologous back-

bones, optimal solutions using the basic energy function are found quickly (typically in

polynomial time), and this energy function yields good solutions (better than the other

energy function variants considered in our tests). This suggests that even if alternative en-

ergy functions are required, it may be beneficial to use an energy function such as the one

considered here for which optimal solutions are readily found. These solutions can then be

used as starting points for an iterative procedure such as that given in (Xiang and Honig,

2001) or for heuristic search algorithms (e.g., as in the original Scwrl program (Bower

et al., 1997)).

Several other authors have considered the combinatorial difficulty of SCP in the con-

text of packing side-chains onto native backbones. An excellent, exhaustive study on side

chain positioning has used very different reasoning to argue that the associated combi-

natorial problem appears not to be that difficult (Xiang and Honig, 2001). This study

considers packing side chains on native backbones, and shows empirically that predicting

the conformation of a single side chain while fixing all others in their native conformations

is only slightly more accurate than the simultaneous prediction of all side chains. Unlike

when integral solutions are found using our approach, their computational approach can-

not guarantee that they have found a minimum energy solution according to their energy

function. Eriksson et al. (Eriksson et al., 2001) also use an ILP formulation to suggest

that the side-chain positioning problem is easy; they apply the method to a single pro-

tein (lambda repressor protein) and find that the solution of the relaxed linear program

always seems to be integral, even for artificial “nonsense” energy functions. The hardness

result (Pierce and Winfree, 2002; Chazelle et al., 2004) suggests this is unlikely to be true

for all energy functions and proteins, and indeed the LP approach does find non-integral

solutions for two of the homology modeling cases in our dataset. On the other hand, oth-

ers (Gordon et al., 2002) have argued that it is important to consider the precise energy

function being optimized; our results are consistent with this view.

In light of the hardness results (Pierce and Winfree, 2002; Chazelle et al., 2004), it is

clear that the frequent integrality of the LP formulation in our experiments is not a result

of the general structure of the problem but instead is a feature of the properties of the

proteins and energy functions studied. The NP-completeness result describes worst-case

behavior, and it may not hold for the classes of problems and energy functions that occur

in practice. In general, finding problems which are hard on average is a longstanding

open problem in theoretical computer science, and in particular cryptography where such

problems would be very useful. Hence, in that light, that our instances are easier than

the worst-case results makes sense.

It is well-known that if the constraint matrix for an LP is totally unimodular (e.g.,

as in formulations for shortest path or max-flow problems), the LP has integer optimal

solutions. This is not the case here, however, as changing only the energy function can

change whether integral solutions are found. It is also known that if the energy function

obeys the triangle inequality, it is possible to obtain a 2-approximation. However, such an

energy function is not realistic for either homology modeling or protein design problems.

Nevertheless, in our applications, the constraint matrices are sparse, and perhaps the LP

is exploiting some other type of underlying structure. An intriguing open question is to

uncover what features of side-chain positioning allow LP and ILP to find optimal solutions

quickly, and to determine whether these features suggest an alternative formulation of the

problem.

Chapter 4

A Semidefinite Programming

Approach to Side-Chain

Positioning with New Rounding

Strategies

Seeking a tighter formulation side-chain positioning instances for which the

LP formulation of the last chapter gives non-integral solutions, we present

and test a semidefinite programming formulation of the problem. We also

introduce and test two rounding schemes to convert a fractional solution to

this semidefinite relaxation into a good choice of rotamers.

4.1 Introduction

This chapter describes a semidefinite programming (SDP) heuristic for SCP. Though in

the previous chapter we saw that an LP heuristic often performs well on native-backbone

and homology modeling problems, solving design problems frequently required resorting

to a branch and bound integer programming algorithm. As shown in the previous chapter,

modern ILP solvers can solve problems of the realistic sizes, but as larger design problems

are attempted and more detailed rotamer libraries are used, a good polynomial heuristic

becomes attractive.

In this chapter we give an effective SDP-based heuristic. Our method works in three

steps: first, relax the SCP problem into an instance of SDP; next, solve it in polynomial

time by an interior-point method; finally, convert the solution into 0/1 form by randomized

rounding (Raghavan and Thompson, 1987; Rolim and Trevisan, 1998). This general

approach for approximation algorithms was pioneered by Lovasz’s ground-breaking work

on the ϑ function (Grotschel et al., 1993; Lovasz, 1979) and the ingenious Max-Cut

algorithm of (Goemans and Williamson, 1995), and it has been pursued further since

(e.g., (Frieze and Jerrum, 1997; Alon and Kahale, 1998; Feige and Kilian, 1998; Karger

et al., 1998; Bertsimas and Ye, 1998; Zwick, 1999)).

In order to convert the fractional SDP solutions into rotamer choices for the orig-

inal SCP problem, we introduce two new techniques for randomized rounding. These

are general techniques that may have applicability beyond the SCP problem. The first

technique, projection rounding, is based on the geometry of the solution vectors, and the

second, Perron-Frobenius rounding, is based on spectral properties of the solution ma-

trix. This second rounding scheme approximates the solution matrix by the eigenvector

corresponding to its highest eigenvalue. This is a standard trick (see, e.g., (Donath and

Hoffman, 1972; Boppana, 1987; Benson et al., 1999)); however, our Perron-Frobenius

rounding is different in that we rely crucially on non-negativity of the entries of the so-

lution matrix. Because every entry in the solution matrix is constrained to be ≥ 0, the

entries of its highest eigenvector will be non-negative and this will allow us to extract a

probability distribution. A further difference is that the matrix that we decompose does

not have a graph-theoretic interpretation.

The inapproximability of SCP means that no rounding scheme should have good per-

formance on all instances; however, we provide some theoretical justification for good

performance on some types of input. We argue that under various assumptions about

the statistical nature of the problem, the expected difference (the drift) between the total

energies given by the optimal fractional solution and our randomized rounding integral

solutions is small.

We have applied our method to redesign computationally the cores of two naturally

occurring proteins, the Bacillus caldolyticus cold shock protein and the TIM barrel triose

phosphate isomerase from chicken. We have also experimented successfully on general

random graphs as well as a class of random graphs that better capture the geometry of

actual proteins. Since LP-based approaches for SCP are effective in practice (Kingsford

et al., 2005; Althaus et al., 2000), as we saw in the previous chapter, we compare our

method to LP; this comparison highlights the benefits of SDP’s additional computational

machinery. Our empirical studies show that, in practice, good solutions to SCP are found

by our two randomized rounding schemes. Additionally, we note that since SDP provides

better lower bounds than LP for the underlying SCP problem, it is a more effective

bounding function for branch-and-bound or branch-and-cut (Eriksson et al., 2001; Althaus

et al., 2000) approaches.

Independently, (Lau, 2002) and (Lau and Watanabe, 1996) applied semidefinite pro-

gramming and randomized rounding to the more restricted problem of weighted constraint

satisfaction; this is a special case of the SCP problem considered here. They also give an

inapproximability result that is weaker than ours in the general case.

At present, SDP solvers are limited to solving small problems. However, as SDP

approaches are increasingly being applied to combinatorial optimization problems, SDP

solvers continue to improve. As larger proteins and rotamer libraries are considered, ex-

haustive techniques (such as branch-and-bound or A∗) may be limited by their potentially

exponential running time. In contrast, our semidefinite programming approach runs in

polynomial time, and the approaches developed in this chapter, which we show work

well on problems of interest, will have broad applicability. Finally, as opposed to other

heuristic techniques (such as simulated annealing), as more is discovered about the nature

of SCP applications in practice, our SDP formulation permits the development of other

rounding schemes that better exploit the real-world statistical properties of the problem.

4.2 A Semidefinite Programming Heuristic

In this section we present a formulation of the SCP problem as a semidefinite program.

Given a graph G with node set V = V1, . . . , Vp, assign to each u ∈ V a 0/1 variable

xu. The intuition is that xu will be 1 if rotamer u is selected, 0 otherwise. Computing the

GMEC is equivalent to solving the following integer quadratic programming problem:

Minimize∑

(u,v)∈G

Euvxuxv (4.1)

subject to∑

u∈Vi

xu = 1 for i = 1, . . . , p (4.1a)

xu ∈ {0, 1} .

We rewrite the program (4.1) into a form that will be more convenient to relax into

a semidefinite program with as few constraints as possible. Add a new position with an

isolated vertex u0 to G and define its singleton vertex set V0 = {u0}. The constraints (4.1a)

applied to this position imply xu0= 1. We square both sides of (4.1a) and, using the fact

that xu ∈ {0, 1}, we add two new sets of constraints to obtain the equivalent program

Minimize∑

(u,v)∈G

Euvxuxv (4.2)

subject to∑

u,v∈Vi

xuxv = 1 for i = 0, . . . , p

u∈Vi

xu0xu = 1 for i = 0, . . . , p

xuxu = xu0xu and xu ∈ {0, 1} for all u .

The relaxation step lifts each xu to IRn, where n is the number of nodes in G (including the

dummy node), scalar multiplication is replaced by the dot product, and the requirement

xu ∈ {0, 1} is replaced by 0 ≤ xTuxv ≤ 1, for all u and v. Quadratic programming

is NP-hard in general, but this relaxed system is an instance of positive semidefinite

programming, and it can be solved efficiently. To see that, we linearize all the constraints

by introducing the variable xuv to denote xTuxv. To ensure that this linearization is not a

further relaxation, we require that the n-by-n matrix X = (xuv) be positive semidefinite

(PSD). We also note that the constraints xTuxv ≤ 1 are redundant since X is PSD and

the diagonal elements are ≤ 1. Thus, we get:

Minimize∑

(u,v)∈G

Euv xuv (4.3)

subject to xuu = xu0u and xuv ≥ 0 for all u, v

u∈Vi

xu0u =∑

u,v∈Vi

xuv = 1 for i = 0, . . . , p

X is PSD .

We can solve the SDP system (4.3) in polynomial time to within any level of accuracy by

using the ellipsoid algorithm (see e.g. (Grotschel et al., 1993)) or, preferably, an interior-

point method (see e.g. (Alizadeh, 1995; Nesterov and Nemirovskii, 1993; Vandenberghe

and Boyd, 1996)).

Next, we must map each vector xu to xu ∈ {0, 1} so that∑

u∈Vixu = 1 and so that

ideally the expected increase in the value of the objective function, the drift, is small. We

discuss two rounding schemes, both of which fit the basic format of randomized round-

ing (Raghavan and Thompson, 1987). The idea is to specify a probability distribution

for each position and home in on a solution by sampling from it. We describe two dis-

tributions, one based on projection, the other on spectral approximation. The first one

is very simple and easy to implement; the second one requires slightly more work. See

Section 4.3 for some empirical comparisons of the two rounding methods. The following

characterization of the geometry of the solution vectors will be useful when we discuss the

rounding schemes.

Lemma 4.2.1 If X = (xuv) is a solution to (4.3) where xuv = xTuxv for vectors xu, xv,

then all the vectors∑

u∈Vixu are equal to xu0

, and each xu belongs to the unit-diameter

sphere with antipodes O (the origin) and xu0.

Proof: Fix i ≥ 0 and let y =∑

u∈Vixu. The constraints imply that ‖y‖2

2 =∑

u,v∈Vixuv =

1. Meanwhile, the inner product of y and xu0is equal to

∑u∈Vi

xu0u = 1. We also have

xTu0xu0

= 1. Since their lengths are the same and equal to the projections onto one

another, it follows that y = xu0for all i. Now, take any node u ∈ Vi. Observe that

∥∥∥xu − xu0

∥∥∥2

2= xuu − xu0u +

where xuu = xu0u follows directly from the constraints. Therefore, xu belongs to the

sphere centered at xu0/2 of radius 1/2. This sphere passes through the two points O and

xu0, which are antipodal. �

We will compare our semidefinite program to the LP relaxation of the following integer

program (IP), which was introduced in the previous chapter:

Minimize∑

(u,v)∈G

Euv xuv (4.4)

subject to∑

u∈Vi

xuu = 1 for i = 1, . . . , p

u∈Vi

xuv = xvv for i = 1 . . . , p and any v

xuv ∈ {0, 1}

The benefit of an SDP formulation over an LP formulation when dealing with fractional

solutions is two-fold: first, the relaxation is more constrained so its solution is closer to

that of the integer program. The SDP formulation generates second moments between the

nodes (Bertsimas and Ye, 1998), and our Perron-Frobenius rounding scheme will implicitly

make use of them. Second, the solutions are vectors and not scalars. This gives us much

more freedom in the rounding phase of the algorithm and allows for effective use of the

“geometry” of the problem. We will compare our semidefinite program to this LP and

the optimal solution in Section 4.3.

4.2.1 Projection Rounding

This scheme is based on the fact that the constraints guarantee that∑

u∈Vi‖xu‖2

2 = 1 for

any 1 ≤ i ≤ p, so that the quantities qu = ‖xu‖22 associated with the nodes u of Vi form a

valid probability distribution from which we can sample effectively.

• The rounding rule: For each 1 ≤ i ≤ p, choose u ∈ Vi at random with

probability qu.

Note that only one u is chosen per Vi. This is called projection rounding because the

probability of choosing u is equal to xuu = xu0u, which is the length of the projection

of xu onto xu0. By looking at the geometry of the SDP formulation in a manner similar

to (Alon and Kahale, 1998; Feige and Kilian, 1998; Karger et al., 1998), we can provide

a measure of theoretical justification for our rounding strategy.

We first provide some intuition behind this rounding scheme. The solution vectors

are constrained to lie on a sphere and the projection rounding rule favors choosing long

vectors. If a single vector xu is dominant within its Vi — a common occurrence — then

simple geometry (Figure 4.1a) shows the dot products of these vectors should also be big.

Because the solution matrix is positive semidefinite, an off-diagonal element xuv is the

dot product of the two vectors xu and xv, and for long vectors xu, xv we can expect xuv

to be large as well. We can thus hope to avoid the most damaging situation: where the

rounding scheme chooses nodes u and v but the fractional solution has put low or zero

weight on the edge between them. The intuition holds in the opposite case as well: two

low probability vectors (that is, short vectors), are likely to have a small dot product,

and we would like to avoid choosing the edge that corresponds to that dot product. We

develop this argument more formally below when we give an upper bound on the drift,

defined as the expected difference between the post- and pre-rounding objective function

value.

Let xu be 1 if u is chosen in the rounding stage and 0 otherwise. As usual, {xu}

denotes the solution of the (relaxed) SDP system. The expected value of the objective

function, post-rounding, is equal to

E{ ∑

(u,v)∈G

Euv xuxv

Euu‖xu‖22 +

(u,v)∈Go

Euv‖xu‖22‖xv‖2

where Go denotes the set of nonloop edges in G. Thus, the drift ∆ is

∆ =∑

(u,v)∈Go

(‖xu‖2

2‖xv‖22 − xT

Observe that the drift originates exclusively from the off-diagonal entries. Let yu denote

the projection of xu on the orthogonal complement x⊥u0of xu0

. We rewrite the drift in

terms of these yu. Because xu0is of unit length, we have

xTuxv = ((xT

uxu0)xu0

+ yu)T ((xTv xu0

)xu0+ yv) =

(xTuxu0

)(xTv xu0

) + yTu yv = ‖xu‖2

2 ‖xv‖22 + yT

u yv ;

therefore,

∆ = −∑

(u,v)∈Go

Euv yTu yv . (4.5)

In the special case where all the energies are non-negative, it is also possible to relate the

drift directly to the lengths of the xu’s. (While this is not true for all energy functions,

the popular side-chain positioning package SCWRL (Bower et al., 1997) only has non-

x u 0O

(a) (b)

Figure 4.1: The geometry of the solution vectors.

negative energies.) By Lemma 4.2.1 and the Pythagorean theorem applied to the right

triangle in Figure 4.1b,

‖yu‖22 +

(‖xu‖2

2 −1

)2= ‖yu‖2

2 +(xT

uxu0− 1

It follows that ‖yu‖2 = ‖xu‖2

√1 − ‖xu‖2

2. Assuming nonnegative energies, by Cauchy-

Schwarz,

∆ ≤∑

(u,v)∈Go

Euv |yTu yv| ≤

(u,v)∈Go

Euv ‖xu‖2‖xv‖2

√(1 − ‖xu‖2

2)(1 − ‖xv‖22). (4.6)

The Sharp-Concentration Case Our algorithm is expected to do very well when, within

each Vi, the probability distribution is sharply concentrated. In other words, if the prob-

ability ‖xu‖22 of picking u greatly exceeds that of selecting the other vertices v ∈ Vi,

then projection rounding does the right thing. Indeed, if within each Vi one ‖xu‖2 is

near 1, then the other ‖xv‖2’s (v ∈ Vi) must be small. This implies that the product

‖xu‖2

√(1 − ‖xu‖2

2) is always small and, by (4.6), so is the drift.

4.2.2 Perron-Frobenius Rounding

Algebraically, projection rounding entails approximating X = W TW by the rank-one

matrix X = W Txu0xT

u0W . Are there better low-rank approximation matrices? To answer

this question, we return to the SDP formulation (4.3), which ensures that the matrix X

is nonnegative. Because X is also positive semidefinite a spectral approach suggests an

alternative way: approximate X by a rank-one matrix so that the difference has minimum

L2 norm.

To simplify the notation, we move all the energies over to the edges by defining Fuv =

Euv + 1p−1(Euu +Evv) if u < v, and 0 otherwise. The objective function of the SDP system

can now be expressed as E = tr (FX), where F = (Fuv) is upper-diagonal. A vector

q = (qu) ∈ IRn is called G-stochastic if it is nonnegative and forms a valid probability

distribution over each Vi (i.e.,∑

u∈Viqu = 1). Randomized rounding with respect to q

produces an expected energy of tr (FqqT ), and so, the drift is ∆ = trF (qqT −X). The

problem, of course, is to find a suitable vector q. The next lemma provides a convenient

criterion to test whether a given q provides a valid distribution.

Lemma 4.2.2 Any nonnegative vector with L1-norm p in the image space of X is G-

stochastic.

Proof: Recall that X = W TW , where W is the matrix of column vectors (xu). Let 1i

be the 0/1 characteristic vector of Vi. Assuming that q = Xy for some y, then

u∈Vi

qu = 1Ti q = 1T

i (W TW )y = (W1i)TWy = xT

where W1i = xu0by Lemma 4.2.1. xT

u0Wy is independent of i and since by assumption

‖q‖1 = p,∑

u∈Viqu = 1 for any i. �

By the Perron-Frobenius theorem for nonnegative matrices (Seneta, 1981), the unit

eigenvector z1 corresponding to the largest eigenvalue λ1 of X is nonnegative. We ap-

proximate X by

X = λ1z1zT1 . (4.7)

Let s = (p/‖z1‖1) be the factor needed to scale z1 to length p in the L1 norm. Since z1

is in the image space of X, the vector q = sz1 is G-stochastic by Lemma 4.2.2. Perron-

Frobenius rounding refers to the standard rounding rule applied now with respect to the

distribution q. That is,

• Perron-Frobenius rounding rule: For each 1 ≤ i ≤ p, choose u ∈ Vi at random

with probability given by the u-th entry of z1 scaled by s.

We can express the drift under this rounding scheme as

∆ = trF (qqT −X) = trF(s2z1z

T1 −X

λ1trF

(λ1z1z

T1 −X

λ1− 1

)trFX, (4.8)

and upper bound it as follows:

Lemma 4.2.3 Let 1 denote the column vector of n ones and U the n-by-n matrix of ones,

and let {zk} be an orthonormal eigenbasis of X with λk ≥ 0 the eigenvalue associated with

zk. Then,

∆ ≤ (1 + δ)‖F‖2

√trX2 − λ2

1 + δ trFX ,

where F = (Fuv) is the energy matrix and X is the solution matrix returned by the SDP

system and

δ =trUX

trUX− 1 =

∑k>1 λk(1

T zk)2

p2 − ∑k>1(1

T zk)2.

Proof: We have δ = s2/λ1 − 1 because

trUX =∑

xTuxv =

(∑xu

)T (∑xv

)= p2‖xu0

‖22 = p2, and

trUX = λ1(1T z1)

2 = λ1‖z1‖21;

where the first follows because∑

u∈Vixu = xu0

and there are p such positions i, and

the second follows from the construction of X . Substituting δ into (4.8) and applying

Cauchy-Schwarz, gives

∆ ≤ (1 + δ)‖F‖2 ‖X − X‖2 + δ trFX .

Note the dependence on the L2 distance between X and its approximation X. We can

express this distance in terms of the spectral weight placed on eigenvectors zk for k > 1.

The diagonalization of the matrix X gives the decomposition X =∑

k λkzkzTk ; therefore,

since X − X is symmetric and the zk’s are orthonormal,

‖X − X‖22 = tr (X − X)2 = tr

λkzkzTk

λ2k tr (zkz

Tk )2 + 2

λkλl tr (zkzTk )(zlz

λ2k = trX2 − λ2

Finally, using p2 − trUX = 0, we have

δ ≡ trUX

trUX− 1 =

trU(X − X)

p2 − trU(X − X)=

∑k>1 λk(1

T zk)2

p2 − ∑k>1(1

T zk)2. �

Note that δ is quite small if the largest eigenvalue λ1 carries most of the spectrum or

z1 is close to the vector 1, which makes the terms 1T zk small for k > 1. The empirical

results in Section 4.3 suggest that λ1 may be much larger than the other eigenvalues in

realistic situations.

The Uniform Case Projection rounding is expected to do well when the solution con-

centrates weight on a single node per position. What if the weights are nearly uniformly

distributed? We can use Lemma 4.2.3 to argue that in this case even the strategy of

uniform guessing has low drift.

Assume that (i) xuu = 1/|Vi| for all u, (ii) xuv = 1/|Vi||Vj | for any (u, v) ∈ Vi × Vj

(i < j), and (iii) xuv = 0 for u, v ∈ Vi. (Note that although these assumptions are

themselves unrealistic, the robustness of our arguments below makes them representative

of the “uniform” end of the spectrum.) It is easy to construct an orthogonal eigenbasis

for X:

Lemma 4.2.4 The largest eigenvalue λ1 of X is equal to∑p

i=1 |Vi|−1 and corresponds to

the eigenvector

z1 =1√∑i |Vi|−1

|Vi|−11i ,

where 1i is the 0/1 characteristic vector of Vi. None of the n − 1 other eigenvectors are

nonnegative: p − 1 of them are of the form 11 − 1i and span the kernel of X, while, for

each i, |Vi| − 1 of them are associated with the eigenvalue |Vi|−1.

Proof: If X = λ1z1zT1 , then X − X is the n-by-n matrix made of blocks B1, . . . , Bp

along the diagonals and 0 everywhere: each Bi is a |Vi|-by-|Vi| circulant matrix with

|Vi|−1 − |Vi|−2 along the diagonal and −|Vi|−2 elsewhere. The eigenvectors of an m-by-m

circulant matrix consist of the rows of the matrix of the Fourier transform over the additive

group Z/mZ: For k > 0, this gives us an eigenvector (1, e2πik/m, . . . , e2πik(m−1)/m) for each

0 < k < m. The corresponding eigenvalue for Bi is |Vi|−1 (hence, both its algebraic and

geometric multiplicities are |Vi| − 1). The corresponding eigenvectors of X are derived

trivially by padding with zeroes at the appropriate places. Note that we must skip the

case k = 0, because the eigenvector of Bi that gets padded into 1i is not an eigenvector of

X. To complete the diagonalization of X, we must resolve its kernel. Going back to the

relations xuu =∑

v∈Vjxuv, we easily verify that KerX is spanned by the p − 1 vectors

11 − 1i, for 1 < i ≤ p. �

Assume now that all the Vi’s are of equal size n/p. Then z1 = (1/√n )1 and Perron-

Frobenius rounding degenerates into choosing solutions uniformly at random. Since all

other eigenvectors are normal to z1, by Lemma 4.2.3, we know that δ = 0. Also, by

Lemma 4.2.4,

trX2 − λ21 =

λ2k = (n− p)p2/n2 .

If each Fuv is 0/1 and each node is connected to 2d neighbors, then ‖F‖2 =√nd and,

by Lemma 4.2.3, ∆ ≤ p√d. This shows that, measured against the energy of (p/n)2dn =

dp2/n of the random solution, the relative drift is at most (n/p)/√d, which is typically

much less than one since each rotamer typically interacts with many rotamers in several

positions.

Hence, if the solution matrix is sharply concentrated, we have shown projection round-

ing is expected to work well, at least under the assumption that the edge weights are

nonnegative. In the opposite situation of a uniform solution matrix we have shown that

for an unweighted, regular graph the drift is small if solutions are chosen uniformly at

random.

We used the SDP formulation to design computationally the cores (i.e., the solvent inac-

cessible portions) of two proteins. Both proteins have a core β-barrel, a region where the

backbone wraps around to form a structure reminiscent of the slats of a wooden barrel.

Our computational work focuses on protein cores because (1) the energetics most

important to the core residues are easier to model than those of solvent-exposed residues,

and (2) the cores are small, making them tractable for SDP. In particular, since we are

focusing on hydrophobic core interactions, we use an energy function — similar to the

one in Chapter 3 — that focuses on obtaining well-packed structures. More specifically,

the interaction energies between rotamers (that is, the edge weights of our graph) are

calculated using the 6–12 Lennard-Jones approximation to the van der Waal’s force. Self-

energies are calculated as the sum of the van der Waal’s interaction between the rotamer

and the backbone, plus a statistical term derived from the empirical probabilities listed

in the rotamer library (Dunbrack Jr and Karplus, 1993). Interactions between the side

chain and the backbone of flanking positions are ignored to account for some backbone

flexibility. The statistical energy term for rotamer u is computed as − ln(pu/p0), where pu

is the probability of seeing rotamer u and p0 is the probability of seeing the most common

rotamer for that amino acid (Bower et al., 1997). (In the notation of Section 3.2.7,

here C = 1.) For all calculations, atom radii and interaction parameters are taken from

AMBER96 (Cornell et al., 1995), a commonly used package for evaluating the energy

of protein conformations, with the radii of hydrogens reduced by 50% because of their

uncertain position. The BALL C++ library (Kohlbacher and Lenhof, 2000) was used to

manipulate the rotamers.

The ultimate test of protein design is to make the protein and confirm the predicted

structure. Obviously, experimental work is beyond the scope of this thesis, but the solu-

tions to the design problems can be at least initially evaluated in several ways. First, we

may expect the designed sequence to be similar to the native sequence since evolution has

likely chosen a favorable sequence. (Note, however, that novel protein sequences that are

considerably more stable than native protein sequences have been designed using fixed

backbone approaches (Malakauskas and Mayo, 1998).) Second, since we are using an en-

ergy function that focuses on packing, we expect the designed structure to avoid clashes

between atoms and to pack the available space tightly. Third, we want the rounded solu-

tion to have energy near the optimum solution. We show below that our computationally

designed cores generally fulfill these criteria.

In order to investigate the performance of the rounding schemes in a more controlled

setting, we also experimented with two types of random graphs. We first consider uniform

random graphs and then consider a family of randomly generated graphs that better model

the interaction graphs observed in proteins.

The semidefinite programs were solved using version 6.0 of the SDPA (Fujisawa et al.,

1997) package, an implementation of an infeasible primal-dual interior-point method. The

linear programs were solved using the dual simplex method with AMPL (Fourer et al.,

2002) and CPLEX 7.1 (ILOG CPLEX, 2000). The SDP solutions were rounded using the

projection and Perron-Frobenius methods described above.

We compare our SDP with the LP obtained by relaxing the integrality constraints

from (4.4). The LP solutions were rounded by choosing node u with probability xuu.

For problems of this size, optimal integral solutions (denoted by OPT ) can be found

using formulation (4.4) and the integer programming option of CPLEX. This allows us to

compute the relative gap of each rounded solution, as computed by |(x − OPT )/OPT |,

where x is the value of the solution. For both the protein design problems and the

simulated data, the SDP rounding schemes perform well, with significantly better average

relative gaps.

4.3.1 Cold Shock Protein

We applied the SDP method to the problem of redesigning the core of the Bacillus cal-

dolyticus cold shock protein (Mueller et al., 2000) (PDB code: 1c9o). Core residues were

defined as having less than 1% of their surface area exposed to solvent in the native struc-

ture as determined by the program Surfv (Nicholls et al., 1991), which rolls a probe sphere

of radius 1.4A along the van der Waals surface. The following eight residues were found:

Val6, Gly16, Ile18, Val28, Leu41, Val47, Phe49, Val63.

The hydrophobic core positions (i.e., all positions listed above except position 16 with Gly)

were varied, and allowed to assume any rotamer of the hydrophobic amino acids Ala, Val,

Ile, Leu, Met, and Phe that occurred in the backbone-dependent rotamer library (Dun-

brack Jr and Karplus, 1993). This yields 55 rotamers per position. The protein is shown

in Figure 4.2a, with variable atoms shown as black spheres and the axis of the β-barrel

vertical. The native positions of the side chains are shown in Figure 4.2b, where the

protein is rotated so that we are looking down the axis of the barrel.

The resulting problem had 385 nodes, 7 positions, and 63,313 nonzero cost matrix

entries. Simple pairwise DEE of (Goldstein, 1994), a polynomial-time rule for throwing

out rotamers that cannot possibly be in the optimal solution (see Section 2.5), was applied

to the problem until no more nodes could be eliminated. This reduced the problem to

137 nodes, 7 positions, and 7,865 nonzero cost matrix entries.

The LP solution was rounded 1,000 times using the simple LP rounding scheme. The

SDP solution was rounded 1,000 times with both the projection and Perron-Frobenius

rounding schemes. The minimum energy solution found is a good measure of how well

one would do in practice, but this minimum energy may be influenced by the moderate

search space size. The average energy of a rounded solution is a better indicator of the

distribution obtained from rounding the relaxations. The best value over 1,000 roundings

and the empirical average objective value in the limit are:

Rounding Method Best Average

LP -217.2880 4058.6651

Projection -238.4218 -102.2822

Perron-Frobenius -238.4218 -209.3617

The optimum solution (determined by the integer programming option of CPLEX)

is −238.4218. Both SDP rounding schemes find the optimum solution; this appears to

correspond to a well-packed and plausible structure, as shown in Figure 4.2c. The average

energy of the rounded solutions suggest that, as expected, the LP rounding scheme is a

poor one. In fact, the average relative gap of the solutions found by the LP rounding

scheme is 18.02, versus an average relative gap of 0.57 for projection rounding and 0.12

for Perron-Frobenius rounding. We expected the Perron-Frobenius rounding scheme to

perform well, as the solution returned by the SDP has most its spectral weight placed on

the largest eigenvalue (7.725 versus less than 0.05 for all the other eigenvalues).

The optimal choice has 57% sequence identity with the native sequence. Additionally,

Figures 4.2b and 4.2c show that the designed sequence packs more atoms into the core of

(a) (b) (c)

Figure 4.2: Cold-shock protein (1c9o). (a) The full protein; (b) the positioning of the sidechains in nature; (c) the solution returned by the SDP rounding schemes (the optimal).

the protein than the native structure. This is one indication that this sequence might be

a good fit for this backbone as more tightly packed cores tend to be favored.

4.3.2 Triose Phosphate Isomerase

We applied the same procedure to the protein triose phosphate isomerase from chicken

muscle (Banner et al., 1976) (PDB code: 1tim). This protein is an α/β-barrel, where

the β-barrel core is surrounded by α-helical structures. We focused on the computational

redesign of residues in the core of the β-barrel, as identified by (Lesk et al., 1989), and

shown in Figure 4.3a as black spheres. The 9 non-glycine core residues are:

Val40, Ala62, Trp90, Ile92, Ile124, Val161, Ala163, Ile207, Leu230.

The native positions of these side chains are shown in Figure 4.3b. Trp90 was allowed

to assume any rotamer of the aromatic amino acids Phe, His, Trp, and Tyr. The other

residues were allowed to assume any rotamer of the hydrophobic amino acids Ala, Val,

Ile, Leu, Met, and Phe. The same energy function was used as above. This resulted in

467 nodes and 91,737 nonzero edges. As with the cold shock protein, DEE was performed

to throw out rotamers that cannot possibly be in the optimal solution; this reduced the

problem to 141 nodes and 8,264 edges.

The optimal solution has objective value -208.5702. In this case, all methods find the

optimal solution (shown in Figure 4.3c) within 1,000 roundings. The average objective

values are:

Rounding Method Average

LP 251.0156

Projection 93.1529

Perron-Frobenius -36.9177

The average energy of the rounded solutions demonstrates that the Perron-Frobenius

rounding again performs best for this problem. In fact, the SDP solution returned is close

to a rank 1 matrix: the largest eigenvalue is 8.7563 out of a total spectral weight of 10;

the second largest is 0.375. The average relative gap of the solutions found by Perron-

Frobenius rounding is 0.82, versus 1.44 for for projection rounding and 2.20 for the LP

rounding scheme.

The optimal solution avoids clashes, and, as can be seen from Figures 4.3a and 4.3b,

packs the available space well. It has only 33% sequence identity with the native solution.

This is not necessarily unexpected: (Dahiyat and Mayo, 1997) designed a sequence with

only 21% identity to the native sequence that folds to the same shape.

4.3.3 Uniform Random Graphs

We consider the random graphs GU (n, p, r) parameterized by the number of nodes n,

number of positions p, and edge probability r. Each position contains n/p nodes. Two

nodes in different positions are connected by an edge with probability r. Each chosen

edge is weighted by drawing a weight uniformly from [0, 1]. There are no self-edges.

We solved 30 instances of uniform random graphs with 60 nodes, 15 positions, and

edge probability 0.5 using SDP. Figure 4.4 shows the 25 largest eigenvalues in descending

order for each of the uniform random graphs shown in Figure 4.5. The eigenvalues sum to

16 because trX = p and there are 16 positions including the dummy position V0 in (4.3).

Val161

Val40Ile124

Ile207

Ala163

Leu230

Met161

Val40Phe124

Met207

Met163

Leu230

(a) (b) (c)

Figure 4.3: Triose phosphate isomerase (1tim). (a) The full protein; (b) the positioningof the side chains in nature; (c) the solution returned by the SDP rounding schemes (theoptimal).

Most of the spectrum is concentrated in the first few eigenvalues, and so one would expect

the X in (4.7) to closely approximate the solution X and thus that Perron-Frobenius

rounding would perform well.

Figure 4.5a compares the fractional objective of the semidefinite program with that

of the linear program. The SDP provides a tighter lower bound on the minimum energy,

typically within 10% of the optimum. In contrast, the fractional LP solution is never

within 60% of the optimum. As expected then, as a side benefit, SDP provides a more

effective bounding function than LP for branch-and-bound frameworks.

Figure 4.5b shows the best rounded solution found over 10,000 roundings. For these

30 graphs, both semidefinite rounding schemes outperform the LP in all cases, generally

finding a solution within 10% of the optimum and only once finding a solution more

than 20% above the optimum. This means, in a practical sense, that SDP allows us to

find lower energy conformations. The two semidefinite rounding schemes are comparable,

though Perron-Frobenius finds a lower energy solution in 11 out of 30 instances. In

one case, projection rounding finds a better solution. The average rounded energy is

shown in Figure 4.5c. Perron-Frobenius gives a slightly better distribution than projection

rounding. Both SDP rounding schemes again outperform the LP one.

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

0 5 10 15 20 25

ueith eigenvalue

Largest eigenvalues

Figure 4.4: The largest eigenvalues of the SDP solution matrices for the uniform randomgraphs.

These results remain qualitatively the same for other values of p ≥ 10 and edge

probabilities ≥ 0.3. For very sparse graphs, the SDP and LP methods yield similar

results.

4.3.4 Neighborhood Random Graphs

The uniform random graphs do not capture several properties of real protein interaction

graphs. Side-chains that are far apart in the folded protein structure typically do not

interact. On the other hand if two residues are near each other in the folded structure

most of their rotamers will interact.

We consider neighborhood random graphs GN (n, p, d) that capture some of these

properties. They are again parameterized by the number of nodes n and number of

positions p, where each position has n/p nodes. Given parameter d, edges are defined as

follows: For each position j a point bj is chosen uniformly at random in the 3D unit cube.

If the Euclidean distance between bi and bj is ≤ d, then the rotamers in positions i and j

are connected by the complete bipartite graph; if the distance is > d, there are no edges

between i and j. Edges are weighted by choosing a number uniformly from [0, 1].

Figure 4.6 shows the results for neighborhood random graphs with various values for

d. For sparse graphs (small d), the SDP and LP approaches yield similar results. Fig-

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 5 10 15 20 25 30

Problem Instance

(a) Fractional Objectives

LP SDP

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0 5 10 15 20 25 30

Problem Instance

(b) Best Rounded Objectives

Best LPBest Proj.

Best P-F

0 5 10 15 20 25 30

Problem Instance

(c) Average Rounded Objective

Avg. LPAvg. Proj.

Avg. P-F

Figure 4.5: Results for 30 uniform random graphs. Relative gaps are computed as inEquation 3.3.

-0.1-0.05

0 0.05 0.1

0.15 0.2

0.25 0.3

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Connection Distance

(a) Fractional Objectives

LP SDP

-0.1-0.05

0 0.05 0.1

0.15 0.2

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Connection Distance

(b) Best Rounded Objectives

Best LPBest Proj.

Best P-F

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Connection Distance

(c) Average Rounded Objective

Avg. LPAvg. Proj.

Avg. P-F

Figure 4.6: Results for neighborhood random graphs with various connection distances d.

ure 4.6a shows the fractional objective values and Figure 4.6b shows the best rounded

objective values. As d increases, positions are more likely to be connected, the optimum

objective grows, and the SDP’s advantage in lower bounding the optimum solution in-

creases. Both projection rounding and Perron-Frobenius rounding can find the optimum

solutions for most of these graphs within 10,000 roundings, whereas this is not the case for

the LP rounding. The average rounded energy is shown in Figure 4.6c, and again Perron-

Frobenius gives a slightly better distribution than projection rounding. The spectra for

neighborhood random graphs of low connection distance are also very concentrated in the

highest eigenvalue. As the connection distance increases, the spectrum generally becomes

more spread out (data not shown).

4.4 Conclusions

In this chapter, we formulated the side-chain positioning problem as an instance of

semidefinite programming and introduced two new rounding schemes for converting frac-

tional solutions into integral ones. Our rounding schemes are quite general and should be

applicable to other problems.

We have applied our method to the problem of computationally redesigning the cores

of two naturally occurring proteins. In addition, we have investigated how the rounding

formulations behave on two classes of random graphs. While the hardness of the SCP

problem argues that no method will do well in general, our computational experiments

confirm the effectiveness of our methods. We provide a measure of theoretical justification

for this.

We have shown that semidefinite programming can be applied to biological problems of

realistic, albeit small, size. Though non-polynomial search heuristics are more practical at

present for larger biological problems (Chapter 3), as semidefinite programming algorithms

and solvers improve, our approach will become more attractive, particularly for design

problems, where the LP approach tested in Chapter 3 requires more computation. In

fact, recent progress has been made in finding approximate solutions to SDP instances.

Along with other applications, (Arora et al., 2005) give an algorithm for finding an ǫ-

additive approximation for our semidefinite program (4.3) in time O(

n1.5Nǫ4.5

), where N is

the number of non-zero energy terms, and n is the total number of rotamers.

Chapter 5

Improving a Mathematical

Programming Approach for Motif

Finding

Moving from structure to protein function, we investigate a particular essential

function of proteins — the regulation of other proteins. Transcription factor

proteins bind to DNA to regulate the transcription of protein encoding genes.

The SCP LP/ILP approach given in Chapter 3 can be recast for the task of

finding the binding sites of these transcription factors. In this chapter, we

speed up the linear and integer programming method of Chapter 3 for this

application.

5.1 Introduction

The central dogma of biology is the basic mechanism of genetic information processing:

genes (subsequences of DNA) are transcribed into mRNA molecules, which are then trans-

lated into proteins, the building blocks of cellular structure and process. One mechanism

by which the abundance of particular proteins can be controlled is by accelerating or

reducing the rate of transcription. To this end, most genes have, near their region of

DNA, binding sites for regulatory proteins called transcription factors. The process of

transcription may be enhanced or slowed by the binding of transcription factors to their

binding sites.

In this chapter, we describe computational methods for automated discovery of these

regulatory elements, the binding sites of transcription factors in DNA. A commonly stud-

ied paradigm starts with a set of DNA sequences that contain binding sites for a regulatory

protein, and then finds shared (or similar) subsequences in each. These subsequences, or

motifs, are putative binding sites for the same transcription factor. The effectiveness of

identifying regulatory elements in this manner has been demonstrated by considering sets

of sequences identified via shared co-expression (Tavazoie et al., 1999), orthology (Cliften

et al., 2003; Kellis et al., 2003), and genome-wide location analysis (Lee et al., 2002).

From a computational point of view, while many motif-finding methods work reason-

ably well, a recent comprehensive study by (Tompa et al., 2005) shows that no single

motif finding method exhibits a high absolute measure of correctness. Broadly speaking,

the methods are either probabilistic or combinatorial. Probabilistic approaches estimate

parameters of a motif model using maximum likelihood or maximum a posterior estima-

tion to find the parameters of these models (Lawrence and Reilly, 1990; Bailey and Elkan,

1995; Lawrence et al., 1993; Liu et al., 2001; Frith et al., 2004). Combinatorial approaches

either enumerate through all allowed motifs (e.g., (Tompa, 1999; Sinha and Tompa, 2003;

Marsan and Sagot, 2000; van Helden et al., 2000; Eskin and Pevzner, 2002; Pavesi et al.,

2004)), or attempt to maximize some measure based on sequence similarity, or minimize

some measure based on distance (e.g., (Hertz and Stormo, 1999; Rigoutsos and Floratos,

1998; Pevzner and Sze, 2000; Sze et al., 2004)).

We take the combinatorial approach and formulate the motif finding problem as that

of finding the best gap-less local multiple sequence alignment using the sum-of-pairs (SP)

scoring scheme. The SP-score is one of many reasonable schemes for assessing motif conser-

vation (Osada et al., 2004; Schuler et al., 1991). The combinatorial problem is equivalent

to that of finding a minimum weight clique of size p in a p-partite graph (e.g. (Reinert

et al., 1997; Pevzner and Sze, 2000; Sze et al., 2004)).

For general notions of distance, this problem is NP-hard to approximate within any

reasonable factor (Section 2.6). In the motif finding setting, where the distances obey the

triangle inequality, the problem remains NP-hard (Akutsu et al., 2000). While constant-

factor approximation algorithms exist (Gusfield, 1993; Bafna et al., 1997), the ability to

find the optimal solution in practice is preferable.

Our approach follows that of (Zaslavsky and Singh, 2005), which introduced the integer

linear programming (ILP) formulation of the motif finding problem. Their testing on

identifying known DNA binding sites of E. coli transcription factors (Robison et al.,

1998) shows that the approach performs well for motif finding, identifying either known

motifs or motifs of higher conservation. They apply the ILP formulation to a variety

of types of motif-finding problems, including DNA motifs, protein motifs, and artificial

motifs embedded in random sequences (so called subtle motifs (Pevzner and Sze, 2000)).

A difficulty mentioned in (Zaslavsky and Singh, 2005), however, is the size of the

integer linear programs, which can have millions of variables for interesting biological

problems. In that work, the authors tackle the ILPs by preprocessing with graph prun-

ing and decomposition techniques. They give DEE-like rules for throwing out nodes and

experiment with a depth-1 branch-and-bound-like procedure where each node in an arbi-

trary position is assumed to be in the solution, and the graph is reduced using the pruning

procedures assuming the chosen node for the arbitrary position is the correct one. If the

graph remains a feasible instance, the smaller ILP is solved for this reduced graph.

Here, we take an alternative direction and propose a novel, more compact ILP that

uses the discrete nature of the distance metric imposed on pairs of subsequences. We

present a exponential-sized class of constraints to make the linear programming (LP)

relaxation of the new formulation provably as tight as that given in (Zaslavsky and Singh,

2005), and we give a separation algorithm so that solving the new relaxation remains

polynomial-time despite the large number of constraints.

Rather than using the separation algorithm explicitly, we describe and test a heuristic

approach to solve the LP relaxation of our novel ILP formulation that, in all observed

cases, finds a solution of the same objective value as the LP relaxation of (Zaslavsky and

Singh, 2005), often an order of magnitude faster. Moreover, we show that in practice,

the LP relaxations for both of the ILP formulations often have integral optimal solutions,

making solving the LP relaxations sufficient for solving the original ILP. Even if this were

not the case, the ability to find faster solutions to the relaxations may translate into

significant speed-ups in branch-and-bound approaches for solving the original ILP.

5.2 Formal Problem Specification

In the motif-finding problem, we are given p sequences, which are assumed without loss

of generality to each have length N ′, and a motif length ℓ. The goal is to find a substring

si of length ℓ in each sequence i, such that the sum of the pairwise distances between the

substrings (i.e.,∑

i<j distance(si, sj)) is minimized. The distance between substrings

may be defined in several ways. The simplest measure, and the one we restrict ourselves

to in this paper, is the Hamming distance.

This motif-finding problem can be expressed in the same graph theoretic terms (Rein-

ert et al., 1997) as the side-chain positioning problem we have considered earlier. For

a problem with p sequences, we define a complete, weighted p-partite graph, with a

part Vi for each sequence. In Vi, there is a node for every possible window of length

ℓ in sequence i. Thus there are N := N ′ − ℓ + 1 nodes in each Vi, and the vertex set

V = V1 ∪ · · · ∪ Vp has size Np. For every pair u and v in different parts there is an

edge (u, v) ∈ E . Letting seq(u) denote the subsequence corresponding to node u, the

weight wuv on edge (u, v) equals distance(seq(u), seq(v)). Let the part of a node be

denoted as part(u) := i such that u ∈ Vi. The goal in motif finding is to choose a node

from each part so as to minimize the weight of the induced subgraph. We note that the

combinatorial formulation of the “subtle motifs” problem is similar (Pevzner and Sze,

2000), though in that case edges exist only between nodes corresponding to subsequences

that are within a certain distance of one another. The approach we outline below can be

extended to that context as well.

5.3 Integer and Linear Programming Formulations

5.3.1 Original Integer Linear Programming Formulation

Given the graph-theoretic formulation above, the ILP presented in Chapter 3 can be used

for motif finding (Zaslavsky and Singh, 2005):

Minimize∑

{u,v}∈E wuv ·Xuv

subject to

∑u∈Vi

Xu = 1 for i = 1, . . . , p

∑u∈Vi

Xuv = Xv for i = 1, . . . , p and v ∈ V \ Vj

Xu,Xuv ∈ {0, 1}

5.3.2 New Integer Linear Programming Formulation

Since the alphabet and the length of the sequences are finite, there are only a finite

number of possible pairwise distances. For example, in the case of Hamming distances,

edge weights can only take on ℓ + 1 different values. We take advantage of the small

number of possible weights and the fact that the edge variables of IP1 are only used to

ensure that if two nodes u and v are chosen in the optimal solution then wuv is added to the

cost of the clique. We introduce a second ILP in which we no longer have edge variables

Xuv. Instead, in addition to the node variables Xu, we have a variable Yujc for each node

u, each position j such that u is not in Vj , and each possible edge weight c. These Y

variables model groupings of the edges by weight into cost bins. The intuition is that Yujc

is 1 if node u and some node v ∈ Vj are chosen such that wuv = c. Formally, let D be the

set of possible edge weights and let W = {(u, j, c) : c ∈ D,u ∈ V, j ∈ 1, . . . p and u 6∈ Vj}

be the set of triples over which the Yujc variables are indexed. Then the following ILP

models the motif-finding graph problem:

Minimize∑

(u,j,c)∈W :part(u)<j c · Yujc (IP2)

subject to

∑u∈Vi

Xu = 1 for i = 1, . . . , p (IP2a)

∑c∈D Yujc = Xu for j = 1, . . . , p and u ∈ V \ Vj (IP2b)

∑v∈Vj :wuv=c Yvic ≥ Yujc for (u, j, c) ∈W s.t. part(u) < j, i = part(u) (IP2c)

Xu, Yujc ∈ {0, 1}

As in the previous formulation, the first set of constraints forces a single node to be

chosen in each part. The second set of constraints says that if a node u is chosen, for each

position j, one of its “adjacent” cost bins must also be chosen (Figure 5.1). The third

set of constraints ensures that Yujc can be chosen only if some node v ∈ Vj is also chosen

such that wuv = c. We discard variables Yujc if there is no v ∈ Vj such that wuv = c.

Figure 5.1 gives a schematic drawing of these constraints.

It is straightforward to see that IP2 correctly models the motif-finding problem if the

variables are ∈ {0, 1}. For any choice of p-clique {u1, . . . , up} of weight µ =∑

i<j wuiuj,

a solution of cost µ to IP2 can be found by taking Xui= 1 for i = 1 . . . , p, and taking

Yuijc = 1 for all 1 ≤ j ≤ p such that wuiuj= c. This solution is easily seen to be feasible,

and between any pair of positions i, j it contributes cost wuiuj; therefore, the total cost is

µ. On the other hand, consider any solution (X,Y ) to IP2 of objective value µ. Consider

the clique formed by the nodes u such that Xu = 1. Between every two positions i < j,

the constraints (IP2a) and (IP2b) imply that exactly one one Yujc and one Yvid are set to

Figure 5.1: Schematic of IP2. Adjacent to each node u there are at most |D| cost bins,each associated with a variable Yujc. Associated with each cost c are the nodes v ∈ Vj forwhich wuv = c (represented in the figure by stars). Constraints (IP2b) say that we mustspread a total of Xu weight apportioned over the bins, while constraints (IP2c) limit usto choosing cost bin variables where there is some node v ∈ Vj chosen such that wuv = c.

1 for some u ∈ Vi and v ∈ Vj and costs c, d. Constraint (IP2c) corresponding to (u, j, c)

with Yujc on its right-hand side can only be satisfied if the sum on its left-hand side is 1,

which implies c = d = wuv. Thus, a clique of weight µ exists in the motif-finding graph

problem.

5.3.3 Advantages of IP2

In practice, IP2 has many fewer variables than IP1. IP1 has Np(N(p−1)/2+1) variables

and p+Np(p−1) constraints (ignoring the binary constraints). If none of the Yujc variables

can be thrown out, IP2 hasNp((p−1)d+1) variables and p+Np(p−1)(d/2+1) constraints,

where d = |D|, the number of kinds of weights. If d < N/2, the second IP will have fewer

variables. In practice, d is expected to be much smaller than N , and while N could

reasonably be expected to grow large as longer and longer sequences become practical to

study, d is constrained by the geometry of transcription factor binding and will remain

small. Also, in practice, it is likely that many Yujc variables are removed because seq(u)

does not have matches of every possible weight in each of the other sequences. IP2, on

the other hand, will have O(d) times more constraints than IP1.

There are additional practical considerations that may make the number of possible

edge weights small. It is often the case that one is interested in an optimal solution only

if it is of high enough quality, meaning that no motif instance in the solution is further

than α away from any other (the diameter of the solution is ≤ α). If this is the case,

edges of weight > α can be deleted. Such a requirement reduces d and makes IP2 still

smaller. In many applications, even if large diameter solutions are acceptable, there is an

expectation that the diameter is likely small, and a solution with a small diameter may

be preferred to one with a lower sum-of-pairs score but more outliers. In such a case,

the IP2 formulation may allow one to check for low diameter solutions quickly.

While the space requirement for the simplex algorithm is related to the number of

constraints and variables, running time is not necessarily directly related. Smaller in-

teger programs with weaker LP relaxations are often less useful for branch-and-bound

approaches to IP solving. Thus, we seek the tightest, smallest IP possible. Towards this

end, experiments must be performed to gauge the efficacy of various formulations on prac-

tical problems. We present experiments below which suggest IP2 can be more than an

order of magnitude faster than IP1.

5.3.4 Linear Programming Relaxation

The typical approach to solving an ILP is to solve the linear program derived from the

ILP by dropping the requirement that the variables be in {0, 1}, and instead requiring

only that the variables lie in the continuous range [0, 1], as we did in Chapter 3. This

modified problem is called the linear programming (LP) relaxation. An ILP formulation

is weaker than another if the corresponding LP relaxation of the first admits solutions

of lower objective values than are possible with the second. Weaker relaxations are often

less useful in solving the corresponding ILP.

The LP relaxation of IP2, which we refer to as LP2, is weaker than the LP relaxation

of IP1, which we refer to as LP1. (Note, in our case the constraints Xu ≤ 1 and Yujc ≤ 1

are implied and not included explicitly.) In this section, we present a fairly natural class

of constraints that, if added to LP2, will make it as strong as LP1. In the subsequent

section, we show that in practice we can focus on just two types of these constraints, and

we are able to solve the original ILP iteratively by adding cutting planes corresponding

to violated constraints of these types.

Additional constraints. Focus on a single pair of positions i and j. In IP1 the edge

variables between Vi and Vj explicitly model the bipartite graph between those two posi-

tions. In IP2, however, the bipartite graph is only implicitly modeled by an understanding

of which Y variables are compatible to be chosen together. We study this implicit rep-

resentation by considering the bipartite compatibility graph Cij between two positions i

and j. Intuitively, we have a node in this compatibility graph for each Yujc and Yvic, and

there is an edge between the nodes corresponding to Yujc and Yvic if wuv = c. These

two Y variables are compatible in that they can both be set to 1 in IP2. More formally,

Cij = (Aij ∪ Aji, F ), where Aij = {(u, j, c) : u ∈ Vi, c ∈ D} is the set of indices of Y

variables adjacent to a node in Vi, going to position j, and Aji is defined analogously,

going in the opposite direction. The edge set F is defined in terms of the neighbors of a

triple (u, j, c) as follows. For u in position i, let N (u, j, c) = {(v, i, c) ∈ Aji : wuv = c} be

the neighbors of (u, j, c). They are the indices of the Yvic variables adjacent to position

j going to position i so that the edge {u, v} has weight c. There is an edge in F going

between (u, j, c) and each of its neighbors. We call c the cost of triple (u, j, c). All this

notation is summarized in Figure 5.2.

In any feasible integral solution, if Yujc = 1, then some Yvic for which (v, i, c) ∈

N (u, j, c) must also be 1. Extending this insight to subsets of the Yujc variables yields a

class of constraints which will ensure that the resulting linear programming formulation

is as tight as LP1. If Qij is a subset of Aij , then let N (Qij) =⋃

(u,j,c)∈QijN (u, j, c) be

the set of indices that are neighbors to any vertex in Qij . The following constraint is

Figure 5.2: Notation used in the faster ILP formulation. N (u, j, c) is shown assumingthat v and w are the only nodes in Vj that have cost c with u. The two columns of circlesrepresent parts of the graph Vi and Vj with each circle representing a node. The solidlines adjacent to each circle represent the Yujc or Yvic variables associated with the node.Aij and Aji (dotted boxes) are the sets of these variables associated with the pair of graphparts i and j. Finally, the function N (u, j, c) maps a variable Yujc to a set of compatibleYvic variables (squiggly lines).

Figure 5.3: Graph Cij used to show constraints (5.1) are sufficient. Nodes r and s area source and sink, respectively, added in the proof of Theorem 5.3.1. Each solid nodecorresponds to a Y variable. The edges between Aji and Aij have infinite capacity, whilethose entering s or leaving r have capacity equal to the value of the Y variable to whichthey are adjacent. The shading gives an r – s cut.

true in IP2 for any such Qij:

(u,j,c)∈Qij

Yujc ≤∑

(v,i,c)∈N (Qij)

Yvic . (5.1)

That is, choose any set of Yujc variables adjacent to position i that go to position j. Their

sum must be less than or equal to the sum of Y variables for their neighbors. Notice that

the third set of constraints in IP2 are of the form (5.1), taking Qij to be the singleton set

{(u, j, c)}.

Theorem 5.3.1 If for every pair i < j, constraints of the form (5.1) are added to IP2

for each Q ⊆ Aij such that all triples in Q are of the same cost then the resulting LP

relaxation LP2′ is as strong as the relaxation LP1 of IP1.

Proof. For any feasible solution for LP2′, we will show that there is a feasible solution

for LP1 with the same objective value, thereby demonstrating that the optimal solution

to LP2′ is no smaller than the optimal solution to LP1. In particular, fix a solution (X,Y )

to the LP2′ with objective value γ. We need to show that for any feasible distribution of

weights on the Y variables a solution to LP1 can be found with objective value γ.

In order to reconstruct a solution X for LP1 of objective value γ, we will set Xu = Xu,

using the values of the node variables Xu in the optimal solution to LP2′. We must assign

values to Xuv to complete the solution. Recall the compatibility graph Cij described

above. Because all edges in Cij are between nodes of the same cost, Cij is really |D|

disjoint bipartite graphs Ccij, one for each cost. Let Ac

ij ∪ Acji be the node set for the

subgraph Ccij for cost c. Each edge in a subgraph Cc

ij corresponds to one edge in the graph

G underlying LP1. Conversely, each edge in G corresponds to exactly one edge in one of

the Ccij graphs (if edge {u, v} has cost c1, it corresponds to an edge in Cc1

ij ). We will thus

proceed by assigning values to the edges in the various Ccij, and this will yield values for

the Xuv.

If y(A) :=∑

(u,j,c)∈A Yujc, by the sets of constraints (IP2a) and (IP2b), y(Aij) =

y(Aji) = 1. Since the constraints (5.1) are included with Q = Acij for each cost c, by the

pigeonhole principle, y(Acij) = y(Ac

ji) for every cost c. Thus, for each subgraph Ccij , the

weight placed on the left half equals the weight placed on the right half. We will consider

each induced subgraph Ccij separately.

We modify Ccij as follows to make it a directed, capacitated graph. Direct the edges

of Ccij so that they go from Ac

ji to Acij, and set the capacities of these edges to be infinite.

Add two dummy nodes {r, s} and edges directed from r to each node in Acji and edges

from each node in Acij to s. Every edge adjacent to r and s is also adjacent to some

node representing a Y variable. Put capacities on these edges equal to the value of the Y

variable associated with the node to which they are adjacent. See Figure 5.3.

The desired solution to LP1 can be found if the weight of the nodes (Y variables)

in each compatibility subgraph can be spread over the edges. In other words, a solution

to LP1 of weight γ can be found if, for each pair (i, j) and each c, there is a flow of weight

y(Acij) from r to s in the above constructed graph. The assignment to Xuv will be the flow

crossing the corresponding edge in the Ccij of appropriate cost. In the following lemma,

we show that the set of constraints described in the theorem ensure that the minimum

cut in the constructed graph is ≥ y(Acij), and thus that there is a flow of the required

weight. The proof of this fact can be found in (Cook et al., 1997) on page 54-55, and is

reproduced in our notation below. �

Lemma 5.3.2 The minimum cut of the flow graph described in the proof of Theorem 5.3.1

(and shown in Figure 5.3) is y(Acij).

Proof. (Adapted from (Cook et al., 1997)). Recall that the capacities of the edges leaving

r are Yvic and those entering s are Yujc, and that the total capacity leaving r equals the

total entering s, and this total capacity equals y(Acij). We want to show that the minimum

r − s cut in this graph is ≥ y(Acij).

Consider an r− s cut {r} ∪A∪B where A ⊆ Acji and B ⊆ Ac

ij . (Such a cut is shaded

in Figure 5.3.) Define A = Aji \ A and B = Aij \ B. If any edges go between A and B

Figure 5.4: Example graph Ccij where the constraints added in our heuristic approach are

insufficient. All the constraints with |Q| = 1 or |Q| = |Aij | are satisfied, but a flow ofvalue 1 does not exist from the right side to the left side in the augmented graph (as inFigure 5.3).

then the capacity is infinite, and we are done. Otherwise the value of the cut is the sum

of the capacities of the edges leaving r and going to A plus the sum of the capacities of

the edges entering s from a node in B. We will now show that

y(A) ≥ y(B) , (5.2)

which implies that the value of the cut is > y(Acij).

Assume for a moment that all nodes in A have a neighbor in B. Then N (B) = A

because there are no edges between A and B. By (5.1), y(B) ≤ y(N (B)) = y(A). On

the other hand, if there is a node in A that does not have a neighbor in B then we can

add that node to A to make A′ (without increasing the cost of the cut), and the above

argument shows that y(Aji \ A′) ≥ y(B), which implies y(A) ≥ y(B) since Aji \ A′ ⊆ A.

It is also clear that linear relaxation LP2′ described in Theorem 5.3.1 is no stronger

than LP1 as any solution to LP1 can be converted to a solution of LP2′ by putting the

weight on edge variables Xuv onto Yujc and Yvic, where wuv = c. This solution to LP2′

will satisfy all the constraints in the theorem.

There are an exponential number of constraints considered in Theorem 5.3.1. Such an

LP can be solved in polynomial time (by the ellipsoid algorithm (Grotschel et al., 1993))

if a polynomial-time separation algorithm exists. The separation algorithm must find a

violated constraint, if one exists, or report that no constraints are violated. The next

lemma gives such a separation algorithm, which simply formalizes the intuition in the

proof of the constraints’ (5.1) sufficiency: All constraints are satisfied only if the max

flow is large enough, which is true only if the min cut is large enough, so if the min cut is

small, that cut will give us a violated constraint.

Lemma 5.3.3 There is a polynomial-time algorithm that can find a violated constraint

in LP2′ or report that none exists.

Proof. Because each constraint in (5.1) involves variables of a single cost, if (5.1) is

violated for some set Q, then Q is a subset of an Acji for some i, j, c, and so we can

consider each subgraph Ccij independently. The proof of Theorem 5.3.1 shows that there is

a violated constraint of the form (5.1) between i, j involving variables of cost c if and only

if the maximum flow in Ccij is less than y(Ac

ij). Thus, the minimum cut can be found for

each triple i, j, c, and, if a triple i, j, c is found where the minimum cut is less than y(Acij),

one knows that a violated constraint exists between positions i and j with Q ⊂ Acij .

The minimum cut can then be examined to determine the violated constraint explicitly.

Let {r} ∪ A ∪ B be the minimum r – s cut in Ccij, with A ⊆ Ac

ji and B ⊆ Acij , and let

A = Acji \ A and B = Ac

ij \ B (following the notation of Lemma 5.3.2). Such a cut

is shaded in Figure 5.3. Let m be the capacity of this cut, and assume, because we

are considering a triple i, j, c that was identified as having a violated constraint, that

y(Acij) > m. Because m <∞ there are no edges going from A to B, and hence two things

hold: (1) m = y(B) + y(A) and (2) N (A) ⊆ B, and therefore y(N (A)) ≤ y(B). Chaining

these facts together, we have

y(A) = y(Acij) − y(A) > m− y(A) = y(B) ≥ y(N (A)) ,

Thus, the set A is a set for which the constraint of the form (5.1) is violated. �

In practice, the ellipsoid algorithm is often not as fast as the (theoretically slower)

simplex algorithm. We can use the faster simplex algorithm because not all of these con-

straints are necessary for real problems. Some particular choices of Qij yield constraints

that are intuitively very useful and are usually enough in practice. The constraints with

the largest Qij, that is Qij = Aij , were used in the proof of Theorem 5.3.1, and we have

found them to be useful in practice and are included in all the LP relaxations solved in

Section 5.4. In fact, for this Qij inequality (5.1) is an equality. LP2 already includes all

the constraints with Qij equal to the singleton set {Yujc} ⊂ Acij . Rather than including

constraints with 1 < |Qij | < |Acij |, we include the constraints with i and j reversed, tak-

ing Qji = {Yvic} ⊆ Acji for v ∈ Vj. More detail about our approach to and experiences

with real problems can be found in Section 5.4. Examples can be constructed for which

these particular constraints are insufficient to make LP2 as tight as LP1. For example,

Figure 5.4 gives a graph Ccij for a single color for which all these constraints hold but for

which no solution of LP1 can be constructed. However, we have not encountered such

pathological cases in practice, and so we do not explore adding constraints with |Q| > 1.

5.3.5 Integrality Gap

As discussed in Section 3.2.4, if arbitrary positive weights are allowed on the edges, the

integrality gap of IP1 and IP2 can be made as large as desired. Here, however, the possible

weights are constrained in several ways. First, they are integers, a fact we have exploited

in our reformulation of the problem. Second, when using a distance metric to compare

windows, the set of weights on the edges of the graph satisfy the triangle inequality — this

is the key property that makes a reasonable approximation algorithm possible. Finally,

the weights are not independent — they are derived from overlapping sliding windows, a

property that we have not yet been able to exploit. So there is hope in the motif-finding

case that it may be possible to prove that the LP relaxation gives a good lower bound on

the optimal.

Figure 5.5: Example for which the integrality ratio of the LP relaxation is at least 4/3.The drawn edges have cost 2, while the undrawn edges have cost 1. Any integral solutionwill have objective value 4, but placing 0.5 on each node yields cost 3.

In the setting where the distance function is the Hamming distance, the gap is at least

4/3. An example can be seen in Figure 5.5: one triangle is labeled with length-ℓ strings

each consisting of a single type of nucleotide (ℓ = 2 in the figure). The other is labeled

with length-ℓ strings consisting of 1/2 of one base and 1/2 of another. That is, the drawn

edges have weight ℓ, while the other edges have weight ℓ/2, so the cost of any integral

solution is ℓ+ (1/2)ℓ+ (1/2)ℓ = 2ℓ. As in Section 3.2.4, by putting 0.5 on each node and

the undrawn edges, the fractional solution obtains a value of 6 × (0.5)(1/2)ℓ = 1.5ℓ. The

integrality gap is thus at least 4/3.

5.4.1 Methodology

We have found the following methodology to work well in practice. We first solve the

LP relaxation of IP2. If the solution is integral, we are finished. Otherwise, we add

any violated constraint of the form (5.1) with i and j reversed and with |Qji| = 1, and

resolve. We have never encountered a problem for which more constraints than these were

necessary to make LP2 as tight as LP1. We use the optimal basis of the previous iteration

as a starting point for the next, setting the dual variables for the added constraints to be

basic.

Because the variant of the simplex algorithm can make a large difference in run-

ning time in practice, LP1 was solved using two different variants. In the first (primal

dualopt), the primal problem was solved using the dual simplex algorithm. In the sec-

ond (dual primalopt), the dual problem was solved using the primal simplex algorithm.

While, in theory, these two variants should perform similarly, in practice running times

can differ significantly. Applying the dual simplex method to the dual problem or the

primal simplex to the primal problem are not expected to perform as well, and small scale

testing confirms this intuition (data not shown). LP2 was always solved using the dual

simplex method applied to the primal problem to make adding constraints faster.

The linear and integer programs were specified with Ampl and solved using CPLEX

7.1. All experiments were run on a public 1.2 GHz SPARC workstation shared by many

researchers, using a single processor. All the timings reported are in CPU seconds on this

machine. For any run, any problem taking longer than 5 hours was aborted.

5.4.2 Test Sets

We present results on identifying the binding sites of 50 transcription factor families. We

construct our data set from the data of (Robison et al., 1998; McGuire et al., 2000) in

a fashion similar to (Osada et al., 2004). In short, we remove all sites for sigma-factors,

duplicate sites, as well as those that could not be unambiguously located in the genome.

For each transcription factor under consideration, we extract the proteins it is regulating,

and gather at least 300 base pairs of genomic sequence upstream of their transcription

start sites. In those cases where the binding site is located further upstream, we extend

the sequence to include the binding site. The window size for each family was chosen

based on the length of the consensus binding site size, determined from other biological

Table 5.1: Sizes for the 50 problems considered: number of sequences (p), motif length(ℓ), and total number of nodes in the underlying graph (n).

TF ℓ p n TF ℓ p n TF ℓ p n

ada 31 3 810 fruR 16 11 4082 metR 15 8 3312araC 48 6 1715 fur 18 9 3182 modE 24 3 934arcA 15 13 4790 galR 16 7 2188 nagC 23 6 1870argR 18 17 5960 gcvA 20 4 1234 narL 16 10 3301carP 25 2 552 glpR 20 11 3829 ntrC 17 5 1516cpxR 15 9 2614 hipB 30 4 1084 ompR 20 9 3057cspA 20 4 1410 hns 11 5 1485 oxyR 39 4 1048cynR 21 2 854 hu 16 2 571 pdhR 17 2 568cysB 40 3 783 iclR 15 2 588 phoB 22 14 4618cytR 18 5 1695 ilvY 27 2 1079 purR 26 20 5856dnaA 15 8 2381 lacI 21 2 560 rhaS 50 2 502fadR 17 7 2122 lexA 20 19 5554 soxS 35 13 4004farR 10 3 873 lrp 25 14 4090 torR 10 4 2198fhlA 27 2 731 malT 10 10 3410 trpR 24 4 1108fis 35 18 5371 marR 24 2 813 tus 23 5 1390flhCD 31 3 810 melR 18 2 717 tyrR 22 17 5258fnr 22 12 3705 metJ 16 15 5754

studies. The families, their sizes and the length of the binding site are shown in Table 5.1.

5.4.3 Performance of the LP Relaxations

We solved LP1 and LP2 relaxations for the 50 problems in Table 5.1. The running times

of LP2 are shown in Figure 5.6(b). Generally, the initial set of constraints is sufficient to

get a tight solution. Six problems required adding constraints to LP2 in order to make it

as tight as LP1. The problems flhCD, torR, and hu required 2 cutting plane iterations,

ompR required 3, oxyR required 4, and nagC required 5. Running times reported in

Figure 5.6(b) are the sum of the solve times of all cutting plane iterations.

Of the 45 problems solvable in < 5 hours, only 3 were not integral. This is somewhat

surprising. Of course, there is much structure to real problems, which may make them less

susceptible to the worst-case analysis. The success of the LP relaxations in finding integral

Transcription Factor Family

etR fnr

nR hns

pR hu lrpgl

carP lacI tus

00.20.40.60.8

11.21.41.6

100300900

25006800

18400B

Results on 45 Transcription Factor Families

Figure 5.6: (a) Speed-up factor of LP2 over LP1 as defined in (5.3) for problems for whichsome method was able to complete in < 5 hours. Shaded bars correspond to problemsfor which LP1 did not finish in < 5 hours. The doublely shaded bar (far right) marksthe problem for which LP2 did not finish in < 5 hours, but LP1 did. (b) Running timesin seconds for LP2. The y-axis is in log scale. (c) Matrix size for LP2 divided by thatfor LP1.

solutions suggests that handling non-integral cases may not be as pressing a problem as

one would think.

We compared the running time of LP2 with that of LP1 by taking as the running time

of LP1 that of the simplex variant that performed the best. In order words, we take the

speed-up factor to be:

min(primal dualopt, dual primalopt)

LP2(5.3)

This gives LP1 the benefit of the doubt. In practice, always achieving the runtime used

in the numerator would require running each variant in parallel using 2 processors. The

speed-up factors for these problems are shown in Figure 5.6(a). For 10 problems, neither

simplex variant completed in < 5 hours when applied to LP1, whereas LP2 did. For these

problems, the numerator of (5.3) was taken to be 5 hours. This gives a lower bound on

the speed up. For one problem, cytR, the reverse is true and LP2 could not finish within 5

hours, while both simplex variants successfully solved it using LP1. For this problem, the

denominator was taken to be 5 hours, and (5.3) gives an upper bound. For 5 problems, no

method found a solution in < 5 hours; these are omitted from Figure 5.6. Only cytR was

slower using LP2, and an order of magnitude increase in speed is common when using LP2

compared with LP1.

As expected, the size of the constraint matrix (defined as the number of constraints

times the number of variables) is often smaller for LP2. Fig. 5.6(c) plots

size for LP2

size for LP1. (5.4)

While, in 5 cases the matrix for LP2 is larger, in many cases it is < 50% the size of the

matrix for LP1. A smaller constraint matrix can often lead to faster iterations of the

simplex algorithm.

We also compared the motifs found by our approach to the set of known transcription

factor binding sites. In all cases, we found motifs that are at least as well conserved as

the actual binding sites (measured by average information content). Since our test data

are real genomic sequences, co-regulated genes may in fact have multiple shared binding

sites for different transcription factors.

5.5 Conclusions

In this chapter, we introduced a novel ILP formulation of the motif finding problem that

works well in practice. In particular, it finds optimal solutions to motif-finding instances

significantly faster than a previous ILP formulation introduced by (Zaslavsky and Singh,

2005). We note that a variety of graph pruning and decomposition techniques have been

introduced for motif finding (e.g., (Reinert et al., 1997; Pevzner and Sze, 2000; Zaslavsky

and Singh, 2005)). It is likely that, in conjunction with those techniques, our formulation

will be able to tackle problems of significantly larger sizes.

Our work opens many interesting avenues for future work. While the underlying graph

problem for motif finding is essentially identical to that of Chapters 2–4, one central

difference is that when minimizing distance based on nucleotide matches and mismatches,

the triangle inequality is satisfied. The current ILP formulations do not exploit this, and

as a result, work in its absence. Another feature commonly present in motif finding that

is not used here is that the edge weights in the graph are not independent, as each node

represents a subsequence from a window sliding along the DNA. Incorporating either the

triangle inequality or the correlation between edge weights into the ILP or its analysis may

lead to further advances in computational methods for motif finding. Finally, it would be

useful to extend the basic formulation presented here to other motif finding applications,

for example, to find multiple co-occurring or repeated motifs.

Chapter 6

Identifying Functionally Related

Yeast Proteins Using Inferred

Evolutionary History

We now consider protein function more generally and focus on using cross-

genomic evolutionary information in the form of the widely used method of

phylogenetic profiles to predict functional linkages between proteins. In this

chapter, we give two methods for making use of the relationships between the

organisms to improve predictions. Along the way, we develop some insight into

what kinds of evolutionary patterns are most indicative of shared function.

6.1 Introduction

The inheritance patterns of interacting or functionally related proteins tend to be similar

due to shared evolutionary pressures. If we knew the evolutionary history of two pro-

teins — a description of when the proteins were gained, lost, or transferred over time

— we could use the assumption that similar histories indicate similar function to decide

whether the proteins are functionally related. Of course, we do not generally know the

complete evolutionary history. As a proxy, the method of phylogenetic profiles, introduced

in (Tatusov et al., 1997; Gasterland and Ragan, 1998; Pellegrini et al., 1999), uses the

presence or absence of similar proteins across many present-day organisms.

In the phylogenetic profile method, each protein in the organism under study is rep-

resented by a vector with one dimension for each of many currently extant species. In

the general case, an entry in the vector represents the degree to which a similar protein

(or homolog) exists in the species corresponding to that dimension. In some applications,

entries are simply 1 if a homolog is believed to exist and 0 otherwise, giving bit vectors

representing the presence and absence of the protein across different species. We refer

to these vectors (either 0/1 or real-valued) as profiles. Combining the assumption that

shared evolution implies shared function with this representation of a protein’s evolution-

ary history we have

shared function ⇐⇒ similar evolution ⇐⇒ related profiles. (∗)

Neither of the ⇐⇒ above are mathematical absolutes; rather they are simply observa-

tions that tend to be true. Much of the progress in applying phylogenetic profiles has been

in refining the definition of the relation at the right end of the above chain of reasoning

so that the implications hold as best as possible. As originally proposed (Pellegrini et al.,

1999), the profiles were binary vectors and simple Hamming distance was used to define

the relation. The Pearson correlation coefficient (Wu et al., 2003), hypergeometric distri-

bution (Wu et al., 2003), and mutual information (Wu et al., 2003; Date and Marcotte,

2003) were later suggested as more refined ways of measuring the relationship between

profiles. We give two new measures in this chapter that more fully exploit the connections

among the organisms from which the profiles are constructed.

The task of identifying proteins with similar functions can be formalized as follows:

Functional-linkage problem: Given a collection of functions {Fi}, each

represented by the set of proteins it contains, find those pairs of proteins

(g1, g2) that are in Fi × Fi for some Fi.

Solving this problem is a potentially fruitful way to begin to make sense of the complex

web of interactions in the cell. It has been proposed (Hartwell et al., 1999) that the

workings of the cell are organized into modules, where there are many interactions and

dependencies within a module and fewer between modules. Detecting that two proteins

are participants in the same pathways is useful for the discovery of such modules. For

example, (Date and Marcotte, 2003) use a network derived from comparing phylogenetic

profiles to uncover clusters of proteins of related function for several organisms.

Even weak evidence that two proteins are functionally related can be beneficial when

incorporated into meta-methods, such as Bayesian networks, that combine information

from many weak classifiers to make very accurate predictions. Such methods have been

useful recently in predicting shared function (Troyanskaya et al., 2003; von Mering et al.,

2003) and protein interactions (Jansen et al., 2003), and phylogenetic profiles have been

incorporated into such methods (e.g. (Marcotte et al., 1999b; Bowers et al., 2004b; Lee

et al., 2004; Lu et al., 2005)). Thus, while the imperfect nature of the reasoning (∗) limits

the accuracy of phylogenetic profiles in isolation, the method can be very useful when

combined with several indicators of shared function.

In this chapter, we improve the ability of phylogenetic profiles to identify proteins

involved in the same pathways by using the evolutionary relationships (phylogenetic in-

formation) between the species used to create the profiles. Vector-based distance measures

between profiles do not take into account such relationships. Here, we introduce a new

probabilistic description of evolutionary history by assigning probabilities to gene pres-

ence at hypothetical ancestral nodes of a species tree relating the extant organisms in the

profile. We also investigate informative patterns of gain and loss in such evolutionary his-

tories and give two methods for using them to predict whether two proteins are involved

in the same pathway.

6.1.1 Our Contributions

We begin our study by investigating to what extent the chain of implications (∗) fails.

To do this, we give a per-function upper bound on the ROC performance of any classifier

that uses the cross-genomic occurrence of proteins to separate the proteins involved in

a particular function from other proteins. This analysis leads us to several classes of

profiles that are common sources of errors and points out those functions that tend to be

more difficult to handle. We show that, in general, phylogenetic profiles contain sufficient

information to map a protein to a function.

We then probe the ability of mutual information, as proposed for the current appli-

cation by (Date and Marcotte, 2003), to identify functional linkages. We show the first

ROC analysis of the performance of mutual information on identifying proteins involved

in the same KEGG (Kanehisa, 1997; Kanehisa and Goto, 2000) pathways and discuss

when certain variations in the application of mutual information are to be preferred over

others. We also suggest that ROC analysis, though helpful, can sometimes be misleading;

we give an alternative per-function assessment of performance. Mutual information is a

reasonable similarity measure between profiles, and as such it will be our baseline method

against which to compare.

Our experiments suggest that prediction of shared function can benefit from the ad-

dition of information about the relationships between species. We augment the profile

vector with a phylogenetic tree with leaves corresponding to the organisms used to create

the profiles and describe a method for inferring whether a gene is present or absent at

the ancestral nodes. We show that taking advantage of these inferred states to compare

two proteins by simply counting contemporaneous gene gains or losses can exceed the

performance of mutual information for many pathways.

Finally, we give a scoring scheme that supplements simple counting of events and

show that, if we are permitted to extract some evidence about which evolutionary events

are most indicative of shared function from a set of training examples, we can again

detect functionally linked proteins better than mutual information for most pathways.

An analysis of the parameters extracted from the training examples will be instructive

about what local patterns are most suggestive of shared function.

6.1.2 Previous Work

In addition to phylogenetic profiles, several other methods that do not depend on sequence

similarity for detecting shared function have been proposed. For example, proteins can be

predicted to be functionally related if they have fused in one or more genomes (Enright

et al., 1999; Marcotte et al., 1999a), or are located near one another in a genome (Overbeek

et al., 1999).

Mutual information applied to phylogenetic profiles has been used to identify proteins

involved in similar pathways in (Date and Marcotte, 2003), where they show that actual

pairs of profiles often have much higher mutual information than would be expected by

chance as modeled by shuffled profiles. Further, they show that the higher the mutual

information between profiles, the more likely the proteins take part in a shared KEGG

pathway. This establishes mutual information as a reasonable way to compare profiles.

Recently, there has been some work on incorporating relationships between species

into the prediction of shared function or protein interactions. In (Goh et al., 2000; Pazos

and Valencia, 2001), the authors propose representing the evolutionary history of a pro-

tein by an n× n matrix, with one row and one column for each of n species. The entries

of this matrix are estimates of evolutionary distance between the corresponding homologs

in two organisms. Two matrices representing proteins are compared using the linear cor-

relation coefficient over their entries. The method has been applied to detecting domains

that are in the same protein, predicting protein-protein interactions in E. coli (Pazos and

Valencia, 2001), comparing chemokines and chemokine receptors, and comparing inter-

acting domains in phosphoglycerate kinase proteins (Goh et al., 2000). This method is

more computationally expensive than the phylogenetic profile method as it requires com-

putation of, potentially, n×n alignments for each pair of proteins considered. As pointed

out in (Pazos and Valencia, 2001), successful large-scale application can depend on the

quality of the alignments used to populate the distance matrix.

In the above case, as in our setting, the mapping between leaves of the two trees to

be compared is known. In a slightly different application, two groups (Gertz et al., 2003;

Ramani and Marcotte, 2003) use simulated annealing to rearrange the rows and columns

of two distance matrices to match the leaves of two trees so as to maximize the correlation

between the matrices and discover this mapping. This allows the identification of specific

interacting partners among the profiles of two families known to interact generally. It has

been successfully applied to match chemokines and tgfβ ligands with their receptors (Gertz

et al., 2003) and histidine kinase sensors and their regulators as well as several other

families (Ramani and Marcotte, 2003). A topology-based method of similar spirit was

recently introduced in (Jothi et al., 2005). While potentially very useful for small-scale

investigation, such a method is not yet applicable to genome-wide analysis.

(Liberles et al., 2002) suggest using a presence / absence labeling of the internal nodes

of a species tree derived from a minimum parsimony criteria. They note that those pairs

of profiles that have low parsimony scores are less useful in predicting shared keyword

annotations. Their preliminary testing provides some evidence that shared evolutionary

events may be informative.

For the different problem of classifying profiles into known functions using SVMs, (Vert,

2002) gives a tree kernel that makes use of a species tree.

Independent of the work in this chapter, (Barker and Pagel, 2005) recently reported a

method similar in spirit to our LRATIO method (see below). They find a maximum like-

lihood set of gene gain / loss rate parameters assuming correlated evolution and another

set assuming independent evolution. If the likelihood of a pair of profiles is much higher

under the correlated-evolution model than in the independent model, the pair is predicted

to interact. They test their method on MIPS (Mewes et al., 2002) protein complex data,

with cross-genomic information from 15 eukaryota. In their testing, genes that have more

than two or three shared gains or losses are usually linked, echoing our result that many

shared gains or losses are very indicative of shared pathways. In contrast to their ap-

proach, we learn the shared- and independent-evolution models from training examples.

In addition, they use a tree derived from the sequences of several universal genes, while

we use a taxonomy tree with edge lengths derived directly from our assumed gene gain /

loss model. Finally, in our tests we use profiles containing 215 organisms spanning three

branches life, while they test only on eukaryota.

6.2 Methods

6.2.1 Computing the Phylogenetic Profiles

For each of the 5,839 S. cerevisiae proteins, we created a phylogenetic profile with a

dimension for each of 215 organisms. (See Listing 6.1.) Among these organisms there are

7 eukaryota, 20 archaea, and 188 bacteria. In the profile for each S. cerevisiae protein g,

the e-value for organism S is set to the lowest e-value found among the BLASTP (Altschul

et al., 1990) hits against a database containing the protein sequences of organism S for

which the alignment covers at least 40% of the query protein g. If no such match is found

within the first 500 BLASTP hits, the e-value is set to 10. All protein sequences were

downloaded from the NCBI FTP site (NCBI, 2005).

(Date and Marcotte, 2003) suggest transforming each e-value x into a number between

0 and 1 as follows:

x 7→ 1 − min

{− 1

ln(x), 1

}. (6.1)

We apply this transformation, which is shown in Figure 6.1, to the e-values we computed

to get the entries of the profiles. The transformation can be thought of as a heuristic way

to associate a probability that a homolog is present in each organism: e-values less than

about 10−4 are likely real homologs and are given value ≥ 0.9 by the transformation. The

1 0 � 7 1 0 � 6 1 0 � 5 1 0 � 4 1 0 � 3 1 0 � 2 1 0 � 1 1 0 00 . 2 50 . 50 . 7 51P rob ab il it yof present

E ' v a l u eFigure 6.1: Transforming e-values to probabilities. The transformation is the same asthat used in (Date and Marcotte, 2003). See Equation (6.1).

probability drops rapidly however, between 10−2 and 0. E-values larger than about 0.37

are given zero probability of being real homologs.

For some of our analysis, we will work with the binary versions of the above computed

profiles — each entry is set to 0 if it is less than 0.5, or to 1 if it is greater than 0.5.

Under this thresholding, the percentage of S. cerevisiae proteins that have homologs in a

particular genome varies between 85% for the closely related yeast E. gossypii to 9% for

the bacteria T. pallidum. As expected, other eukaryota share many more proteins with

S. cerevisiae than do bacteria (Figure 6.2).

6.2.2 Mutual Information

In (Date and Marcotte, 2003; Wu et al., 2003), the authors propose the use of mutual

information to compare phylogenetic profiles. To compute the mutual information be-

tween two real-valued profiles, (Date and Marcotte, 2003) bin each real-valued entry to

the nearest multiple of 0.1. Let pgb be the fraction of the entries of the profile for protein

g that are in bin b. Then the entropy of the profile is defined as H(g) = −∑b pgb log pgb.

If pg1g2b1b2 is the joint probability that a dimension is in bin b1 in the profile for g1 while in

itilis

f D. m

Figure 6.2: Fraction of S. cerevisiae genes that are predicted to have homologs in each ofthe 215 organisms. Organisms are sorted by decreasing shared proteins with S. cerevisiae.For clarity, organism names are shown for only a subset of the 215 organisms used toconstruct the profiles. A protein is predicted to have a homolog in an organism if theprofile entry for that organism is ≥ 0.5.

bin b2 in the profile for g2 then the joint entropy between two profiles for proteins g1, g2 is

H(g1, g2) = −∑b1,b2

pg1g2b1b2 log pg1g2b1b2. The mutual information between two profiles is

MI(g1, g2) = H(g1)+H(g2)−H(g1, g2). We will also consider mutual information where

the profiles are binned into two bins {0, 1}, where an entry is rounded to 0 or 1 with a

threshold of 0.5. See e.g. (Cover and Thomas, 1991) for an introduction to entropy and

mutual information.

6.2.3 Definition of Pathways

We seek to identify pairs of proteins that are involved in the same biological pathway.

The KEGG (Kanehisa, 1997; Kanehisa and Goto, 2000) database of pathways has 95

pathways annotated with at least two S. cerevisiae proteins. These form our groups of

positive examples: every pair of proteins both annotated to the same KEGG pathway as of

February 25, 2005 is considered a positive example. Because the division of the workings of

the cell into disjoint pathways is somewhat arbitrary, we must be more careful in defining

the pairs that are assumed to be involved in different pathways — the negative pairs.

We are fortunate in that the KEGG pathways are connected into a higher-order pathway

graph, where nodes in the graph represent KEGG pathways, and edges connect pathways

that are related (for example, because a signal is propagated between them).

Negative pairs will be those pairs of proteins that are annotated only with pathways

that do not share an edge. Formally, let N (F ) be the the neighbors of pathway F in this

graph, and if Q is a set of pathways, let N (Q) =⋃

F∈Q N (F ). Further, let pw(g) be the

set of pathways in which protein g is involved. A pair of proteins (g1, g2) is a negative

example if N (pw(g1)) ∩ N (pw(g2)) = ∅.

Under this definition, 1,120 proteins are involved in the examples. There are 36,948

positive pairs and 523,450 negative pairs. The number of yeast proteins annotated as

involved in each KEGG pathway are shown in Figure 6.3. (For clarity, only pathways

containing at least 20 proteins are shown.) A significant percentage (30%) of the positive

Pathway

Ribosome

Cell cycle

Purine metabolism

Pyrimidine metabolism

Starch and sucrose metabolism

Oxidative phosphorylation

MAPK signaling pathway

Figure 6.3: Number of proteins annotated to each of the KEGG pathways. Only pathwaysthat involve at least 20 proteins are shown, though we consider smaller pathways in ouranalysis as well.

pairs come from the ‘ribosome’ KEGG pathway because it is the largest (149 proteins).

Purine and pyrimidine metabolism, cell cycle, and oxidative phosphorylation are other

large pathways.

6.2.4 Framework for Inferring Ancestral Gene State

To model the relationships between the species, the 215 species are related by the non-

binary tree T (Listing 6.1) taken from the NCBI taxonomy database. If we knew the

present / absent state at the internal nodes, we could look for contemporaneous gains and

losses along the edges. However, while we “know” the present/absent state at the leaves

(given by the profiles), we must infer the state for ancestral nodes.

Using the framework described below, we will be able to compute, for each organism i

(extant or ancestral), the probability that the protein is present. In other words, if Lg(i)

is the true, hidden gene state of node i for protein g, our framework will allow us to make

guesses about the probabilities

Pr[Lg(i) = Present] . (6.2)

This probabilistic assignment of absent / present to every node in the tree will be our

representation of the protein’s evolutionary history. We consider a gene to be present at

a node if the probability (6.2) is ≥ 0.5; otherwise, we assume the gene is absent.

To calculate these probabilities, we first estimate lengths (times) for the edges in the

tree using expectation maximization (EM), coupled with a simple model of gene gain and

loss (described in the next subsection), so that the edge lengths maximize the likelihood

of generating the complete set of observed S. cerevisiae profiles. Once we have the edge

lengths, we can find the maximum likelihood probabilistic labeling of the internal nodes

given a profile using a dynamic programming algorithm described in (Friedman et al.,

2002; Felsenstein, 1981).

The EM procedure to find the edge lengths iterates between two steps: in the ‘E’ step,

expected counts of edge labelings are computed using a dynamic programming algorithm

and the current edge lengths (which are set uniformly to 1 at the start). In the ‘M’ step,

edge lengths are found for each edge so that the likelihood of those expected counts is

maximized. The first part of the ‘E’ step, which is taken from (Friedman et al., 2002), is

described in the ‘Computing the Likelihood of Data Given Edge Lengths’ section below,

and we refer the reader to (Friedman et al., 2002) for a description of the full ‘E’ step.

The ‘M’ step is described in ‘Finding Edge Lengths That Maximize Likelihood’ section.

Once edge lengths are known, the maximum likelihood labeling for a given profile can be

computed.

Throughout this chapter α, β, σ, δ will stand for gene states, either ‘present’ or ‘absent’,

which we will sometimes abbreviate P, A.

Figure 6.4: Gene gain/loss probability model. “P” stands for Present, “A” for Absent, gis the gain probability over a short time step, ℓ is the probability of a loss.

The Gene Gain / Loss Model

To assign edge lengths to the branches we use a probabilistic model of gene gain and loss.

We use the Markov process shown in Figure 6.4 where that the probability of gaining a

gene over a small time step is g, and the probability of losing a gene over a small time

step is ℓ. The probability that a gene is absent at time step t can be written in terms of

the probability that it is absent at time step t− 1:

PA(t) = (1 − g)PA(t− 1) + ℓ(1 − PA(t− 1)) ;

∆PA = PA(t+ 1) − PA(t) = ℓ− (g + ℓ)PA(t) .

Taking the limit as the time step becomes infinitesimal, we have

dPA(t)

dt+ (g + ℓ)PA(t) = ℓ .

We can solve this linear differential equation by multiplying by e(g+ℓ)t (see, e.g. (Betz

et al., 1954)):

[dPA(t)]e(g+ℓ)t + (g + ℓ)PA(t)e

(g+ℓ)tdt = ℓe(g+ℓ)tdt

d[PA(t)e

(g+ℓ)t]

= ℓe(g+ℓ)tdt

Integrating both sides we obtain

PA(t) = e−(g+ℓ)t

[∫ℓe(g+ℓ)tdt+ C

= e−(g+ℓ)t

g + ℓe(g+ℓ)t + C

g + ℓ+Ce−(g+ℓ)t .

If PA(0) = 1 (the gene started absent), then C = 1 − ℓg+ℓ , while if PA(0) = 0 (the gene

started present), then C = − ℓg+ℓ . So, we get the following transition probabilities:

Pr[A | A, t] =ℓ

g + ℓ+

(1 − ℓ

g + ℓ

)e−(g+ℓ)t (6.3)

Pr[A | P, t] =ℓ

g + ℓ− ℓ

g + ℓe−(g+ℓ)t (6.4)

Similarly, we can derive PP(t):

Pr[P | P, t] =g

g + ℓ+

(1 − g

g + ℓ

)e−(g+ℓ)t (6.5)

Pr[P | A, t] =g

g + ℓ− g

g + ℓe−(g+ℓ)t (6.6)

Because ℓ/(g + l) = 1/(1 + g/ℓ), the transition probabilities can be written in terms of

the ratio g/ℓ and the sum g + ℓ. The g + ℓ term only appears as a coefficient of t, and

so changing g + ℓ will only scale the branch lengths, meaning that the real dependency

is only on the ratio between g and ℓ. Though some believe that gene loss is easier than

gene gain, for the experiments described here we have taken g = ℓ = 0.5, as there is little

evidence for a different choice. We have taken the prior probability of gene presence at

the root to be 0.5 (similar results are obtained with 0.25).

Computing the Likelihood of Data Given Edge Lengths (E step)

We will optimize the edge lengths {tij} to maximize the likelihood of generating the

observed S. cerevisiae profiles with the given tree and our gain/loss model. We first

present an expression for the likelihood of a labeling. This presentation is based on that

of (Friedman et al., 2002), which we have adapted to work with our gene gain/loss model.

Let Θ represent the complete hypothesis or model — the gene gain/loss model, the tree

topology, and the tree edge lengths. If the tree is rooted at node R, and π(j) denotes the

parent of node j, then the likelihood of a labeling is

LC =∏

Pr[xgR]

Pr[xgj | xg

π(j),Θ] ,

where xgi is the state (present/absent) of node i in the tree with data given by gene g.

Taking the log, and reorganizing the summations, we can rewrite this as

logLC =∑

log Pr[α] +∑

(α,β)

g:(i,j)=(α,β)

logPα→β(tij) , (6.7)

where α and β range over the alphabet of labelings {absent,present} and Pα→β(tij) is

shorthand for the probability of going from state α to state β over edge (i, j) given its

current length tij, and Pr[α] is the prior probability at the root. Let SR(α) = |{g : xgR =

α}| and let Sij(α, β) = |{g : xgi = α and xg

j = β}|. Then we can rewrite (6.7) as

logLC =∑

SR(α) log Pr[α] + LL(i, j) ,

LL(i, j) =∑

Sij(α, β) log Pα→β(tij) . (6.8)

Only LL depends on the edge lengths.

For the leaves, as an estimate of Pr[leaf is labeled with e-value x|Present], we use the

curve of Figure 6.1; we use one minus that value for an estimate of the probability of

observing an e-value at a leaf assuming the gene is absent.

We do not know the internal labelings. Instead we use expected value of Sij(α, β)

which equals∑

g Pr[i = α∧ j = β | leaf labels for profile g,Θ]. These probabilities can be

computed with an efficient dynamic programming algorithm that is described in (Friedman

et al., 2002; Felsenstein, 1981). Thus, we can compute estimates Sij(α, β) for Sij(α, β).

Once we have the expected counts, we must find an edge length tij for each edge (i, j)

that maximizes the likelihood of those counts.

Finding Edge Lengths That Maximize Likelihood (M step)

Let s = g + ℓ and γ = ℓ/s, and, for notational convenience, drop the subscripts ij.

Combining (6.8) with the transition probabilities developed above, we want to find a t to

maximize

LL = log[γ + (1 − γ)e−st

]S(A,A)

+ log[γ − γe−st

]S(P,A)

+ log[1 − γ + γe−st

]S(P,P)

+ log[1 − γ − (1 − γ)e−st

]S(A,P).

Let x = e−st. Then maximizing LL is equivalent to maximizing

[γ + (1 − γ)x]S(A,A) · [γ − γx]S(P,A) · [1 − γ + γx]S(P,P) · [1 − γ − (1 − γ)x]S(A,P) .

Noting that 1 − γ − (1 − γ)x = (1 − γ)(1 − x), we see that maximizing the above is in

turn equivalent to maximizing

[γ + (1 − γ)x]S(A,A) · [1 − γ + γx]S(P,P) · [1 − x]S(A,P)+S(P,A) . (6.9)

0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 10 . 2 50 . 50 . 7 511 . 2 51 . 51 . 7 52 c = 2 , q = 5c = 2 , q = 2c = 5 , q = 2

Figure 6.5: Plot of Ψ(x) = (1 + x)q(1 − x)c for a few settings of c and q. When q > c(blue, solid line), the maximum of Ψ(x) is found at (q − c)/(q + c) (vertical black line).When q = c (green, dashed line), or c > q (red, longer dashed line) the maximum is at 0,giving an infinite edge length.

If γ = 0.5 then we can simplify (6.9). Let c = S(A, P) + S(P, A) and q = S(A, A) + S(P, P).

Then (6.9) is proportional to Ψ(x) = (1 + x)q(1 − x)c. (See Figure 6.5.) Because t ≥ 0,

x ∈ [0, 1]. Setting the derivative of Ψ to 0 gives −(1+x)qc(1−x)c−1+(1−x)cq(1+x)q−1 = 0.

So, x = (q− c)/(q+ c) maximizes (6.9), and a maximum likelihood t = −1/(g+ ℓ) ln[(q−

c)/(q+c)]. If changes are more common than not along an edge (c > q) then the maximum

edge length is infinite. If this occurs in practice during the iterative EM algorithm, we

simply increase the current edge length by some small amount. This rarely happens

in practice, as the maximum likelihood reconstruction tries to minimize the number of

changes.

If γ 6= 0.5 then a maximizing x can be found using Newton’s method to find a zero of

the derivative of (6.9). Mathematica was used to compute the derivative D(x) of (6.9):

fp := 1 + γ(x− 1)

fa := γ + x− γx

fc := (1 − x)

fq :=γS(P, P)

x− 1+

(1 − γ)S(A, A)

D(x) = f S(P,P)p f S(P,A)+S(A,P)

c f S(A,A)a fq

6.2.5 Comparing Profiles Using Tree Labelings

Counting Shared Gain/Loss

If two genes co-evolve, one expects them to be gained or lost from organisms at the same

time. We can label a node present if Equation (6.2) is ≥ 0.5. Looking at the labeled

trees, we would expect correlated genes to contain more edges (i, j) where both genes are

present at i, and both absent at j, or vice versa. Symbolically, we expect many AA −→ PP

and PP −→ AA transitions. (We use the notation αβ −→ σδ to denote the situation where,

when comparing tree labelings for proteins g1 and g2, a parent node is labeled α in the

reconstruction for gene g1 and β in g2 while a child is labeled σ in g1 and δ in g2.) Ranking

examples by the number of these AA −→ PP and PP −→ AA transitions gives our shared

gain/loss method, or SGL for short. An advantage of counting shared gains and losses is

that only profiles that have a fair number of both present and absent entries — and are

thus ‘interesting,’ at least by some measure — can have many gains and losses.

Comparing Likelihoods

The other possible transitions contain information too. For example, PP −→ PP is likely

better than PA −→ PA. If we are permitted to look at some training positives and

negatives, we can learn how to take advantage of the whole range of possible edge la-

belings by estimating joint transition probabilities such as Pr[child = AA | parent =

AA,positive example] and Pr[child = AA | parent = AA,negative example]. One hopes

that the empirical joint transition probabilities computed from positive examples cap-

ture the characteristics of correlated evolution, while those for negative examples capture

independent evolution.

In other words, we can think of labeling a tree from an alphabet of four “characters”

{AA, PP, AP, PA}, and we will derive these transition probabilities between these “charac-

ters” from a set of training examples as follows. Let F be the training set of protein pairs

that are involved in the same pathway. We estimate the joint transition probabilities for,

say, proteins with shared function using the training examples as

Pr[children are labeled σδ | parents are labeled αβ,+] =

|{(g1, g2) ∈ F, (i, j) ∈ T : Lg1(i) = α ∧ Lg1

(j) = σ ∧ Lg2(i) = β ∧ Lg2

(j) = δ}||{(g1, g2) ∈ F, (i, j) ∈ T : Lg1

(i) = α ∧ Lg2(i) = β}| (6.10)

If the probability for two transitions should be equal because they are symmetric with

respect to the order of the tree labelings (e.g., Pr[PA | AA,+] should equal Pr[AP | AA,+]),

we compute both, as in Equation (6.10), and use the average of these estimates. Transition

probabilities for unrelated proteins can be derived similarly. We also derive an estimate

for root labelings as

Pr[Root labeled αβ | +] =|{(g1, g2) ∈ F : Lg1

(R) = α ∧ Lg2(R) = β}|

|F | , (6.11)

where R is the root node of the tree.(In our experiments, we take the root to be the

node where the three main branches of life — eukaryota, archaea, bacteria — meet.) See

Section 6.3.4 below for a discussion of the probabilities so derived.

Given these probabilities, we can compute the likelihood of the pair of observed pro-

files assuming that they have correlated evolution using the transition probabilities derived

from the positive examples. Similarly, we can compute the likelihood of the profiles as-

suming independent evolution using the transition probabilities derived from the negative

examples. We score the pair of profiles (g1, g2) by the factor by which joint evolution

appears more likely:

lscore(g1, g2) =Pr[g1, g2 | +]

Pr[g1, g2 | −]. (6.12)

To compute Pr[g1, g2 | +] we use a standard likelihood computation, summing over all

possible internal states using an efficient dynamic programming algorithm. We will refer

to the scoring scheme of Equation (6.12) as LRATIO.

6.2.6 Assessing Results

We will consider two means for assessing how well a method identifies proteins with shared

function. First, we will use ROC (receiver operating characteristic) curves in which the

number of true positives scored above a threshold is plotted against the the number of

false positives scored above that threshold. As a useful summary statistic, in Section 6.3.1,

we will use the area under such a curve when allowing up to 10% of the negatives to be

predicted as positives (denoted ROC10%).

For most of the experiments discussed here, we assess performance on a per-function

basis. Each method will assign a score to every example indicating to what extent the

method considers the proteins in the pair to be functionally linked. To determine how

well positive examples for each function are distinguished from the negatives we use the

average rank :

avgrank(Fi) =1

|F×i |

(g1,g2)∈F ′

rank(g1, g2) , (6.13)

where F×i are the pairs of positives arising from function Fi, and rank(g1, g2) is defined

as the number of negative examples ranked better than (g1, g2). When there are ties —

several negative and positive examples with the same score — we take the rank to be the

expected number of negatives ranked better if the examples were randomly sorted within

groups of the same score. Formally, if U is the set of negatives,

rank(g1, g2) = |{(g′1, g′2) ∈ U : score(g′1, g′2) > score(g1, g2)}|+

2|(g′1, g′2) ∈ U : score(g′1, g

′2) = score(g1, g2}| .

A low average rank indicates that the method performs well. The average ranks (6.13)

are divided by the maximum possible rank to give the fractional average rank. When

comparing the scores assigned by two methods we plot the difference in fractional average

rank assigned to each function.

KEGG Pathways

0.1 0.0 0.2 0.3 0.4

Figure 6.6: Estimate for maximal ROC performance for the KEGG pathways for variousvalues of the ball radius r. Circles give the absolute maximum ROC10% upper bound.The gray bars give the estimate for when r is 10% of the length of the profiles, and theother symbols give the estimated ROC10% upper bound for larger values of r.

6.3.1 Limit on ROC Performance of Phylogenetic Profiles

Because some negative examples may have the same profiles as positives examples in a

pathway, it may be that no classifier can achieve perfect predictions. We are interested

in quantifying how such confusions affect the ability of phylogenetic profiles to associate

proteins with functions. To do this, for this section, we consider a variation on the

functional-linkage problem:

Functional Separation: Given a collection {Fi} of functions, for each func-

tion Fi, distinguish those proteins in Fi from those not in Fi.

From each Fi arises a separate classification problem with the positives taken to be those

proteins in Fi and the negatives taken to be those proteins in some Fj with j 6= i. (For this

section, we consider only those functions with at least 5 annotated proteins.) The known

functional groupings play a much more central role in this variant since each function is

treated independently. In this section, we measure the performance of a classifier for this

problem using ROC curve analysis, taking the area under the curve up to 10% of the false

positives (which we denote ROC10%) as a measure of success. For ease of analysis, for

this section only we use binary profiles — thresholding the real-value entries at 0.5.

When measuring performance using ROC10% area only the ranking of examples is

important. We can compute the theoretical upper bound on the ROC10% area achievable

by any classifier by grouping proteins with the same profile together and ordering the

groups in decreasing order of the ratio of positives to negatives that they contain. In

other words, we take the classifier to be an algorithm that can rank points of the Hamming

cube. The optimal ROC10% performance of such a classifier is an absolute upper bound

on any classifier that sees only the vectors because such classifiers must treat equal profiles

equally.

This upper bound endows the classifier with a lot of power: it can distinguish between

two profiles perfectly even if they differ by only one bit. In practical cases, the space

of possible profiles is very large (here, 2215), while the number of actual profiles is much

smaller (here, 1000 — 2500), so it may not be surprising, even though the dimensions

are not independent, if few clashes between positives and negatives occur. It is thus

interesting to put the above, pure upper bounds into context by assessing performance on

theoretically weaker classifiers.

To do so, we must assume a computational model for the classifier. Above, the algo-

rithm was allowed to distinguish among vectors that differ in any dimension by ranking

points of the Hamming cube. Equivalently, we may think of the algorithm as allowed to

rank balls of radius ≥ 0 centered at an input point. We introduce a series of weaker clas-

sifiers that are limited to ranking balls of radius ≥ r (measured in the fraction of differing

entries) centered at an input point. In this weaker model of classifier, the rank of an input

example is taken to be the lowest rank of a ball that contains it, and points can still be

L_5K_8J_12I_12H_14G_15F_17E_20D_25C_26B_27A_28EukArcBac

Figure 6.7: Profiles which, for some function F , describe both a protein in F and aprotein not in F . White indicates the presence of a homolog. The first profile indicatesthe breakdown into the three main branches of life: Eukaryota (white) are at the left,followed by the archaea (gray), and then the bacteria (black). Numbers next to eachprofile give the number positive examples which collide with the profile. Only profiles forwhich at least 5 examples have collisions are shown.

chosen as ball centers after they have been covered by another ball. The parameter r is

the resolution and a larger r implies a less powerful classifier. While under this model, a

classifier can distinguish between some pairs of examples at distance < r, it can not do

so “too often.” We would like to find the ranking of balls that maximizes the ROC10% of

the ordering. We use the greedy algorithm that exhaustively searches over balls and radii

for the ball with the best positive to negative ratio and then removes the points in the

ball as a way to find a good ranking of balls. The greedy algorithm does not guarantee

that the optimal ranking is found, however.

With a classifier with perfect resolution, the highest achievable ROC10% areas for

classification problems arising from the KEGG pathways are shown in Figure 6.6 (circles).

In most cases, a near-perfect ROC10% area is achievable, and no collisions are observed

in 20 of the pathways. In no case is the achievable ROC10% area less than 0.09, but three

pathways have maximum ROC10% area ≤ 0.095: the MAPK signaling pathway (KEGG

ID 04010; ROC10% 0.093), the phosphatidylinositol signaling system (04070; 0.091), and

the cell cycle (04110; 0.095).

Profiles involved in some collision between a positive and a negative for KEGG path-

ways are shown in Figure 6.7. These are the profiles stopping the ROC10% area given

by the circles from being 0.1. Even with profiles constructed from 215 organisms and a

relatively permissive definition of homolog, the main main source of error is profiles that

have few homologs outside the eukaryota. The all-present profile also accounts for many

of the errors. The most interesting class of collisions, those involving the profiles of several

dehydrogenases, are shown in cluster G in Figure 6.7. Profile K belongs to several proteins

related to aconitate hydratase and at least one that is related to leucine biosynthesis.

If r is increased to 10% of the length of the profiles, 32 pathways have maximum

achievable ROC10% area ≤ 0.09, while 20 have maximum ROC10% area ≤ 0.08. The

cell cycle (04110) and the basal transcription factors (03022) pathways drop especially

precipitously, both to below ROC10% 0.051. Figure 6.8 shows the profiles responsible for

mistakes with this value of r. For presentation, the 599 unique profiles involved in such

collisions were clustered using a greedy heuristic. (Colliding profiles were considered in

arbitrary order and added to the cluster that had a center to which the profile had lowest

Hamming distance; if no center was within Hamming distance 10, the profile became the

center of a new cluster.)

Several classes of profiles stand out. Again, profiles for which a signal is present only

for the relatively few eukaryota (clusters A, C, E, M, T, GG in Figure 6.8) or present for

nearly all organisms (cluster B) are a top source of errors, and the dehydrogenase profile

(cluster D) continues to be involved in many collisions. Two additional common types of

colliding profiles are of interest. The first are proteins which have homologs in most of

the eukaryota and archaea, but not in the bacteria (clusters N, O, Q, R, S, CC). Such

profiles are common among ribosomal proteins (Lecompte et al., 2002), as well as RNA

polymerase and the proteasome proteins. Secondly, the profiles of clusters W and Y give

profiles that are common among some kinases. From these examples, we can see that

there are two general classes of mistakes: proteins are often confused if their profiles can

be explained by broad speciation events, or they are confused if they are involved in cross-

pathway functionality such as phosphorylation or dehydrogenation. That the former class

KK_5JJ_5II_6HH_7GG_7FF_7EE_8DD_8CC_9BB_9AA_10Z_10Y_10X_10W_10V_11U_11T_11S_13R_14Q_14P_15O_15N_17M_17L_20K_23J_23I_23H_26G_29F_35E_43D_58C_150B_186A_250EukArcBac

Figure 6.8: Profiles involved in reducing the best possible ROC area when r is 10% ofthe profile lengths. For each function, profiles in that function were found for which thereexists a profile with Hamming distance ≤ 22 outside the function. These colliding profileswere greedily clustered into groups of diameter of Hamming distance no more than 20.Numbers indicate the number of positive examples with with some profile in the clustercollided. For brevity, only clusters containing at least 5 examples are shown. Gray scalevalues indicate the average value of the profiles in the cluster, where white indicates thepresence of a homolog. The first profile indicates the breakdown into the three mainbranches of life: Eukaryota (white) are at the left, followed by the archaea (gray), andthen the bacteria (black).

False positives0% 10% 20% 30% 40% 50%

50%real MI/fullreal MI/no ribosomebinary MI/fullbinary MI/no ribosome

Figure 6.9: ROC curve of the performance of mutual information on the KEGG test set.Lines labeled ‘real MI’ use the real value of the profiles, rounded to the nearest 0.1, whilethose labeled ‘binary MI’ use round the values of the profiles to the nearest number in{0, 1}. Lines marked with ‘full’ use all pathways in KEGG, while those marked with ‘noribosome’ show the performance if the ribosome positive pairs are removed.

is involved in many collisions is consistent with observations made in (Wu et al., 2003) on

profiles of 41 organisms.

6.3.2 Performance of Mutual Information for Finding Functional Link-

We test the performance of mutual information on the KEGG data set described above.

How the entries of each profile are discretized can have a large effect on performance,

so we test the binning scheme suggested by (Date and Marcotte, 2003) as well as apply

mutual information to binary profiles. The results are shown in Figure 6.9 in the form

of ROC curves. Comparing the performance of the real-value profiles (red) with binary

profiles (green) on the full data set shows that MI applied to binary profiles perform

considerably worse than when applied to real-valued profiles. In fact, however, much of

the advantage of the real-valued profiles comes in detecting pairs of ribosomal proteins.

When the ribosomal positives are removed while keeping the set of negatives the same,

KEGG Pathways

−0.3

−0.2

−0.1

655 12

28 231

91 105

55 820

66 55 253

36 153

45 78 15 231

66 190

36 406

45 496

45 406

55 28 666

66 496

45 105

45 5524

1555 10 21

1636 21

Binary MI does better.

Real−valued MI does better.

Ribosome

Figure 6.10: For each pathway the difference in average rank (Section 6.2.6) between thebinary MI scheme and the real-valued MI scheme (as suggested by (Date and Marcotte,2003)) is plotted. Values greater than 0 indicate that the binary scheme ranks examplesfrom that function better. The numbers inside the bars give the number of positive pairsarising from the function. Function numbers (x-axis) are KEGG ID numbers. A mappingfrom KEGG IDs to function names can be found in Table 6.1.

real-valued mutual information performs badly (blue curve in Figure 6.10). Binary mutual

information performs much better (yellow curve), though still not at the level that can

be achieved when the ribosome pairs are included. Real-valued mutual information seems

to be useful, then, primarily to identify ribosomal proteins, and mutual information in

general seems to find ribosomal pairs easy to distinguish from unrelated pairs.

ROC analysis with all functions lumped together (as in Figure 6.9) assumes that the

importance of a pathway is proportional to the number of examples that come from it.

With a more complete data set that might be true, but at this stage likely the opposite

holds: those pathways for which few proteins are known are the best ones for which to

make new predictions. This motivates the use of a per-function analysis of the performance

of various methods, such as shown in Figure 6.10. For each KEGG pathway, the difference

of the average fractional rank (Section 6.2.6) achieved by binary MI compared with real-

valued MI for examples from that function is plotted. Bars above zero indicate that

examples of that function are scored more favorably using binary MI. Most functions (63

KEGG Pathways

−0.4

−0.3

−0.2

−0.1

155 66

655 10

555 36 351

28 45 45 903

91 136

55 120

36 406

66 21 153

66 351

566 45 43

078 45 23

155 35

510 15 12

055 15

45 28 55 496

SGL does better.

< 20% Present

Figure 6.11: Advantage of SGL over binary MI on the 80 KEGG pathway with at least 10pairs. Bars indicate, as in Figure 6.10, the difference between the average rank assignedto pairs of each function by the two methods. For the 50 pathways with bars above 0,SGL outperforms mutual information.

out of 80 functions with at least 10 pairs) are handled more effectively by the binary MI

scheme than the real-valued scheme. The most important exception is the largest pathway,

the ribosome, as discussed above. Because it is more effective on most pathways, we will

compare with the binary MI scheme for the rest of this chapter.

6.3.3 Predicting Linkages From Shared Gains and Losses

Counting the number of edges (i, j) for which both proteins are inferred to be present in

the ancestor i and both absent in the descendant j or vice versa (AA −→ PP or PP −→

AA) performs better than MI on 50 out of 80 pathways that have at least 10 pairs of

proteins in them (Figure 6.11). MI outperforms SGL for the ribosome pathway, along

with the pathways that include many proteins that contain few homologs (such pathways

are highlighted in yellow in the figure). These pathways tend to have low ROC upper

bounds (Section 6.3.1) and have few gain or loss events, and thus the SGL method is not

suited for predicting shared function of this type.

The ability of such a straightforward method to rank examples from more than half

Root LabelsPP AP AA

−0.5

−0.3

−0.1

−0.178−0.123

PP AP AA

Positives

Negatives

Figure 6.12: (a) Empirical root probabilities Pr[R = αβ | same pathway] (blue) andPr[R = αβ | different pathways] (red). (b) Relative increases for each root label prob-ability for proteins with shared functions compared with those that do not have sharedfunctions. See Equation (6.14).

of the pathways better than MI is encouraging, especially because SGL does not consider

where in the tree the shared events took place or their relation to one another. It also

does not account for how many organisms are labeled with the same state in the two

trees — it is simple to find examples with several shared gains and losses but large areas

of disagreement in the tree. While such pairs may indeed be functionally linked, it is

reasonable that they should score less favorably than a pair with several shared gains and

large areas of agreement.

6.3.4 Predicting Linkages By Comparing Likelihoods

Estimating the Transition Probabilities

In order to test our LRATIO method, we randomly divide the positive and negatives into

three parts and compute transition probabilities (as outlined in Section 6.2.4) for both

positive and negative examples using 2/3rds of the examples. There is certainly a large

Edge Labels (Parent, Parent, Child, Child)PPPP PPAP PPAA APPP APPA APAP APAA AAPP AAAP AAAA

−0.3

−0.1

−0.088

0.325 0.3100.224

−0.019

−0.003

PPPP PPAP PPAA APPP APPA APAP APAA AAPP AAAP AAAA

Positives

Negatives

Figure 6.13: (a) Empirical transition probabilities for edges (i, j) in bacteria: Pr[j = σδ |i = αβ, same pathway] (blue) and Pr[j = σδ | i = αβ,different pathways] (red). Here, αβare labels of a parent node, and σδ are labels of a child node. The distributions are verysimilar in this view; their differences are more apparent in the (b) panel. (b) Relativeincreases for each transition probability for proteins with shared functions compared withthose that do not have shared functions (Equation (6.14)).

class of pairs of proteins that share a pathway for which the assumption of shared evolution

does not hold. These pairs, which are unlikely to be identified by any method geared

towards exploiting shared evolution, may obscure the true differences in the probability

distributions between proteins with correlated and uncorrelated evolution. Accordingly,

when computing the probabilities, we do not use any pair (positive or negative) that

involves a protein with a profile that can primarily be explained by speciation events.

These are considered to be those profiles for which there is a node in the tree for which

every leaf under the node has an e-value < 10−5 while all other nodes have an e-value

> 10−5, or vice versa. (If either of those two groups contains more than 5 organisms, we

allow a single exception.)

Additionally, because the ribosome pathway is so large and we do not want to train

the method to recognize only this well-studied pathway, we leave out the positive pairs

coming from the ribosome pathway.

The probabilities for the root labels (Equation 6.11) for one of the cross-validation tests

are shown in Figure 6.12(a). Immediately, we see the difference in the distributions of root

labels between positives and negatives: a larger fraction of the roots of trees for proteins

with shared function are both labeled with the ‘present’ state, while the disagreeing state

AP is more common in unrelated proteins. Below the distributions, for each root state,

we plot the relative increase in probability of the positives:

p+ − p−p+

. (6.14)

So, if Equation (6.14) is greater than zero, a labeling is more common in positives exam-

The ability to handle distantly related organisms differently is one of the advantages

of having a tree connecting the organisms. Accordingly, we estimate a separate set of

probabilities for each of the three main branches of life, eukaryota, archaea, and bacteria.

The computed transition probabilities for the bacteria edges, the largest class of edges,

Figure 6.14: Schematic of anti-correlated evolution AP −→ PA. The two rectangles repre-sent profiles, and the triangle represents a subtree which would be primarily labeled AP.If there is a loss of one protein and a compensating gain by another, there will be anAP −→ PA edge at the top of the subtree for the organisms in which the complementarygain occurred.

are shown in Figure 6.13(a). They make sense at a high level: the most probable transi-

tions are those in which no change is made in the gene state of either protein, and those

transitions in which one protein maintains its state are slightly more probable than having

both genes change their state. This is expected since the reconstruction of ancestor label-

ings sought to minimize changes. While the transition probabilities for positive examples

appear very similar to those for negatives, there are important differences as seen by the

plot of Equation (6.14) under each probability (Figure 6.13(b)). Shared gains or losses

(PP −→ AA or AA −→ PP) are more likely to occur in positive examples. Transitions where

disagreeing gene states are brought into agreement (AP −→ PP and AP −→ AA) also have a

higher probability in positives, while transitions in which the states diverge have the same

or lower probability in positives. Finally, and perhaps most interestingly, anti-correlated

edges (AP −→ PA) are more frequently seen between positive pairs. These edges would

occur in blocks such as those depicted in Figure 6.14, which may arise if a protein is lost

because another can assume its role. ((Bowers et al., 2004a) and (Morett et al., 2003) have

investigated such complementary profiles. An advantage of incorporating a species tree

is that we can detect local complementarity naturally, while previous methods typically

require the complementary to hold to a large extent across the entire profile.)

KEGG Pathways

−0.4

−0.3

−0.2

−0.1

751 18 30 77 35 301

57 35 40 15 45 18 117

35 187

18 22 15 40 117

12 198

63 135

63 360

35 145

22 12 22 84 77 15 18 273

45 495

35 26 77 15 672

40 12 18 18 805

515 16

4751 13

518 70 35

LRATIO does better.

< 20% Present

Figure 6.15: Improvement in average fractional rank achieved by LRATIO compared withMI for each KEGG pathway that contains at least 10 pairs of proteins. Values above 0indicate that LRATIO scored the pairs in the pathway more favorably than MI. Onlythose pathways that had on average at least 10 pairs of proteins in the three tests areshown. The numbers in the bars give the average number of examples arising from thatpathway. Yellow marks those pathways with many proteins that have few homologs acrossthe species.

Assessing Performance of LRATIO

We test LRATIO using three-way cross-validation, computing probabilities like those

shown in Figures 6.12 and 6.13 using two-thirds of the examples, and assessing per-

formance on the remaining third. The average fractional ranks (Section 6.2.6) for each

of the pathways are themselves averaged over the three tests. The differences between

these averaged average fractional ranks are shown in Figure 6.15 for each function. Out

of the 71 pathways that had on average at least 10 pairs, LRATIO ranks the examples

from 51 better than MI. MI, as expected, outperforms LRATIO on those pathways with

few present entries in their profiles (yellow in Figure 6.15).

The pathway with the largest improvement in performance is animnoacyl-tRNA biosyn-

thesis, which contains about 222 pairs in each testing slice. The average fractional rank

is reduced from 0.77 to 0.20 when using LRATIO because many proteins in this pathway

are present in most sampled organisms, and such all-present profiles are scored very low

(a) YPR145WYPR035WYOR375CYLR160CYLR158CYLR157CYLR155CYJL126WYGR124WYFR055WYDR321WYDR019CYDL215CYDL171CYAL062WYAL012W

(b) YPR006CYPR001WYOR388CYOL126CYNR001CYNL117WYLR304CYKL085WYJL200CYIR031CYGR204WYER065CYDL078CYCR005CYBR084W

Figure 6.16: Profiles for the proteins in pathways with the lowest average rank usingLRATIO. (a) nitrogen metabolism (KEGG ID 00910). (b) glyoxylate and dicarboxylatemetabolism (00630). White indicates the presence of a homolog.

by MI. In contrast, a distance measure such as Hamming distance would give such pairs

the highest score possible. LRATIO may take the middle path — rewarding such pairs

in the amount warranted. Not all improvements arise from such constant profiles. The

profiles of the proteins involved in the top average ranked pathways (nitrogen metabolism

and glyoxylate and dicarboxylatae metabolism, KEGG IDs 0910 and 00630, respectively)

are shown in Figure 6.16. These pathways contain few constant profiles.

To test whether LRATIO is using real evolutionary events or merely fitting a small set

of parameters to the data, we repeat the training and testing but shuffle where the extant

organisms appear in the tree; this amounts to the same thing as shuffling the dimensions

of all the profiles using the same permutation. If LRATIO takes its advantage simply

from its ability to set parameters from a set of training examples, one would expect that

the random tree will do nearly as well. This is not the case, as shown in Figure 6.17 —

the real tree outperforms the shuffled tree in 51 out of 71 pathways. The pathways which

the random tree scores better are those that are have many proteins that have homologs

KEGG Pathways

−0.4

−0.3

−0.2

−0.1

40 135

18 35 22 360

35 63 45 18 198

77 18 273

35 35 15 145

12 301

63 77 40 57 77 117

12 26 15 45 40 51 18 35 15 117

18 12 22 15 18 117

84 22 40 3013

518 13

Real tree does better.

Random tree does better.

< 20% Present

Figure 6.17: Improvement in average fractional rank achieved by LRATIO using the realtree compared with using a tree with the leaves shuffled. Values above 0 indicate that usingLRATIO with the real tree was more successful identifying positive pairs. Numbers inthe bars give average number of examples in that pathway. Yellow marks those pathwayswith many proteins that have few homologs across the species.

only among the eukaryota, or only among the eukaryota and archaea. This makes sense:

if the leaves are randomized for these pathways, the single shared gene gain at the top of

the eukaryota subtree is transformed into six or seven shared gains spread over the tree —

a pattern that would be much more indicative of shared evolution. This results in better

scores for these pathways.

6.4 Discussion

We assessed the relative merits of two protocols for using mutual information to predict

functional linkages, finding binary MI to perform best for most pathways. One issue that is

brought to the forefront by this research is the need to consider performance on functions

separately. If each pathway’s importance is taken to be proportional to the number

of examples that are derived from it (as in standard ROC analysis), the improvements

on well-studied functions such as the ribosome and cell cycle will obscure successes in

identifying less obvious functional linkages. Our per-function analysis does not suffer

from this problem and also makes it clear that there are several classes of functions that

should be attacked through different methods.

Mutual information can be divided into the sum of two components. Informally,

the first, H(g1) + H(g2), measures how “interesting” the pairs of profiles (g1, g2) are

— all-present or all-absent profiles may simply indicate we have not yet sampled enough

organisms to detect a signal. The second component, −H(g1, g2), measures the correlation

between profiles. That MI captures both of these aspects is one reason it is such an

intuitively pleasing measure. An attractive feature of incorporating the phylogenetic tree

is to promote a more delicate definition of “interesting” than mutual information permits.

For example, if two proteins are both present in all firmicutes save one for which they

are both absent this may be more “interesting” then if the same pattern is distributed

randomly over the tree. If the right balance between requiring interesting profiles and

agreeing profiles can be struck, predictive ability may be improved. We have begun to

gain some insight about what patterns are most characteristic of shared function and are

thus informative. For example, the presence of several correlated gains and losses are

very indicative of shared function, and counting these events is a useful way to identify

proteins involved in the same pathways.

Not all edges in the tree should be treated the same. Shared gene absence is more

indicative of shared function when it occurs on edges between organisms closely related

to yeast, while shared gene presence is most indicative in the opposite case. This is seen

in Figure 6.18 in which for each tree edge we plot number of AA −→ AA seen on that edge

in positive examples divided by number such edge labelings seen in negative examples

against the edge’s tree distance from the yeast E. gossypii, a close relative of S. cerevisiae.

Similar values are plotted for PP −→ PP. It may be informative to attempt to derive

correlations between other patterns and their environment in the tree (e.g. height, node

degree, distance to yeast) and use such insights to continue to improve our understanding

of what are informative evolutionary events.

Distance to E. gossypii0 0.5 1 1.5 2 2.5 3 3.5 4

0.4Shared presence (PPPP)

Shared absence (AAAA)

Figure 6.18: How indicative AA −→ AA and PP −→ PP edges are of a pair being a positiveexample, plotted against that edge’s proximity to the yeast E. gossypii. Ribosomal ex-amples are not included. By random chance, one expects a value of total # of positives /total # of negatives = 0.049. Edges where both proteins are labeled absent at each endpoint are more indicative of shared function when they are near yeast in the tree; theopposite is true for edges where each protein is present at both end points.

Looking at the variation between empirical transition probabilities computed for pairs

with shared functions versus those without, we get an idea of which transitions are more

prominent in positive examples. The absolute differences between the positive and neg-

ative distributions are very small, and, despite the filtering we perform on the training

examples, there undoubtedly remain positive training examples for which the assumption

of shared evolution does not hold. Better filtering schemes to identify those pairs that are

most promising to train on will likely improve the performance of LRATIO on many path-

ways. Or, perhaps, given the values in Figures 6.12 and 6.13, we can concoct transition

distributions that maintain the relative order of transition probabilities but enlarge the

difference between the positive and negative distributions. Additionally, to avoid giving

undue importance to large pathways, our LRATIO training method may also benefit if

examples are weighted so that the contribution from each function is the same.

Another interesting path for future research is to investigate additional non-learning

scoring schemes. That SGL does so well suggests that if it could be augmented in a

natural way to account for agreeing or disagreeing subtrees or events such as AP −→ PA

it may be able to perform even better. Perhaps the quest to improve SGL will give some

further appreciation of which larger-scale evolutionary features are important for detecting

co-evolution.

In this chapter, we have made some significant first steps toward understanding how

cross-genomic evolution can be exploited to identify proteins that are involved in the same

pathways in S. cerevisiae. We presented a framework for taking relationships between

organisms into account when comparing profiles, and we gave several methods for using

patterns of inferred gene presence and absence across a tree connecting extant species to

predict whether two proteins are involved in the same pathways. Our per-function analysis

of the methods suggests that there is benefit in considering relationships between the

species, and that the methods may be useful additions to the repertoire of approaches to

find proteins with correlated evolution and provide a starting point for continued research

in the area.

6.5 List of Pathways and the Phylogenetic Tree

List of KEGG Pathways and Their Identifiers

ID Pathway ID Pathway0010 Glycolysis / Gluconeogenesis 0020 Citrate cycle (TCA cycle)0030 Pentose phosphate pathway 0040 Pentose and glucuronate intercon-

versions0051 Fructose and mannose metabolism 0052 Galactose metabolism0053 Ascorbate and aldarate metabolism 0061 Fatty acid biosynthesis (path 1)0062 Fatty acid biosynthesis (path 2) 0071 Fatty acid metabolism0072 Synthesis and degradation of ketone

bodies0100 Biosynthesis of steroids

0120 Bile acid biosynthesis 0130 Ubiquinone biosynthesis0150 Androgen and estrogen metabolism 0190 Oxidative phosphorylation0193 ATP synthesis 0220 Urea cycle and metabolism of amino

groups0230 Purine metabolism 0240 Pyrimidine metabolism0251 Glutamate metabolism 0252 Alanine and aspartate metabolism

ID Pathway ID Pathway0260 Glycine, serine and threonine

metabolism0271 Methionine metabolism

0272 Cysteine metabolism 0280 Valine, leucine and isoleucine degra-dation

0290 Valine, leucine and isoleucine biosyn-thesis

0300 Lysine biosynthesis

0310 Lysine degradation 0330 Arginine and proline metabolism0340 Histidine metabolism 0350 Tyrosine metabolism0360 Phenylalanine metabolism 0361 gamma-Hexachlorocyclohexane

degradation0362 Benzoate degradation via hydroxyla-

tion0380 Tryptophan metabolism

0400 Phenylalanine, tyrosine and trypto-phan biosynthesis

0401 Novobiocin biosynthesis

0410 beta-Alanine metabolism 0430 Taurine and hypotaurinemetabolism

0440 Aminophosphonate metabolism 0450 Selenoamino acid metabolism0460 Cyanoamino acid metabolism 0480 Glutathione metabolism0500 Starch and sucrose metabolism 0510 N-Glycan biosynthesis0512 O-Glycan biosynthesis 0513 High-mannose type N-glycan biosyn-

thesis0520 Nucleotide sugars metabolism 0521 Streptomycin biosynthesis0530 Aminosugars metabolism 0561 Glycerolipid metabolism0562 Inositol phosphate metabolism 0563 Glycosylphosphatidylinositol(GPI)-

anchor biosynthesis0564 Glycerophospholipid metabolism 0590 Prostaglandin and leukotriene

metabolism0600 Glycosphingolipid metabolism 0602 Blood group glycolipid biosynthesis-

neolactoseries0603 Globoside metabolism 0604 Ganglioside biosynthesis0620 Pyruvate metabolism 0625 Tetrachloroethene degradation0630 Glyoxylate and dicarboxylate

metabolism0632 Benzoate degradation via CoA liga-

tion0640 Propanoate metabolism 0650 Butanoate metabolism0670 One carbon pool by folate 0680 Methane metabolism0710 Carbon fixation 0720 Reductive carboxylate cycle (CO2

fixation)0730 Thiamine metabolism 0740 Riboflavin metabolism0750 Vitamin B6 metabolism 0760 Nicotinate and nicotinamide

metabolism0770 Pantothenate and CoA biosynthesis 0780 Biotin metabolism0790 Folate biosynthesis 0860 Porphyrin and chlorophyll

metabolism0900 Terpenoid biosynthesis 0903 Limonene and pinene degradation0910 Nitrogen metabolism 0920 Sulfur metabolism

ID Pathway ID Pathway0960 Alkaloid biosynthesis II 0970 Aminoacyl-tRNA biosynthesis2020 Two-component system 3010 Ribosome3020 RNA polymerase 3022 Basal transcription factors3030 DNA polymerase 3050 Proteasome3060 Protein export 4010 MAPK signaling pathway4070 Phosphatidylinositol signaling sys-

tem4110 Cell cycle

4120 Ubiquitin mediated proteolysis

Listing 6.1: Phylogenetic tree relating the 215 organisms. Tree is given in Newick/NHformat.

[ Eukaryota ]

(((((Eremothecium_gossypii, spombe), Encephalitozoon_cuniculi), (celegans,

dmelanogaster)), athaliana_all, pfalciparum),

[ Archaea ]

((Aeropyrum_pernix, Pyrobaculum_aerophilum, (Sulfolobus_solfataricus,

Sulfolobus_tokodaii)), Nanoarchaeum_equitans, (Archaeoglobus_fulgidus,

(Haloarcula_marismortui_ATCC_43049, Halobacterium_sp),

Methanobacterium_thermoautotrophicum, Methanopyrus_kandleri,

(Methanococcus_jannaschii, Methanococcus_maripaludis_S2),

(Methanosarcina_acetivorans, Methanosarcina_mazei),

(Picrophilus_torridus_DSM_9790, (Thermoplasma_acidophilum,

Thermoplasma_volcanium)), (Pyrococcus_abyssi, Pyrococcus_furiosus,

Pyrococcus_horikoshii))),

[ Bacteria ]

(Thermotoga_maritima, Pirellula_sp, Aquifex_aeolicus, Fusobacterium_nucleatum,

[ Firmicutes ]

((Mesoplasma_florum_L1, Onion_yellows_phytoplasma,

(Ureaplasma_urealyticum, (Mycoplasma_gallisepticum, Mycoplasma_genitalium,

Mycoplasma_hyopneumoniae_232, Mycoplasma_mobile_163K, Mycoplasma_mycoides,

Mycoplasma_penetrans, Mycoplasma_pneumoniae, Mycoplasma_pulmonis))),

(Thermoanaerobacter_tengcongensis, (Clostridium_acetobutylicum,

Clostridium_perfringens, Clostridium_tetani_E88)),

((Enterococcus_faecalis_V583, (Lactobacillus_johnsonii_NCC_533,

Lactobacillus_plantarum), (Lactococcus_lactis, (Streptococcus_mutans,

(Streptococcus_pneumoniae_R6, Streptococcus_pneumoniae_TIGR4),

(Streptococcus_agalactiae_NEM316, Streptococcus_agalactiae_2603),

(Streptococcus_thermophilus_CNRZ1066, Streptococcus_thermophilus_LMG_18311),

(Streptococcus_pyogenes, Streptococcus_pyogenes_MGAS10394,

Streptococcus_pyogenes_MGAS8232), (Streptococcus_pyogenes_MGAS315,

Streptococcus_pyogenes_SSI-1)))), ((Staphylococcus_epidermidis_ATCC_12228,

(Staphylococcus_aureus_aureus_MRSA252, Staphylococcus_aureus_aureus_MSSA476,

Staphylococcus_aureus_Mu50, Staphylococcus_aureus_MW2,

Staphylococcus_aureus_N315)), (Listeria_innocua, (Listeria_monocytogenes,

Listeria_monocytogenes_4b_F2365)), (Oceanobacillus_iheyensis,

Geobacillus_kaustophilus_HTA426, (Bacillus_halodurans, Bacillus_subtilis,

(Bacillus_licheniformis_ATCC_14580, Bacillus_licheniformis_DSM_13),

(Bacillus_thuringiensis_konkukian, (Bacillus_anthracis_Ames_0581,

Bacillus_anthracis_str_Sterne, Bacillus_anthracis_A2012,

Bacillus_anthracis_Ames), (Bacillus_cereus_ATCC_10987,

Bacillus_cereus_ATCC14579, Bacillus_cereus_ZK))))))),

[ Proteobacteria ]

((Zymomonas_mobilis_ZM4, Silicibacter_pomeroyi_DSS-3,

(Anaplasma_marginale_St_Maries,

(Wolbachia_endosymbiont_of_Drosophila_melanogaster, (Rickettsia_conorii,

(Rickettsia_typhi_wilmington, Rickettsia_prowazekii)))), Mesorhizobium_loti,

Caulobacter_crescentus, ((Sinorhizobium_meliloti,

(Agrobacterium_tumefaciens_C58_Cereon, Agrobacterium_tumefaciens_C58_UWash)),

(Bartonella_henselae_Houston-1, Bartonella_quintana_Toulouse),

(Bradyrhizobium_japonicum, Rhodopseudomonas_palustris_CGA009),

(Brucella_melitensis, Brucella_suis_1330))), (Azoarcus_sp_EbN1,

Nitrosomonas_europaea, (Chromobacterium_violaceum,

(Neisseria_meningitidis_MC58, Neisseria_meningitidis_Z2491)),

((Ralstonia_solanacearum, (Burkholderia_mallei_ATCC_23344,

Burkholderia_pseudomallei_K96243)), (Bordetella_bronchiseptica,

Bordetella_parapertussis, Bordetella_pertussis))),

(Methylococcus_capsulatus_Bath, Francisella_tularensis_tularensis,

Wigglesworthia_brevipalpis, (Coxiella_burnetii, (Legionella_pneumophila_Lens,

Legionella_pneumophila_Paris, Legionella_pneumophila_Philadelphia_1)),

(Idiomarina_loihiensis_L2TR, Shewanella_oneidensis), (Acinetobacter_sp_ADP1,

(Pseudomonas_aeruginosa, Pseudomonas_putida_KT2440, Pseudomonas_syringae)),

(Blochmannia_floridanus, Photorhabdus_luminescens, (Buchnera_sp,

(Buchnera_aphidicola_Sg, Buchnera_aphidicola)),

Erwinia_carotovora_atroseptica_SCRI1043, (Shigella_flexneri_2a_2457T,

Shigella_flexneri_2a), (Salmonella_typhi_Ty2, Salmonella_typhimurium_LT2,

Salmonella_enterica_Paratypi_ATCC_9150, Salmonella_typhi),

(Escherichia_coli_CFT073, Escherichia_coli_K12, Escherichia_coli_O157H7,

Escherichia_coli_O157H7_EDL933), (Yersinia_pseudotuberculosis_IP32953,

(Yersinia_pestis_CO92, Yersinia_pestis_KIM,

Yersinia_pestis_biovar_Mediaevails))), (Photobacterium_profundum_SS9,

(Vibrio_cholerae, Vibrio_parahaemolyticus, (Vibrio_vulnificus_CMCP6,

Vibrio_vulnificus_YJ016))), ((Xanthomonas_campestris, Xanthomonas_citri),

(Xylella_fastidiosa, Xylella_fastidiosa_Temecula1)), (Pasteurella_multocida,

Mannheimia_succiniciproducens_MBEL55E, (Haemophilus_ducreyi_35000HP,

Haemophilus_influenzae))), ((Bdellovibrio_bacteriovorus,

Desulfotalea_psychrophila_LSv54, Desulfovibrio_vulgaris_Hildenborough,

Geobacter_sulfurreducens), (Campylobacter_jejuni, (Wolinella_succinogenes,

(Helicobacter_hepaticus, (Helicobacter_pylori_26695,

Helicobacter_pylori_J99)))))),

[ Actinobacteria ]

(Symbiobacterium_thermophilum_IAM14863, (Bifidobacterium_longum,

(Propionibacterium_acnes_KPA171202, (Streptomyces_avermitilis,

Streptomyces_coelicolor), (Leifsonia_xyli_xyli_CTCB0,

(Tropheryma_whipplei_Twist, Tropheryma_whipplei_TW08_27)),

(Nocardia_farcinica_IFM10152, (Corynebacterium_diphtheriae,

Corynebacterium_efficiens_YS-314, Corynebacterium_glutamicum),

(Mycobacterium_avium_paratuberculosis, Mycobacterium_leprae,

(Mycobacterium_bovis, (Mycobacterium_tuberculosis_CDC1551,

Mycobacterium_tuberculosis_H37Rv))))))),

[ Deinococci ]

(Deinococcus_radiodurans, (Thermus_thermophilus_HB27,

Thermus_thermophilus_HB8)),

[ Chlamydiales ]

(Parachlamydia_sp_UWE25, ((Chlamydia_muridarum, Chlamydia_trachomatis),

(Chlamydophila_caviae, Chlamydophila_pneumoniae_AR39,

Chlamydophila_pneumoniae_CWL029, Chlamydophila_pneumoniae_J138,

Chlamydophila_pneumoniae_TW_183))),

[ Bacteroidetes / Chlorobi group ]

(Chlorobium_tepidum_TLS, (Porphyromonas_gingivalis_W83,

(Bacteroides_fragilis_YCH46, Bacteroides_thetaiotaomicron_VPI-5482))),

[ Spirochaetales ]

((Leptospira_interrogans_serovar_Copenhageni,

Leptospira_interrogans_serovar_Lai), ((Borrelia_burgdorferi,

Borrelia_garinii_PBi), (Treponema_denticola_ATCC_35405, Treponema_pallidum))),

[ Cyanobacteria ]

(Gloeobacter_violaceus, Nostoc_sp, (Prochlorococcus_marinus_MED4,

Prochlorococcus_marinus_MIT9313, Prochlorococcus_marinus_CCMP1375),

(Synechococcus_sp_WH8102, Synechocystis_PCC6803,

Thermosynechococcus_elongatus))));

Chapter 7

Conclusion and Future Work

In this thesis, we have made progress toward solving several problems essential to the

computational exploration of the processes of life. No problem is more central to the

workings of the cell than understanding protein structure, as proteins are the building

blocks of cellular structures and mechanisms. Equally important is understanding the

regulatory network that is responsible for modulating the creation of these building blocks.

More generally, determining the role each protein plays in the cell is one of the basic steps

in making sense of life. Looking further into the future, once the workings of the cell are

better understood, protein design will allow humans to tinker with the exquisite machinery

of life to cure disease.

Chapters 3 and 4 tackled a subproblem of protein structure prediction and design

using mathematic programming techniques. Our methods hold the advantage of focus-

ing on provably optimal and near-optimal solutions, rather than merely empirically good

solutions. Chapter 5 presented a new way of discovering in DNA the binding sites of

regulatory proteins, and Chapter 6 discusses assigning functions to proteins using evolu-

tionary information. Though our contributions make significant progress, many avenues

of future work are apparent.

For our SCP work, it would be useful to extend our methodologies to handle backbone

flexibility — either fully flexible backbones or smaller movements. Though packages are

currently available that approach structure prediction and design by allowing backbone

motions, backbone flexibility and side-chain flexibility are usually treated independently.

Putting these into a single optimization framework may allow for better solutions. Con-

tinued speed improvements would be welcome as well.

A significant open question in our side-chain positioning work is why integral solutions

are so often observed for native-backbone and homology modeling problems, while rarely

for design problems. One very reasonable hypothesis is that for the former only a small set

of rotamer choices are particularly good, while for the latter, because several amino acids

have very similar side chains and because of the added flexibility afforded by designing

several positions at once, there are many more “good” choices of rotamers. It would be

interesting to test this hypothesis.

As SDP solvers are improved, it may become useful to apply them to larger instances

of design problems to see how they scale. The SCP design problems make a nice real-

world test case for emerging SDP solver packages. It may also be interesting to apply our

rounding schemes to other optimization problems.

For the problem of motif finding, our ILP should be combined with the existing graph

pruning techniques to determine just how large are the problems that can be solved to

optimality. Casting the problem so that it depends so explicitly on the number of possible

distances between motifs, as we have done, may suggest some practical improvements

where, for example, successive ILPs are solved, increasing the number of allowed edge

weights until we can be assured of having found a good solution. It would also be useful

to extend the approach to handle the cases where sequences can each contain zero or more

than one motifs. Further, a rigorous treatment of the significance of the motifs predicted

may help to weed out spurious matches.

Incorporating the improved phylogenetic profile comparison methods presented in

Chapter 6 into meta-methods that aggregate data from many sources should be useful.

It would be also interesting to investigate which organisms are most useful for identifying

the functional linkages arising from which pathways. More generally, one wonders how

the organismal composition of the profiles affects which methods are useful for predicting

shared function. A natural next step is to apply the methods to predict functional linkages

in other organisms.

This thesis has provided new computation methods for attacking several important

problems, along with their very careful testing. The methods presented here should be an

excellent starting place from which to continue tackling some of these central problems in

computational biology.

Bibliography

Akutsu, T., Arimura, H., and Shimozono, S. (2000). On approximation algorithms for

local multiple alignment. In Proc. Annual Internat. Conf. on Computat. Mol. Biol.,

pages 1–7.

Alizadeh, F. (1995). Interior point methods in semidefinite programming with applications

to combinatorial optimization. SIAM J. Optim., 5(1):13–51.

Alon, N. and Kahale, N. (1998). Approximating the independence number via the θ-

function. Math. Programming, 80:253–264.

Aloy, P., Stark, A., Hadley, C., and Russell, R. B. (2003). Predictions without templates:

new folds, secondary structure, and contacts in CASP5. PROTEINS: Struct. Funct.

Genet., 53:436–456.

Althaus, E., Kohlbacher, O., Lenhof, H.-P., and Muller, P. (2000). A combinatorial

approach to protein docking with flexible side-chains. In Proc. 4th Annual Internat.

Conf. on Computat. Mol. Biol., pages 15–24, New York, NY. ACM Press.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic

local alignment search tool. J. Mol. Biol., 215:403–410.

Arora, S., Hazan, E., and Kale, S. (2005). Fast algorithms for approximate semidefinite

programming using the multiplicative weights update method. Proc. of the 45th

Annual IEEE Sympos. Found. Comput. Sci., in press.

Arora, S., Lund, C., Motwani, R., Sudan, M., and Szegedy, M. (1998). Proof verification

and hardness of approximation problems. J. ACM, 45(3):501–555.

Arora, S. and Safra, M. (1998). Probabilistic checking of proofs: A new characterization

of NP. J. ACM, 45(1):70–122.

Bafna, V., Lawler, E., and A., P. P. (1997). Approximation algorithms for multiple

alignment. Theoretical Computer Science, 182:233–244.

Bahadur, K. C. D., Akutsu, T., Tomita, E., and Seki, T. (2004). Protein side-chain packing

problem: a maximum edge-weight clique algorithmic approach. In Proceedings of

the Second Conference on Asia-Pacific Bioinformatics, pages 191–200, Darlinghurst,

Australia. Australian Computer Society, Inc.

Bailey, T. and Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers

using expectation maximization. Machine Learning, 21:51–80.

Banner, D. W., Bloomer, A., Petsko, G. A., Phillips, D. C., and Wilson, I. A. (1976).

Atomic coordinates for triose phosphate isomerase from chicken muscle. Biochem.

Biophys. Res. Commun., 72(1):146–55.

Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogenetic-

statistical analyses of whole genomes. PLOS Comp. Biology, 1(1):24–31.

Benson, S. J., Ye, Y., and Zhang, X. (1999). Mixed linear and semidefinite programming

for combinatorial and quadratic optimization. Optim. Methods and Software, 11:515–

Bertsimas, D. and Ye, Y. (1998). Semidefinite relaxations, multivariate normal distribu-

tions, and order statistics. In Du, D.-Z. and Pardalos, P. M., editors, Handbook of

Combinatorial Optimization, volume 3, pages 1–19. Kluwer Academic Publishers.

Betz, H., Burcham, P. B., and Ewing, G. M. (1954). Differential Equations with Appli-

caitons. Harper & Brothers, New York.

Boppana, R. B. (1987). Eigenvalues and graph bisection: An average-case analysis. In

Proc. of the 28th Annual Sympos. on Found. of Comput. Sci., pages 280–285, Wash-

ington, D.C. IEEE Computer Society Press.

Bower, M. J., Cohen, F. E., and Dunbrack, Jr, R. L. (1997). Prediction of protein side-

chain rotamers from a backbone-dependent rotamer library: A homology modeling

tool. J. Mol. Biol., 267:1268–1282.

Bowers, P. M., Cokus, S. J., Eisenberg, D., and Yates, T. O. (2004a). Use of logic

relationships to decipher protein network organization. Science, 306:2246–2249.

Bowers, P. M., Pellegrini, M., Thompson, M. J., Fierro, J., Yeates, T. O., and Eisen-

berg, D. (2004b). Prolinks: a database of protein functional linkages derived from

coevolution. Genome Biol., 5(5):R35.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., and

Karplus, M. (1983). CHARMM: A program for macromolecular energy, minimization,

and dynamics calculations. J. Comp. Chem., 4:187–217.

Canutescu, A. A., Shelenkov, A. A., and Dunbrack Jr., R. L. (2003). A graph-theory

algorithm for rapid protein side-chain prediction. Protein Sci., 12:2001–2014.

Chazelle, B., Kingsford, C., and Singh, M. (2003). The side-chain positioning problem:

a semidefinite programming formulation with new rounding schemes. In Proceed-

ings of the Paris C. Kanellakis memorial workshop on Principles of computing and

knowledge, pages 86–94.

Chazelle, B., Kingsford, C., and Singh, M. (2004). A semidefinite-programming approach

to side-chain positioning with new rounding strategies. INFORMS J. on Comput.,

16:380–392.

Cliften, P., Sundarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., et al. (2003).

Finding functional features in Saccharomyces genomes by phylogenetic footprinting.

Science, 301:71–76.

Cook, W., Cunningham, W., Pulleyblank, W., and Schrijver, A. (1997). Combinatorial

Optimization. Wiley-Interscience, New York.

Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, Jr., K. M., Ferguson,

D. M., Spellmeyer, D. C., Fox, T., Caldwell, J. W., and Kollman, P. A. (1995). A

second generation force field for the simulation of proteins, nucleic acids, and organic

molecules. J. Am. Chem. Soc., 117:5179–5197.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-

Interscience.

Dahiyat, B. I. and Mayo, S. L. (1997). De novo protein design: Fully automated sequence

selection. Science, 278:82–87.

Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by

genome-wide analysis of functional linkages. Nat. Biotechnol., 21(9):1055–1062.

Desmet, J., De Maeyer, M., Hazes, B., and Lasters, I. (1992). The dead-end elimination

theorem and its use in protein side-chain positioning. Nature, 356:539–542.

Desmet, J., De Maeyer, M., and Lasters, I. (1994). The “dead end elimination” theorem

as a new approach to the side-chain packing problem. In Merz, K. and LeGrand,

S., editors, The Protein Folding Problem and Tertiary Structure Prediction, pages

307–337. Birkhauser, Boston, MA.

Donath, W. E. and Hoffman, A. (1972). Algorithms for partitioning of graphs and com-

puter logic based on eigenvectors of connection matrices. IBM Tech. Disclosure Bull.,

15:938–944.

Dunbrack Jr, R. L. and Karplus, M. (1993). Backbone-dependent rotamer library for

proteins: application to side-chain prediction. J. Mol. Biol., 230:543–574.

Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999). Protein

interaction maps for complete genomes based on gene fusion events. Nature, 402:86–

Eriksson, O., Zhou, Y., and Elofsson, A. (2001). Side chain-positioning as an integer

programming problem. In Proc. of 1st Workshop on Algorithms in BioInformatics,

pages 129–141, BRICS, University of Aarhus, Denmark.

Eskin, E. and Pevzner, P. (2002). Finding composite regulatory patterns in dna sequences.

Bioinformatics, 18:S354–S363.

Feige, U. and Kilian, J. (1998). Heuristics for finding large independent sets, with appli-

cations to coloring semi-random graphs. In Proc. of the 39th Annual Sympos. Found.

Comput. Sci., pages 674–683, Los Alamitos, CA. IEEE Computer Society.

Felsenstein, J. (1981). Evolutionary trees from dna sequences: A maximum likelihood

approach. J. Mol. Evol., 17:368–376.

Fourer, R., Gay, D. M., and Kernighan, B. W. (2002). AMPL: A Modeling Language for

Mathematical Programming. Brooks/Cole Publishing Company, Pacific Grove, CA.

Friedman, N., Ninio, M., Pe’er, I., and Pupko, T. (2002). A structural EM algorithm for

phylogenetic inference. J. Comp. Biol., 9(2):331–353.

Frieze, A. and Jerrum, M. (1997). Improved approximation algorithms for MAX k-CUT

and MAX BISECTION. Algorithmica, 18(1):61–77.

Frith, M., Hansen, U., Spouge, J., and Z., W. (2004). Finding functional sequence elements

by multiple local alignment. Nucleic Acids Res., 32:189–200.

Fujisawa, K., Fukuda, M., Kojima, M., and Nakata, K. (1997). Numerical evaluation

of SDPA. Technical Report B-330, Department of Mathematical and Computing

Sciences, Tokyo Institute of Technology, Oh-Okayama, Meguro-ku, Tokyo 152, Japan.

Gasterland, T. and Ragan, M. A. (1998). Microbial genescapes: phyletic and functional

patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics, 3:199–

Gertz, J., Elfond, G., Shustrova, A., Weisinger, M., Pellegrini, M., Cokus, S., and Roth-

schild, B. (2003). Inferring protein interactions from phylogenetic distance matrices.

Bioinformatics, 19(16):2039–2045.

Goemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for

maximum cut and satisfiability problems using semidefinite programming. J. ACM,

42:1115–1145.

Goh, C.-S., Bogan, A. A., Joachimiak, M., Walther, D., and Cohen, F. E. (2000). Co-

evolution of proteins with their interaction partners. J. Mol. Biol., 299:283–293.

Goldstein, R. F. (1994). Efficient rotamer elimination applied to protein side-chains and

related spin glasses. Biophys. J., 66:1335–1340.

Gordon, D. B., Hom, G., Mayo, S., and Pierce, N. (2002). Exact rotamer optimization

for protein design. J. Comput. Chemistry, 24:232–243.

Gordon, D. B. and Mayo, S. L. (1998). Radical performance enhancements for combinato-

rial optimization algorithms based on the dead-end elimination theorem. J. Comput.

Chem., 19(13):1505–1514.

Grey, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A.,

and Baker, D. (2003). Protein-protein docking with simultaneous optimization of

rigid-body displacement and side-chain conformations. J. Mol. Biol., 331:281–299.

Grotschel, M., Lovasz, L., and Schrijver, A. (1993). Geometric Algorithms and Combina-

torial Optimization. Springer-Verlag, Berlin, Germany, 2nd edition.

Grotschel, M., Lovasz, L., and Schrijver, A. (1993). Geometric Algoritms and Combina-

torial Optimization. Springer, 2nd corrected edition.

Gusfield, D. (1993). Efficient methods for multiple sequence alignment with guaranteed

error bounds. Bull. Math. Biol., 55:141–154.

Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T., and Kim, P. S. (1998). High-resolution

protein design with backbone freedom. Science, 282:1462–1467.

Hartwell, L. H., Hopfield, J. J., Leibler, S., and Murray, A. W. (1999). From molecular

to modular cell biology. Nature, 402:C47–C52.

Hertz, G. and Stormo, G. (1999). Identifying dna and protein patterns with statistically

significant alignments of multiple sequences. Bioinformatics, 15:563–577.

Holm, S. and Sander, C. (1991). Database algorithm for generating protein backbone and

sidechain coordinates from a Ca trace: Application to model building and detection

of coordinate errors. J. Mol. Biol., 218:183–194.

ILOG CPLEX (2000). ILOG CPLEX 7.1. http://www.ilog.com/products/cplex/.

Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Sny-

der, M., Greenblatt, J. F., and Gerstein, M. (2003). A bayesian networks approach for

predicting protein-protein interactions from genomic data. Science, 302(5644):449–

Jones, T. A. and Kleywegt, G. J. (1999). CASP3 comparative modeling evaluation.

Proteins, 37:30–46.

Jothi, R., Kann, M., and Przytycka, T. (2005). Predicting protein-protein interaction by

searching evolutionary tree authoporhism space. In Proceedings of the 13th Annual

International Conference on Intelligent Systems for Molecular Biology.

Kanehisa, M. (1997). A database for post-genome analysis. Trends Genet., 13:375–376.

Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes.

Nuc. Acids Res., 28:27–30.

Karger, D., Motwani, R., and Sudan, M. (1998). Approximate graph coloring by semidef-

inite programming. J. ACM, 45(2):246–265.

Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. (2003). Sequencing

and comparison of yeast species to identify genes and regulatory elements. Nature,

423:241–254.

Kingsford, C., Chazelle, B., and Singh, M. (2005). Solving and analyzing side-chain posi-

tioning problems using linear and integer programming. Bioinformatics, 21(7):1028–

Klepeis, J. L., Floudas, C. A., Morikis, D., Tsokos, C. G., Argyropoulos, E., Spruce, L.,

and Lambris, J. D. (2003). Integrated computational and experimental approach for

lead optimization and design of compstatin variants with improved activity. J. Am.

Chem. Soc., 125:8422–8423.

Kohlbacher, O. and Lenhof, H.-P. (2000). BALL — rapid software prototyping in com-

putational molecular biology. Bioinformatics, 16(9):815–824.

Kortemme, T., Joachimiak, L. A., Bullock, A. N., Schuler, A. D., Stoddard, B. L., and

Baker, D. (2004). Computational redesign of protein-protein interaction specificity.

Nature Struct. Mol. Biol., 11(4):371–379.

Kuhlman, B. and Baker, D. (2000). Native protein sequences are close to optimal for their

structures. Proc. Natl. Acad. Sci. USA, 97(19):10383–10388.

Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L., and Baker, D.

(2003). Design of a novel globular protein fold with atomic-level accuracy. Science,

302:1364–1368.

Lasters, I., De Maeyer, M., and Desmet, J. (1995). Enhanced dead-end elimination in

the search for the global minimum energy conformation of a collection of protein side

chains. Prot. Eng., 8:815–822.

Lau, H. C. (2002). A new approach for weighted constraint satisfaction. Constraints,

7:151–165.

Lau, H. C. and Watanabe, O. (1996). Randomized approximation of the constraint satis-

faction problem. In Karlsson, R. and Lingas, A., editors, Proc. of the 5th Scandinavian

Workshop on Algorithm Theory, pages 76–87, Berlin, Germany. Springer.

Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J. (1993).

Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Science, 262:208–214.

Lawrence, C. and Reilly, A. (1990). An expectation maximization (EM) algorithm for

the identification and characterization of common sites in unaligned biopolymer se-

quences. Proteins: Structure, Function, and Genetics, 7:41–51.

Leach, A. R. and Lemon, A. P. (1998). Exploring the conformational space of protein side

chains using dead-end elimination and the A* algorithm. Proteins, 33:227–239.

Lecompte, O., Ripp, R., Thierry, J.-C., Moras, D., and Poch, O. (2002). Comparative

analysis of ribosomal proteins in complete genomes: an example of reductive evolution

at the domain scale. Nuc. Acids Res., pages 5382–5390.

Lee, C. (1994). Predicting protein mutant energetics by self-consistent ensemble optimiza-

tion. J. Mol. Biol., 236(3):918–939.

Lee, C. and Subbiah, S. (1991). Prediction of protein side-chain conformation by packing

optimization. J. Mol. Biol., 217(2):373–388.

Lee, I., Date, S. V., Adai, A. T., and Marcotte, E. M. (2004). A probabilistic functional

network of yeast genes. Science, 306:1555–1558.

Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, et al. (2002). Tran-

scriptional regulatory network Saccharomyces cerevisiae. Science, 298:799–804.

Lesk, A. M., Branden, C.-I., and Chothia, C. (1989). Structural principles of α/β barrel

proteins: The packing of the interior of the sheet. Proteins, 5:139–148.

Liberles, D. A., Thoren, A., von Heijne, G., and Elofsson, A. (2002). The use of phyloge-

netic profiles for gene predictions. Current Genomics, 3:131–137.

Lilien, R. H., Stevens, B. W., Anderson, A. C., and Donald, B. R. (2004). A novel

ensemble-based scoring and search algorithm for protein redesign, and its applica-

tion to modify the substrate specificity of the gramicidin synthetase a phenylalanine

adenylation enzyme. In Proc. 8th Annual Internat. Conf. on Computat. Mol. Biol.,

pages 46–57, New York, NY. ACM Press.

Liu, X., Brutlag, D., and Liu, J. (2001). Bioprospector: discovering conserved DNA motifs

in upstream regulatory regions of co-expressed genes. In Pac. Symp. Biocomp., pages

127–138.

Looger, L. L., Dwyer, M. A., Smith, J. J., and Hellinga, H. W. (2003). Computational

design of receptor and sensor proteins with novel functions. Nature, 423:185–190.

Looger, L. L. and Hellinga, H. W. (2001). Generalized dead-end elimination algorithms

make large-scale protein side-chain structure prediction tractable: implications for

protein design and structural genomics. J. Mol. Biol., 307(1):429–445.

Lovasz, L. (1979). On the Shannon capacity of a graph. IEEE Trans. Inform. Theory,

25:1–7.

Lu, L., Xia, Y., Paccanaro, A., Yu, H., and Gerstein, M. (2005). Assessing the limits of

genomic data integration for predicting protein networks. Genome Res, 15:945–953.

MacKerell, A. D., Jr., B. B., C. L. Brooks, I., Nilsson, L., Roux, B., Won, Y., and

Karplus, M. (1998). CHARMM: The energy function and its parameterization with

an overview of the program. In v. R. Schleyer, P. et al., editors, The Encyclopedia of

Computational Chemistry, volume 1, pages 271–277. John Wiley & Sons, Chichester.

Malakauskas, S. M. and Mayo, S. L. (1998). Design, structure and stability of a hyper-

thermophilic protein variant. Nat. Struct. Biol., 5(6):470–475.

Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O., and Eisenberg,

D. (1999a). Detecting protein function and protein-protein interactions from genome

sequences. Science, 285(5428):751–753.

Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., and Eisenberg, D.

(1999b). A combined algorithm for genome-wide prediction of protein function. Na-

ture, 402:83–86.

Marsan, L. and Sagot, M. F. (2000). Algorithms for extracting structured motifs using a

suffix tree with an application to promoter and regulatory site consensus identifica-

tion. J. Comp. Bio., 7:345–362.

Martin, A. C. R. (2001). Profit program version 2.2.

http://www.bioinf.org.uk/software/profit.

McGuire, A., Hughes, J., and Church, G. (2000). Conservation of dna regulatory motifs

and discovery of new motifs in microbial genomes. Genome Res., 10:744–757.

McLachlan, A. D. (1982). Rapid comparison of protein structures. Acta Cryst, A38:871–

Mewes, H. W., Frishman, D., Guldener, U., Hannhaupt, G., Mayer, K., Mokrejs, M.,

Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. (2002). MIPS: a database

for genomes and protein sequences. Nuc. Acids Res., 30(1):31–34.

Morett, E., Korbel, J. O., Rajan, E., Saab-Rincon, G., Olvera, L., Olvera, M., Schmidt,

S., Snel, B., and Bork, P. (2003). Systematic discovery of analogous enzymes in

thiamin biosynthesis. Nat. Biotechnol., 21(7):790–795.

Mueller, U., Perl, D., Schmid, F. X., and Heinemann, U. (2000). Thermal stability and

atomic-resolution crystal structure of the Bacillus caldolyticus cold shock protein. J.

Mol. Biol., 297(4):975–988.

NCBI (2005). National Center for Biotechnology Information.

ftp://ftp.ncbi.nih.gov/genomes/.

Nesterov, Y. and Nemirovskii, A. (1993). Interior Point Polynomial Methods in Convex

Programming: Theory and Algorithms. SIAM, Philadelphia, PA.

Nicholls, A., Sharp, K. A., and Honig, B. (1991). Protein folding and association: In-

sights from the interfacial and thermodynamic properties of hydrocarbons. Proteins,

11(4):281–296.

Osada, R., Zaslavsky, E., and Singh, M. (2004). Comparative analysis of methods for

representing and searching for transcription factor binding sites. Bioinformatics,

20:3516–3525.

Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D., and Maltsev, N. (1999). The use

of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci., 96(6):2896–2901.

Park, S., Yang, X., and Saven, J. G. (2004). Advances in computational protein design.

Curr. Opinion Struct. Biol., 14(4):487–494.

Pavesi, G., Mereghetti, P., Mauri, G., and Pesole, G. (2004). Weeder Web: discovery of

transcription factor binding sites in a set of sequences from co-regulated genes. Nucl.

Acids Res., 32:W199–W203.

Pazos, F. and Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein-

protein interaction. Prot. Eng., 14(9):609–614.

Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O. (1999).

Assigning protein functions by comparative genome analysis: protein phylogenetic

profiles. Proc. Natl. Acad. Sci., 96:4285–4288.

Petrey, D., Xiang, Z., Tang, C., Xie, L., Gimpelev, M., Mitros, T., Soto, C., Goldsmith-

Fischman, S., Kernytsky, A., Schlessinger, A., Koh, I., Alexov, E., and Honig, B.

(2003). Using multiple structure alignments, fast model building and energetic anal-

ysis in fold recognition and homology modeling. Proteins, 53:430–435.

Pevzner, P. and Sze, S. (2000). Combinatorial approaches to finding subtle signals in dna

sequences. In Proc. Intell. Syst. in Mol. Biol., pages 269–278.

Pierce, N. A., Spriet, J. A., Desmet, J., and Mayo, S. L. (2000). Conformational splitting:

A more powerful criterion for dead-end elimination. J. Comput. Chem, 21(11):999–

Pierce, N. A. and Winfree, E. (2002). Protein design is NP-hard. Prot. Eng., 15(10):779–

Ponder, J. W. and Richards, F. M. (1987). Tertiary templates for proteins. use of packing

criteria in the enumeration of allowed sequences for different structural classes. J.

Mol. Biol., 193(4):775–791.

Raghavan, P. and Thompson, C. (1987). Randomized rounding: a technique for provably

good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374.

Ramani, A. K. and Marcotte, E. M. (2003). Exploiting the co-evolution of interacting

proteins to discover interaction specificity. J. Mol. Biol., 327:273–284.

Reinert, K., Lenhof, H., Mutzel, P., Mehlhorn, K., and Kececioglu, J. (1997). A branch-

and-cut algorithm for multiple sequence alignment. In Proc. Annual Internat. Conf.

on Computat. Mol. Biol., pages 241–249.

Rigoutsos, I. and Floratos, A. (1998). Combinatorial pattern discovery in biological se-

quences: The TEIRESIAS algorithm. Bioinformatics, 14:55–67.

Robison, K., McGuire, A. M., and Church, G. M. (1998). A comprehensive library of

DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli

K-12 genome. J. Mol. Biol., 284:241–254.

Rolim, J. D. P. and Trevisan, L. (1998). A case study of de-randomization methods for

combinatorial approximation problems. J. of Combin. Optim., 2(3):219–236.

Samudrala, R. and Moult, J. (1998). A graph-theoretic algorithm for comparative mod-

eling of protein structure. J. Mol. Biol., 279(1):287–302.

Schuler, G., Altschul, S., and Lipman, D. (1991). A workbench for multiple alignment

construction and analysis. Proteins, 9(3):180–190.

Seneta, E. (1981). Non-negative Matrices and Markov Chains. Springer-Verlag, New York,

NY, 2nd edition.

Simons, K. T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of pro-

tein tertiary structures from fragments with similar local sequences using simulated

annealing and bayesian scoring functions. J. Mol. Biol., 268:209–225.

Sinha, S. and Tompa, M. (2003). YMF: A program for discovery of novel transcription

factor binding sites by statistical overrepresentation. Nucl. Acids Res., 31:3586–3588.

Summers, N. and Karplus, M. (1989). Construction of side-chains in homology modeling.

application to the C-terminal lobe of rhizopuspepsin. J. Mol. Biol., 210:785–811.

Sze, S.-H., Lu, S., and Chen, J. (2004). Integrating sample-driven and pattern-driven

approaches in motif finding. In Proc. Workshop on Algo. in Biocomp., pages 438–

Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997). A genomic perspective on

protein families. Science, 278(5338):631–637.

Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., and Church, G. (1999). System-

atic determination of genetic network architecture. Nature Genetics, 22(3):281–285.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Clustal W: improving

the sensitivity of progressive multiple sequence alignment through sequence weight-

ing, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res.,

22:4673–4680.

Tompa, M. (1999). An exact method for finding short motifs in sequences, with application

to the ribosome binding site problem. In Proc. Intell. Syst. in Mol. Biol., pages 262–

Tompa, M., Li, N., Bailey, T. L., Church, G., De Moor, B., Eskin, E., et al. (2005).

Assessing computational tools for the discovery of transcription factor binding sites.

Nature Biotech., 23:137–144.

Troyanskaya, O., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D. (2003).

A Bayesian framework for combining heterogeneous data sources for gene function

prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci., 100(14):8348–8353.

van Helden, J., Rios, A., and Collado-Vides, J. (2000). Discovering regulatory elements in

non-coding sequences by analysis of spaced dyads. Nucleic Acids Res., 28:1808–1818.

Vandenberghe, L. and Boyd, S. (1996). Semidefinite programming. SIAM Rev., 38(1):49–

Vazirani, V. (2001). Approximation Algorithms. Springer-Verlag, Berlin.

Ventura, S. and Serrano, L. (2004). Designing proteins from the inside out. Proteins,

56:1–10.

Vert, J.-P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics, 18:S275–

von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., and Snel, B. (2003).

STRING: a database of predicted functional associations between proteins. Nuc.

Acids Res., 31:258–261.

Wang, C., Schueler-Furman, O., and Baker, D. (2005). Improved side-chain modeling for

protein-protein docking. Protein Sci., 14:1328–1339.

Wu, J., Kasif, S., and DeLisi, C. (2003). Identification of functional links between genes

using phylogenetic profiles. Bioinformtics, 19(12):1524–1530.

Xiang, Z. and Honig, B. (2001). Extending the accuracy limits of prediction for side-chain

conformations. J. Mol. Biol., pages 421–430.

Xu, J. (2005). Rapid protein side-chain packing via tree decomposition. In Miyano, S.,

Mesirov, J., Kasif, S., Istrail, S., Pevzner, P., and Waterman, M., editors, 9th Annual

Internat. Conf. Res. In Comp. Mol. Biol., pages 423–439. Springer.

Zaslavsky, E. and Singh, M. (2005). Combinatorial optimization approaches to motif

finding. Manuscript, submitted for publication.

Zwick, U. (1999). Outward rotations: a tool for rounding solutions of semidefinite pro-

gramming relaxations, with applications to MAX CUT and other problems. In Proc.

of the 31st Annual ACM Sympos. on Theory of Comput., pages 679–687, New York,

NY. ACM Press.

Computational Approaches to Problems in Protein Structure and Function

Documents

Mechanisms of protein misfolding: Novel therapeutic approaches to protein-misfolding ... · Mechanisms of protein misfolding: Novel therapeutic approaches to protein-misfolding diseases

Computational approaches for systems metabolomics

Protein design - Stanford University · • Computational protein design is often combined with experimental protein engineering methods • For example, computational designs can

COMPUTATIONAL APPROACHES TO PREDICT …

Computational Approaches to Determining

Computational Approaches in Epigenomics

Computational approaches on PBP

Information Theoretic Approaches in Computational ... · Thomas Zastrow - University Tübingen Computational Dialectometry 1 Information Theoretic Approaches in Computational Dialectometry

Computational Approaches to Protein-Ligand Bindinginfochim.u-strasbg.fr › CS3_2016 › Conferences › ChemInfo_2016.CECCHINI.pdfMarco Cecchini Laboratoire d’Ingénierie des Fonctions

Improved Approaches to Protein-Protein Coupling and the

Computational Prediction of Protein-Protein Interactions › BioinfoCourse › PDFs › ...Computational Prediction of Protein-Protein Interactions Enright A.J., Skrabanek L. and Bader

Computational Approaches for Protein Function …Computational Approaches for Protein Function Prediction: A Survey Gaurav Pandey, Vipin Kumar and Michael Steinbach Department of Computer

Computational Approaches to Systems Biology

Computational Protein Design. 2. Computational Protein Design Techniques

Computational Approaches to Protein Structure Prediction

Computational modeling of Human-nCoV protein-protein

Computational Molecular Biology Protein - Ligand And ...molsim.sci.univr.it/2013_biocomp/Docking.pdf · Computational Molecular Biology Protein - Ligand And Protein - Protein Docking

Computational Approaches to Conceptual Blending

COMPUTATIONAL APPROACHES TO REFERENCE

Computational Strategies for Protein-Surface and Protein-Nanoparticle Interactions · 2017-08-09 · Computational Strategies for Protein-Surface and Protein-Nanoparticle Interactions