View
2
Download
0
Category
Preview:
Citation preview
Computational Approaches to Problems in
Protein Structure and Function
Carleton L. Kingsford
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
By the Department of
Computer Science
November 2005
c© Copyright by Carleton L. Kingsford, 2005. All rights reserved.
Abstract
We present computational approaches to solve several problems arising in protein structure
and function.
In the first part of this thesis, we develop a new method for finding the lowest energy
positions of side chains when given the backbone of a protein, a widely studied problem
that has applications in homology modeling and protein design. We present an integer
linear programming formulation of side-chain positioning and relax it to give a polynomial-
time linear programming heuristic that allows us to tackle large problems. We test the
integer and linear programming approach on native and homologous backbones, where we
show that optimal solutions can usually be found using linear programming, and in protein
redesign, where we find that instances often cannot be solved using linear programming
directly, but where optimal solutions for large instances can be found using the more
expensive integer programming procedure. We also present an alternative formulation of
the side-chain positioning problem as a semidefinite program, which provides a tighter
relaxation than the linear program. We introduce two novel rounding schemes to convert
fractional solutions of the semidefinite program into choices of rotamers and provide some
theoretical justifications for their effectiveness. We extensively test the semidefinite pro-
gramming formulation and rounding schemes on simulated data and on the redesign of
two naturally occurring protein cores and show that the approach finds good solutions.
The second part of this thesis considers the problem of finding transcription factor
binding sites by locating a collection of mutually similar subsequences within the upstream
DNA sequences of genes. Our approach to side-chain positioning can be recast to solve
this problem, and it has previously been shown that this is a promising direction to pursue.
We improve the mathematical programming formulation to find binding sites up to 45
times faster.
Finally, in the last part of the thesis, we investigate protein function more broadly and
give extensions to the popular phylogenetic profile method for predicting shared function
iii
from cross-genomic evolutionary history. For many biological functions, our methods are
better able to identify functionally linked proteins than previously introduced methods.
iv
Acknowledgments
The main content of several chapters has appeared as papers or submissions, and I thank
my co-authors for allowing our joint work to be included in this thesis.
The work on applying linear programming to side-chain positioning in Chapter 3
appeared in the journal Bioinformatics (Kingsford et al., 2005) and is joint work with
Bernard Chazelle and Mona Singh. I thank Amy Keating, Jessica Fong, Gevorg Grigoryan,
Elena Nabieva, Robert Osada, Elena Zaslavsky and the reviewers for insightful comments
on that paper.
Chapter 4 on applying semidefinite programming to side-chain positioning appeared
in the Special Issue in Computational Molecular Biology / Bioinformatics of the IN-
FORMS Journal on Computing (Chazelle et al., 2004). (A preliminary version appeared
in (Chazelle et al., 2003).) This work is also joint work with Bernard Chazelle and Mona
Singh, and I thank Tony Wirth for helpful comments on the manuscript.
Chapter 5 is joint work with Elena Zaslavsky and Mona Singh. I thank Elena Nabieva
for her comments on an earlier version of the manuscript that became this chapter.
I was in part supported by a DIMACS Graduate Student Summer Award while working
on Chapter 6, which is joint work with Mona Singh. Thanks to Jessica Fong for reading
this chapter.
Bernard Chazelle and Olga Troyanskaya each gave helpful and much appreciated com-
ments on the entire thesis in their capacity as readers on my thesis committee. I am
also grateful to Rob Schapire and Michael Hecht for serving on my thesis committee as
non-readers.
This work was supported financially through my advisor and Princeton University via
the National Science Foundation and the Defense Advanced Research Projects Agency.
For the work described here, at least 50,000 lines of software code were written. I
would like to thank my advisor, Mona Singh, for making these many lines of code seem
worthwhile and for her consistent insights.
v
Jessica Fong played a special role in this thesis, keeping the hours not spent with those
lines of code dependably enjoyable.
Finally, I especially want to thank my parents, Howard and Geraldine, and sister,
Carriann, for their support and encouragement.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
1 Introduction 1
2 The Side-chain Positioning Problem 5
2.1 Brief Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Side-chain Positioning Problem . . . . . . . . . . . . . . . . . . . . . 6
2.3 Formal Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Applications and Successes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Previous Methods to Solve the SCP Problem . . . . . . . . . . . . . . . . 11
2.6 The Inapproximability of Side-chain Positioning . . . . . . . . . . . . . . . 12
3 Solving and Analyzing Side-chain Positioning Problems Using Linear
and Integer Programming 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Biological Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Integer Linear Programming Formulation . . . . . . . . . . . . . . 20
3.2.2 Multiple Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 LP/ILP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
3.2.4 Integrality Gap of the LP Relaxation . . . . . . . . . . . . . . . . . 25
3.2.5 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.6 Rotamer Library and Structure Manipulation . . . . . . . . . . . . 30
3.2.7 Energy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.8 Evaluating Predicted Structures . . . . . . . . . . . . . . . . . . . 34
3.3 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Native Backbone Tests . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Homology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Protein Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4 Other Energy Functions . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.5 Obtaining Multiple Solutions . . . . . . . . . . . . . . . . . . . . . 45
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 A Semidefinite Programming Approach to Side-Chain Positioning with
New Rounding Strategies 50
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 A Semidefinite Programming Heuristic . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Projection Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Perron-Frobenius Rounding . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Cold Shock Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Triose Phosphate Isomerase . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Uniform Random Graphs . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.4 Neighborhood Random Graphs . . . . . . . . . . . . . . . . . . . . 70
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Improving a Mathematical Programming Approach for Motif Finding 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
viii
5.2 Formal Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Integer and Linear Programming Formulations . . . . . . . . . . . . . . . 79
5.3.1 Original Integer Linear Programming Formulation . . . . . . . . . 79
5.3.2 New Integer Linear Programming Formulation . . . . . . . . . . . 79
5.3.3 Advantages of IP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.4 Linear Programming Relaxation . . . . . . . . . . . . . . . . . . . 82
5.3.5 Integrality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.3 Performance of the LP Relaxations . . . . . . . . . . . . . . . . . . 92
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Identifying Functionally Related Yeast Proteins Using Inferred Evolu-
tionary History 96
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Computing the Phylogenetic Profiles . . . . . . . . . . . . . . . . . 102
6.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.3 Definition of Pathways . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.4 Framework for Inferring Ancestral Gene State . . . . . . . . . . . . 106
6.2.5 Comparing Profiles Using Tree Labelings . . . . . . . . . . . . . . 113
6.2.6 Assessing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.1 Limit on ROC Performance of Phylogenetic Profiles . . . . . . . . 116
6.3.2 Performance of Mutual Information for Finding Functional Linkages 121
ix
6.3.3 Predicting Linkages From Shared Gains and Losses . . . . . . . . . 123
6.3.4 Predicting Linkages By Comparing Likelihoods . . . . . . . . . . . 124
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 List of Pathways and the Phylogenetic Tree . . . . . . . . . . . . . . . . . 133
7 Conclusion and Future Work 138
x
List of Figures
2.1 SCP graph formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Converting a 3-CNF formula to an SCP problem. . . . . . . . . . . . . . . 14
3.1 Interacting pairs of residues in two proteins. . . . . . . . . . . . . . . . . . 23
3.2 Flow chart of the LP/ILP approach. . . . . . . . . . . . . . . . . . . . . . 26
3.3 Bad integrality ratio example. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 The distribution of rotamers in the rotamer library. . . . . . . . . . . . . . 31
3.5 The 6–12 Lennard-Jones van der Waals force. . . . . . . . . . . . . . . . . 32
3.6 The average rmsd over 5 proteins for various values of C. . . . . . . . . . 33
3.7 Native backbone running times. . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Native test set χ1 angle errors. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Performance compared with exposed surface error. . . . . . . . . . . . . . 37
3.10 Distribution of mistakes broken down by amino acid. . . . . . . . . . . . . 38
3.11 Design problem solve times. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.12 Design problem relative gaps. . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.13 Near-optimal solution relative gaps. . . . . . . . . . . . . . . . . . . . . . 46
4.1 Geometry of the solution vectors. . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Cold-shock protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Triose phosphate isomerase. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Largest eigenvalues of uniform random graphs. . . . . . . . . . . . . . . . 70
xi
4.5 Gaps and rounded energies for uniform random graphs. . . . . . . . . . . 71
4.6 Gaps and rounded energies for neighborhood random graphs. . . . . . . . 72
5.1 Schematic of the new ILP formulation. . . . . . . . . . . . . . . . . . . . . 81
5.2 Notation used in the faster ILP formulation. . . . . . . . . . . . . . . . . . 84
5.3 Graph used to show the new ILP can be made as tight as the old one. . . 84
5.4 Bad example for the heuristic set of constraints. . . . . . . . . . . . . . . . 87
5.5 Bad integrality ratio example for motif finding. . . . . . . . . . . . . . . . 90
5.6 Speed up and size reduction of new ILP formulation. . . . . . . . . . . . . 93
6.1 Transforming e-values to probabilities. . . . . . . . . . . . . . . . . . . . . 103
6.2 Fraction of S. cerevisiae genes that have homologs. . . . . . . . . . . . . . 104
6.3 Sizes of the KEGG pathways. . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Gene gain / loss probability model. . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Plot of the function maximized in the M-step. . . . . . . . . . . . . . . . . 112
6.6 Maximal ROC performance for KEGG pathways. . . . . . . . . . . . . . . 116
6.7 Colliding profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.8 Colliding profiles at Hamming distance 10% . . . . . . . . . . . . . . . . . 120
6.9 ROC analysis of mutual information. . . . . . . . . . . . . . . . . . . . . . 121
6.10 Per-function analysis of mutual information. . . . . . . . . . . . . . . . . . 122
6.11 Comparison of SGL with binary MI . . . . . . . . . . . . . . . . . . . . . 123
6.12 Root label probabilities for LRATIO . . . . . . . . . . . . . . . . . . . . . 124
6.13 Transition probabilities for LRATIO . . . . . . . . . . . . . . . . . . . . . 125
6.14 Complementary evolution schematic. . . . . . . . . . . . . . . . . . . . . . 127
6.15 Comparing LRATIO to MI. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.16 Pathways ranked the highest by LRATIO. . . . . . . . . . . . . . . . . . . 129
6.17 Comparison of LRATIO using real tree and a random tree. . . . . . . . . 130
6.18 Shared presence and shared absence edges. . . . . . . . . . . . . . . . . . . 132
xii
List of Tables
3.1 Native backbone problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Homology modeling problems and their sizes. . . . . . . . . . . . . . . . . 29
3.3 Run time for homology modeling test set. . . . . . . . . . . . . . . . . . . 30
3.4 Average performance on native test set. . . . . . . . . . . . . . . . . . . . 36
3.5 Average performance on homology modeling problems. . . . . . . . . . . . 40
3.6 Redesign test set and solve times. . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Sizes for motif finding problems. . . . . . . . . . . . . . . . . . . . . . . . 92
xiii
Chapter 1
Introduction
In this thesis, we present computational approaches to solve several problems in protein
structure and function. Algorithms and computational methods are becoming essential
to make sense of the vast amount of data that is now available about workings of the cell.
Our main task in this thesis is the development of algorithmic and analytical methods for
understanding proteins — the workhorse molecules of life — and their function. In partic-
ular, we address the problems of predicting and designing protein structures, discovering
protein binding sites in DNA, and assigning proteins to biological pathways.
In the first part of this thesis, we focus on protein structure. A central problem
in molecular biology is that of predicting a protein’s three-dimensional fold when given
only its one-dimensional amino acid sequence. The structure of a protein plays a critical
role in its function. While the number of known protein sequences is growing rapidly,
their corresponding protein structures are being determined at a significantly slower pace.
Despite decades of work, the problem of predicting the 3D structure of a protein from its
amino acid sequence remains unsolved. In the first part of this thesis, we consider the side-
chain positioning (SCP) problem, a challenging and important component of the general
protein structure prediction problem, with applications in homology modeling and protein
design. For side-chain positioning, the task is to find the lowest energy conformation of
1
a protein’s side chains on a given, fixed backbone. In Chapters 2, 3, and 4, we study
a widely-used version of the problem where the side-chain positioning procedure uses a
library of discrete side-chain conformations and an energy function that can be expressed
as a sum of pairwise terms. In practice, this problem is tackled by a variety of general
search techniques and specialized heuristics. Chapter 2 is an introduction to the SCP
problem, previous methods that have been used to solve it, and some successes obtained
with these methods. It also provides a brief overview of the biology behind the problem,
and we show that it is NP-complete to find even a reasonable approximate solution to the
problem.
In Chapter 3 we present an integer linear programming (ILP) formulation of side-chain
positioning. Formulating the SCP problem as an ILP requires that we represent the space
of possible side-chain conformations using a system of binary variables and express the
quality of a conformation as a linear function of those variables. A low-energy structure
will arise from a setting of these variables that minimizes this function. Because it is
difficult to enforce, we relax the constraint that the value of the variables be either 0 or
1 to give a polynomial-time linear programming (LP) heuristic that allows us to tackle
large problem sizes.
To test the effectiveness of the heuristic, we apply it to place side chains on native and
homologous backbones and to choose amino acid types for protein design. Surprisingly,
when positioning side chains on native and homologous backbones, optimal solutions using
a biologically relevant energy function can usually be found using LP. On the other hand,
design problems often cannot be solved using LP directly, but optimal solutions for large
instances can still be found using the computationally more expensive ILP procedure. We
briefly explore the effect of the choice of energy function used to evaluate the quality of
a structure on the difficulty of the problem. While the ease with which solutions can be
found does vary with the choice of energy function, the LP/ILP approach described in
Chapter 3 is able to find optimal solutions for the energy function variants we considered.
2
Our analysis is the first large-scale demonstration that LP-based approaches are highly
effective in finding optimal (and near-optimal) solutions for the side-chain positioning
problem, and our success in finding optimal solutions puts the theoretical hardness results
into practical context.
Because solutions to the LP relaxation of design problems often have variables set to
values that are not 0 or 1, we also explore a tighter relaxation to the integer program.
In Chapter 4, we write the side-chain positioning problem as an instance of semidefinite
programming (SDP). We introduce two novel rounding schemes and provide theoretical
justifications for their effectiveness under various conditions. We extensively test the
SDP formulation and the new rounding schemes on simulated data, as well as on the
computational redesign of two naturally occurring protein cores. We show that our SDP
approach generally finds good solutions. The rounding schemes should be applicable
outside context of side-chain positioning.
In the second part of this thesis, we study the problem of predicting a protein’s bind-
ing sites in DNA. One of the most important roles that proteins play in the cell is the
regulation of the expression of other proteins, and in Chapter 5 we consider the problem
of finding transcription factor binding sites in the regions of DNA upstream of genes. The
motif-finding problem is that of locating a collection of mutually similar subsequences
within a given set of DNA sequences; these subsequences often correspond to regulatory
elements, the discovery of which can help explain the circuitry of the cell. We study a com-
binatorial framework for the motif-finding problem, where the goal is to find a minimum
weighted clique in a k-partite graph. Previous approaches to find these cliques have relied
on graph pruning and divide-and-conquer techniques. Though the side-chain positioning
and motif-finding problems seem very different, the same combinatorial problem lies at
their heart. Recently, it has been shown that mathematical programming is a promising
approach for motif finding using an integer program identical to the one we apply to SCP
in Chapter 3. In Chapter 5, we describe a novel, faster integer linear programming formu-
3
lation for the problem. A key observation driving the improvement is that the weights on
the edges in the graph come from a small set of possibilities. By exploiting this property
we often solve these problems to optimality an order of magnitude (and up to 45 times)
faster than the existing mathematical programming approach. We show that our new
formulation leads to a method that is highly effective in practice on instances arising from
biological sequence data.
Though very different biological applications, the two problems of motif finding and
side-chain positioning are connected by their underlying graph problem, and our ap-
proaches to them share the philosophy of focusing on optimal (rather than simply “good”)
solutions while optimizing a clear objective function.
In the third part of the thesis, we investigate protein function more generally, and
develop several new methods for predicting the role a protein plays in the cell using infor-
mation from its evolutionary history by extending the widely used method of phylogenetic
profiles. A phylogenetic profile is a vector indicating whether a protein is present across a
variety of organisms. Similar profiles are taken to indicate similar function, but determin-
ing which profiles should be considered similar is not straightforward. In Chapter 6, we
show that if the measure of similarity incorporates the broad evolutionary relationships
between organisms, it is possible to make better predictions of functional links. We relate
the species involved in the profiles by a species tree and infer whether each protein was
present or absent in the non-extant, ancestor organisms implied by the internal nodes of
the tree. We give two new measures of profile similarity that use these inferred, ances-
tral gene states to better predict shared function. We also investigate the best way to
assess the quality of the predictions and present evidence that considering performance
on each function separately gives a more accurate picture of performance than lumping
all functions together, as is often done.
In the next chapter, we begin the study of the first of these problems, the side-chain
positioning problem.
4
Chapter 2
The Side-chain Positioning
Problem
In this chapter, we start with some biological background and introduce the
side-chain positioning problem — the problem of determining a low-energy
configuration of the side-chain atoms of a protein. We describe some ap-
plications and conclude by showing that the combinatorial problem behind
side-chain positioning is hard to approximate.
2.1 Brief Biological Background
A protein molecule is formed from a chain of amino acids. Each amino acid consists
of a central carbon atom, and attached to this carbon are a hydrogen atom, an amino
group (NH2), a carboxyl group (COOH) and a side chain that characterizes the amino
acid. Side chains vary in composition; for example, the side chain for the amino acid
glycine consists of a single hydrogen atom, and the side chain for the amino acid alanine
consists of a carbon atom with three hydrogen atoms attached. The amino acids of a
protein are connected in sequence, with the carboxyl group of one amino acid forming
5
a peptide bond with the amino group of the next amino acid. This forms the protein
backbone, and the repeating amino acid units (also called residues) within the protein
consist of both the main-chain atoms that comprise the backbone as well as the side-
chain atoms. There are 20 commonly occurring amino acids, and each protein molecule
is specified by a sequence corresponding to the amino acids that make it up. Whereas a
protein’s sequence immediately reveals its chemical composition, its structure, specified
by the spatial coordinates of its main-chain and side-chain atoms, is significantly more
difficult to determine.
It is generally believed that a protein’s native structure corresponds to the conforma-
tion with the minimum global free energy. It follows that one approach to predict protein
structure computationally is to start with the protein’s amino acid sequence, specify an
appropriate energy function, and find the conformation that minimizes the energy. A sim-
ilar strategy can be used to design novel folds sequences by searching for a sequence that,
when placed onto a backbone, yields a low energy structure. Protein structures are diffi-
cult to predict and design due to inaccuracies in energy functions as well as the infeasibility
of computationally searching over all possible conformations; in practice, predictions are
often made by settling for less than optimal solutions when considering imperfect energy
functions.
2.2 The Side-chain Positioning Problem
In the side-chain positioning problem (SCP), one is given a fixed backbone and a protein
sequence, and the task is to predict the best conformation of the protein’s side chains on
this backbone. SCP is a key step in computational methods for predicting and designing
protein structures (see, e.g., (Summers and Karplus, 1989; Holm and Sander, 1991; Lee
and Subbiah, 1991; Ventura and Serrano, 2004; Park et al., 2004)).
The problem is made discrete by the observation that in actual protein structures side
chains tend to occupy one of a small number of conformations (Ponder and Richards,
6
1987), called rotamers. These rotamers are identified by finding frequently occurring side-
chain conformations in databases of protein structures. Common conformations for each
side chain are collected into rotamer libraries (e.g. (Dunbrack Jr and Karplus, 1993)).
The total energy of the molecule is expressed as a sum of pairwise energies between atoms
(i.e., when computing energies, only two residues are considered at a time). SCP can then
be formulated as a combinatorial optimization problem: choose a rotamer for each side
chain such that the overall energy of the molecule is minimized (see Section 2.3).
This formulation of SCP has been the basis of some of the more successful methods for
homology modeling (e.g., (Lee and Subbiah, 1991; Petrey et al., 2003; Xiang and Honig,
2001; Jones and Kleywegt, 1999; Bower et al., 1997)), and protein design (e.g., (Dahiyat
and Mayo, 1997; Malakauskas and Mayo, 1998; Looger et al., 2003)). In homology mod-
eling, the goal is to predict the structure for a protein that is homologous (operationally
defined as having similar sequence) to another one of known structure. When two pro-
teins have high sequence similarity, they almost always have a similar overall shape, and
thus the backbone of the protein of known structure can be used as a reasonable tem-
plate backbone for the protein under investigation. In the homology-modeling setting,
the rotamers for a single amino acid are considered at each position.
In protein design (or redesign), the goal is to find a sequence of amino acids that will
fold into a given shape. Though the goal in the design setting is very different from that in
homology modeling, the underlying formulation for both problems is identical. The design
problem is often reduced to SCP by the following method: rather than specifying exactly
the amino acid at each position, we allow the optimization problem to choose among
rotamers from several different types of amino acids at each position. The optimization
problem is solved and the amino acid that corresponds to the rotamer that was chosen at
position i is taken to be the ith amino acid in the sequence. This sequence is the one that
best fits this backbone, and thus, it is hoped, will fold into this shape. This approach has
led to some dramatic successes in protein design, including the design of a 28-residue zinc
7
finger domain that folds in the absence of zinc (Dahiyat and Mayo, 1997) and the design
of novel receptor proteins for several small molecules (Looger et al., 2003).
2.3 Formal Problem Description
The SCP problem can be stated as follows (Desmet et al., 1992): given a fixed backbone
of length p, each residue position i is associated with a set of possible candidate rotamers
{ir}. Once a single rotamer for each residue position has been chosen, the potential energy
of a protein system is given by the formula
E = E0 +∑
i
E(ir) +∑
i<j
E(irjs) , (2.1)
where E0 is the self-energy of the backbone, E(ir) is the energy resulting from the interac-
tion between the backbone and the chosen rotamer ir at position i as well as the intrinsic
energy of rotamer ir, and E(ir, js) accounts for the pairwise interaction energy between
chosen rotamers ir and js. In this discretized setting, the placement of each side chain
is reduced to finding an assignment of rotamers to positions that minimizes the overall
energy of the system (the global minimum energy conformation, or GMEC ).
A pairwise additive energy of the form (2.1) can capture most of the forces that are
commonly taken into account in energetic calculations of protein structure. Empirical
force fields that approximate the forces acting on a system of atoms are often used in
molecular dynamics simulations and other energetic calculations. Two common force fields
used in these simulations and in energy minimization are AMBER (Cornell et al., 1995)
and CHARMM (Brooks et al., 1983; MacKerell et al., 1998), and these force fields are often
set up to include terms that model van der Waals interactions, prefered dihedral angles,
bond angles, and bond lengths. Each of these terms is inherently pairwise. Electrostatic
effects are also important in the dynamics of the molecule and are modeled by distributing
fractional point charges to the atom centers in the molecule (taking into account only
8
iV
position i
rotamer i r
inte
ract
ion
Figure 2.1: The graph formulation. In this hypothetical example, there are four positionsin the protein, one with two rotamer possibilities and the others with three rotamerpossibilities each.
chemical, but not three-dimensional, structure), and summing the pairwise Coloumbic
potential between these point charges. With this common approach, the electrostatic
energy term can be fit into the framework of (2.1). On the other hand, terms to account
for the effects of solvent usually depend on a molecule’s exposed surface area and do not
fit well into this framework. However, many approaches to protein structure prediction
and design consider proteins in a vacuum and do not explicitly model this term.
It is convenient to reformulate the SCP problem in graph-theoretic terms. Let G be an
undirected p-partite graph with node set V1∪· · ·∪Vp , where Vi includes a node u for each
rotamer ir at position i; the Vi’s may have varying sizes. Each node u of Vi is assigned
a weight Euu = E(ir); each pair of nodes u ∈ Vi and v ∈ Vj (i 6= j), corresponding to
rotamers ir and js respectively, is joined by an edge with a weight of Euv = E(irjs).
Zero-weight edges can be thought of as equivalent to the absence of an edge. The global
minimum energy conformation is achieved by picking one node per Vi to minimize the
weight of the induced subgraph.
When applied to protein structure prediction using either a native backbone or the
backbone of a homologous protein, because the sequence is known, the rotamers in each
set Vi are conformations of a single, particular amino acid. In contrast, in the design
scenario, it is the sequence itself that is the object of the search. This is handled by
putting rotamers from several amino acids into the sets Vi. The designed sequence is
9
taken to be the sequence of amino acids corresponding to the chosen rotamers in the
low-energy solution, under the assumption that if a sequence fits well onto a backbone, it
will fold into that shape.
2.4 Applications and Successes
The formulation of the SCP problem studied in this thesis has been at the heart of some
dramatic protein design successes. The first major success was the redesign of a 28-residue
zinc finger so that it folds into the native shape without the presence of zinc (Dahiyat and
Mayo, 1997). The Zif268 zinc finger domain consists of a alpha helix and a 2-stranded
beta sheet (ββα motif). In the wild type, the beta-sheet and alpha-helix are held together
by interactions between a zinc ion and two cysteine and two histidine residues. A new
sequence was designed that folds into the same ββα motif without requiring coordination
by the zinc atom. Rather than allowing all amino acids at all positions, the redesign
allowed mutations to Ala, Val, Leu, Ile, Phe, Tyr, and Trp in the core and Ala, Ser, Thr,
His, Asp, Asn, Glu, Gln, Lys, and Arg for surface residues. Any of the preceding amino
acids were allowed in boundary positions. The total search space consisted of 1.1 × 1062
possible choices of rotamers, and finding the optimal rotamer sequence via DEE required
37 CPU hours (on 3.9 GFLOPS processors), while the energy calculations required 53
CPU hours. The same formulation was use to redesign several boundary residues of
the β1 domain of protein G from Streptococcus in order to increase the stability of the
protein (Malakauskas and Mayo, 1998).
More recently, the SCP problem has also been used to redesign protein-protein inter-
actions as well as protein-ligand interactions. Novel receptor proteins have been designed
to bind to trinitrotoluene, L-lactate, and serotonin (Looger et al., 2003). Many ligand
orientations are docked to the protein that is to be redesigned. For each orientation of
the ligand, the ligand-contacting side chains (12–18 residues) are redesigned using DEE.
For each of the ligand-protein pairs, the design process takes about 60 CPU days. In (Ko-
10
rtemme et al., 2004), the E7/Im7 protein-protein interface was redesigned to create several
pairs of mutants that bind preferentially to each other but not with the wild type. Side-
chain repacking is used as a subroutine, applied to successive fixed backbones.
In general, the protein-protein interface is often mediated by interlocking side chains,
and thus side-chain positioning can be a valuable way to augment ridge-body transforma-
tions when trying to dock one protein onto another. See, for example, (Grey et al., 2003;
Wang et al., 2005).
Side-chain positioning finds its way into methods to design completely novel proteins.
The design of a 93-residue α/β protein with a novel fold (dubbed Top7) was reported
in (Kuhlman et al., 2003) using the method first described in (Kuhlman and Baker,
2000). There the method of design alternates between side-chain positioning on a fixed
backbone and continuous relaxation of the backbone given a designed sequence.
Homology modeling is another widely used application for this formulation of the SCP
problem. The popular Scwrl package allows rapid low-resolution predictions of structures
by using homologous proteins as backbones (Bower et al., 1997). When homologous
backbones are not available, SCP has been used with much success in ab initio structure
prediction — that is determining the complete 3D coordinates of the atoms of a protein
given only its amino acid sequence. For example, the Rosetta program (Simons et al.,
1997) repacks side chains by choosing rotamers from a rotamer library as it performs a
Monte Carlo search for an optimal backbone shape. For speed, simulated annealing is
used to find a good choice of rotamers. See, e.g., (Aloy et al., 2003) for a discussion of
other recent successes in ab initio fold prediction.
2.5 Previous Methods to Solve the SCP Problem
There has been considerable progress in development of both exhaustive and heuristic
techniques for this problem. Within the past dozen years, a series of papers on dead-end
elimination have given rules for throwing out rotamers that cannot possibly be in the
11
optimal solution (e.g., (Desmet et al., 1992; Desmet et al., 1994; Goldstein, 1994; Lasters
et al., 1995; Gordon and Mayo, 1998; Looger and Hellinga, 2001; Gordon et al., 2002)).
In Chapter 3 we will use the Goldstein (Goldstein, 1994) variant of dead-end elim-
ination condition to prune some instances. Following the notation of Section 2.3, this
condition says that a rotamer u ∈ Vi can be thrown out if there is some other rotamer
v ∈ Vi such that
Euu − Evv +∑
j 6=i
minw∈Vj
(Euw − Evw) > 0 .
The rotamers u are selected in sequence starting with an arbitrary rotamer. Every possible
v is tested to see if the above condition holds, indicating that rotamer u can be eliminated.
This process stops when a pass through the rotamers finds none that can be removed.
Special-purpose heuristic search techniques for specific energy functions have been
successfully applied, as in the original Scwrl package (Bower et al., 1997). More gen-
eral search methods such as simulated annealing (e.g., (Lee and Subbiah, 1991; Holm
and Sander, 1991)), A∗ (Leach and Lemon, 1998), Monte Carlo search (e.g., (Xiang and
Honig, 2001)), and mean-field optimization (Lee, 1994) have also been used. Special-
ized graph-theoretic approaches have also been developed (Samudrala and Moult, 1998;
Canutescu et al., 2003; Bahadur et al., 2004). Among these previous methods, the ex-
haustive methods always find the optimal solution but are not efficient (i.e., may require
exponential search), whereas the heuristics are efficient but do not guarantee finding the
optimal solution.
2.6 The Inapproximability of Side-chain Positioning
It has been shown that SCP is NP-complete (Pierce and Winfree, 2002). Accordingly,
it is unlikely that there is any efficient algorithm for solving the problem optimally. For
many NP-complete problems, it is possible to develop approximation algorithms that can
efficiently find a suboptimal solution that is within a provable factor of the optimal one.
12
For SCP, we show below that it is unlikely that there is any approximation algorithm
with a reasonable performance guarantee.
Theorem 2.6.1 It is NP-complete to approximate the minimum energy of the global min-
imum energy conformation (GMEC) within a factor of cn, where c is a positive constant
and n is the total number of rotamers.
One detail is that complexity results are proved for yes/no decision questions. The
SCP problem is an optimization problem in which we are given an instance of a side-
chain positioning problem, and we seek the best conformation as well as its energy. It
is turned into a yes/no decision problem by providing as additional input an integer k,
and asking whether the GMEC of the instance has energy less than k. Note that this
modified problem is not harder than the original optimization version: if one could solve
the optimization version, one could easily solve this yes/no decision version.
Theorem 2.6.1 does not mean that good algorithms and methods for the side-chain
positioning problem cannot be shown to work well in practice. Indeed, several papers have
presented efficient algorithms that seem to work well in practice (e.g. (Xiang and Honig,
2001; Bower et al., 1997)), or algorithms that are designed to find optimal solutions but
complete quickly for some problems (e.g. (Goldstein, 1994; Lasters et al., 1995; Desmet
et al., 1992; Pierce et al., 2000; Kingsford et al., 2005)).
Proof of Theorem 2.6.1. We will prove this theorem by showing that if the SCP problem
has a good approximation algorithm then we would also have an efficient algorithm for a
problem for which it is likely that none exists, namely a problem involving satisfiability
of boolean formulas.
A 3-CNF (conjunctive normal form) formula is a conjunction of clauses, each one
consisting of the disjunction of three literals (not necessarily distinct). An example of
such a boolean formula is shown at the top of Figure 2.2. In that figure, letters a, b, . . .
represent variables and a, b, . . . represent the negation of those variables. ∨,∧ represent
13
111
11
. . .
(a b c) . . .(f g d) (b c d)
3
3 3
(a d f)
3
Figure 2.2: Converting a 3-CNF formula to an SCP problem.
“or” and “and,” respectively. A formula is satisfiable if the variables can be assigned
values true or false such that the whole formula is true. Given a 3-CNF formula, it is
NP-complete to determine whether it is satisfiable.
The PCP theorem (Arora et al., 1998; Arora and Safra, 1998) asserts that, given any
3-CNF formula Φ on n variables, there exists another one, denoted by Ψ, which contains
nO(1) variables and is satisfiable if and only if Φ is satisfiable. Furthermore, if Ψ is not
satisfiable, then it is strongly unsatisfiable, meaning that no truth assignment can satisfy
more than a fraction α of its clauses, for some constant 0 < α < 1. Finally, Ψ can
be derived from Φ in polynomial time. Since 3-CNF satisfiability is NP-complete, it is
then also NP-complete to distinguish between formulas that are satisfiable and those that
are strongly unsatisfiable. (If there is an efficient algorithm that distinguishes between
such formulas, then any 3-CNF formula can be efficiently tested for satisfiability by first
converting it to a strongly unsatisfiable formula and then using this algorithm.)
Given a 3-CNF formula with p clauses that is either satisfiable or strongly unsatisfiable,
we create an SCP problem such that if the formula is satisfiable then the GMEC = 1,
but if the formula is not satisfiable then the GMEC will tell us how many clauses can be
satisfied in the original 3-CNF.
We build a p + 1-partite graph G as follows: each clause i corresponds to a set Vi of
4 vertices. In each Vi three vertices are associated with the literals of clause i. These
14
vertices have no self-weights. Two vertices in Vi and Vj are joined in G if and only if the
literal of one is the negation of the other. Each such edge is assigned weight 3. The 4th
vertex in each Vi is an “extra” vertex with no adjacent edges and vertex weight 1. We
add an additional position with a single node of weight 1. The total number of rotamers,
or nodes, in this SCP instance is n = 4p+ 1. This reduction is depicted in Figure 2.2.
If the CNF formula is satisfiable then for each Vi we select a literal set to true as the
GMEC vertex. These p vertices form an independent set (i.e., since one cannot set both
a variable and its negation to true, these vertices have no edges between them) and the
energy of the system is 1.
If the CNF formula is not satisfiable then the GMEC is formed by picking the largest
independent set among the vertices, including at most one vertex per Vi, and completing
the selection for the remaining Vi by choosing the fourth “extra” vertex for each. (Picking
any pair of adjacent vertices would be a mistake since that choice could be locally improved
by choosing an isolated vertex in each position of weight 1.) We can set to true the literals
corresponding to the vertices of the independent set. Therefore, the energy of the GMEC
is p − c + 1, where c is the maximum number of satisfiable clauses in the CNF formula.
Because the CNF formula is strongly unsatisfiable, the minimum number of unsatisfied
clauses p−c is at least (1−α)p, and the optimal GMEC of the corresponding SCP problem
is at least (1 − α)p + 1.
Thus, suppose we had an efficient algorithm that was guaranteed to find a solution
at most (1−α)4 n times the optimal. Then for any satisfiable formula, this algorithm would
find a solution to the corresponding SCP problem of value at most (1−α)4 n. Since this is
less than (1 − α)p + 1, the minimum possible value of the GMEC corresponding to an
unsatisfiable formula, this algorithm could distinguish between satisfiable and strongly
unsatisfiable formulas, something the PCP Theorem implies we cannot do. �
15
Chapter 3
Solving and Analyzing Side-chain
Positioning Problems Using
Linear and Integer Programming
We describe an integer linear programming formulation of the side-chain po-
sitioning problem and show how to use integer programming and a linear
programming heuristic to find optimal solutions to large native-backbone, ho-
mology modeling, and design problems. We also show how to find several
near-optimal solutions, which are often useful in protein design. Finally, we
investigate how the method of evaluating the total energy of a choice of ro-
tamers affects the ease with which solutions are found.
3.1 Introduction
In this chapter, we give an integer linear programming (ILP) formulation of the side-chain
positioning problem described in Chapter 2 and show that it can tackle large problem
sizes and can obtain successive near-optimal solutions. Multiple near-optimal solutions
16
are especially useful in protein design, where it may be desirable to find several possible
sequences for a particular shape.
An optimal solution to the ILP we describe below gives an optimal solution to the
underlying SCP problem. By relaxing the integrality constraint of our ILP, we get a
polynomial-time linear programming (LP) heuristic, where the solution may or may not
correspond to a solution for the SCP problem. Our overall LP/ILP approach for SCP is as
follows. First, we apply LP to an instance of SCP. If the solution using LP is integral, then
that solution is provably the conformation with the global minimum energy, and we have
found, in polynomial-time, the optimal solution to the SCP instance. On the other hand,
if the solution using LP is fractional, we run the computationally more expensive (i.e.,
no longer polynomial-time) ILP procedure to find the optimal SCP solution, after first
preprocessing with Goldstein dead-end elimination (Goldstein, 1994) (see Section 2.5).
Using our LP/ILP approach, we evaluate instances arising when positioning side chains
on native and homologous backbones, and when choosing side chains for protein design.
We show that LP and ILP are highly effective methods for obtaining optimal solutions
for the SCP problem. The LP/ILP approach is shown to tackle problems of size up to
10218 easily when packing side chains on native and homologous backbones, and of size
up to 10201 when redesigning protein cores. As proof of principle, we also obtain multiple
(100) near-optimal solutions for a native-backbone SCP problem of size 1079.
Although mathematical programming approaches to SCP (Althaus et al., 2000; Eriks-
son et al., 2001; Chazelle et al., 2004) have been suggested previously, they are extensively
tested for the first time here, improved to take advantage of the geometry of the protein,
as well as extended to handle larger problem sizes and to find near-optimal solutions.
In (Althaus et al., 2000), an LP relaxation similar to the one we study below is used in
a branch-and-cut framework to solve three small (50-54 side-chains) docking problems.
In (Eriksson et al., 2001), a different LP formulation is used in a branch-and-bound ap-
proach on a single backbone.
17
SCP methods are commonly evaluated using two scales. In the predictive scale, one
asks how well the side-chain conformations predicted by the method agree with those
that are found in the actual structure; or, in the case of protein design, whether the newly
designed sequence folds into the desired shape. In the combinatorial scale, one asks how
close the total energy resulting from the predicted side-chain conformations is to the lowest
possible minimum energy using the given rotamer library and energy function. Of course,
the predictive scale measures what we are ultimately interested in (i.e., the quality of the
end result). However, the combinatorial scale is useful for improving search algorithms and
energy functions, and such improvements are necessary to get higher-quality predictions
of side-chain conformations. Theoretical results argue that the SCP problem is difficult
on the combinatorial scale: the combinatorial problem underlying SCP is not just NP-
complete (Pierce and Winfree, 2002), but also inapproximable (Section 2.6). That is, it is
unlikely that there exists a polynomial-time method that can guarantee a good (let alone
optimal) solution to SCP for all instances of the problem. However, these are worst-case
results: they may not hold for the classes of problems and energy functions that occur in
practice. In this chapter, we hope to put these theoretical hardness results into practical
context.
We use our LP formulation to probe the difficulty of SCP instances arising in differ-
ent applications. We label an instance as “easy” if LP finds an integral (i.e., optimal)
solution. In contrast, if LP finds a fractional solution, we use this as evidence that the
instance is more difficult to solve. Our computational experiments on 25 native-backbone
problems and 33 homology-modeling problems show that LP can almost always find an
integral solution when using an energy function based on van der Waals interactions and
a statistical rotamer self-energy term. Similar, even simpler, energy functions have been
the basis for successful homology-modeling packages (Bower et al., 1997). Since SCP is
NP-complete, it is intriguing that integral solutions are found so readily, and in these
cases, since a polynomial-time procedure has provably found optimal solutions, it appears
18
that the theoretical hardness results do not apply in practice. On the other hand, when
using the same energy function on 25 protein design problems of approximately the same
size, the LP does not often find integral solutions. This suggests that the optimization
problems underlying protein design may be considerably more difficult to solve than those
arising in the native- or homology-modeling settings. We also explore how changing the
energy functions affects the problem’s hardness. The LP approach sometimes finds op-
timal solutions under energy function variants; however, different energy functions affect
its ability to do so.
3.1.1 Biological Relevance
While our primary goal is to study the combinatorial nature of SCP, in order to verify
that the energy functions considered are appropriate for predicting protein structures for
native and homologous backbones, we compare side-chain conformations predicted by the
LP/ILP approach with those in the native structures. The solutions found for native
and homologous backbones give structures that are comparable in quality to those found
by other methods using the same rotamer library (Bower et al., 1997; Xiang and Honig,
2001).
3.1.2 Practical Implications
There are several immediate practical consequences of our analysis. First, our work argues
that attempts to improve search methods should be focused on protein design problems,
as they seem to be computationally more difficult to solve than homology modeling prob-
lems. Second, in our experience, even seemingly small differences in problem instances can
have a large impact on the ease with which solutions are obtained. This makes it hard to
compare different published benchmarks of SCP algorithms, as these algorithms are often
tested with differing energy functions and in different settings (e.g., design vs. homology
modeling). To facilitate comparisons and to encourage the use of LP/ILP approaches, we
19
are making our software for generating the LP/ILP publicly available. Third, our analysis
suggests that the choice of an energy function should depend on two factors: how bio-
logically meaningful it is and how it affects the ease with which optimal or near-optimal
solutions are found. For example, a combinatorially “easy” energy function may be useful
in finding a subset of reasonable predictions that can then be evaluated using the desired
energy function. Finally, and most importantly, our analysis includes the first large-scale
test of an LP/ILP approach, and we demonstrate that such an approach provides an effec-
tive and practical technique for solving the SCP problem for both homology modeling and
protein design applications. Decades of research on LP furnish us with highly-developed
machinery that we can exploit; the advantage of relying on this off-the-shelf technology
is that any subsequent progress in optimizing linear programs will translate into faster
running times for our method. While there are many fast heuristics for side-chain po-
sitioning, in many cases, optimal and successive near-optimal solutions are desired. In
these cases, LP-based approaches provide a general, state-of-the-art methodology.
3.2 Methods
3.2.1 Integer Linear Programming Formulation
We first formulate the side-chain positioning problem as an integer linear program (ILP),
so that a solution to the ILP gives an optimal solution to the SCP problem. The ILP is
based on the graph formulation of SCP discussed in Section 2.3. The vertex set of this
graph is V = V1 ∪ · · · ∪ Vp , and its edge set D = {(u, v) : u ∈ Vi, v ∈ Vj , i 6= j }. Recall
that the nodes correspond to rotamers, and the edges to interactions between rotamers.
We introduce a {0, 1} decision variable xuu for each node u in V , and a {0, 1} decision
variable xuv for each edge in D. Setting xuu to 1 corresponds to choosing rotamer u, and
similarly setting xuv to 1 corresponds to choosing to “pay” the energy between rotamers u
and v. We constrain our optimization so that only one rotamer is chosen per residue, and
20
so that we pay the cost for edge {u, v} if and only if rotamers u and v are both chosen.
The following integer program ensures these conditions:
Minimize E =∑
u∈V Euuxuu +∑
{u,v}∈D Euvxuv
subject to
∑u∈Vj
xuu = 1 for j = 1, . . . , p
∑u∈Vj
xuv = xvv for j = 1, . . . , p and v ∈ V \ Vj
xuu, xuv ∈ {0, 1}
(IP1)
The first set of constraints ensures that we choose exactly one rotamer for each residue.
The second set of constraints demands that we set the edge variables xuv to 1 for edges
that are in the subgraph induced by the choice of rotamers: if xvv = 0 then no adjacent
edges can be chosen, and if xvv = 1 then exactly one adjacent edge is chosen for each vertex
set. Though derived independently, this formulation is similar to the version of (Althaus
et al., 2000) (without modifying the energies to be negative) and simpler than that of
(Eriksson et al., 2001). Additionally, on the experimental side, (Klepeis et al., 2003) use
a similar integer programming formulation to design variants of the peptide Compstatin
that are predicted to improved inhibitory activity in complement pathways. However, this
is a slightly different model in which side-chain positions are not explicitly represented.
In practice, the ILP given above can have many variables and constraints that do not
affect the optimization, and the system can be pruned dramatically. In particular, if all
the pairwise energies between rotamers in positions i and j are non-positive, then we can
remove all variables xuv with u ∈ Vi and v ∈ Vj such that Euv = 0, and modify the
equality constraints in (IP1) that contain such an xuv by removing those variables and
changing “=” to “≤.” Because we are minimizing and all the energies between i and j
are zero or less, this change does not affect the optimal solution. A frequent special case
has zero energies between all rotamers in two positions; this corresponds to residues that
21
are too far apart in the structure to have any rotamers that interact with each other. The
more general case involves residues that are far enough apart that only a subset of their
rotamers have interactions with each other.
More formally, for each Vj, let N+(Vj) be the set union of the Vi for which there exists
some v ∈ Vi and u ∈ Vj with Euv > 0. Let D′ be the set of pairs {u, v} with u ∈ Vj such
that either v ∈ N+(Vj), or v 6∈ N+(Vj) but Euv < 0. There will be edge variables xuv
only for pairs in D′. Our modified ILP is as follows:
Minimize E ′ =∑
u∈V Euuxuu +∑
{u,v}∈D′ Euvxuv
subject to
∑u∈Vj
xuu = 1 for j = 1, . . . , p
∑u∈Vj
xuv = xvv for j = 1, . . . , p and v ∈ N+(Vj)
∑u∈Vj :Euv<0 xuv ≤ xvv for j = 1, . . . , p and v 6∈ N+(Vj)
xuu, xuv ∈ {0, 1}
(IP2)
An inequality constraint is not included if the sum on the left-hand side is empty.
The simple modification of (IP1) given in (IP2) is crucial in practice, providing in some
cases an order of magnitude speed up. For example, when focusing on packing side chains
the principal component of the energy function is the van der Waals force. Since this force
quickly becomes negative, and asymptotically goes to zero, each residue can have positive
interactions with only a few nearby residues. In other words, if p is the total number of
side chains, and if we treat the maximum number of rotamers in a position as a constant,
then the geometry of the problem can reduce the number of variables from O(p2) to O(p),
where the constant is related to the radius of influence of the van der Waals force and how
many residues can be packing into a sphere of that radius. As an example, Figure 3.1 show
two proteins in our test set. In Figures 3.1(a) and (b) lines are drawn between residues
with some non-zero energy. For these we must keep at least some variables. Though at
22
(a) (b)
(c) (d)
Figure 3.1: (a-b) Interacting residues in 1igd. (a) those pairs that have some non-zerointeraction. (b) those pairs of residues that have some positive interaction. (c-d) Theanalogue of (a-b) for 1aac.
23
first glance visually these proteins now look quite dense, on closer inspection it can be
seen that far fewer than the O(n2) pairs are connected by an edge; for many positions no
edge variables are necessary. The effect is more pronounced for elongated proteins and
can be seen clearly at the “tail” of the protein 1igd in Figure 3.1(a). Figures 3.1(b) and
(d) show pairs of residues that have rotamers with some positive energy between them.
These are the only pairs of positions for which we cannot eliminate any variables using
the technique above. It is apparent visually that far fewer than the O(p2) pairs of residues
have positive interactions.
3.2.2 Multiple Solutions
Sometimes it is desirable to find several optimal and near-optimal solutions. In the present
framework, the LP/ILP can be solved iteratively to find an ensemble of low-energy so-
lutions. At iteration m, all previously discovered solutions are excluded by adding the
constraints∑
u∈Sk
xuu ≤ p− 1 for k = 1, . . . ,m− 1 (3.1)
to (IP2), where Sk contains the optimal set of rotamers found in iteration k. This will
require that the new solution differs from all previous ones in at least one position. It
may be desirable to obtain successive solutions that differ more from each other, and this
can be accomplished by replacing p − 1 in (3.1) by p − q, where 1 < q ≤ p. Further, by
summing over only a subset of the positions, we can force, say, a core residue to assume
a different rotamer.
3.2.3 LP/ILP Approach
The ILP formulation is as hard to solve as the original SCP problem. If we relax the
integrality constraints xuv ∈ {0, 1} by replacing them with constraints 0 ≤ xuv ≤ 1 for
u, v ∈ V , we obtain a linear program, which can be solved efficiently. If the optimal
solution to the relaxed linear program is integral—all variables are set to either 0 or 1—
24
then that solution is also an optimal solution to the ILP and SCP problem. So our LP/ILP
approach to find optimal solutions is as follows (Figure 3.2): solve the problem of interest
using the computationally easier LP formulation. If the solution returned is integral, then
the problem instance was easy to solve, and we have the optimal solution to the original
SCP problem. Otherwise, we run polynomial-time Goldstein DEE (Goldstein, 1994) (See
Section 2.5) until no more rotamers can be eliminated and then solve the more difficult
ILP. None of the problems considered here converge when this simple DEE process is
applied.
We solve the LP relaxation before applying DEE principally so that we can test the
LP in isolation. We would like to know what size problems can we tackle using the LP
alone. Also, we will see below, somewhat surprisingly, that the solution to the LP is often
integral. It is interesting to see how often this happens even before the simplifying DEE
step. Finally, in some cases, the LP on the full graph can be solved more quickly than
running DEE to reduce the graph. This may be due to the highly optimized solver code
that is available. Using these off-the-shelf solvers leverages the many years of engineering
work that has been spent speeding up the code.
The CPLEX package (ILOG CPLEX, 2000) with AMPL (Fourer et al., 2002) was
used to solve the linear and integer programs. All computation was done on a single
Sparc 1200MHz processor.
3.2.4 Integrality Gap of the LP Relaxation
Restricting ourselves to non-negative energy functions for this section (the Scwrl energy
function is one such energy function, for example), the integrality gap is a measure of how
good a lower bound on the optimal solution can be guaranteed by an LP relaxation of a
particular ILP. The integrality gap for a minimization problem is defined (Vazirani, 2001,
pg. 102) as
supI
OPT (I)
OPTf (I), (3.2)
25
C o n v e r g e d ?
N oY e s
N o
Y e s
Figure 3.2: Flow chart of the LP/ILP approach which we test here.
26
Figure 3.3: Example for which the integrality ratio of the LP relaxation is very large. Thedrawn edges have positive weight (say weight 1), while the weight between pairs of nodesbetween which no edge is drawn is 0.
where I is an instance of the problem, OPT (I) is the true optimum, and OPTf (I) is the
optimal value of the LP relaxation.
Consider the graph of Figure 3.3, where the drawn edges have weight 1 and the remain-
ing edges have weight 0. If this is the underlying graph of an IP2 formulation, any integral
solution must choose exactly one edge between positions, making OPT = 1. However, by
placing 0.5 on each of the nodes and 0.5 on the undrawn edges between positions, the LP
relaxation of IP2 can achieve an value of 0. The integrality gap is therefore infinite. (If a
zero energy solution is troublesome, weight the undrawn edges by some small ǫ.)
The small example of Figure 3.3 is not unrealistic in the SCP setting. A similar set of
weights can be realized by, say, a planar configuration of three residues, where one set of
rotamers forms an overlapping “tepee” on one side of the plane, while the other set forms
a mirror tepee on the other side. Therefore, the LP relaxation could perform very badly.
Perhaps surprisingly, we present experimental evidence below, that on the contrary, this
formulation is often very tight.
27
Prot Len Var Rot Size Time Prot Len Var Rot Size Time
1aac† 105 85 1523 79 14 1mfm 153 118 2134 112 231aho† 64 54 981 49 7 1plc 99 82 1156 73 81b9o 123 112 2056 112 25 1qj4 256 221 4080 218 1001c5e 95 71 1108 61 8 1qq4 198 143 2045 121 291c9o 66 53 1130 56 9 1qtn 152 134 2516 132 331cc7 72 66 1396 66 17 1qu9 126 100 1817 94 201cex† 197 146 2556 136 36 1rcf 169 142 2396 139 431cku 85 60 1093 58 10 1vfy 67 63 939 56 71ctj† 89 61 1021 62 6 2pth 193 151 3077 151 681cz9 139 111 2332 111 56 3lzt 129 105 2074 102 281czp 98 83 1170 75 10 5p21 166 144 2874 146 781d4t 104 89 1636 84 19 7rsa 124 109 1958 100 261igd† 61 50 926 47 6
Table 3.1: The native backbone problem sizes. For each protein, Prot gives its PDBidentifier, Len gives its length, Var indicates how many of its side chains have more thanone possible rotamer and Rot gives the total number of rotamers considered. Size givesthe log10 of the search space size. Time gives the number of seconds for the solve phaseof CPLEX.†These proteins were used to determine the weight of the statistical potential in the basicenergy function (see text).
28
Template / Seq Var Template / Seq Vartarget id len Rot Size target id len Rot Size
1aac/1id2 62 86 1608 82 1igd/1mi0 78 49 723 441aac/2b3i 29 87 1242 73 1mfm/1b4l 54 117 1978 1051aho/1dq7 50 53 719 44 1mfm/1cob 80 119 1980 1081b9o/1f6r 75 114 1999 111 1mfm/1xso 65 114 1826 1041c9o/1csp 82 53 1076 56 1plc/1byo 71 79 1131 701c9o/1g6p 61 54 1409 60 1plc/1jxf 44 77 1093 641c9o/1mjc 57 52 862 48 1qj4/1e89 75 220 4154 2181cc7/1fe4 37 62 1222 60 1qq4/1hpg 34 139 1514 1051cku/1eyt 87 61 1095 58 1qu9/1j7h 75 101 1885 971cku/3hip 73 65 1079 59 1qu9/1qd9 49 104 1749 971ctj/1c6r 79 64 1030 62 1rcf/1czh 69 140 2151 1351ctj/1cyj 64 66 1291 69 1vfy/1hyj 40 57 1060 531ctj/1f1f 46 64 1219 62 3lzt/2mef 59 105 2320 1081czp/1doy 73 81 990 69 5p21/1kao 49 147 2977 1481czp/4fxc 79 81 961 70 7rsa/1bsr 81 110 2242 1041d4t/1luk 31 93 1877 91 7rsa/1rra 67 112 2111 1041igd/1fcl 75 51 899 48
Table 3.2: Template gives the PDB identifier for the protein used as the template back-bone, and Target gives the PDB identifier of the protein for which the structure is tobe predicted. Seq ID gives percentage identity between template and target protein se-quences, Var Len gives the number of side-chains that are varied, and Rot gives the totalnumber of rotamers considered. Size is the log10 of the search space size.
3.2.5 Data Set
The primary protein set (Table 3.1) consists of 25 proteins taken from (Xiang and Honig,
2001). The proteins vary in size, ranging from 50 to 221 residues with more than one
possible rotamer. As in (Xiang and Honig, 2001), only the first chain in the PDB file is
used for experiments.
For homology modeling, 33 homologs to the proteins of Table 3.1 are also used. These
protein pairs share between 29% and 87% sequence identity (Table 3.2). Whereas for some
proteins there are other more similar protein sequences present in the PDB, for evaluation
purposes, the chosen homologs give a wider range of sequence identity. ClustalW (Thomp-
son et al., 1994), with default settings, was used to align the protein pairs. For each pair,
29
Template / Time Template / Time Template / Timetarget (ILP) target (ILP) target (ILP)
1aac/1id2 14 1ctj/1cyj 10 1plc/1jxf 81aac/2b3i 13 1ctj/1f1f 9 1qj4/1e89 1201aho/1dq7 4 1czp/1doy 8 1qq4/1hpg 141b9o/1f6r 24 1czp/4fxc 6 1qu9/1j7h 271c9o/1csp 7 1d4t/1luk 26 (1) 1qu9/1qd9 19 (2)1c9o/1g6p 13 1igd/1fcl 7 1rcf/1czh 381c9o/1mjc 3 1igd/1mi0 3 1vfy/1hyj 91cc7/1fe4 13 1mfm/1b4l 23 3lzt/2mef 521cku/1eyt 10 1mfm/1cob 19 5p21/1kao 711cku/3hip 12 1mfm/1xso 16 7rsa/1bsr 411ctj/1c6r 7 1plc/1byo 7 7rsa/1rra 42
Table 3.3: Solve times (in CPU seconds) for the LP relaxation for the 33 homologymodeling problems. For the two problems which did not return integral solutions, the IPsolve time is given in parentheses.
the protein in the original data set was taken as the template backbone, and its sequence
homolog was taken as the target protein to be predicted. If the ith residue of the target
sequence is aligned to the jth residue of the template sequence, then rotamers corre-
sponding to the ith residue were considered at the jth position in the template backbone.
Any gaps in the target sequence were handled by modeling the side chains of the native
residues of the template. Any gaps in the template sequence caused the corresponding
residues in the target sequence to be left out of the model.
3.2.6 Rotamer Library and Structure Manipulation
We used Dunbrack’s backbone-dependent rotamer library (Dunbrack Jr and Karplus,
1993). For each 10◦ range of φ, ψ backbone angles, this library has 320 rotamers, with
the largest number of rotamers, 81, belonging to arginine and lysine. Figure 3.4 shows
the distribution of rotamers for each (φ,ψ) bin. Backbones were held fixed, and missing
hydrogens were added using the BALL C++ library (Kohlbacher and Lenhof, 2000), which
was also used to manipulate rotamers. All non-protein atoms were ignored. Each choice
30
Figure 3.4: The distribution of rotamers in each (φ,ψ) bin of Dunbrack’s backbone-dependent rotamer library. Each bin has 320 rotamers, with longer residues getting morerotamers.
of rotamers was converted to a 3-dimensional structure using the given backbone atoms
and the stock side chains from (Kohlbacher and Lenhof, 2000). For all computations, the
backbone, alanines and glycines were held fixed.
3.2.7 Energy Function
All the energy functions considered consist of a rotamer self-energy term and a pairwise
rotamer interaction term. For the basic energy function, used for all computations unless
otherwise specified, pairwise rotamer energies are computed using van der Waals interac-
tions, and self-energies are computed using both statistical potentials and van der Waals
interactions. The basic energy function is similar to that of the Scwrl package (Bower
et al., 1997), though we use a more realistic van der Waals term.
Van der Waals interactions between rotamers. The pairwise van der Waals inter-
action energy between rotamers u and v is the sum of the van der Waals interactions
between the side-chain atoms of u and v. We use the 6–12 Lennard-Jones formulation of
the van der Waals force (Figure 3.5). The parameters used in the van der Waals force are
those of AMBER96 except the hydrogen radii are reduced by 50% to account for their
uncertain position. As in AMBER96, for atoms separated by 3 bonds (1–4 pairs), van
31
Figure 3.5: A schematic of the 6–12 Lennard-Jones approximation to the van der Waalsforce. The van der Waals force is asymptotically infinite as two atoms approach eachother, and asymptotically zero as the atoms are separated. There is a small range ofdistances where the atoms are attracted to each other with a small negative energy.
der Waals interaction parameters are reduced by half, and there is no van der Waals con-
tribution between atoms separated by fewer than 3 bonds. Each atom-atom interaction
is capped at 100 kcal/mol. As an optimization, the van der Waals interactions are taken
to be zero at distances longer than 10 A and residues are assumed not to interact if their
Cβ atoms are farther apart than 8.0 A plus the longest possible extensions of their side
chains. Any value less than 10−6 is considered to be 0. These approximations generally
have insignificant effects on the calculated energies.
Van der Waals interactions in self-energy terms. For each rotamer, the van der
Waals energy is computed (as described above) between each of its atoms and all the fixed
backbone atoms in the system except those corresponding to the current residue and the
residues on either side of it. The self-energy also includes the van der Waals interactions
with atoms in fixed residues.
Statistical self-energies. For each amino acid i in a particular backbone setting, let
piu be the fraction of times amino acid i is found in rotamer u, and pi0 be the fraction
32
C0 10 20 30 40 50 60 70 80 90 100
Ave
rage
RM
SD
1.55
1.65
1.75
1.85
1.95
2.05
2.15
RMSD
Figure 3.6: The average rmsd over 5 proteins for various values of C.
of times amino acid i is in its most common rotamer. These values are obtained from
the rotamer library (Dunbrack Jr and Karplus, 1993). As in (Bower et al., 1997), the
statistical self-energy term for a particular rotamer u is given by − ln(piu/pi0), so that the
more common a rotamer, the lower the energy assigned to it.
Combining the statistical self-energies with the van der Waals interactions. In
summing up the total energy of the system, the statistical self-energy term is weighted
by a constant C that is the relative weighting of it in comparison to the physical van der
Waals term. The choice of C can have a large effect on the accuracy of the solution and
the ease with which it can be found. C can be thought of as the inertia for a residue to
remain in a highly-favored side-chain conformation. To calibrate C, five proteins of varied
structure (1aac, 1aho, 1cex, 1ctg and 1igd) were selected from the test set. The LP/ILP
algorithm was applied to each for values of C ranging between 0.5 and 100. Figure 3.6
shows the average side-chain root mean squared deviation (rmsd) over the five proteins
for various values of C. It is best to set C to the smallest value that works well so as to
use as much information about the specific fold as possible. C = 10 was taken to be a
good choice.
33
3.2.8 Evaluating Predicted Structures
For each protein, we compare the predicted side-chain conformations with those found in
its crystal structure. We use two measures of accuracy. First, we compute the percent-
age of χ1 side-chain dihedral angles predicted within 20 degrees of the native structure,
and the percentage of both χ1 and χ2 side-chain dihedral angles predicted within 20 de-
grees of native. Second, we compute the rmsd between the predicted structure and the
crystal structure. When positioning side chains on native backbones, rmsd is computed
between corresponding side-chain atoms only. When positioning side chains of a tar-
get protein on a homologous backbone, the native backbone of the target protein and the
homologous backbone are first fit together using all the non-hydrogen atoms in both struc-
tures (McLachlan, 1982; Martin, 2001), and then rmsd is computed over the side-chain
atoms.
Because performance can vary greatly depending on the location of the residue in the
protein, in addition to evaluating predictions over all residues, we report performance
over only core residues, defined to be those that have less than 10% of their possible
surface area exposed in the crystal structure. For each residue, exposed surface area is
determined as a percentage of the surface area of the residue in isolation using the Surfv
package (Nicholls et al., 1991).
3.3 Computational Results
We test the hardness of SCP instances and evaluate the LP/ILP approach on problems
resulting from three applications: predicting the conformations of a protein’s side chains
on its native backbone, predicting the structure of a protein using the backbone of a
homologous sequence as a template, and designing a protein sequence for a given backbone.
34
Figure 3.7: Running times (in CPU seconds) of the 25 native backbone problems.
3.3.1 Native Backbone Tests
For the each of 25 proteins in Table 3.1, we ran the LP/ILP approach to predict side-
chain conformations on native backbones. We used the native protein sequence from the
PDB file and allowed each residue to assume all the rotamers listed in the library for the
given amino acid and φ, ψ backbone angles. This resulted in search spaces with up to
10218 possibilities. Using the basic energy function described in the previous section, all
problems returned optimal integral solutions using LP, and it was not necessary to use the
more computationally expensive ILP formulation. The total CPU time for solving the 25
LPs using formulation (IP2) was under 12 minutes; this is approximately 13 times faster
than when using the formulation (IP1). Figure 3.7 shows the number of CPU sections
spent in the solve phase of CPLEX on each of the native backbone problems.
To ensure that the energy function produces meaningful structures, we compare the
side-chain conformations predicted by the LP with the side-chain conformations in the
crystal structure (Table 3.4). Over all the residues, we find that 80% of χ1 angles and
51% of the χ1 and χ2 angles are predicted within 20◦ of native. For just core residues,
our approach leads to 87% of χ1 angles and 62% of the χ1 and χ2 angles predicted
correctly. Additionally, our method obtains an average rmsd per protein of 1.553A. This
is a reasonable level of accuracy, as it is comparable with values obtained when running
35
Coreresidues
Allresidues
(a) LP/ILP χ1 / χ1+2 87% / 62% 80% / 51%(b) Scwrl χ1 / χ1+2 88% / 60% 80% / 49%(c) LP/ILP rmsd 1.079 A 1.553 A(d) Scwrl rmsd 1.170 A 1.649 A(e) Best rmsd 0.575 A 0.640 A(f) Simple method rmsd 1.796 A 1.954 A
Table 3.4: Prediction of side-chain conformations on native backbones, with a comparisonof the LP/ILP prediction with those of other methods and the crystal structure. All valuesare averaged over the 25 proteins of Table 3.1. (a) The percentage of residues over allproteins for which LP/ILP predicted conformation has the χ1 and χ1+2 dihedral angleswithin 20 degrees of the native structure; (b) these values for Scwrl; (c) the rmsd of thepredicted side chain conformations from those of the native side chains using the LP/ILPmethod; (d) these values for Scwrl; (e) the best rmsd obtainable with the rotamer libraryused in these experiments; (f) the rmsd obtained by choosing the most common rotamerin the library.
the widely-used Scwrl package (version 2.9) (Bower et al., 1997) (see Table 3.4) and with
what is reported in (Xiang and Honig, 2001) when using the same rotamer library (on a
slightly different test set).
The majority of side chains have their χ1 angle are predicted correctly within 20%.
Figure 3.8 shows the fraction of χ1 angles predicted correctly within various tolerances.
The jump at about 120◦ is expected because many side chains prefer to be in a conforma-
tion in which their χ1 torsion angle (the rotation of the Cα–Cβ bond) is in one of three
evenly spaced values at 0◦, 120◦, and 240◦. Thus, if the native residue is near a rotameric
state, but the optimization chooses the wrong one, the correct one will be approximately
120◦ degrees away.
As expected, prediction is more successful for core residues (Figure 3.9). This is
both because we do not model the effect of solvent, which plays an important role in
positioning of the side chains on the surface, and because the side chains in the core are
more constrained, resulting in fewer physically realizable packings. The positions of the
surface side chains are uncertain even in the crystal structures to which we are comparing,
36
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140 160 180
Frac
tion
corr
ect
Angle tolerance
Angle Tolerance vs. Fraction Correct
Figure 3.8: The distribution of χ1 angle errors for the native test set.
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Frac
tion
corr
ect
Fraction exposed
Fraction Exposed vs. Fraction Correct
Figure 3.9: Performance compared with the fraction of the residue exposed for the nativetest set. The fraction correct corresponds to χ1 angles predicted within within 20◦ of thenative conformations. Fraction exposed was computed with surfv.
37
Amino AcidARG ASN ASP CYS GLN GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
Num
ber
of R
esid
ues
0
20
40
60
80
100
120
140
160
180Right
Wrong
Amino AcidARG ASN ASP CYS GLN GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
Fra
ctio
n R
ight
0
0.2
0.4
0.6
0.8
1
Figure 3.10: (Top) The number of core residues with their χ1 angles predicted correctly(within 20◦) and incorrectly (> 20◦) in the native backbone tests, broken down by aminoacid. (Bottom) The same data as in the top graph, expressed as the percentage of coreresidues for which the χ1 angle was predicted correctly (within 20◦).
meaning that our standard of correctness is uncertain, further complicating the prediction
of the positions of these side chains.
Not all amino acids are equally easy to predict (see Figure 3.10). Cysteines are espe-
cially difficult. This is not surprising as pairs of cysteines often interact via salt bridges
which are not modeled at all by our packing energy function. Prolines are also difficult
to predict, again likely due to their unique structure and interaction with the backbone.
In our tests, serines were also more likely to be incorrectly positioned than other amino
acids, an effect also apparent in (Xu, 2005), where serines were predicted with the lowest
accuracy using either their tree decomposition method or Scwrl 3.0.
For comparison, Table 3.4 also shows the best rmsd obtainable with Dunbrack’s library
38
if the rotamer with the smallest rmsd with the crystal structure is chosen. This is a lower
bound on the rmsd that any algorithm can achieve using the given library. The results
are also shown for the simple method that choses the rotamer that is most common for
each side-chain. This method uses no information about the global fold. Together, the
values for “best” and “simple” define the range any prediction algorithm should fall into.
Table 3.4 shows that though the use of a van der Waals energy function beats the
simple prediction scheme, it does not approach the theoretical limit of this rotamer library.
However, the percentage of torsional angles that are correct and the RMSD do not tell
the whole story. Over the 25 proteins, the simple scheme produces 782 clashes — the
center of one atom is placed inside the van der Waals radii of another. Even the best
rotamers frequently clash: by choosing the rotamer with the lowest rmsd to the native
side chain, we introduce 56 clashes, though many (but not all) clashing pairs are between
cysteines that form disulfide bonds and are thus not sufficiently close to a rotamer state
in the library. The LP/IP approach avoids the clashes almost entirely, allowing only two.
Overall, our testing on native backbones shows that when using a simplified energy
function, LP can readily obtain optimal solutions with respect to the energy function, and
that these optimal solutions correspond to predicted structures of quality similar to that
of other popular approaches.
3.3.2 Homology Modeling
We next explore the combinatorial problems associated with homology modeling. The 33
pairs of homologous proteins considered, their percent sequence identity, and the rmsd
between their backbones are shown in Table 3.2.
We solved the resulting LP formulations for all 33 problems; this took under 12 min-
utes of CPU time. The LP found optimal solutions for 31 of the 33 pairs. For only two
template/target pairs, 1d4t/1luk and 1qu9/1qd9, the optimal LP solutions were not in-
tegral. For these two problems, the optimal integral solution was found using dead-end
39
Coreresidues
Allresidues
(a) LP/ILP rmsd 2.177 A 3.230 A(b) Scwrl rmsd 2.137 A 3.260 A(c) Backbone rmsd 1.385 A 1.978 A(d) Simple method rmsd 2.504 A 3.425 A
Table 3.5: Prediction of side-chain conformations using homology modeling, with a com-parison of the LP/ILP prediction with those of other methods and the crystal structure.All values are averaged over the 33 problems of Table 3.2. (a) the rmsd between just side-chain atoms when comparing the LP/ILP predicted structure with the crystal structure;(b) this value when comparing the Scwrl predictions with the native structure; (c) thermsd between template and target structures when only considering backbone atoms; (d)the side-chain rmsd when choosing the most common rotamer at each position.
elimination and the integer programming algorithm of CPLEX. A good measure of how
close the LP relaxation objective is to the optimal solution is the relative gap, defined as:
100|OPT − lp|
|OPT| (3.3)
where OPT is the energy value of the optimal integral solution and lp is the optimal
objective for the LP relaxation. The relative gaps for both 1d4t/1luk and 1qu9/1qd9
were fairly small (0.207 and 15.260, respectively), and the total time for solving these two
integer linear programs was less than one minute.
In order to show that the basic energy function is useful in the homology modeling
scenario, we report the accuracies of our predicted structures (Table 3.5). We computed
the side-chain rmsd between the target structures and predicted structures, as well as
the side-chain rmsd obtained by the Scwrl rotamer choices. The average side-chain rmsd
obtained by the LP/ILP approach with the basic energy function is 3.230A, which is
competitive with Scwrl’s performance of 3.260A when run on the same test set. We also
show the results for blindly choosing the most common rotamer in each position.
For these tests, we did not optimize many important aspects of homology modeling,
such as choosing the homolog with the most similar sequence or hand fixing alignments, so
40
the results should not be taken to be the best possible for any of the methods. However,
the use of a simplified energy function results in predicted structures that are biologically
reasonable. Additionally, optimal solutions with respect to this energy function are easily
found using the LP/ILP approach.
3.3.3 Protein Design
We considered the problem of designing novel sequences that fold into known backbones.
We partitioned the amino acids into the following classes: AVILMF / HKR / DE / TQNS
/ WY / P / C / G. For each of the 25 proteins in our native test set (Table 3.1), we fixed
the surface residues and the native backbone and allowed the core residues to assume any
rotamer of any amino acid in the same class as the native residue. We focused on core
residues since the basic energy function optimizes primarily van der Waals interactions.
The sizes of the resulting problems are shown in Table 3.6.
When applying LP to the resulting problems with the basic energy function, only 6 out
of 25 solutions had integral solutions. Thus, from the perspective of this LP, the design
problem is more difficult than fitting side chains on native and homologous backbones.
CPU time for solving the 25 LP problems was approximately 20 hours, with one protein
(1qj4) taking about 10.5 of those hours.
To obtain optimal solutions for the 19 proteins with non-integral solutions, we apply
Goldstein dead-end elimination and then run the ILP solver of CPLEX. When solving the
ILP, CPLEX, in addition to using many other heuristics, solves several linear programs
that are subproblems of the ILP (these subproblems are referred to as the branch-and-
bound nodes). The number of such subproblems is a very rough indication of the compu-
tational effort expended by CPLEX. The number used for the design problems is shown
in the N column of Table 3.6. For several of the problems, many branch-and-bound nodes
were needed. CPLEX was able to find the optimal integral solutions to all the problems
in approximately 138 hours. Nearly all of that time (125 hours) was spent on the largest
41
Prot Var len Rot Size Time (ILP) Rel gap N
1aac 38 2153 62 3.3e2 (1.3e2) 2.630 41aho 18 668 22 4.4 Integral1b9o 48 1842 69 2.4e2 (9.4) 1.099 01c5e 25 1369 42 5.8e1 Integral1c9o 14 757 24 9.1e1 (4.6e1) 3.936 341cc7 18 866 29 9.5e1 (2.4) 0.272 01cex 78 3926 126 2.6e3 (7.0e2) 0.913 301cku 22 897 31 8.8 Integral1ctj 24 1262 40 2.8e1 Integral1cz9 53 2664 87 1.2e3 (3.2e2) 0.702 271czp 30 1475 47 4.4e2 (1.4e2) 1.202 391d4t 32 1691 52 1.8e2 (8.9e1) 1.039 331igd 11 552 18 3.4 Integral1mfm 46 3215 80 6.5e3 (5.4e3) 3.234 2331plc 33 1691 54 4.7e2 (1.3e2) 3.991 81qj4 124 6655 201 3.8e4 (4.5e5) 2.677 72931qq4 72 3500 115 1.5e3 (6.9e2) 4.272 381qtn 49 2181 74 2.6e2 (7.0e1) 0.558 81qu9 43 2057 70 2.3e2 (6.4) 0.162 21rcf 65 3189 105 2.7e3 (9.6e1) 0.053 01vfy 15 665 20 4 Integral2pth 76 4395 127 1.1e4 (2.4e4) 2.115 16233lzt 48 1940 71 4.2e2 (3.9e2) 3.445 455p21 70 3624 114 4.1e3 (1.3e4) 2.259 14537rsa 46 1993 66 5.7e2 (1.4e1) 0.120 0
Table 3.6: Var Len gives the number of core positions that were allow to vary, and Rotgives the total number of rotamers considered. Size is the log10 of the search space size.Time is the number of seconds CPLEX spent solving the LP, and given in parentheses, thetime for solving the ILP. Rel Gap gives the relative gap, as defined in Equation (3.3), andis a measure of how far the energy of the solution of the LP is from that of the optimalrotamer choice. N gives the number of subproblems CPLEX considered in finding theoptimal choice of rotamers.
42
Number of Residues0 20 40 60 80 100 120 140 160
Sec
onds
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1aac1aho 1b9o1c5e1c9o1cc7
1cex
1cku1ctj1cz9
1czp1d4t1igd
1mfm
1plc
1qj4
1qq41qtn1qu9
1rcf
1vfy
2pth
3lzt
5p21
7rsa
Number of Residues0 20 40 60 80 100 120 140 160
Sec
onds
50010002000350055009000
1400022000345005400084000
131000204000317000492500
1aac
1aho
1b9o
1c5e
1c9o
1cc7
1cex
1cku1ctj
1cz9
1czp1d4t
1igd
1mfm
1plc
1qj4
1qq4
1qtn
1qu9
1rcf
1vfy
2pth
3lzt
5p21
7rsa
(a) (b)
Figure 3.11: The time spent in the CPLEX solve phase (in CPU seconds) of the designproblems considered here. (a) The time spent on the full problem on the initial LPrelaxation. (b) The time spent on solving the IP to optimality for the problems afterbeing reduced by Goldstein DEE. The y-axis is in log scale.
problem, 1qj4; the other 18 problems took only 13 hours of computation. The running
times for each of the design problems are shown in Figure 3.11.
While we do not have a way to convert a fractional solution to a choice of rotamers that
always ensures a low energy choice, empirically, the LP relaxation does give us an estimate
of the energy of the optimal solution. The integrality ratio of Section 3.2.4 suggests that
the energy of the LP may be very low, while the energy of the optimal solution may
remain high. How good an estimate of the optimal energy does the LP relaxation give us
in practice? The relative gaps, defined as in Equation (3.3), for these 25 design problems
are all less than 5% (Figure 3.12). One way to construe the design tests above is that
they are tests of fitting a sequence pattern onto a structure. The pattern is defined by the
groups of amino acids allowed at each position. The small integrality gaps suggest that
in practice, the LP may give a good estimate of how well a sequence pattern can be fit
onto a backbone. While in traditional design applications we require a choice of amino
acid for each designed position, there may be database search applications that require
43
Figure 3.12: The relative gaps (Eqn. 3.3) for the 25 design problems. The gaps are allless than 5%.
only an estimate of how well a sequence pattern can be fit onto a backbone. For example,
when redesigning an active site of a protein, there may be constraints on the sequence
to preserve functionality. The protein designer may wish to search a database of solved
structures for backbone shapes onto which a sequence matching a pattern can be fit with
low energy. Here, the LP may provide a fast estimate of the fit between sequence pattern
and backbone shape.
The best way to test a designed sequence is to make the protein and confirm the
predicted structure (e.g., (Dahiyat and Mayo, 1997; Harbury et al., 1998; Malakauskas
and Mayo, 1998; Looger et al., 2003; Klepeis et al., 2003; Lilien et al., 2004)); this is
beyond the scope of this thesis. However, the basic energy function is reasonable for
designing protein cores as it focuses on van der Waal interactions, and the use of other
energy functions is not likely to make the problem easier (see below). Thus, while the
LP/ILP approach found optimal solutions for these protein design problems, our analysis
shows that protein design problems are likely to be considerably more difficult to solve
than homology modeling problems.
44
3.3.4 Other Energy Functions
We also investigated how changing the energy function affects the ability of LP to find
optimal solutions. For five proteins from Table 3.1 (1c9o, 1czp, 1d4t, 1qtn, 1vfy), we fit
side chains on their native backbones using two additional energy function variants.
In the first variant, the self-energies include the van der Waals interactions with the
backbone (as before), but the statistical term is replaced by a torsion term as well as intra-
side-chain van der Waals interactions. These self-energy terms are meant to measure the
local favorability of a side-chain conformation. The pairwise interaction energies between
rotamers consist of only van der Waals interactions.
The second variant is the same as the first, except that the self-energies include elec-
trostatic interactions with the backbone and the pairwise energies include electrostatic
interactions between side-chains. In all cases, the electrostatic interactions were modeled
using the distance-dependent electrostatic component (ǫ = r) of the AMBER96 force field.
In contrast to the basic energy function, for which 100% of the solutions were integral,
the LP finds optimal solutions for only 60% (three out of five) of the proteins using the
either of variant of the energy function. Thus, small changes in the energy function can
influence the ease with which solutions are found. We note that ILP can still find optimal
solutions for these problems, and additionally that the basic energy function gives the
best accuracy over these proteins (1.634 A average rmsd versus 2.069 A and 2.409 A for
variants 1 and 2, respectively).
3.3.5 Obtaining Multiple Solutions
By adding constraints (3.1) to the integer program, we can look at an ensemble of provably
near-optimal solutions. Near-optimal solutions can be used to generate several candidates
for protein design, as well as to analyze the energy landscape and gauge the difficulty
of the global optimization problem. We found the 10 lowest-energy solutions for four
proteins (1aho, 1cex, 1ctj and 1igd) and the 100 lowest-energy solutions for 1aac, using
45
Instance
1 2 3 4 5 6 7 8 9 10
Rel
ativ
e ga
p0
0.1
0.2
0.3
0.4
0.5
1aac
1aho
1cex
1ctj1igd
2 26 50 74 980
0.10.20.30.40.5 1aac
Figure 3.13: Relative gap between the optimal solution (with value OPT) and the 9 nextlowest-energy solutions (where the i-th solution has value xi). Inset shows relative gapsfor the 100 lowest-energy solutions for 1aac. Relative gap at each iteration i is defined as100 |OPT−xi|
|OPT | ,
the basic energy function to fit each sequence onto its native backbone. Since at each
step we are excluding all previously found solutions, each successive solution takes longer
to find. The relative gap (Equation 3.3) between each successive solution and the global
optimum is plotted in Figure 3.13. These gaps are very small, and from the point of view
of this energy function, any of several solutions perform similarly. This indicates that even
though LP has no difficulty finding optimal solutions, no one choice of rotamers clearly
stands out as the right one.
3.4 Discussion
Our experiments suggest that mathematical programming should become a widely-used
technique for attacking SCP in the context of both homology modeling and protein de-
sign. The described approach exploits general, highly-developed optimization machinery,
and it is likely that problems much larger than those studied here can be solved by em-
ploying faster hardware and more effectively exploiting the CPLEX package (e.g., using
46
parallelized versions of the software, or specifying alternative strategies for branching and
node selection). The addition of valid inequalities in a branch and cut framework as in
(Althaus et al., 2000) might further speed up solution of the problems.
For even larger problems, further specialized optimizations may be necessary. As a first
step, we have shown how to reduce the size of the ILP dramatically, without compromising
optimality, by exploiting the fact that in protein structures amino acids do not interact
with other amino acids that are far away in 3D. Furthermore, in practice, to solve large
instances optimally, we would suggest first running basic DEE, and then following with
either LP or ILP. We also note that some of the techniques developed for DEE can be
incorporated directly into ILP if necessary. For example, we can disallow choosing a
certain pair of rotamers (between positions that have some positive pairwise rotamer
energy between them) by removing the corresponding edge variable from the objective
function and constraints. Alternatively, the LP/ILP approach can be applied in cases
where the DEE procedure does not converge to a single solution. Finally, as compared
with other methods, the LP/ILP approach is simple to model and flexible enough to
extend easily. For example, we have already shown how to use ILP to obtain successive
near-optimal solutions.
Our analysis suggests that protein design problems are considerably more difficult to
solve than homology modeling problems. For native-backbone and homology modeling,
optimal, biologically realistic solutions can usually be found quickly using a simple LP
relaxation. For protein design, fewer solutions of the LP relaxation are integral, even
with the same energy function. As suggested by (Gordon et al., 2002), it is possible that
for repacking side chains on backbones, there are only a few good rotamer choices for
each side chain, whereas for protein design there are several amino acid choices for each
position, each with a few good rotamer choices. From a computational viewpoint, this
suggests that the design problems are those on which efforts to improve the optimization
scheme should be focused.
47
We also find that the choice of energy function affects the ease with which optimal
solutions are found using LP. For positioning side chains on native and homologous back-
bones, optimal solutions using the basic energy function are found quickly (typically in
polynomial time), and this energy function yields good solutions (better than the other
energy function variants considered in our tests). This suggests that even if alternative en-
ergy functions are required, it may be beneficial to use an energy function such as the one
considered here for which optimal solutions are readily found. These solutions can then be
used as starting points for an iterative procedure such as that given in (Xiang and Honig,
2001) or for heuristic search algorithms (e.g., as in the original Scwrl program (Bower
et al., 1997)).
Several other authors have considered the combinatorial difficulty of SCP in the con-
text of packing side-chains onto native backbones. An excellent, exhaustive study on side
chain positioning has used very different reasoning to argue that the associated combi-
natorial problem appears not to be that difficult (Xiang and Honig, 2001). This study
considers packing side chains on native backbones, and shows empirically that predicting
the conformation of a single side chain while fixing all others in their native conformations
is only slightly more accurate than the simultaneous prediction of all side chains. Unlike
when integral solutions are found using our approach, their computational approach can-
not guarantee that they have found a minimum energy solution according to their energy
function. Eriksson et al. (Eriksson et al., 2001) also use an ILP formulation to suggest
that the side-chain positioning problem is easy; they apply the method to a single pro-
tein (lambda repressor protein) and find that the solution of the relaxed linear program
always seems to be integral, even for artificial “nonsense” energy functions. The hardness
result (Pierce and Winfree, 2002; Chazelle et al., 2004) suggests this is unlikely to be true
for all energy functions and proteins, and indeed the LP approach does find non-integral
solutions for two of the homology modeling cases in our dataset. On the other hand, oth-
ers (Gordon et al., 2002) have argued that it is important to consider the precise energy
function being optimized; our results are consistent with this view.
48
In light of the hardness results (Pierce and Winfree, 2002; Chazelle et al., 2004), it is
clear that the frequent integrality of the LP formulation in our experiments is not a result
of the general structure of the problem but instead is a feature of the properties of the
proteins and energy functions studied. The NP-completeness result describes worst-case
behavior, and it may not hold for the classes of problems and energy functions that occur
in practice. In general, finding problems which are hard on average is a longstanding
open problem in theoretical computer science, and in particular cryptography where such
problems would be very useful. Hence, in that light, that our instances are easier than
the worst-case results makes sense.
It is well-known that if the constraint matrix for an LP is totally unimodular (e.g.,
as in formulations for shortest path or max-flow problems), the LP has integer optimal
solutions. This is not the case here, however, as changing only the energy function can
change whether integral solutions are found. It is also known that if the energy function
obeys the triangle inequality, it is possible to obtain a 2-approximation. However, such an
energy function is not realistic for either homology modeling or protein design problems.
Nevertheless, in our applications, the constraint matrices are sparse, and perhaps the LP
is exploiting some other type of underlying structure. An intriguing open question is to
uncover what features of side-chain positioning allow LP and ILP to find optimal solutions
quickly, and to determine whether these features suggest an alternative formulation of the
problem.
49
Chapter 4
A Semidefinite Programming
Approach to Side-Chain
Positioning with New Rounding
Strategies
Seeking a tighter formulation side-chain positioning instances for which the
LP formulation of the last chapter gives non-integral solutions, we present
and test a semidefinite programming formulation of the problem. We also
introduce and test two rounding schemes to convert a fractional solution to
this semidefinite relaxation into a good choice of rotamers.
4.1 Introduction
This chapter describes a semidefinite programming (SDP) heuristic for SCP. Though in
the previous chapter we saw that an LP heuristic often performs well on native-backbone
and homology modeling problems, solving design problems frequently required resorting
to a branch and bound integer programming algorithm. As shown in the previous chapter,
50
modern ILP solvers can solve problems of the realistic sizes, but as larger design problems
are attempted and more detailed rotamer libraries are used, a good polynomial heuristic
becomes attractive.
In this chapter we give an effective SDP-based heuristic. Our method works in three
steps: first, relax the SCP problem into an instance of SDP; next, solve it in polynomial
time by an interior-point method; finally, convert the solution into 0/1 form by randomized
rounding (Raghavan and Thompson, 1987; Rolim and Trevisan, 1998). This general
approach for approximation algorithms was pioneered by Lovasz’s ground-breaking work
on the ϑ function (Grotschel et al., 1993; Lovasz, 1979) and the ingenious Max-Cut
algorithm of (Goemans and Williamson, 1995), and it has been pursued further since
(e.g., (Frieze and Jerrum, 1997; Alon and Kahale, 1998; Feige and Kilian, 1998; Karger
et al., 1998; Bertsimas and Ye, 1998; Zwick, 1999)).
In order to convert the fractional SDP solutions into rotamer choices for the orig-
inal SCP problem, we introduce two new techniques for randomized rounding. These
are general techniques that may have applicability beyond the SCP problem. The first
technique, projection rounding, is based on the geometry of the solution vectors, and the
second, Perron-Frobenius rounding, is based on spectral properties of the solution ma-
trix. This second rounding scheme approximates the solution matrix by the eigenvector
corresponding to its highest eigenvalue. This is a standard trick (see, e.g., (Donath and
Hoffman, 1972; Boppana, 1987; Benson et al., 1999)); however, our Perron-Frobenius
rounding is different in that we rely crucially on non-negativity of the entries of the so-
lution matrix. Because every entry in the solution matrix is constrained to be ≥ 0, the
entries of its highest eigenvector will be non-negative and this will allow us to extract a
probability distribution. A further difference is that the matrix that we decompose does
not have a graph-theoretic interpretation.
The inapproximability of SCP means that no rounding scheme should have good per-
formance on all instances; however, we provide some theoretical justification for good
51
performance on some types of input. We argue that under various assumptions about
the statistical nature of the problem, the expected difference (the drift) between the total
energies given by the optimal fractional solution and our randomized rounding integral
solutions is small.
We have applied our method to redesign computationally the cores of two naturally
occurring proteins, the Bacillus caldolyticus cold shock protein and the TIM barrel triose
phosphate isomerase from chicken. We have also experimented successfully on general
random graphs as well as a class of random graphs that better capture the geometry of
actual proteins. Since LP-based approaches for SCP are effective in practice (Kingsford
et al., 2005; Althaus et al., 2000), as we saw in the previous chapter, we compare our
method to LP; this comparison highlights the benefits of SDP’s additional computational
machinery. Our empirical studies show that, in practice, good solutions to SCP are found
by our two randomized rounding schemes. Additionally, we note that since SDP provides
better lower bounds than LP for the underlying SCP problem, it is a more effective
bounding function for branch-and-bound or branch-and-cut (Eriksson et al., 2001; Althaus
et al., 2000) approaches.
Independently, (Lau, 2002) and (Lau and Watanabe, 1996) applied semidefinite pro-
gramming and randomized rounding to the more restricted problem of weighted constraint
satisfaction; this is a special case of the SCP problem considered here. They also give an
inapproximability result that is weaker than ours in the general case.
At present, SDP solvers are limited to solving small problems. However, as SDP
approaches are increasingly being applied to combinatorial optimization problems, SDP
solvers continue to improve. As larger proteins and rotamer libraries are considered, ex-
haustive techniques (such as branch-and-bound or A∗) may be limited by their potentially
exponential running time. In contrast, our semidefinite programming approach runs in
polynomial time, and the approaches developed in this chapter, which we show work
well on problems of interest, will have broad applicability. Finally, as opposed to other
52
heuristic techniques (such as simulated annealing), as more is discovered about the nature
of SCP applications in practice, our SDP formulation permits the development of other
rounding schemes that better exploit the real-world statistical properties of the problem.
4.2 A Semidefinite Programming Heuristic
In this section we present a formulation of the SCP problem as a semidefinite program.
Given a graph G with node set V = V1, . . . , Vp, assign to each u ∈ V a 0/1 variable
xu. The intuition is that xu will be 1 if rotamer u is selected, 0 otherwise. Computing the
GMEC is equivalent to solving the following integer quadratic programming problem:
Minimize∑
(u,v)∈G
Euvxuxv (4.1)
subject to∑
u∈Vi
xu = 1 for i = 1, . . . , p (4.1a)
xu ∈ {0, 1} .
We rewrite the program (4.1) into a form that will be more convenient to relax into
a semidefinite program with as few constraints as possible. Add a new position with an
isolated vertex u0 to G and define its singleton vertex set V0 = {u0}. The constraints (4.1a)
applied to this position imply xu0= 1. We square both sides of (4.1a) and, using the fact
that xu ∈ {0, 1}, we add two new sets of constraints to obtain the equivalent program
Minimize∑
(u,v)∈G
Euvxuxv (4.2)
subject to∑
u,v∈Vi
xuxv = 1 for i = 0, . . . , p
∑
u∈Vi
xu0xu = 1 for i = 0, . . . , p
xuxu = xu0xu and xu ∈ {0, 1} for all u .
53
The relaxation step lifts each xu to IRn, where n is the number of nodes in G (including the
dummy node), scalar multiplication is replaced by the dot product, and the requirement
xu ∈ {0, 1} is replaced by 0 ≤ xTuxv ≤ 1, for all u and v. Quadratic programming
is NP-hard in general, but this relaxed system is an instance of positive semidefinite
programming, and it can be solved efficiently. To see that, we linearize all the constraints
by introducing the variable xuv to denote xTuxv. To ensure that this linearization is not a
further relaxation, we require that the n-by-n matrix X = (xuv) be positive semidefinite
(PSD). We also note that the constraints xTuxv ≤ 1 are redundant since X is PSD and
the diagonal elements are ≤ 1. Thus, we get:
Minimize∑
(u,v)∈G
Euv xuv (4.3)
subject to xuu = xu0u and xuv ≥ 0 for all u, v
∑
u∈Vi
xu0u =∑
u,v∈Vi
xuv = 1 for i = 0, . . . , p
X is PSD .
We can solve the SDP system (4.3) in polynomial time to within any level of accuracy by
using the ellipsoid algorithm (see e.g. (Grotschel et al., 1993)) or, preferably, an interior-
point method (see e.g. (Alizadeh, 1995; Nesterov and Nemirovskii, 1993; Vandenberghe
and Boyd, 1996)).
Next, we must map each vector xu to xu ∈ {0, 1} so that∑
u∈Vixu = 1 and so that
ideally the expected increase in the value of the objective function, the drift, is small. We
discuss two rounding schemes, both of which fit the basic format of randomized round-
ing (Raghavan and Thompson, 1987). The idea is to specify a probability distribution
for each position and home in on a solution by sampling from it. We describe two dis-
tributions, one based on projection, the other on spectral approximation. The first one
is very simple and easy to implement; the second one requires slightly more work. See
54
Section 4.3 for some empirical comparisons of the two rounding methods. The following
characterization of the geometry of the solution vectors will be useful when we discuss the
rounding schemes.
Lemma 4.2.1 If X = (xuv) is a solution to (4.3) where xuv = xTuxv for vectors xu, xv,
then all the vectors∑
u∈Vixu are equal to xu0
, and each xu belongs to the unit-diameter
sphere with antipodes O (the origin) and xu0.
Proof: Fix i ≥ 0 and let y =∑
u∈Vixu. The constraints imply that ‖y‖2
2 =∑
u,v∈Vixuv =
1. Meanwhile, the inner product of y and xu0is equal to
∑u∈Vi
xu0u = 1. We also have
xTu0xu0
= 1. Since their lengths are the same and equal to the projections onto one
another, it follows that y = xu0for all i. Now, take any node u ∈ Vi. Observe that
∥∥∥xu − xu0
2
∥∥∥2
2= xuu − xu0u +
1
4=
1
4,
where xuu = xu0u follows directly from the constraints. Therefore, xu belongs to the
sphere centered at xu0/2 of radius 1/2. This sphere passes through the two points O and
xu0, which are antipodal. �
We will compare our semidefinite program to the LP relaxation of the following integer
program (IP), which was introduced in the previous chapter:
Minimize∑
(u,v)∈G
Euv xuv (4.4)
subject to∑
u∈Vi
xuu = 1 for i = 1, . . . , p
∑
u∈Vi
xuv = xvv for i = 1 . . . , p and any v
xuv ∈ {0, 1}
The benefit of an SDP formulation over an LP formulation when dealing with fractional
solutions is two-fold: first, the relaxation is more constrained so its solution is closer to
55
that of the integer program. The SDP formulation generates second moments between the
nodes (Bertsimas and Ye, 1998), and our Perron-Frobenius rounding scheme will implicitly
make use of them. Second, the solutions are vectors and not scalars. This gives us much
more freedom in the rounding phase of the algorithm and allows for effective use of the
“geometry” of the problem. We will compare our semidefinite program to this LP and
the optimal solution in Section 4.3.
4.2.1 Projection Rounding
This scheme is based on the fact that the constraints guarantee that∑
u∈Vi‖xu‖2
2 = 1 for
any 1 ≤ i ≤ p, so that the quantities qu = ‖xu‖22 associated with the nodes u of Vi form a
valid probability distribution from which we can sample effectively.
• The rounding rule: For each 1 ≤ i ≤ p, choose u ∈ Vi at random with
probability qu.
Note that only one u is chosen per Vi. This is called projection rounding because the
probability of choosing u is equal to xuu = xu0u, which is the length of the projection
of xu onto xu0. By looking at the geometry of the SDP formulation in a manner similar
to (Alon and Kahale, 1998; Feige and Kilian, 1998; Karger et al., 1998), we can provide
a measure of theoretical justification for our rounding strategy.
We first provide some intuition behind this rounding scheme. The solution vectors
are constrained to lie on a sphere and the projection rounding rule favors choosing long
vectors. If a single vector xu is dominant within its Vi — a common occurrence — then
simple geometry (Figure 4.1a) shows the dot products of these vectors should also be big.
Because the solution matrix is positive semidefinite, an off-diagonal element xuv is the
dot product of the two vectors xu and xv, and for long vectors xu, xv we can expect xuv
to be large as well. We can thus hope to avoid the most damaging situation: where the
rounding scheme chooses nodes u and v but the fractional solution has put low or zero
weight on the edge between them. The intuition holds in the opposite case as well: two
56
low probability vectors (that is, short vectors), are likely to have a small dot product,
and we would like to avoid choosing the edge that corresponds to that dot product. We
develop this argument more formally below when we give an upper bound on the drift,
defined as the expected difference between the post- and pre-rounding objective function
value.
Let xu be 1 if u is chosen in the rounding stage and 0 otherwise. As usual, {xu}
denotes the solution of the (relaxed) SDP system. The expected value of the objective
function, post-rounding, is equal to
E{ ∑
(u,v)∈G
Euv xuxv
}=
∑
u
Euu‖xu‖22 +
∑
(u,v)∈Go
Euv‖xu‖22‖xv‖2
2,
where Go denotes the set of nonloop edges in G. Thus, the drift ∆ is
∆ =∑
(u,v)∈Go
Euv
(‖xu‖2
2‖xv‖22 − xT
uxv
).
Observe that the drift originates exclusively from the off-diagonal entries. Let yu denote
the projection of xu on the orthogonal complement x⊥u0of xu0
. We rewrite the drift in
terms of these yu. Because xu0is of unit length, we have
xTuxv = ((xT
uxu0)xu0
+ yu)T ((xTv xu0
)xu0+ yv) =
(xTuxu0
)(xTv xu0
) + yTu yv = ‖xu‖2
2 ‖xv‖22 + yT
u yv ;
therefore,
∆ = −∑
(u,v)∈Go
Euv yTu yv . (4.5)
In the special case where all the energies are non-negative, it is also possible to relate the
drift directly to the lengths of the xu’s. (While this is not true for all energy functions,
the popular side-chain positioning package SCWRL (Bower et al., 1997) only has non-
57
x u
x v
yvyu
x u 0
x u 0O
xu0xu
yu
r
(a) (b)
Figure 4.1: The geometry of the solution vectors.
negative energies.) By Lemma 4.2.1 and the Pythagorean theorem applied to the right
triangle in Figure 4.1b,
‖yu‖22 +
(‖xu‖2
2 −1
2
)2= ‖yu‖2
2 +(xT
uxu0− 1
2
)2=
1
4.
It follows that ‖yu‖2 = ‖xu‖2
√1 − ‖xu‖2
2. Assuming nonnegative energies, by Cauchy-
Schwarz,
∆ ≤∑
(u,v)∈Go
Euv |yTu yv| ≤
∑
(u,v)∈Go
Euv ‖xu‖2‖xv‖2
√(1 − ‖xu‖2
2)(1 − ‖xv‖22). (4.6)
The Sharp-Concentration Case Our algorithm is expected to do very well when, within
each Vi, the probability distribution is sharply concentrated. In other words, if the prob-
ability ‖xu‖22 of picking u greatly exceeds that of selecting the other vertices v ∈ Vi,
then projection rounding does the right thing. Indeed, if within each Vi one ‖xu‖2 is
near 1, then the other ‖xv‖2’s (v ∈ Vi) must be small. This implies that the product
‖xu‖2
√(1 − ‖xu‖2
2) is always small and, by (4.6), so is the drift.
58
4.2.2 Perron-Frobenius Rounding
Algebraically, projection rounding entails approximating X = W TW by the rank-one
matrix X = W Txu0xT
u0W . Are there better low-rank approximation matrices? To answer
this question, we return to the SDP formulation (4.3), which ensures that the matrix X
is nonnegative. Because X is also positive semidefinite a spectral approach suggests an
alternative way: approximate X by a rank-one matrix so that the difference has minimum
L2 norm.
To simplify the notation, we move all the energies over to the edges by defining Fuv =
Euv + 1p−1(Euu +Evv) if u < v, and 0 otherwise. The objective function of the SDP system
can now be expressed as E = tr (FX), where F = (Fuv) is upper-diagonal. A vector
q = (qu) ∈ IRn is called G-stochastic if it is nonnegative and forms a valid probability
distribution over each Vi (i.e.,∑
u∈Viqu = 1). Randomized rounding with respect to q
produces an expected energy of tr (FqqT ), and so, the drift is ∆ = trF (qqT −X). The
problem, of course, is to find a suitable vector q. The next lemma provides a convenient
criterion to test whether a given q provides a valid distribution.
Lemma 4.2.2 Any nonnegative vector with L1-norm p in the image space of X is G-
stochastic.
Proof: Recall that X = W TW , where W is the matrix of column vectors (xu). Let 1i
be the 0/1 characteristic vector of Vi. Assuming that q = Xy for some y, then
∑
u∈Vi
qu = 1Ti q = 1T
i (W TW )y = (W1i)TWy = xT
u0Wy,
where W1i = xu0by Lemma 4.2.1. xT
u0Wy is independent of i and since by assumption
‖q‖1 = p,∑
u∈Viqu = 1 for any i. �
By the Perron-Frobenius theorem for nonnegative matrices (Seneta, 1981), the unit
eigenvector z1 corresponding to the largest eigenvalue λ1 of X is nonnegative. We ap-
59
proximate X by
X = λ1z1zT1 . (4.7)
Let s = (p/‖z1‖1) be the factor needed to scale z1 to length p in the L1 norm. Since z1
is in the image space of X, the vector q = sz1 is G-stochastic by Lemma 4.2.2. Perron-
Frobenius rounding refers to the standard rounding rule applied now with respect to the
distribution q. That is,
• Perron-Frobenius rounding rule: For each 1 ≤ i ≤ p, choose u ∈ Vi at random
with probability given by the u-th entry of z1 scaled by s.
We can express the drift under this rounding scheme as
∆ = trF (qqT −X) = trF(s2z1z
T1 −X
)
=s2
λ1trF
(λ1z1z
T1 −X
)+
( s2
λ1− 1
)trFX, (4.8)
and upper bound it as follows:
Lemma 4.2.3 Let 1 denote the column vector of n ones and U the n-by-n matrix of ones,
and let {zk} be an orthonormal eigenbasis of X with λk ≥ 0 the eigenvalue associated with
zk. Then,
∆ ≤ (1 + δ)‖F‖2
√trX2 − λ2
1 + δ trFX ,
where F = (Fuv) is the energy matrix and X is the solution matrix returned by the SDP
system and
δ =trUX
trUX− 1 =
∑k>1 λk(1
T zk)2
p2 − ∑k>1(1
T zk)2.
Proof: We have δ = s2/λ1 − 1 because
trUX =∑
xTuxv =
(∑xu
)T (∑xv
)= p2‖xu0
‖22 = p2, and
trUX = λ1(1T z1)
2 = λ1‖z1‖21;
60
where the first follows because∑
u∈Vixu = xu0
and there are p such positions i, and
the second follows from the construction of X . Substituting δ into (4.8) and applying
Cauchy-Schwarz, gives
∆ ≤ (1 + δ)‖F‖2 ‖X − X‖2 + δ trFX .
Note the dependence on the L2 distance between X and its approximation X. We can
express this distance in terms of the spectral weight placed on eigenvectors zk for k > 1.
The diagonalization of the matrix X gives the decomposition X =∑
k λkzkzTk ; therefore,
since X − X is symmetric and the zk’s are orthonormal,
‖X − X‖22 = tr (X − X)2 = tr
(∑
k>1
λkzkzTk
)2
=∑
k>1
λ2k tr (zkz
Tk )2 + 2
∑
k>l>1
λkλl tr (zkzTk )(zlz
Tl )
=∑
k>1
λ2k = trX2 − λ2
1.
Finally, using p2 − trUX = 0, we have
δ ≡ trUX
trUX− 1 =
trU(X − X)
p2 − trU(X − X)=
∑k>1 λk(1
T zk)2
p2 − ∑k>1(1
T zk)2. �
Note that δ is quite small if the largest eigenvalue λ1 carries most of the spectrum or
z1 is close to the vector 1, which makes the terms 1T zk small for k > 1. The empirical
results in Section 4.3 suggest that λ1 may be much larger than the other eigenvalues in
realistic situations.
The Uniform Case Projection rounding is expected to do well when the solution con-
centrates weight on a single node per position. What if the weights are nearly uniformly
distributed? We can use Lemma 4.2.3 to argue that in this case even the strategy of
uniform guessing has low drift.
61
Assume that (i) xuu = 1/|Vi| for all u, (ii) xuv = 1/|Vi||Vj | for any (u, v) ∈ Vi × Vj
(i < j), and (iii) xuv = 0 for u, v ∈ Vi. (Note that although these assumptions are
themselves unrealistic, the robustness of our arguments below makes them representative
of the “uniform” end of the spectrum.) It is easy to construct an orthogonal eigenbasis
for X:
Lemma 4.2.4 The largest eigenvalue λ1 of X is equal to∑p
i=1 |Vi|−1 and corresponds to
the eigenvector
z1 =1√∑i |Vi|−1
p∑
i=1
|Vi|−11i ,
where 1i is the 0/1 characteristic vector of Vi. None of the n − 1 other eigenvectors are
nonnegative: p − 1 of them are of the form 11 − 1i and span the kernel of X, while, for
each i, |Vi| − 1 of them are associated with the eigenvalue |Vi|−1.
Proof: If X = λ1z1zT1 , then X − X is the n-by-n matrix made of blocks B1, . . . , Bp
along the diagonals and 0 everywhere: each Bi is a |Vi|-by-|Vi| circulant matrix with
|Vi|−1 − |Vi|−2 along the diagonal and −|Vi|−2 elsewhere. The eigenvectors of an m-by-m
circulant matrix consist of the rows of the matrix of the Fourier transform over the additive
group Z/mZ: For k > 0, this gives us an eigenvector (1, e2πik/m, . . . , e2πik(m−1)/m) for each
0 < k < m. The corresponding eigenvalue for Bi is |Vi|−1 (hence, both its algebraic and
geometric multiplicities are |Vi| − 1). The corresponding eigenvectors of X are derived
trivially by padding with zeroes at the appropriate places. Note that we must skip the
case k = 0, because the eigenvector of Bi that gets padded into 1i is not an eigenvector of
X. To complete the diagonalization of X, we must resolve its kernel. Going back to the
relations xuu =∑
v∈Vjxuv, we easily verify that KerX is spanned by the p − 1 vectors
11 − 1i, for 1 < i ≤ p. �
Assume now that all the Vi’s are of equal size n/p. Then z1 = (1/√n )1 and Perron-
Frobenius rounding degenerates into choosing solutions uniformly at random. Since all
other eigenvectors are normal to z1, by Lemma 4.2.3, we know that δ = 0. Also, by
62
Lemma 4.2.4,
trX2 − λ21 =
∑
k>1
λ2k = (n− p)p2/n2 .
If each Fuv is 0/1 and each node is connected to 2d neighbors, then ‖F‖2 =√nd and,
by Lemma 4.2.3, ∆ ≤ p√d. This shows that, measured against the energy of (p/n)2dn =
dp2/n of the random solution, the relative drift is at most (n/p)/√d, which is typically
much less than one since each rotamer typically interacts with many rotamers in several
positions.
Hence, if the solution matrix is sharply concentrated, we have shown projection round-
ing is expected to work well, at least under the assumption that the edge weights are
nonnegative. In the opposite situation of a uniform solution matrix we have shown that
for an unweighted, regular graph the drift is small if solutions are chosen uniformly at
random.
4.3 Computational Results
We used the SDP formulation to design computationally the cores (i.e., the solvent inac-
cessible portions) of two proteins. Both proteins have a core β-barrel, a region where the
backbone wraps around to form a structure reminiscent of the slats of a wooden barrel.
Our computational work focuses on protein cores because (1) the energetics most
important to the core residues are easier to model than those of solvent-exposed residues,
and (2) the cores are small, making them tractable for SDP. In particular, since we are
focusing on hydrophobic core interactions, we use an energy function — similar to the
one in Chapter 3 — that focuses on obtaining well-packed structures. More specifically,
the interaction energies between rotamers (that is, the edge weights of our graph) are
calculated using the 6–12 Lennard-Jones approximation to the van der Waal’s force. Self-
energies are calculated as the sum of the van der Waal’s interaction between the rotamer
and the backbone, plus a statistical term derived from the empirical probabilities listed
63
in the rotamer library (Dunbrack Jr and Karplus, 1993). Interactions between the side
chain and the backbone of flanking positions are ignored to account for some backbone
flexibility. The statistical energy term for rotamer u is computed as − ln(pu/p0), where pu
is the probability of seeing rotamer u and p0 is the probability of seeing the most common
rotamer for that amino acid (Bower et al., 1997). (In the notation of Section 3.2.7,
here C = 1.) For all calculations, atom radii and interaction parameters are taken from
AMBER96 (Cornell et al., 1995), a commonly used package for evaluating the energy
of protein conformations, with the radii of hydrogens reduced by 50% because of their
uncertain position. The BALL C++ library (Kohlbacher and Lenhof, 2000) was used to
manipulate the rotamers.
The ultimate test of protein design is to make the protein and confirm the predicted
structure. Obviously, experimental work is beyond the scope of this thesis, but the solu-
tions to the design problems can be at least initially evaluated in several ways. First, we
may expect the designed sequence to be similar to the native sequence since evolution has
likely chosen a favorable sequence. (Note, however, that novel protein sequences that are
considerably more stable than native protein sequences have been designed using fixed
backbone approaches (Malakauskas and Mayo, 1998).) Second, since we are using an en-
ergy function that focuses on packing, we expect the designed structure to avoid clashes
between atoms and to pack the available space tightly. Third, we want the rounded solu-
tion to have energy near the optimum solution. We show below that our computationally
designed cores generally fulfill these criteria.
In order to investigate the performance of the rounding schemes in a more controlled
setting, we also experimented with two types of random graphs. We first consider uniform
random graphs and then consider a family of randomly generated graphs that better model
the interaction graphs observed in proteins.
The semidefinite programs were solved using version 6.0 of the SDPA (Fujisawa et al.,
1997) package, an implementation of an infeasible primal-dual interior-point method. The
64
linear programs were solved using the dual simplex method with AMPL (Fourer et al.,
2002) and CPLEX 7.1 (ILOG CPLEX, 2000). The SDP solutions were rounded using the
projection and Perron-Frobenius methods described above.
We compare our SDP with the LP obtained by relaxing the integrality constraints
from (4.4). The LP solutions were rounded by choosing node u with probability xuu.
For problems of this size, optimal integral solutions (denoted by OPT ) can be found
using formulation (4.4) and the integer programming option of CPLEX. This allows us to
compute the relative gap of each rounded solution, as computed by |(x − OPT )/OPT |,
where x is the value of the solution. For both the protein design problems and the
simulated data, the SDP rounding schemes perform well, with significantly better average
relative gaps.
4.3.1 Cold Shock Protein
We applied the SDP method to the problem of redesigning the core of the Bacillus cal-
dolyticus cold shock protein (Mueller et al., 2000) (PDB code: 1c9o). Core residues were
defined as having less than 1% of their surface area exposed to solvent in the native struc-
ture as determined by the program Surfv (Nicholls et al., 1991), which rolls a probe sphere
of radius 1.4A along the van der Waals surface. The following eight residues were found:
Val6, Gly16, Ile18, Val28, Leu41, Val47, Phe49, Val63.
The hydrophobic core positions (i.e., all positions listed above except position 16 with Gly)
were varied, and allowed to assume any rotamer of the hydrophobic amino acids Ala, Val,
Ile, Leu, Met, and Phe that occurred in the backbone-dependent rotamer library (Dun-
brack Jr and Karplus, 1993). This yields 55 rotamers per position. The protein is shown
in Figure 4.2a, with variable atoms shown as black spheres and the axis of the β-barrel
vertical. The native positions of the side chains are shown in Figure 4.2b, where the
protein is rotated so that we are looking down the axis of the barrel.
65
The resulting problem had 385 nodes, 7 positions, and 63,313 nonzero cost matrix
entries. Simple pairwise DEE of (Goldstein, 1994), a polynomial-time rule for throwing
out rotamers that cannot possibly be in the optimal solution (see Section 2.5), was applied
to the problem until no more nodes could be eliminated. This reduced the problem to
137 nodes, 7 positions, and 7,865 nonzero cost matrix entries.
The LP solution was rounded 1,000 times using the simple LP rounding scheme. The
SDP solution was rounded 1,000 times with both the projection and Perron-Frobenius
rounding schemes. The minimum energy solution found is a good measure of how well
one would do in practice, but this minimum energy may be influenced by the moderate
search space size. The average energy of a rounded solution is a better indicator of the
distribution obtained from rounding the relaxations. The best value over 1,000 roundings
and the empirical average objective value in the limit are:
Rounding Method Best Average
LP -217.2880 4058.6651
Projection -238.4218 -102.2822
Perron-Frobenius -238.4218 -209.3617
The optimum solution (determined by the integer programming option of CPLEX)
is −238.4218. Both SDP rounding schemes find the optimum solution; this appears to
correspond to a well-packed and plausible structure, as shown in Figure 4.2c. The average
energy of the rounded solutions suggest that, as expected, the LP rounding scheme is a
poor one. In fact, the average relative gap of the solutions found by the LP rounding
scheme is 18.02, versus an average relative gap of 0.57 for projection rounding and 0.12
for Perron-Frobenius rounding. We expected the Perron-Frobenius rounding scheme to
perform well, as the solution returned by the SDP has most its spectral weight placed on
the largest eigenvalue (7.725 versus less than 0.05 for all the other eigenvalues).
The optimal choice has 57% sequence identity with the native sequence. Additionally,
Figures 4.2b and 4.2c show that the designed sequence packs more atoms into the core of
66
Leu41
Val6
Val47
Val28
Ile18
Phe49
Val63
Leu41
Ile6
Ile47
Ile28
Ile18
Phe49
Val63
(a) (b) (c)
Figure 4.2: Cold-shock protein (1c9o). (a) The full protein; (b) the positioning of the sidechains in nature; (c) the solution returned by the SDP rounding schemes (the optimal).
the protein than the native structure. This is one indication that this sequence might be
a good fit for this backbone as more tightly packed cores tend to be favored.
4.3.2 Triose Phosphate Isomerase
We applied the same procedure to the protein triose phosphate isomerase from chicken
muscle (Banner et al., 1976) (PDB code: 1tim). This protein is an α/β-barrel, where
the β-barrel core is surrounded by α-helical structures. We focused on the computational
redesign of residues in the core of the β-barrel, as identified by (Lesk et al., 1989), and
shown in Figure 4.3a as black spheres. The 9 non-glycine core residues are:
Val40, Ala62, Trp90, Ile92, Ile124, Val161, Ala163, Ile207, Leu230.
The native positions of these side chains are shown in Figure 4.3b. Trp90 was allowed
to assume any rotamer of the aromatic amino acids Phe, His, Trp, and Tyr. The other
residues were allowed to assume any rotamer of the hydrophobic amino acids Ala, Val,
Ile, Leu, Met, and Phe. The same energy function was used as above. This resulted in
467 nodes and 91,737 nonzero edges. As with the cold shock protein, DEE was performed
to throw out rotamers that cannot possibly be in the optimal solution; this reduced the
problem to 141 nodes and 8,264 edges.
67
The optimal solution has objective value -208.5702. In this case, all methods find the
optimal solution (shown in Figure 4.3c) within 1,000 roundings. The average objective
values are:
Rounding Method Average
LP 251.0156
Projection 93.1529
Perron-Frobenius -36.9177
The average energy of the rounded solutions demonstrates that the Perron-Frobenius
rounding again performs best for this problem. In fact, the SDP solution returned is close
to a rank 1 matrix: the largest eigenvalue is 8.7563 out of a total spectral weight of 10;
the second largest is 0.375. The average relative gap of the solutions found by Perron-
Frobenius rounding is 0.82, versus 1.44 for for projection rounding and 2.20 for the LP
rounding scheme.
The optimal solution avoids clashes, and, as can be seen from Figures 4.3a and 4.3b,
packs the available space well. It has only 33% sequence identity with the native solution.
This is not necessarily unexpected: (Dahiyat and Mayo, 1997) designed a sequence with
only 21% identity to the native sequence that folds to the same shape.
4.3.3 Uniform Random Graphs
We consider the random graphs GU (n, p, r) parameterized by the number of nodes n,
number of positions p, and edge probability r. Each position contains n/p nodes. Two
nodes in different positions are connected by an edge with probability r. Each chosen
edge is weighted by drawing a weight uniformly from [0, 1]. There are no self-edges.
We solved 30 instances of uniform random graphs with 60 nodes, 15 positions, and
edge probability 0.5 using SDP. Figure 4.4 shows the 25 largest eigenvalues in descending
order for each of the uniform random graphs shown in Figure 4.5. The eigenvalues sum to
16 because trX = p and there are 16 positions including the dummy position V0 in (4.3).
68
Trp90
Val161
Val40Ile124
Ile207
Ala62
Ala163
Ile92
Leu230
Trp90
Met161
Val40Phe124
Met207
Met62
Met163
Met92
Leu230
(a) (b) (c)
Figure 4.3: Triose phosphate isomerase (1tim). (a) The full protein; (b) the positioningof the side chains in nature; (c) the solution returned by the SDP rounding schemes (theoptimal).
Most of the spectrum is concentrated in the first few eigenvalues, and so one would expect
the X in (4.7) to closely approximate the solution X and thus that Perron-Frobenius
rounding would perform well.
Figure 4.5a compares the fractional objective of the semidefinite program with that
of the linear program. The SDP provides a tighter lower bound on the minimum energy,
typically within 10% of the optimum. In contrast, the fractional LP solution is never
within 60% of the optimum. As expected then, as a side benefit, SDP provides a more
effective bounding function than LP for branch-and-bound frameworks.
Figure 4.5b shows the best rounded solution found over 10,000 roundings. For these
30 graphs, both semidefinite rounding schemes outperform the LP in all cases, generally
finding a solution within 10% of the optimum and only once finding a solution more
than 20% above the optimum. This means, in a practical sense, that SDP allows us to
find lower energy conformations. The two semidefinite rounding schemes are comparable,
though Perron-Frobenius finds a lower energy solution in 11 out of 30 instances. In
one case, projection rounding finds a better solution. The average rounded energy is
shown in Figure 4.5c. Perron-Frobenius gives a slightly better distribution than projection
rounding. Both SDP rounding schemes again outperform the LP one.
69
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
eige
nval
ueith eigenvalue
Largest eigenvalues
Figure 4.4: The largest eigenvalues of the SDP solution matrices for the uniform randomgraphs.
These results remain qualitatively the same for other values of p ≥ 10 and edge
probabilities ≥ 0.3. For very sparse graphs, the SDP and LP methods yield similar
results.
4.3.4 Neighborhood Random Graphs
The uniform random graphs do not capture several properties of real protein interaction
graphs. Side-chains that are far apart in the folded protein structure typically do not
interact. On the other hand if two residues are near each other in the folded structure
most of their rotamers will interact.
We consider neighborhood random graphs GN (n, p, d) that capture some of these
properties. They are again parameterized by the number of nodes n and number of
positions p, where each position has n/p nodes. Given parameter d, edges are defined as
follows: For each position j a point bj is chosen uniformly at random in the 3D unit cube.
If the Euclidean distance between bi and bj is ≤ d, then the rotamers in positions i and j
are connected by the complete bipartite graph; if the distance is > d, there are no edges
between i and j. Edges are weighted by choosing a number uniformly from [0, 1].
Figure 4.6 shows the results for neighborhood random graphs with various values for
d. For sparse graphs (small d), the SDP and LP approaches yield similar results. Fig-
70
-0.1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 5 10 15 20 25 30
Rel
ativ
e G
ap
Problem Instance
(a) Fractional Objectives
LP SDP
-0.1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0 5 10 15 20 25 30
Rel
ativ
e G
ap
Problem Instance
(b) Best Rounded Objectives
Best LPBest Proj.
Best P-F
0
0.5
1
1.5
2
2.5
0 5 10 15 20 25 30
Rel
ativ
e G
ap
Problem Instance
(c) Average Rounded Objective
Avg. LPAvg. Proj.
Avg. P-F
Figure 4.5: Results for 30 uniform random graphs. Relative gaps are computed as inEquation 3.3.
71
-0.1-0.05
0 0.05 0.1
0.15 0.2
0.25 0.3
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Rel
ativ
e G
ap
Connection Distance
(a) Fractional Objectives
LP SDP
-0.1-0.05
0 0.05 0.1
0.15 0.2
0.25
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Rel
ativ
e G
ap
Connection Distance
(b) Best Rounded Objectives
Best LPBest Proj.
Best P-F
-0.1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Rel
ativ
e G
ap
Connection Distance
(c) Average Rounded Objective
Avg. LPAvg. Proj.
Avg. P-F
Figure 4.6: Results for neighborhood random graphs with various connection distances d.
72
ure 4.6a shows the fractional objective values and Figure 4.6b shows the best rounded
objective values. As d increases, positions are more likely to be connected, the optimum
objective grows, and the SDP’s advantage in lower bounding the optimum solution in-
creases. Both projection rounding and Perron-Frobenius rounding can find the optimum
solutions for most of these graphs within 10,000 roundings, whereas this is not the case for
the LP rounding. The average rounded energy is shown in Figure 4.6c, and again Perron-
Frobenius gives a slightly better distribution than projection rounding. The spectra for
neighborhood random graphs of low connection distance are also very concentrated in the
highest eigenvalue. As the connection distance increases, the spectrum generally becomes
more spread out (data not shown).
4.4 Conclusions
In this chapter, we formulated the side-chain positioning problem as an instance of
semidefinite programming and introduced two new rounding schemes for converting frac-
tional solutions into integral ones. Our rounding schemes are quite general and should be
applicable to other problems.
We have applied our method to the problem of computationally redesigning the cores
of two naturally occurring proteins. In addition, we have investigated how the rounding
formulations behave on two classes of random graphs. While the hardness of the SCP
problem argues that no method will do well in general, our computational experiments
confirm the effectiveness of our methods. We provide a measure of theoretical justification
for this.
We have shown that semidefinite programming can be applied to biological problems of
realistic, albeit small, size. Though non-polynomial search heuristics are more practical at
present for larger biological problems (Chapter 3), as semidefinite programming algorithms
and solvers improve, our approach will become more attractive, particularly for design
problems, where the LP approach tested in Chapter 3 requires more computation. In
73
fact, recent progress has been made in finding approximate solutions to SDP instances.
Along with other applications, (Arora et al., 2005) give an algorithm for finding an ǫ-
additive approximation for our semidefinite program (4.3) in time O(
n1.5Nǫ4.5
), where N is
the number of non-zero energy terms, and n is the total number of rotamers.
74
Chapter 5
Improving a Mathematical
Programming Approach for Motif
Finding
Moving from structure to protein function, we investigate a particular essential
function of proteins — the regulation of other proteins. Transcription factor
proteins bind to DNA to regulate the transcription of protein encoding genes.
The SCP LP/ILP approach given in Chapter 3 can be recast for the task of
finding the binding sites of these transcription factors. In this chapter, we
speed up the linear and integer programming method of Chapter 3 for this
application.
5.1 Introduction
The central dogma of biology is the basic mechanism of genetic information processing:
genes (subsequences of DNA) are transcribed into mRNA molecules, which are then trans-
lated into proteins, the building blocks of cellular structure and process. One mechanism
by which the abundance of particular proteins can be controlled is by accelerating or
75
reducing the rate of transcription. To this end, most genes have, near their region of
DNA, binding sites for regulatory proteins called transcription factors. The process of
transcription may be enhanced or slowed by the binding of transcription factors to their
binding sites.
In this chapter, we describe computational methods for automated discovery of these
regulatory elements, the binding sites of transcription factors in DNA. A commonly stud-
ied paradigm starts with a set of DNA sequences that contain binding sites for a regulatory
protein, and then finds shared (or similar) subsequences in each. These subsequences, or
motifs, are putative binding sites for the same transcription factor. The effectiveness of
identifying regulatory elements in this manner has been demonstrated by considering sets
of sequences identified via shared co-expression (Tavazoie et al., 1999), orthology (Cliften
et al., 2003; Kellis et al., 2003), and genome-wide location analysis (Lee et al., 2002).
From a computational point of view, while many motif-finding methods work reason-
ably well, a recent comprehensive study by (Tompa et al., 2005) shows that no single
motif finding method exhibits a high absolute measure of correctness. Broadly speaking,
the methods are either probabilistic or combinatorial. Probabilistic approaches estimate
parameters of a motif model using maximum likelihood or maximum a posterior estima-
tion to find the parameters of these models (Lawrence and Reilly, 1990; Bailey and Elkan,
1995; Lawrence et al., 1993; Liu et al., 2001; Frith et al., 2004). Combinatorial approaches
either enumerate through all allowed motifs (e.g., (Tompa, 1999; Sinha and Tompa, 2003;
Marsan and Sagot, 2000; van Helden et al., 2000; Eskin and Pevzner, 2002; Pavesi et al.,
2004)), or attempt to maximize some measure based on sequence similarity, or minimize
some measure based on distance (e.g., (Hertz and Stormo, 1999; Rigoutsos and Floratos,
1998; Pevzner and Sze, 2000; Sze et al., 2004)).
We take the combinatorial approach and formulate the motif finding problem as that
of finding the best gap-less local multiple sequence alignment using the sum-of-pairs (SP)
scoring scheme. The SP-score is one of many reasonable schemes for assessing motif conser-
76
vation (Osada et al., 2004; Schuler et al., 1991). The combinatorial problem is equivalent
to that of finding a minimum weight clique of size p in a p-partite graph (e.g. (Reinert
et al., 1997; Pevzner and Sze, 2000; Sze et al., 2004)).
For general notions of distance, this problem is NP-hard to approximate within any
reasonable factor (Section 2.6). In the motif finding setting, where the distances obey the
triangle inequality, the problem remains NP-hard (Akutsu et al., 2000). While constant-
factor approximation algorithms exist (Gusfield, 1993; Bafna et al., 1997), the ability to
find the optimal solution in practice is preferable.
Our approach follows that of (Zaslavsky and Singh, 2005), which introduced the integer
linear programming (ILP) formulation of the motif finding problem. Their testing on
identifying known DNA binding sites of E. coli transcription factors (Robison et al.,
1998) shows that the approach performs well for motif finding, identifying either known
motifs or motifs of higher conservation. They apply the ILP formulation to a variety
of types of motif-finding problems, including DNA motifs, protein motifs, and artificial
motifs embedded in random sequences (so called subtle motifs (Pevzner and Sze, 2000)).
A difficulty mentioned in (Zaslavsky and Singh, 2005), however, is the size of the
integer linear programs, which can have millions of variables for interesting biological
problems. In that work, the authors tackle the ILPs by preprocessing with graph prun-
ing and decomposition techniques. They give DEE-like rules for throwing out nodes and
experiment with a depth-1 branch-and-bound-like procedure where each node in an arbi-
trary position is assumed to be in the solution, and the graph is reduced using the pruning
procedures assuming the chosen node for the arbitrary position is the correct one. If the
graph remains a feasible instance, the smaller ILP is solved for this reduced graph.
Here, we take an alternative direction and propose a novel, more compact ILP that
uses the discrete nature of the distance metric imposed on pairs of subsequences. We
present a exponential-sized class of constraints to make the linear programming (LP)
relaxation of the new formulation provably as tight as that given in (Zaslavsky and Singh,
77
2005), and we give a separation algorithm so that solving the new relaxation remains
polynomial-time despite the large number of constraints.
Rather than using the separation algorithm explicitly, we describe and test a heuristic
approach to solve the LP relaxation of our novel ILP formulation that, in all observed
cases, finds a solution of the same objective value as the LP relaxation of (Zaslavsky and
Singh, 2005), often an order of magnitude faster. Moreover, we show that in practice,
the LP relaxations for both of the ILP formulations often have integral optimal solutions,
making solving the LP relaxations sufficient for solving the original ILP. Even if this were
not the case, the ability to find faster solutions to the relaxations may translate into
significant speed-ups in branch-and-bound approaches for solving the original ILP.
5.2 Formal Problem Specification
In the motif-finding problem, we are given p sequences, which are assumed without loss
of generality to each have length N ′, and a motif length ℓ. The goal is to find a substring
si of length ℓ in each sequence i, such that the sum of the pairwise distances between the
substrings (i.e.,∑
i<j distance(si, sj)) is minimized. The distance between substrings
may be defined in several ways. The simplest measure, and the one we restrict ourselves
to in this paper, is the Hamming distance.
This motif-finding problem can be expressed in the same graph theoretic terms (Rein-
ert et al., 1997) as the side-chain positioning problem we have considered earlier. For
a problem with p sequences, we define a complete, weighted p-partite graph, with a
part Vi for each sequence. In Vi, there is a node for every possible window of length
ℓ in sequence i. Thus there are N := N ′ − ℓ + 1 nodes in each Vi, and the vertex set
V = V1 ∪ · · · ∪ Vp has size Np. For every pair u and v in different parts there is an
edge (u, v) ∈ E . Letting seq(u) denote the subsequence corresponding to node u, the
weight wuv on edge (u, v) equals distance(seq(u), seq(v)). Let the part of a node be
denoted as part(u) := i such that u ∈ Vi. The goal in motif finding is to choose a node
78
from each part so as to minimize the weight of the induced subgraph. We note that the
combinatorial formulation of the “subtle motifs” problem is similar (Pevzner and Sze,
2000), though in that case edges exist only between nodes corresponding to subsequences
that are within a certain distance of one another. The approach we outline below can be
extended to that context as well.
5.3 Integer and Linear Programming Formulations
5.3.1 Original Integer Linear Programming Formulation
Given the graph-theoretic formulation above, the ILP presented in Chapter 3 can be used
for motif finding (Zaslavsky and Singh, 2005):
Minimize∑
{u,v}∈E wuv ·Xuv
subject to
∑u∈Vi
Xu = 1 for i = 1, . . . , p
∑u∈Vi
Xuv = Xv for i = 1, . . . , p and v ∈ V \ Vj
Xu,Xuv ∈ {0, 1}
(IP1)
5.3.2 New Integer Linear Programming Formulation
Since the alphabet and the length of the sequences are finite, there are only a finite
number of possible pairwise distances. For example, in the case of Hamming distances,
edge weights can only take on ℓ + 1 different values. We take advantage of the small
number of possible weights and the fact that the edge variables of IP1 are only used to
ensure that if two nodes u and v are chosen in the optimal solution then wuv is added to the
cost of the clique. We introduce a second ILP in which we no longer have edge variables
Xuv. Instead, in addition to the node variables Xu, we have a variable Yujc for each node
u, each position j such that u is not in Vj , and each possible edge weight c. These Y
79
variables model groupings of the edges by weight into cost bins. The intuition is that Yujc
is 1 if node u and some node v ∈ Vj are chosen such that wuv = c. Formally, let D be the
set of possible edge weights and let W = {(u, j, c) : c ∈ D,u ∈ V, j ∈ 1, . . . p and u 6∈ Vj}
be the set of triples over which the Yujc variables are indexed. Then the following ILP
models the motif-finding graph problem:
Minimize∑
(u,j,c)∈W :part(u)<j c · Yujc (IP2)
subject to
∑u∈Vi
Xu = 1 for i = 1, . . . , p (IP2a)
∑c∈D Yujc = Xu for j = 1, . . . , p and u ∈ V \ Vj (IP2b)
∑v∈Vj :wuv=c Yvic ≥ Yujc for (u, j, c) ∈W s.t. part(u) < j, i = part(u) (IP2c)
Xu, Yujc ∈ {0, 1}
As in the previous formulation, the first set of constraints forces a single node to be
chosen in each part. The second set of constraints says that if a node u is chosen, for each
position j, one of its “adjacent” cost bins must also be chosen (Figure 5.1). The third
set of constraints ensures that Yujc can be chosen only if some node v ∈ Vj is also chosen
such that wuv = c. We discard variables Yujc if there is no v ∈ Vj such that wuv = c.
Figure 5.1 gives a schematic drawing of these constraints.
It is straightforward to see that IP2 correctly models the motif-finding problem if the
variables are ∈ {0, 1}. For any choice of p-clique {u1, . . . , up} of weight µ =∑
i<j wuiuj,
a solution of cost µ to IP2 can be found by taking Xui= 1 for i = 1 . . . , p, and taking
Yuijc = 1 for all 1 ≤ j ≤ p such that wuiuj= c. This solution is easily seen to be feasible,
and between any pair of positions i, j it contributes cost wuiuj; therefore, the total cost is
µ. On the other hand, consider any solution (X,Y ) to IP2 of objective value µ. Consider
the clique formed by the nodes u such that Xu = 1. Between every two positions i < j,
the constraints (IP2a) and (IP2b) imply that exactly one one Yujc and one Yvid are set to
80
Figure 5.1: Schematic of IP2. Adjacent to each node u there are at most |D| cost bins,each associated with a variable Yujc. Associated with each cost c are the nodes v ∈ Vj forwhich wuv = c (represented in the figure by stars). Constraints (IP2b) say that we mustspread a total of Xu weight apportioned over the bins, while constraints (IP2c) limit usto choosing cost bin variables where there is some node v ∈ Vj chosen such that wuv = c.
1 for some u ∈ Vi and v ∈ Vj and costs c, d. Constraint (IP2c) corresponding to (u, j, c)
with Yujc on its right-hand side can only be satisfied if the sum on its left-hand side is 1,
which implies c = d = wuv. Thus, a clique of weight µ exists in the motif-finding graph
problem.
5.3.3 Advantages of IP2
In practice, IP2 has many fewer variables than IP1. IP1 has Np(N(p−1)/2+1) variables
and p+Np(p−1) constraints (ignoring the binary constraints). If none of the Yujc variables
can be thrown out, IP2 hasNp((p−1)d+1) variables and p+Np(p−1)(d/2+1) constraints,
where d = |D|, the number of kinds of weights. If d < N/2, the second IP will have fewer
variables. In practice, d is expected to be much smaller than N , and while N could
reasonably be expected to grow large as longer and longer sequences become practical to
study, d is constrained by the geometry of transcription factor binding and will remain
small. Also, in practice, it is likely that many Yujc variables are removed because seq(u)
does not have matches of every possible weight in each of the other sequences. IP2, on
the other hand, will have O(d) times more constraints than IP1.
81
There are additional practical considerations that may make the number of possible
edge weights small. It is often the case that one is interested in an optimal solution only
if it is of high enough quality, meaning that no motif instance in the solution is further
than α away from any other (the diameter of the solution is ≤ α). If this is the case,
edges of weight > α can be deleted. Such a requirement reduces d and makes IP2 still
smaller. In many applications, even if large diameter solutions are acceptable, there is an
expectation that the diameter is likely small, and a solution with a small diameter may
be preferred to one with a lower sum-of-pairs score but more outliers. In such a case,
the IP2 formulation may allow one to check for low diameter solutions quickly.
While the space requirement for the simplex algorithm is related to the number of
constraints and variables, running time is not necessarily directly related. Smaller in-
teger programs with weaker LP relaxations are often less useful for branch-and-bound
approaches to IP solving. Thus, we seek the tightest, smallest IP possible. Towards this
end, experiments must be performed to gauge the efficacy of various formulations on prac-
tical problems. We present experiments below which suggest IP2 can be more than an
order of magnitude faster than IP1.
5.3.4 Linear Programming Relaxation
The typical approach to solving an ILP is to solve the linear program derived from the
ILP by dropping the requirement that the variables be in {0, 1}, and instead requiring
only that the variables lie in the continuous range [0, 1], as we did in Chapter 3. This
modified problem is called the linear programming (LP) relaxation. An ILP formulation
is weaker than another if the corresponding LP relaxation of the first admits solutions
of lower objective values than are possible with the second. Weaker relaxations are often
less useful in solving the corresponding ILP.
The LP relaxation of IP2, which we refer to as LP2, is weaker than the LP relaxation
of IP1, which we refer to as LP1. (Note, in our case the constraints Xu ≤ 1 and Yujc ≤ 1
82
are implied and not included explicitly.) In this section, we present a fairly natural class
of constraints that, if added to LP2, will make it as strong as LP1. In the subsequent
section, we show that in practice we can focus on just two types of these constraints, and
we are able to solve the original ILP iteratively by adding cutting planes corresponding
to violated constraints of these types.
Additional constraints. Focus on a single pair of positions i and j. In IP1 the edge
variables between Vi and Vj explicitly model the bipartite graph between those two posi-
tions. In IP2, however, the bipartite graph is only implicitly modeled by an understanding
of which Y variables are compatible to be chosen together. We study this implicit rep-
resentation by considering the bipartite compatibility graph Cij between two positions i
and j. Intuitively, we have a node in this compatibility graph for each Yujc and Yvic, and
there is an edge between the nodes corresponding to Yujc and Yvic if wuv = c. These
two Y variables are compatible in that they can both be set to 1 in IP2. More formally,
Cij = (Aij ∪ Aji, F ), where Aij = {(u, j, c) : u ∈ Vi, c ∈ D} is the set of indices of Y
variables adjacent to a node in Vi, going to position j, and Aji is defined analogously,
going in the opposite direction. The edge set F is defined in terms of the neighbors of a
triple (u, j, c) as follows. For u in position i, let N (u, j, c) = {(v, i, c) ∈ Aji : wuv = c} be
the neighbors of (u, j, c). They are the indices of the Yvic variables adjacent to position
j going to position i so that the edge {u, v} has weight c. There is an edge in F going
between (u, j, c) and each of its neighbors. We call c the cost of triple (u, j, c). All this
notation is summarized in Figure 5.2.
In any feasible integral solution, if Yujc = 1, then some Yvic for which (v, i, c) ∈
N (u, j, c) must also be 1. Extending this insight to subsets of the Yujc variables yields a
class of constraints which will ensure that the resulting linear programming formulation
is as tight as LP1. If Qij is a subset of Aij , then let N (Qij) =⋃
(u,j,c)∈QijN (u, j, c) be
the set of indices that are neighbors to any vertex in Qij . The following constraint is
83
Figure 5.2: Notation used in the faster ILP formulation. N (u, j, c) is shown assumingthat v and w are the only nodes in Vj that have cost c with u. The two columns of circlesrepresent parts of the graph Vi and Vj with each circle representing a node. The solidlines adjacent to each circle represent the Yujc or Yvic variables associated with the node.Aij and Aji (dotted boxes) are the sets of these variables associated with the pair of graphparts i and j. Finally, the function N (u, j, c) maps a variable Yujc to a set of compatibleYvic variables (squiggly lines).
Figure 5.3: Graph Cij used to show constraints (5.1) are sufficient. Nodes r and s area source and sink, respectively, added in the proof of Theorem 5.3.1. Each solid nodecorresponds to a Y variable. The edges between Aji and Aij have infinite capacity, whilethose entering s or leaving r have capacity equal to the value of the Y variable to whichthey are adjacent. The shading gives an r – s cut.
84
true in IP2 for any such Qij:
∑
(u,j,c)∈Qij
Yujc ≤∑
(v,i,c)∈N (Qij)
Yvic . (5.1)
That is, choose any set of Yujc variables adjacent to position i that go to position j. Their
sum must be less than or equal to the sum of Y variables for their neighbors. Notice that
the third set of constraints in IP2 are of the form (5.1), taking Qij to be the singleton set
{(u, j, c)}.
Theorem 5.3.1 If for every pair i < j, constraints of the form (5.1) are added to IP2
for each Q ⊆ Aij such that all triples in Q are of the same cost then the resulting LP
relaxation LP2′ is as strong as the relaxation LP1 of IP1.
Proof. For any feasible solution for LP2′, we will show that there is a feasible solution
for LP1 with the same objective value, thereby demonstrating that the optimal solution
to LP2′ is no smaller than the optimal solution to LP1. In particular, fix a solution (X,Y )
to the LP2′ with objective value γ. We need to show that for any feasible distribution of
weights on the Y variables a solution to LP1 can be found with objective value γ.
In order to reconstruct a solution X for LP1 of objective value γ, we will set Xu = Xu,
using the values of the node variables Xu in the optimal solution to LP2′. We must assign
values to Xuv to complete the solution. Recall the compatibility graph Cij described
above. Because all edges in Cij are between nodes of the same cost, Cij is really |D|
disjoint bipartite graphs Ccij, one for each cost. Let Ac
ij ∪ Acji be the node set for the
subgraph Ccij for cost c. Each edge in a subgraph Cc
ij corresponds to one edge in the graph
G underlying LP1. Conversely, each edge in G corresponds to exactly one edge in one of
the Ccij graphs (if edge {u, v} has cost c1, it corresponds to an edge in Cc1
ij ). We will thus
proceed by assigning values to the edges in the various Ccij, and this will yield values for
the Xuv.
If y(A) :=∑
(u,j,c)∈A Yujc, by the sets of constraints (IP2a) and (IP2b), y(Aij) =
85
y(Aji) = 1. Since the constraints (5.1) are included with Q = Acij for each cost c, by the
pigeonhole principle, y(Acij) = y(Ac
ji) for every cost c. Thus, for each subgraph Ccij , the
weight placed on the left half equals the weight placed on the right half. We will consider
each induced subgraph Ccij separately.
We modify Ccij as follows to make it a directed, capacitated graph. Direct the edges
of Ccij so that they go from Ac
ji to Acij, and set the capacities of these edges to be infinite.
Add two dummy nodes {r, s} and edges directed from r to each node in Acji and edges
from each node in Acij to s. Every edge adjacent to r and s is also adjacent to some
node representing a Y variable. Put capacities on these edges equal to the value of the Y
variable associated with the node to which they are adjacent. See Figure 5.3.
The desired solution to LP1 can be found if the weight of the nodes (Y variables)
in each compatibility subgraph can be spread over the edges. In other words, a solution
to LP1 of weight γ can be found if, for each pair (i, j) and each c, there is a flow of weight
y(Acij) from r to s in the above constructed graph. The assignment to Xuv will be the flow
crossing the corresponding edge in the Ccij of appropriate cost. In the following lemma,
we show that the set of constraints described in the theorem ensure that the minimum
cut in the constructed graph is ≥ y(Acij), and thus that there is a flow of the required
weight. The proof of this fact can be found in (Cook et al., 1997) on page 54-55, and is
reproduced in our notation below. �
Lemma 5.3.2 The minimum cut of the flow graph described in the proof of Theorem 5.3.1
(and shown in Figure 5.3) is y(Acij).
Proof. (Adapted from (Cook et al., 1997)). Recall that the capacities of the edges leaving
r are Yvic and those entering s are Yujc, and that the total capacity leaving r equals the
total entering s, and this total capacity equals y(Acij). We want to show that the minimum
r − s cut in this graph is ≥ y(Acij).
Consider an r− s cut {r} ∪A∪B where A ⊆ Acji and B ⊆ Ac
ij . (Such a cut is shaded
in Figure 5.3.) Define A = Aji \ A and B = Aij \ B. If any edges go between A and B
86
Figure 5.4: Example graph Ccij where the constraints added in our heuristic approach are
insufficient. All the constraints with |Q| = 1 or |Q| = |Aij | are satisfied, but a flow ofvalue 1 does not exist from the right side to the left side in the augmented graph (as inFigure 5.3).
then the capacity is infinite, and we are done. Otherwise the value of the cut is the sum
of the capacities of the edges leaving r and going to A plus the sum of the capacities of
the edges entering s from a node in B. We will now show that
y(A) ≥ y(B) , (5.2)
which implies that the value of the cut is > y(Acij).
Assume for a moment that all nodes in A have a neighbor in B. Then N (B) = A
because there are no edges between A and B. By (5.1), y(B) ≤ y(N (B)) = y(A). On
the other hand, if there is a node in A that does not have a neighbor in B then we can
add that node to A to make A′ (without increasing the cost of the cut), and the above
argument shows that y(Aji \ A′) ≥ y(B), which implies y(A) ≥ y(B) since Aji \ A′ ⊆ A.
�
It is also clear that linear relaxation LP2′ described in Theorem 5.3.1 is no stronger
than LP1 as any solution to LP1 can be converted to a solution of LP2′ by putting the
weight on edge variables Xuv onto Yujc and Yvic, where wuv = c. This solution to LP2′
will satisfy all the constraints in the theorem.
There are an exponential number of constraints considered in Theorem 5.3.1. Such an
LP can be solved in polynomial time (by the ellipsoid algorithm (Grotschel et al., 1993))
87
if a polynomial-time separation algorithm exists. The separation algorithm must find a
violated constraint, if one exists, or report that no constraints are violated. The next
lemma gives such a separation algorithm, which simply formalizes the intuition in the
proof of the constraints’ (5.1) sufficiency: All constraints are satisfied only if the max
flow is large enough, which is true only if the min cut is large enough, so if the min cut is
small, that cut will give us a violated constraint.
Lemma 5.3.3 There is a polynomial-time algorithm that can find a violated constraint
in LP2′ or report that none exists.
Proof. Because each constraint in (5.1) involves variables of a single cost, if (5.1) is
violated for some set Q, then Q is a subset of an Acji for some i, j, c, and so we can
consider each subgraph Ccij independently. The proof of Theorem 5.3.1 shows that there is
a violated constraint of the form (5.1) between i, j involving variables of cost c if and only
if the maximum flow in Ccij is less than y(Ac
ij). Thus, the minimum cut can be found for
each triple i, j, c, and, if a triple i, j, c is found where the minimum cut is less than y(Acij),
one knows that a violated constraint exists between positions i and j with Q ⊂ Acij .
The minimum cut can then be examined to determine the violated constraint explicitly.
Let {r} ∪ A ∪ B be the minimum r – s cut in Ccij, with A ⊆ Ac
ji and B ⊆ Acij , and let
A = Acji \ A and B = Ac
ij \ B (following the notation of Lemma 5.3.2). Such a cut
is shaded in Figure 5.3. Let m be the capacity of this cut, and assume, because we
are considering a triple i, j, c that was identified as having a violated constraint, that
y(Acij) > m. Because m <∞ there are no edges going from A to B, and hence two things
hold: (1) m = y(B) + y(A) and (2) N (A) ⊆ B, and therefore y(N (A)) ≤ y(B). Chaining
these facts together, we have
y(A) = y(Acij) − y(A) > m− y(A) = y(B) ≥ y(N (A)) ,
Thus, the set A is a set for which the constraint of the form (5.1) is violated. �
88
In practice, the ellipsoid algorithm is often not as fast as the (theoretically slower)
simplex algorithm. We can use the faster simplex algorithm because not all of these con-
straints are necessary for real problems. Some particular choices of Qij yield constraints
that are intuitively very useful and are usually enough in practice. The constraints with
the largest Qij, that is Qij = Aij , were used in the proof of Theorem 5.3.1, and we have
found them to be useful in practice and are included in all the LP relaxations solved in
Section 5.4. In fact, for this Qij inequality (5.1) is an equality. LP2 already includes all
the constraints with Qij equal to the singleton set {Yujc} ⊂ Acij . Rather than including
constraints with 1 < |Qij | < |Acij |, we include the constraints with i and j reversed, tak-
ing Qji = {Yvic} ⊆ Acji for v ∈ Vj. More detail about our approach to and experiences
with real problems can be found in Section 5.4. Examples can be constructed for which
these particular constraints are insufficient to make LP2 as tight as LP1. For example,
Figure 5.4 gives a graph Ccij for a single color for which all these constraints hold but for
which no solution of LP1 can be constructed. However, we have not encountered such
pathological cases in practice, and so we do not explore adding constraints with |Q| > 1.
5.3.5 Integrality Gap
As discussed in Section 3.2.4, if arbitrary positive weights are allowed on the edges, the
integrality gap of IP1 and IP2 can be made as large as desired. Here, however, the possible
weights are constrained in several ways. First, they are integers, a fact we have exploited
in our reformulation of the problem. Second, when using a distance metric to compare
windows, the set of weights on the edges of the graph satisfy the triangle inequality — this
is the key property that makes a reasonable approximation algorithm possible. Finally,
the weights are not independent — they are derived from overlapping sliding windows, a
property that we have not yet been able to exploit. So there is hope in the motif-finding
case that it may be possible to prove that the LP relaxation gives a good lower bound on
the optimal.
89
Figure 5.5: Example for which the integrality ratio of the LP relaxation is at least 4/3.The drawn edges have cost 2, while the undrawn edges have cost 1. Any integral solutionwill have objective value 4, but placing 0.5 on each node yields cost 3.
In the setting where the distance function is the Hamming distance, the gap is at least
4/3. An example can be seen in Figure 5.5: one triangle is labeled with length-ℓ strings
each consisting of a single type of nucleotide (ℓ = 2 in the figure). The other is labeled
with length-ℓ strings consisting of 1/2 of one base and 1/2 of another. That is, the drawn
edges have weight ℓ, while the other edges have weight ℓ/2, so the cost of any integral
solution is ℓ+ (1/2)ℓ+ (1/2)ℓ = 2ℓ. As in Section 3.2.4, by putting 0.5 on each node and
the undrawn edges, the fractional solution obtains a value of 6 × (0.5)(1/2)ℓ = 1.5ℓ. The
integrality gap is thus at least 4/3.
5.4 Computational Results
5.4.1 Methodology
We have found the following methodology to work well in practice. We first solve the
LP relaxation of IP2. If the solution is integral, we are finished. Otherwise, we add
any violated constraint of the form (5.1) with i and j reversed and with |Qji| = 1, and
resolve. We have never encountered a problem for which more constraints than these were
90
necessary to make LP2 as tight as LP1. We use the optimal basis of the previous iteration
as a starting point for the next, setting the dual variables for the added constraints to be
basic.
Because the variant of the simplex algorithm can make a large difference in run-
ning time in practice, LP1 was solved using two different variants. In the first (primal
dualopt), the primal problem was solved using the dual simplex algorithm. In the sec-
ond (dual primalopt), the dual problem was solved using the primal simplex algorithm.
While, in theory, these two variants should perform similarly, in practice running times
can differ significantly. Applying the dual simplex method to the dual problem or the
primal simplex to the primal problem are not expected to perform as well, and small scale
testing confirms this intuition (data not shown). LP2 was always solved using the dual
simplex method applied to the primal problem to make adding constraints faster.
The linear and integer programs were specified with Ampl and solved using CPLEX
7.1. All experiments were run on a public 1.2 GHz SPARC workstation shared by many
researchers, using a single processor. All the timings reported are in CPU seconds on this
machine. For any run, any problem taking longer than 5 hours was aborted.
5.4.2 Test Sets
We present results on identifying the binding sites of 50 transcription factor families. We
construct our data set from the data of (Robison et al., 1998; McGuire et al., 2000) in
a fashion similar to (Osada et al., 2004). In short, we remove all sites for sigma-factors,
duplicate sites, as well as those that could not be unambiguously located in the genome.
For each transcription factor under consideration, we extract the proteins it is regulating,
and gather at least 300 base pairs of genomic sequence upstream of their transcription
start sites. In those cases where the binding site is located further upstream, we extend
the sequence to include the binding site. The window size for each family was chosen
based on the length of the consensus binding site size, determined from other biological
91
Table 5.1: Sizes for the 50 problems considered: number of sequences (p), motif length(ℓ), and total number of nodes in the underlying graph (n).
TF ℓ p n TF ℓ p n TF ℓ p n
ada 31 3 810 fruR 16 11 4082 metR 15 8 3312araC 48 6 1715 fur 18 9 3182 modE 24 3 934arcA 15 13 4790 galR 16 7 2188 nagC 23 6 1870argR 18 17 5960 gcvA 20 4 1234 narL 16 10 3301carP 25 2 552 glpR 20 11 3829 ntrC 17 5 1516cpxR 15 9 2614 hipB 30 4 1084 ompR 20 9 3057cspA 20 4 1410 hns 11 5 1485 oxyR 39 4 1048cynR 21 2 854 hu 16 2 571 pdhR 17 2 568cysB 40 3 783 iclR 15 2 588 phoB 22 14 4618cytR 18 5 1695 ilvY 27 2 1079 purR 26 20 5856dnaA 15 8 2381 lacI 21 2 560 rhaS 50 2 502fadR 17 7 2122 lexA 20 19 5554 soxS 35 13 4004farR 10 3 873 lrp 25 14 4090 torR 10 4 2198fhlA 27 2 731 malT 10 10 3410 trpR 24 4 1108fis 35 18 5371 marR 24 2 813 tus 23 5 1390flhCD 31 3 810 melR 18 2 717 tyrR 22 17 5258fnr 22 12 3705 metJ 16 15 5754
studies. The families, their sizes and the length of the binding site are shown in Table 5.1.
5.4.3 Performance of the LP Relaxations
We solved LP1 and LP2 relaxations for the 50 problems in Table 5.1. The running times
of LP2 are shown in Figure 5.6(b). Generally, the initial set of constraints is sufficient to
get a tight solution. Six problems required adding constraints to LP2 in order to make it
as tight as LP1. The problems flhCD, torR, and hu required 2 cutting plane iterations,
ompR required 3, oxyR required 4, and nagC required 5. Running times reported in
Figure 5.6(b) are the sum of the solve times of all cutting plane iterations.
Of the 45 problems solvable in < 5 hours, only 3 were not integral. This is somewhat
surprising. Of course, there is much structure to real problems, which may make them less
susceptible to the worst-case analysis. The success of the LP relaxations in finding integral
92
Transcription Factor Family
fur
fadR
mal
Tm
etR fnr
ilvY
dnaA
galR
mar
Rcy
nR hns
cpxR
araC
mel
Ric
lRfh
lAom
pR hu lrpgl
pRm
odE
fruR
lexA
arcA
rhaS
ntrC
gcvA
pdhR
carP lacI tus
nagC
trpR
farR
cspA
torR
ada
flhC
Dhi
pBm
etJ
argR
oxyR
cysB
phoB
cytR
Mat
rix S
ize
00.20.40.60.8
11.21.41.6
C
LP2
Tim
es
100300900
25006800
18400B
Results on 45 Transcription Factor Families
Spe
ed u
p
0
5
10
15
20
25
30
35
40
45A
Figure 5.6: (a) Speed-up factor of LP2 over LP1 as defined in (5.3) for problems for whichsome method was able to complete in < 5 hours. Shaded bars correspond to problemsfor which LP1 did not finish in < 5 hours. The doublely shaded bar (far right) marksthe problem for which LP2 did not finish in < 5 hours, but LP1 did. (b) Running timesin seconds for LP2. The y-axis is in log scale. (c) Matrix size for LP2 divided by thatfor LP1.
solutions suggests that handling non-integral cases may not be as pressing a problem as
one would think.
We compared the running time of LP2 with that of LP1 by taking as the running time
of LP1 that of the simplex variant that performed the best. In order words, we take the
speed-up factor to be:
min(primal dualopt, dual primalopt)
LP2(5.3)
93
This gives LP1 the benefit of the doubt. In practice, always achieving the runtime used
in the numerator would require running each variant in parallel using 2 processors. The
speed-up factors for these problems are shown in Figure 5.6(a). For 10 problems, neither
simplex variant completed in < 5 hours when applied to LP1, whereas LP2 did. For these
problems, the numerator of (5.3) was taken to be 5 hours. This gives a lower bound on
the speed up. For one problem, cytR, the reverse is true and LP2 could not finish within 5
hours, while both simplex variants successfully solved it using LP1. For this problem, the
denominator was taken to be 5 hours, and (5.3) gives an upper bound. For 5 problems, no
method found a solution in < 5 hours; these are omitted from Figure 5.6. Only cytR was
slower using LP2, and an order of magnitude increase in speed is common when using LP2
compared with LP1.
As expected, the size of the constraint matrix (defined as the number of constraints
times the number of variables) is often smaller for LP2. Fig. 5.6(c) plots
size for LP2
size for LP1. (5.4)
While, in 5 cases the matrix for LP2 is larger, in many cases it is < 50% the size of the
matrix for LP1. A smaller constraint matrix can often lead to faster iterations of the
simplex algorithm.
We also compared the motifs found by our approach to the set of known transcription
factor binding sites. In all cases, we found motifs that are at least as well conserved as
the actual binding sites (measured by average information content). Since our test data
are real genomic sequences, co-regulated genes may in fact have multiple shared binding
sites for different transcription factors.
94
5.5 Conclusions
In this chapter, we introduced a novel ILP formulation of the motif finding problem that
works well in practice. In particular, it finds optimal solutions to motif-finding instances
significantly faster than a previous ILP formulation introduced by (Zaslavsky and Singh,
2005). We note that a variety of graph pruning and decomposition techniques have been
introduced for motif finding (e.g., (Reinert et al., 1997; Pevzner and Sze, 2000; Zaslavsky
and Singh, 2005)). It is likely that, in conjunction with those techniques, our formulation
will be able to tackle problems of significantly larger sizes.
Our work opens many interesting avenues for future work. While the underlying graph
problem for motif finding is essentially identical to that of Chapters 2–4, one central
difference is that when minimizing distance based on nucleotide matches and mismatches,
the triangle inequality is satisfied. The current ILP formulations do not exploit this, and
as a result, work in its absence. Another feature commonly present in motif finding that
is not used here is that the edge weights in the graph are not independent, as each node
represents a subsequence from a window sliding along the DNA. Incorporating either the
triangle inequality or the correlation between edge weights into the ILP or its analysis may
lead to further advances in computational methods for motif finding. Finally, it would be
useful to extend the basic formulation presented here to other motif finding applications,
for example, to find multiple co-occurring or repeated motifs.
95
Chapter 6
Identifying Functionally Related
Yeast Proteins Using Inferred
Evolutionary History
We now consider protein function more generally and focus on using cross-
genomic evolutionary information in the form of the widely used method of
phylogenetic profiles to predict functional linkages between proteins. In this
chapter, we give two methods for making use of the relationships between the
organisms to improve predictions. Along the way, we develop some insight into
what kinds of evolutionary patterns are most indicative of shared function.
6.1 Introduction
The inheritance patterns of interacting or functionally related proteins tend to be similar
due to shared evolutionary pressures. If we knew the evolutionary history of two pro-
teins — a description of when the proteins were gained, lost, or transferred over time
— we could use the assumption that similar histories indicate similar function to decide
whether the proteins are functionally related. Of course, we do not generally know the
96
complete evolutionary history. As a proxy, the method of phylogenetic profiles, introduced
in (Tatusov et al., 1997; Gasterland and Ragan, 1998; Pellegrini et al., 1999), uses the
presence or absence of similar proteins across many present-day organisms.
In the phylogenetic profile method, each protein in the organism under study is rep-
resented by a vector with one dimension for each of many currently extant species. In
the general case, an entry in the vector represents the degree to which a similar protein
(or homolog) exists in the species corresponding to that dimension. In some applications,
entries are simply 1 if a homolog is believed to exist and 0 otherwise, giving bit vectors
representing the presence and absence of the protein across different species. We refer
to these vectors (either 0/1 or real-valued) as profiles. Combining the assumption that
shared evolution implies shared function with this representation of a protein’s evolution-
ary history we have
shared function ⇐⇒ similar evolution ⇐⇒ related profiles. (∗)
Neither of the ⇐⇒ above are mathematical absolutes; rather they are simply observa-
tions that tend to be true. Much of the progress in applying phylogenetic profiles has been
in refining the definition of the relation at the right end of the above chain of reasoning
so that the implications hold as best as possible. As originally proposed (Pellegrini et al.,
1999), the profiles were binary vectors and simple Hamming distance was used to define
the relation. The Pearson correlation coefficient (Wu et al., 2003), hypergeometric distri-
bution (Wu et al., 2003), and mutual information (Wu et al., 2003; Date and Marcotte,
2003) were later suggested as more refined ways of measuring the relationship between
profiles. We give two new measures in this chapter that more fully exploit the connections
among the organisms from which the profiles are constructed.
The task of identifying proteins with similar functions can be formalized as follows:
Functional-linkage problem: Given a collection of functions {Fi}, each
97
represented by the set of proteins it contains, find those pairs of proteins
(g1, g2) that are in Fi × Fi for some Fi.
Solving this problem is a potentially fruitful way to begin to make sense of the complex
web of interactions in the cell. It has been proposed (Hartwell et al., 1999) that the
workings of the cell are organized into modules, where there are many interactions and
dependencies within a module and fewer between modules. Detecting that two proteins
are participants in the same pathways is useful for the discovery of such modules. For
example, (Date and Marcotte, 2003) use a network derived from comparing phylogenetic
profiles to uncover clusters of proteins of related function for several organisms.
Even weak evidence that two proteins are functionally related can be beneficial when
incorporated into meta-methods, such as Bayesian networks, that combine information
from many weak classifiers to make very accurate predictions. Such methods have been
useful recently in predicting shared function (Troyanskaya et al., 2003; von Mering et al.,
2003) and protein interactions (Jansen et al., 2003), and phylogenetic profiles have been
incorporated into such methods (e.g. (Marcotte et al., 1999b; Bowers et al., 2004b; Lee
et al., 2004; Lu et al., 2005)). Thus, while the imperfect nature of the reasoning (∗) limits
the accuracy of phylogenetic profiles in isolation, the method can be very useful when
combined with several indicators of shared function.
In this chapter, we improve the ability of phylogenetic profiles to identify proteins
involved in the same pathways by using the evolutionary relationships (phylogenetic in-
formation) between the species used to create the profiles. Vector-based distance measures
between profiles do not take into account such relationships. Here, we introduce a new
probabilistic description of evolutionary history by assigning probabilities to gene pres-
ence at hypothetical ancestral nodes of a species tree relating the extant organisms in the
profile. We also investigate informative patterns of gain and loss in such evolutionary his-
tories and give two methods for using them to predict whether two proteins are involved
in the same pathway.
98
6.1.1 Our Contributions
We begin our study by investigating to what extent the chain of implications (∗) fails.
To do this, we give a per-function upper bound on the ROC performance of any classifier
that uses the cross-genomic occurrence of proteins to separate the proteins involved in
a particular function from other proteins. This analysis leads us to several classes of
profiles that are common sources of errors and points out those functions that tend to be
more difficult to handle. We show that, in general, phylogenetic profiles contain sufficient
information to map a protein to a function.
We then probe the ability of mutual information, as proposed for the current appli-
cation by (Date and Marcotte, 2003), to identify functional linkages. We show the first
ROC analysis of the performance of mutual information on identifying proteins involved
in the same KEGG (Kanehisa, 1997; Kanehisa and Goto, 2000) pathways and discuss
when certain variations in the application of mutual information are to be preferred over
others. We also suggest that ROC analysis, though helpful, can sometimes be misleading;
we give an alternative per-function assessment of performance. Mutual information is a
reasonable similarity measure between profiles, and as such it will be our baseline method
against which to compare.
Our experiments suggest that prediction of shared function can benefit from the ad-
dition of information about the relationships between species. We augment the profile
vector with a phylogenetic tree with leaves corresponding to the organisms used to create
the profiles and describe a method for inferring whether a gene is present or absent at
the ancestral nodes. We show that taking advantage of these inferred states to compare
two proteins by simply counting contemporaneous gene gains or losses can exceed the
performance of mutual information for many pathways.
Finally, we give a scoring scheme that supplements simple counting of events and
show that, if we are permitted to extract some evidence about which evolutionary events
are most indicative of shared function from a set of training examples, we can again
99
detect functionally linked proteins better than mutual information for most pathways.
An analysis of the parameters extracted from the training examples will be instructive
about what local patterns are most suggestive of shared function.
6.1.2 Previous Work
In addition to phylogenetic profiles, several other methods that do not depend on sequence
similarity for detecting shared function have been proposed. For example, proteins can be
predicted to be functionally related if they have fused in one or more genomes (Enright
et al., 1999; Marcotte et al., 1999a), or are located near one another in a genome (Overbeek
et al., 1999).
Mutual information applied to phylogenetic profiles has been used to identify proteins
involved in similar pathways in (Date and Marcotte, 2003), where they show that actual
pairs of profiles often have much higher mutual information than would be expected by
chance as modeled by shuffled profiles. Further, they show that the higher the mutual
information between profiles, the more likely the proteins take part in a shared KEGG
pathway. This establishes mutual information as a reasonable way to compare profiles.
Recently, there has been some work on incorporating relationships between species
into the prediction of shared function or protein interactions. In (Goh et al., 2000; Pazos
and Valencia, 2001), the authors propose representing the evolutionary history of a pro-
tein by an n× n matrix, with one row and one column for each of n species. The entries
of this matrix are estimates of evolutionary distance between the corresponding homologs
in two organisms. Two matrices representing proteins are compared using the linear cor-
relation coefficient over their entries. The method has been applied to detecting domains
that are in the same protein, predicting protein-protein interactions in E. coli (Pazos and
Valencia, 2001), comparing chemokines and chemokine receptors, and comparing inter-
acting domains in phosphoglycerate kinase proteins (Goh et al., 2000). This method is
more computationally expensive than the phylogenetic profile method as it requires com-
100
putation of, potentially, n×n alignments for each pair of proteins considered. As pointed
out in (Pazos and Valencia, 2001), successful large-scale application can depend on the
quality of the alignments used to populate the distance matrix.
In the above case, as in our setting, the mapping between leaves of the two trees to
be compared is known. In a slightly different application, two groups (Gertz et al., 2003;
Ramani and Marcotte, 2003) use simulated annealing to rearrange the rows and columns
of two distance matrices to match the leaves of two trees so as to maximize the correlation
between the matrices and discover this mapping. This allows the identification of specific
interacting partners among the profiles of two families known to interact generally. It has
been successfully applied to match chemokines and tgfβ ligands with their receptors (Gertz
et al., 2003) and histidine kinase sensors and their regulators as well as several other
families (Ramani and Marcotte, 2003). A topology-based method of similar spirit was
recently introduced in (Jothi et al., 2005). While potentially very useful for small-scale
investigation, such a method is not yet applicable to genome-wide analysis.
(Liberles et al., 2002) suggest using a presence / absence labeling of the internal nodes
of a species tree derived from a minimum parsimony criteria. They note that those pairs
of profiles that have low parsimony scores are less useful in predicting shared keyword
annotations. Their preliminary testing provides some evidence that shared evolutionary
events may be informative.
For the different problem of classifying profiles into known functions using SVMs, (Vert,
2002) gives a tree kernel that makes use of a species tree.
Independent of the work in this chapter, (Barker and Pagel, 2005) recently reported a
method similar in spirit to our LRATIO method (see below). They find a maximum like-
lihood set of gene gain / loss rate parameters assuming correlated evolution and another
set assuming independent evolution. If the likelihood of a pair of profiles is much higher
under the correlated-evolution model than in the independent model, the pair is predicted
to interact. They test their method on MIPS (Mewes et al., 2002) protein complex data,
101
with cross-genomic information from 15 eukaryota. In their testing, genes that have more
than two or three shared gains or losses are usually linked, echoing our result that many
shared gains or losses are very indicative of shared pathways. In contrast to their ap-
proach, we learn the shared- and independent-evolution models from training examples.
In addition, they use a tree derived from the sequences of several universal genes, while
we use a taxonomy tree with edge lengths derived directly from our assumed gene gain /
loss model. Finally, in our tests we use profiles containing 215 organisms spanning three
branches life, while they test only on eukaryota.
6.2 Methods
6.2.1 Computing the Phylogenetic Profiles
For each of the 5,839 S. cerevisiae proteins, we created a phylogenetic profile with a
dimension for each of 215 organisms. (See Listing 6.1.) Among these organisms there are
7 eukaryota, 20 archaea, and 188 bacteria. In the profile for each S. cerevisiae protein g,
the e-value for organism S is set to the lowest e-value found among the BLASTP (Altschul
et al., 1990) hits against a database containing the protein sequences of organism S for
which the alignment covers at least 40% of the query protein g. If no such match is found
within the first 500 BLASTP hits, the e-value is set to 10. All protein sequences were
downloaded from the NCBI FTP site (NCBI, 2005).
(Date and Marcotte, 2003) suggest transforming each e-value x into a number between
0 and 1 as follows:
x 7→ 1 − min
{− 1
ln(x), 1
}. (6.1)
We apply this transformation, which is shown in Figure 6.1, to the e-values we computed
to get the entries of the profiles. The transformation can be thought of as a heuristic way
to associate a probability that a homolog is present in each organism: e-values less than
about 10−4 are likely real homologs and are given value ≥ 0.9 by the transformation. The
102
1 0 � 7 1 0 � 6 1 0 � 5 1 0 � 4 1 0 � 3 1 0 � 2 1 0 � 1 1 0 00 . 2 50 . 50 . 7 51P rob ab il it yof present
E ' v a l u eFigure 6.1: Transforming e-values to probabilities. The transformation is the same asthat used in (Date and Marcotte, 2003). See Equation (6.1).
probability drops rapidly however, between 10−2 and 0. E-values larger than about 0.37
are given zero probability of being real homologs.
For some of our analysis, we will work with the binary versions of the above computed
profiles — each entry is set to 0 if it is less than 0.5, or to 1 if it is greater than 0.5.
Under this thresholding, the percentage of S. cerevisiae proteins that have homologs in a
particular genome varies between 85% for the closely related yeast E. gossypii to 9% for
the bacteria T. pallidum. As expected, other eukaryota share many more proteins with
S. cerevisiae than do bacteria (Figure 6.2).
6.2.2 Mutual Information
In (Date and Marcotte, 2003; Wu et al., 2003), the authors propose the use of mutual
information to compare phylogenetic profiles. To compute the mutual information be-
tween two real-valued profiles, (Date and Marcotte, 2003) bin each real-valued entry to
the nearest multiple of 0.1. Let pgb be the fraction of the entries of the profile for protein
g that are in bin b. Then the entropy of the profile is defined as H(g) = −∑b pgb log pgb.
If pg1g2b1b2 is the joint probability that a dimension is in bin b1 in the profile for g1 while in
103
Ere
mot
heci
um g
ossy
pii
cele
gans
Bac
illus
cer
eus
AT
CC
109
87S
taph
yloc
occu
s au
reus
MW
2S
. aur
eus
aure
us M
SS
A47
6B
acill
us a
nthr
acis
A20
12B
acill
us li
chen
iform
is D
SM
13
Met
hano
sarc
ina
acet
ivor
ans
Legi
onel
la p
neum
ophi
la P
aris
Lact
ococ
cus
lact
isP
seud
omon
as a
erug
inos
aB
rady
rhiz
obiu
m ja
poni
cum
Str
epto
myc
es a
verm
itilis
E. c
oli O
157H
7 E
DL9
33F
usob
acte
rium
nuc
leat
umA
cine
toba
cter
sp
AD
P1
Aqu
ifex
aeol
icus
Esc
heric
hia
coli
K12
Pse
udom
onas
put
ida
KT
2440
E. c
arot
ovor
a at
rose
ptic
a S
CR
I104
3La
ctob
acill
us p
lant
arum
Ral
ston
ia s
olan
acea
rum
Y. p
estis
bio
var
Med
iaev
ails
Shi
gella
flex
neri
2a 2
457T
Bde
llovi
brio
bac
terio
voru
sX
anth
omon
as c
itri
Bac
tero
ides
frag
ilis
YC
H46
The
rmop
lasm
a ac
idop
hilu
mM
. avi
um p
arat
uber
culo
sis
M. s
ucci
nici
prod
ucen
s M
BE
L55E
Cam
pylo
bact
er je
juni
D. p
sych
roph
ila L
Sv5
4S
. the
rmop
hilu
s LM
G 1
8311
Bru
cella
sui
s 13
30C
oryn
ebac
teriu
m d
ipht
heria
eT
herm
osyn
echo
cocc
us e
long
atus
Geo
bact
er s
ulfu
rred
ucen
sB
orde
tella
per
tuss
isB
orre
lia b
urgd
orfe
riS
trep
toco
ccus
pyo
gene
sZ
ymom
onas
mob
ilis
ZM
4S
trep
toco
ccus
pyo
gene
s S
SI−
1B
orre
lia g
arin
ii P
Bi
W. e
ndos
ymbi
ont o
f D. m
elan
ogas
ter
D. v
ulga
ris H
ilden
boro
ugh
Aer
opyr
um p
erni
xB
arto
nella
hen
sela
e H
oust
on−
1U
reap
lasm
a ur
ealy
ticum
Bar
tone
lla q
uint
ana
Tou
lous
eM
ycop
lasm
a ge
nita
lium
C. p
neum
onia
e T
W 1
83M
ycop
lasm
a pn
eum
onia
eB
uchn
era
aphi
dico
laA
napl
asm
a m
argi
nale
St M
arie
s
% S
. Cer
evis
iae
Hom
olog
s P
rese
nt
0
0.2
0.4
0.6
0.8
1
Figure 6.2: Fraction of S. cerevisiae genes that are predicted to have homologs in each ofthe 215 organisms. Organisms are sorted by decreasing shared proteins with S. cerevisiae.For clarity, organism names are shown for only a subset of the 215 organisms used toconstruct the profiles. A protein is predicted to have a homolog in an organism if theprofile entry for that organism is ≥ 0.5.
104
bin b2 in the profile for g2 then the joint entropy between two profiles for proteins g1, g2 is
H(g1, g2) = −∑b1,b2
pg1g2b1b2 log pg1g2b1b2. The mutual information between two profiles is
MI(g1, g2) = H(g1)+H(g2)−H(g1, g2). We will also consider mutual information where
the profiles are binned into two bins {0, 1}, where an entry is rounded to 0 or 1 with a
threshold of 0.5. See e.g. (Cover and Thomas, 1991) for an introduction to entropy and
mutual information.
6.2.3 Definition of Pathways
We seek to identify pairs of proteins that are involved in the same biological pathway.
The KEGG (Kanehisa, 1997; Kanehisa and Goto, 2000) database of pathways has 95
pathways annotated with at least two S. cerevisiae proteins. These form our groups of
positive examples: every pair of proteins both annotated to the same KEGG pathway as of
February 25, 2005 is considered a positive example. Because the division of the workings of
the cell into disjoint pathways is somewhat arbitrary, we must be more careful in defining
the pairs that are assumed to be involved in different pathways — the negative pairs.
We are fortunate in that the KEGG pathways are connected into a higher-order pathway
graph, where nodes in the graph represent KEGG pathways, and edges connect pathways
that are related (for example, because a signal is propagated between them).
Negative pairs will be those pairs of proteins that are annotated only with pathways
that do not share an edge. Formally, let N (F ) be the the neighbors of pathway F in this
graph, and if Q is a set of pathways, let N (Q) =⋃
F∈Q N (F ). Further, let pw(g) be the
set of pathways in which protein g is involved. A pair of proteins (g1, g2) is a negative
example if N (pw(g1)) ∩ N (pw(g2)) = ∅.
Under this definition, 1,120 proteins are involved in the examples. There are 36,948
positive pairs and 523,450 negative pairs. The number of yeast proteins annotated as
involved in each KEGG pathway are shown in Figure 6.3. (For clarity, only pathways
containing at least 20 proteins are shown.) A significant percentage (30%) of the positive
105
Pathway
0001
000
020
0003
000
051
0005
200
053
0007
100
100
0012
000
190
0019
300
220
0023
000
240
0025
100
252
0026
000
271
0027
200
280
0029
000
300
0031
000
330
0034
000
350
0036
000
380
0040
000
450
0046
000
480
0050
000
510
0051
300
520
0053
000
561
0056
200
563
0056
400
600
0062
000
625
0063
000
632
0064
000
650
0067
000
710
0072
000
740
0076
000
770
0079
000
860
0091
000
920
0097
003
010
0302
003
022
0303
003
050
0306
004
010
0407
004
110
0412
0
Num
ber
of P
rote
ins
0
20
40
60
80
100
120
140
160
180
Ribosome
Cell cycle
Purine metabolism
Pyrimidine metabolism
Starch and sucrose metabolism
Oxidative phosphorylation
MAPK signaling pathway
Figure 6.3: Number of proteins annotated to each of the KEGG pathways. Only pathwaysthat involve at least 20 proteins are shown, though we consider smaller pathways in ouranalysis as well.
pairs come from the ‘ribosome’ KEGG pathway because it is the largest (149 proteins).
Purine and pyrimidine metabolism, cell cycle, and oxidative phosphorylation are other
large pathways.
6.2.4 Framework for Inferring Ancestral Gene State
To model the relationships between the species, the 215 species are related by the non-
binary tree T (Listing 6.1) taken from the NCBI taxonomy database. If we knew the
present / absent state at the internal nodes, we could look for contemporaneous gains and
losses along the edges. However, while we “know” the present/absent state at the leaves
(given by the profiles), we must infer the state for ancestral nodes.
Using the framework described below, we will be able to compute, for each organism i
(extant or ancestral), the probability that the protein is present. In other words, if Lg(i)
is the true, hidden gene state of node i for protein g, our framework will allow us to make
106
guesses about the probabilities
Pr[Lg(i) = Present] . (6.2)
This probabilistic assignment of absent / present to every node in the tree will be our
representation of the protein’s evolutionary history. We consider a gene to be present at
a node if the probability (6.2) is ≥ 0.5; otherwise, we assume the gene is absent.
To calculate these probabilities, we first estimate lengths (times) for the edges in the
tree using expectation maximization (EM), coupled with a simple model of gene gain and
loss (described in the next subsection), so that the edge lengths maximize the likelihood
of generating the complete set of observed S. cerevisiae profiles. Once we have the edge
lengths, we can find the maximum likelihood probabilistic labeling of the internal nodes
given a profile using a dynamic programming algorithm described in (Friedman et al.,
2002; Felsenstein, 1981).
The EM procedure to find the edge lengths iterates between two steps: in the ‘E’ step,
expected counts of edge labelings are computed using a dynamic programming algorithm
and the current edge lengths (which are set uniformly to 1 at the start). In the ‘M’ step,
edge lengths are found for each edge so that the likelihood of those expected counts is
maximized. The first part of the ‘E’ step, which is taken from (Friedman et al., 2002), is
described in the ‘Computing the Likelihood of Data Given Edge Lengths’ section below,
and we refer the reader to (Friedman et al., 2002) for a description of the full ‘E’ step.
The ‘M’ step is described in ‘Finding Edge Lengths That Maximize Likelihood’ section.
Once edge lengths are known, the maximum likelihood labeling for a given profile can be
computed.
Throughout this chapter α, β, σ, δ will stand for gene states, either ‘present’ or ‘absent’,
which we will sometimes abbreviate P, A.
107
Figure 6.4: Gene gain/loss probability model. “P” stands for Present, “A” for Absent, gis the gain probability over a short time step, ℓ is the probability of a loss.
The Gene Gain / Loss Model
To assign edge lengths to the branches we use a probabilistic model of gene gain and loss.
We use the Markov process shown in Figure 6.4 where that the probability of gaining a
gene over a small time step is g, and the probability of losing a gene over a small time
step is ℓ. The probability that a gene is absent at time step t can be written in terms of
the probability that it is absent at time step t− 1:
PA(t) = (1 − g)PA(t− 1) + ℓ(1 − PA(t− 1)) ;
so,
∆PA = PA(t+ 1) − PA(t) = ℓ− (g + ℓ)PA(t) .
Taking the limit as the time step becomes infinitesimal, we have
dPA(t)
dt+ (g + ℓ)PA(t) = ℓ .
We can solve this linear differential equation by multiplying by e(g+ℓ)t (see, e.g. (Betz
et al., 1954)):
[dPA(t)]e(g+ℓ)t + (g + ℓ)PA(t)e
(g+ℓ)tdt = ℓe(g+ℓ)tdt
d[PA(t)e
(g+ℓ)t]
= ℓe(g+ℓ)tdt
108
Integrating both sides we obtain
PA(t) = e−(g+ℓ)t
[∫ℓe(g+ℓ)tdt+ C
]
= e−(g+ℓ)t
[ℓ
g + ℓe(g+ℓ)t + C
]
=ℓ
g + ℓ+Ce−(g+ℓ)t .
If PA(0) = 1 (the gene started absent), then C = 1 − ℓg+ℓ , while if PA(0) = 0 (the gene
started present), then C = − ℓg+ℓ . So, we get the following transition probabilities:
Pr[A | A, t] =ℓ
g + ℓ+
(1 − ℓ
g + ℓ
)e−(g+ℓ)t (6.3)
Pr[A | P, t] =ℓ
g + ℓ− ℓ
g + ℓe−(g+ℓ)t (6.4)
Similarly, we can derive PP(t):
Pr[P | P, t] =g
g + ℓ+
(1 − g
g + ℓ
)e−(g+ℓ)t (6.5)
Pr[P | A, t] =g
g + ℓ− g
g + ℓe−(g+ℓ)t (6.6)
Because ℓ/(g + l) = 1/(1 + g/ℓ), the transition probabilities can be written in terms of
the ratio g/ℓ and the sum g + ℓ. The g + ℓ term only appears as a coefficient of t, and
so changing g + ℓ will only scale the branch lengths, meaning that the real dependency
is only on the ratio between g and ℓ. Though some believe that gene loss is easier than
gene gain, for the experiments described here we have taken g = ℓ = 0.5, as there is little
evidence for a different choice. We have taken the prior probability of gene presence at
the root to be 0.5 (similar results are obtained with 0.25).
109
Computing the Likelihood of Data Given Edge Lengths (E step)
We will optimize the edge lengths {tij} to maximize the likelihood of generating the
observed S. cerevisiae profiles with the given tree and our gain/loss model. We first
present an expression for the likelihood of a labeling. This presentation is based on that
of (Friedman et al., 2002), which we have adapted to work with our gene gain/loss model.
Let Θ represent the complete hypothesis or model — the gene gain/loss model, the tree
topology, and the tree edge lengths. If the tree is rooted at node R, and π(j) denotes the
parent of node j, then the likelihood of a labeling is
LC =∏
g
Pr[xgR]
∏
j 6=R
Pr[xgj | xg
π(j),Θ] ,
where xgi is the state (present/absent) of node i in the tree with data given by gene g.
Taking the log, and reorganizing the summations, we can rewrite this as
logLC =∑
α
∑
g:xg
R=α
log Pr[α] +∑
(i,j)
∑
(α,β)
∑
g:(i,j)=(α,β)
logPα→β(tij) , (6.7)
where α and β range over the alphabet of labelings {absent,present} and Pα→β(tij) is
shorthand for the probability of going from state α to state β over edge (i, j) given its
current length tij, and Pr[α] is the prior probability at the root. Let SR(α) = |{g : xgR =
α}| and let Sij(α, β) = |{g : xgi = α and xg
j = β}|. Then we can rewrite (6.7) as
logLC =∑
α
SR(α) log Pr[α] + LL(i, j) ,
where
LL(i, j) =∑
α,β
Sij(α, β) log Pα→β(tij) . (6.8)
Only LL depends on the edge lengths.
For the leaves, as an estimate of Pr[leaf is labeled with e-value x|Present], we use the
110
curve of Figure 6.1; we use one minus that value for an estimate of the probability of
observing an e-value at a leaf assuming the gene is absent.
We do not know the internal labelings. Instead we use expected value of Sij(α, β)
which equals∑
g Pr[i = α∧ j = β | leaf labels for profile g,Θ]. These probabilities can be
computed with an efficient dynamic programming algorithm that is described in (Friedman
et al., 2002; Felsenstein, 1981). Thus, we can compute estimates Sij(α, β) for Sij(α, β).
Once we have the expected counts, we must find an edge length tij for each edge (i, j)
that maximizes the likelihood of those counts.
Finding Edge Lengths That Maximize Likelihood (M step)
Let s = g + ℓ and γ = ℓ/s, and, for notational convenience, drop the subscripts ij.
Combining (6.8) with the transition probabilities developed above, we want to find a t to
maximize
LL = log[γ + (1 − γ)e−st
]S(A,A)
+ log[γ − γe−st
]S(P,A)
+ log[1 − γ + γe−st
]S(P,P)
+ log[1 − γ − (1 − γ)e−st
]S(A,P).
Let x = e−st. Then maximizing LL is equivalent to maximizing
[γ + (1 − γ)x]S(A,A) · [γ − γx]S(P,A) · [1 − γ + γx]S(P,P) · [1 − γ − (1 − γ)x]S(A,P) .
Noting that 1 − γ − (1 − γ)x = (1 − γ)(1 − x), we see that maximizing the above is in
turn equivalent to maximizing
[γ + (1 − γ)x]S(A,A) · [1 − γ + γx]S(P,P) · [1 − x]S(A,P)+S(P,A) . (6.9)
111
0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 10 . 2 50 . 50 . 7 511 . 2 51 . 51 . 7 52 c = 2 , q = 5c = 2 , q = 2c = 5 , q = 2
Figure 6.5: Plot of Ψ(x) = (1 + x)q(1 − x)c for a few settings of c and q. When q > c(blue, solid line), the maximum of Ψ(x) is found at (q − c)/(q + c) (vertical black line).When q = c (green, dashed line), or c > q (red, longer dashed line) the maximum is at 0,giving an infinite edge length.
If γ = 0.5 then we can simplify (6.9). Let c = S(A, P) + S(P, A) and q = S(A, A) + S(P, P).
Then (6.9) is proportional to Ψ(x) = (1 + x)q(1 − x)c. (See Figure 6.5.) Because t ≥ 0,
x ∈ [0, 1]. Setting the derivative of Ψ to 0 gives −(1+x)qc(1−x)c−1+(1−x)cq(1+x)q−1 = 0.
So, x = (q− c)/(q+ c) maximizes (6.9), and a maximum likelihood t = −1/(g+ ℓ) ln[(q−
c)/(q+c)]. If changes are more common than not along an edge (c > q) then the maximum
edge length is infinite. If this occurs in practice during the iterative EM algorithm, we
simply increase the current edge length by some small amount. This rarely happens
in practice, as the maximum likelihood reconstruction tries to minimize the number of
changes.
If γ 6= 0.5 then a maximizing x can be found using Newton’s method to find a zero of
the derivative of (6.9). Mathematica was used to compute the derivative D(x) of (6.9):
fp := 1 + γ(x− 1)
fa := γ + x− γx
fc := (1 − x)
fq :=γS(P, P)
fp+
fc
x− 1+
(1 − γ)S(A, A)
fa
D(x) = f S(P,P)p f S(P,A)+S(A,P)
c f S(A,A)a fq
112
6.2.5 Comparing Profiles Using Tree Labelings
Counting Shared Gain/Loss
If two genes co-evolve, one expects them to be gained or lost from organisms at the same
time. We can label a node present if Equation (6.2) is ≥ 0.5. Looking at the labeled
trees, we would expect correlated genes to contain more edges (i, j) where both genes are
present at i, and both absent at j, or vice versa. Symbolically, we expect many AA −→ PP
and PP −→ AA transitions. (We use the notation αβ −→ σδ to denote the situation where,
when comparing tree labelings for proteins g1 and g2, a parent node is labeled α in the
reconstruction for gene g1 and β in g2 while a child is labeled σ in g1 and δ in g2.) Ranking
examples by the number of these AA −→ PP and PP −→ AA transitions gives our shared
gain/loss method, or SGL for short. An advantage of counting shared gains and losses is
that only profiles that have a fair number of both present and absent entries — and are
thus ‘interesting,’ at least by some measure — can have many gains and losses.
Comparing Likelihoods
The other possible transitions contain information too. For example, PP −→ PP is likely
better than PA −→ PA. If we are permitted to look at some training positives and
negatives, we can learn how to take advantage of the whole range of possible edge la-
belings by estimating joint transition probabilities such as Pr[child = AA | parent =
AA,positive example] and Pr[child = AA | parent = AA,negative example]. One hopes
that the empirical joint transition probabilities computed from positive examples cap-
ture the characteristics of correlated evolution, while those for negative examples capture
independent evolution.
In other words, we can think of labeling a tree from an alphabet of four “characters”
{AA, PP, AP, PA}, and we will derive these transition probabilities between these “charac-
ters” from a set of training examples as follows. Let F be the training set of protein pairs
that are involved in the same pathway. We estimate the joint transition probabilities for,
113
say, proteins with shared function using the training examples as
Pr[children are labeled σδ | parents are labeled αβ,+] =
|{(g1, g2) ∈ F, (i, j) ∈ T : Lg1(i) = α ∧ Lg1
(j) = σ ∧ Lg2(i) = β ∧ Lg2
(j) = δ}||{(g1, g2) ∈ F, (i, j) ∈ T : Lg1
(i) = α ∧ Lg2(i) = β}| (6.10)
If the probability for two transitions should be equal because they are symmetric with
respect to the order of the tree labelings (e.g., Pr[PA | AA,+] should equal Pr[AP | AA,+]),
we compute both, as in Equation (6.10), and use the average of these estimates. Transition
probabilities for unrelated proteins can be derived similarly. We also derive an estimate
for root labelings as
Pr[Root labeled αβ | +] =|{(g1, g2) ∈ F : Lg1
(R) = α ∧ Lg2(R) = β}|
|F | , (6.11)
where R is the root node of the tree.(In our experiments, we take the root to be the
node where the three main branches of life — eukaryota, archaea, bacteria — meet.) See
Section 6.3.4 below for a discussion of the probabilities so derived.
Given these probabilities, we can compute the likelihood of the pair of observed pro-
files assuming that they have correlated evolution using the transition probabilities derived
from the positive examples. Similarly, we can compute the likelihood of the profiles as-
suming independent evolution using the transition probabilities derived from the negative
examples. We score the pair of profiles (g1, g2) by the factor by which joint evolution
appears more likely:
lscore(g1, g2) =Pr[g1, g2 | +]
Pr[g1, g2 | −]. (6.12)
To compute Pr[g1, g2 | +] we use a standard likelihood computation, summing over all
possible internal states using an efficient dynamic programming algorithm. We will refer
to the scoring scheme of Equation (6.12) as LRATIO.
114
6.2.6 Assessing Results
We will consider two means for assessing how well a method identifies proteins with shared
function. First, we will use ROC (receiver operating characteristic) curves in which the
number of true positives scored above a threshold is plotted against the the number of
false positives scored above that threshold. As a useful summary statistic, in Section 6.3.1,
we will use the area under such a curve when allowing up to 10% of the negatives to be
predicted as positives (denoted ROC10%).
For most of the experiments discussed here, we assess performance on a per-function
basis. Each method will assign a score to every example indicating to what extent the
method considers the proteins in the pair to be functionally linked. To determine how
well positive examples for each function are distinguished from the negatives we use the
average rank :
avgrank(Fi) =1
|F×i |
∑
(g1,g2)∈F ′
i
rank(g1, g2) , (6.13)
where F×i are the pairs of positives arising from function Fi, and rank(g1, g2) is defined
as the number of negative examples ranked better than (g1, g2). When there are ties —
several negative and positive examples with the same score — we take the rank to be the
expected number of negatives ranked better if the examples were randomly sorted within
groups of the same score. Formally, if U is the set of negatives,
rank(g1, g2) = |{(g′1, g′2) ∈ U : score(g′1, g′2) > score(g1, g2)}|+
1
2|(g′1, g′2) ∈ U : score(g′1, g
′2) = score(g1, g2}| .
A low average rank indicates that the method performs well. The average ranks (6.13)
are divided by the maximum possible rank to give the fractional average rank. When
comparing the scores assigned by two methods we plot the difference in fractional average
rank assigned to each function.
115
KEGG Pathways
0001
000
020
0003
000
051
0005
200
053
0007
100
100
0012
000
190
0019
300
220
0023
000
240
0025
100
252
0026
000
271
0027
200
280
0029
000
300
0031
000
330
0034
000
350
0036
000
380
0040
000
450
0046
000
480
0050
000
510
0051
300
520
0053
000
561
0056
200
563
0056
400
600
0062
000
625
0063
000
632
0064
000
650
0067
000
710
0072
000
740
0076
000
770
0079
000
860
0091
000
920
0097
003
010
0302
003
022
0303
003
050
0306
004
010
0407
004
110
0412
0
RO
C A
rea
Upp
er B
ound
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.1 0.0 0.2 0.3 0.4
Figure 6.6: Estimate for maximal ROC performance for the KEGG pathways for variousvalues of the ball radius r. Circles give the absolute maximum ROC10% upper bound.The gray bars give the estimate for when r is 10% of the length of the profiles, and theother symbols give the estimated ROC10% upper bound for larger values of r.
6.3 Computational Results
6.3.1 Limit on ROC Performance of Phylogenetic Profiles
Because some negative examples may have the same profiles as positives examples in a
pathway, it may be that no classifier can achieve perfect predictions. We are interested
in quantifying how such confusions affect the ability of phylogenetic profiles to associate
proteins with functions. To do this, for this section, we consider a variation on the
functional-linkage problem:
Functional Separation: Given a collection {Fi} of functions, for each func-
tion Fi, distinguish those proteins in Fi from those not in Fi.
From each Fi arises a separate classification problem with the positives taken to be those
proteins in Fi and the negatives taken to be those proteins in some Fj with j 6= i. (For this
section, we consider only those functions with at least 5 annotated proteins.) The known
116
functional groupings play a much more central role in this variant since each function is
treated independently. In this section, we measure the performance of a classifier for this
problem using ROC curve analysis, taking the area under the curve up to 10% of the false
positives (which we denote ROC10%) as a measure of success. For ease of analysis, for
this section only we use binary profiles — thresholding the real-value entries at 0.5.
When measuring performance using ROC10% area only the ranking of examples is
important. We can compute the theoretical upper bound on the ROC10% area achievable
by any classifier by grouping proteins with the same profile together and ordering the
groups in decreasing order of the ratio of positives to negatives that they contain. In
other words, we take the classifier to be an algorithm that can rank points of the Hamming
cube. The optimal ROC10% performance of such a classifier is an absolute upper bound
on any classifier that sees only the vectors because such classifiers must treat equal profiles
equally.
This upper bound endows the classifier with a lot of power: it can distinguish between
two profiles perfectly even if they differ by only one bit. In practical cases, the space
of possible profiles is very large (here, 2215), while the number of actual profiles is much
smaller (here, 1000 — 2500), so it may not be surprising, even though the dimensions
are not independent, if few clashes between positives and negatives occur. It is thus
interesting to put the above, pure upper bounds into context by assessing performance on
theoretically weaker classifiers.
To do so, we must assume a computational model for the classifier. Above, the algo-
rithm was allowed to distinguish among vectors that differ in any dimension by ranking
points of the Hamming cube. Equivalently, we may think of the algorithm as allowed to
rank balls of radius ≥ 0 centered at an input point. We introduce a series of weaker clas-
sifiers that are limited to ranking balls of radius ≥ r (measured in the fraction of differing
entries) centered at an input point. In this weaker model of classifier, the rank of an input
example is taken to be the lowest rank of a ball that contains it, and points can still be
117
L_5K_8J_12I_12H_14G_15F_17E_20D_25C_26B_27A_28EukArcBac
Figure 6.7: Profiles which, for some function F , describe both a protein in F and aprotein not in F . White indicates the presence of a homolog. The first profile indicatesthe breakdown into the three main branches of life: Eukaryota (white) are at the left,followed by the archaea (gray), and then the bacteria (black). Numbers next to eachprofile give the number positive examples which collide with the profile. Only profiles forwhich at least 5 examples have collisions are shown.
chosen as ball centers after they have been covered by another ball. The parameter r is
the resolution and a larger r implies a less powerful classifier. While under this model, a
classifier can distinguish between some pairs of examples at distance < r, it can not do
so “too often.” We would like to find the ranking of balls that maximizes the ROC10% of
the ordering. We use the greedy algorithm that exhaustively searches over balls and radii
for the ball with the best positive to negative ratio and then removes the points in the
ball as a way to find a good ranking of balls. The greedy algorithm does not guarantee
that the optimal ranking is found, however.
With a classifier with perfect resolution, the highest achievable ROC10% areas for
classification problems arising from the KEGG pathways are shown in Figure 6.6 (circles).
In most cases, a near-perfect ROC10% area is achievable, and no collisions are observed
in 20 of the pathways. In no case is the achievable ROC10% area less than 0.09, but three
pathways have maximum ROC10% area ≤ 0.095: the MAPK signaling pathway (KEGG
ID 04010; ROC10% 0.093), the phosphatidylinositol signaling system (04070; 0.091), and
the cell cycle (04110; 0.095).
Profiles involved in some collision between a positive and a negative for KEGG path-
ways are shown in Figure 6.7. These are the profiles stopping the ROC10% area given
118
by the circles from being 0.1. Even with profiles constructed from 215 organisms and a
relatively permissive definition of homolog, the main main source of error is profiles that
have few homologs outside the eukaryota. The all-present profile also accounts for many
of the errors. The most interesting class of collisions, those involving the profiles of several
dehydrogenases, are shown in cluster G in Figure 6.7. Profile K belongs to several proteins
related to aconitate hydratase and at least one that is related to leucine biosynthesis.
If r is increased to 10% of the length of the profiles, 32 pathways have maximum
achievable ROC10% area ≤ 0.09, while 20 have maximum ROC10% area ≤ 0.08. The
cell cycle (04110) and the basal transcription factors (03022) pathways drop especially
precipitously, both to below ROC10% 0.051. Figure 6.8 shows the profiles responsible for
mistakes with this value of r. For presentation, the 599 unique profiles involved in such
collisions were clustered using a greedy heuristic. (Colliding profiles were considered in
arbitrary order and added to the cluster that had a center to which the profile had lowest
Hamming distance; if no center was within Hamming distance 10, the profile became the
center of a new cluster.)
Several classes of profiles stand out. Again, profiles for which a signal is present only
for the relatively few eukaryota (clusters A, C, E, M, T, GG in Figure 6.8) or present for
nearly all organisms (cluster B) are a top source of errors, and the dehydrogenase profile
(cluster D) continues to be involved in many collisions. Two additional common types of
colliding profiles are of interest. The first are proteins which have homologs in most of
the eukaryota and archaea, but not in the bacteria (clusters N, O, Q, R, S, CC). Such
profiles are common among ribosomal proteins (Lecompte et al., 2002), as well as RNA
polymerase and the proteasome proteins. Secondly, the profiles of clusters W and Y give
profiles that are common among some kinases. From these examples, we can see that
there are two general classes of mistakes: proteins are often confused if their profiles can
be explained by broad speciation events, or they are confused if they are involved in cross-
pathway functionality such as phosphorylation or dehydrogenation. That the former class
119
KK_5JJ_5II_6HH_7GG_7FF_7EE_8DD_8CC_9BB_9AA_10Z_10Y_10X_10W_10V_11U_11T_11S_13R_14Q_14P_15O_15N_17M_17L_20K_23J_23I_23H_26G_29F_35E_43D_58C_150B_186A_250EukArcBac
Figure 6.8: Profiles involved in reducing the best possible ROC area when r is 10% ofthe profile lengths. For each function, profiles in that function were found for which thereexists a profile with Hamming distance ≤ 22 outside the function. These colliding profileswere greedily clustered into groups of diameter of Hamming distance no more than 20.Numbers indicate the number of positive examples with with some profile in the clustercollided. For brevity, only clusters containing at least 5 examples are shown. Gray scalevalues indicate the average value of the profiles in the cluster, where white indicates thepresence of a homolog. The first profile indicates the breakdown into the three mainbranches of life: Eukaryota (white) are at the left, followed by the archaea (gray), andthen the bacteria (black).
120
False positives0% 10% 20% 30% 40% 50%
Tru
e po
sitiv
es
0%
10%
20%
30%
40%
50%
False positives0% 10% 20% 30% 40% 50%
Tru
e po
sitiv
es
0%
10%
20%
30%
40%
50%
False positives0% 10% 20% 30% 40% 50%
Tru
e po
sitiv
es
0%
10%
20%
30%
40%
50%
False positives0% 10% 20% 30% 40% 50%
Tru
e po
sitiv
es
0%
10%
20%
30%
40%
50%
False positives0% 10% 20% 30% 40% 50%
Tru
e po
sitiv
es
0%
10%
20%
30%
40%
50%real MI/fullreal MI/no ribosomebinary MI/fullbinary MI/no ribosome
Figure 6.9: ROC curve of the performance of mutual information on the KEGG test set.Lines labeled ‘real MI’ use the real value of the profiles, rounded to the nearest 0.1, whilethose labeled ‘binary MI’ use round the values of the profiles to the nearest number in{0, 1}. Lines marked with ‘full’ use all pathways in KEGG, while those marked with ‘noribosome’ show the performance if the ribosome positive pairs are removed.
is involved in many collisions is consistent with observations made in (Wu et al., 2003) on
profiles of 41 organisms.
6.3.2 Performance of Mutual Information for Finding Functional Link-
ages
We test the performance of mutual information on the KEGG data set described above.
How the entries of each profile are discretized can have a large effect on performance,
so we test the binning scheme suggested by (Date and Marcotte, 2003) as well as apply
mutual information to binary profiles. The results are shown in Figure 6.9 in the form
of ROC curves. Comparing the performance of the real-value profiles (red) with binary
profiles (green) on the full data set shows that MI applied to binary profiles perform
considerably worse than when applied to real-valued profiles. In fact, however, much of
the advantage of the real-valued profiles comes in detecting pairs of ribosomal proteins.
When the ribosomal positives are removed while keeping the set of negatives the same,
121
KEGG Pathways
0513
0680
0562
3022
4110
4120
4070
0790
0300
0670
3030
0640
0010
4010
0620
0500
0530
0410
0561
0251
0710
0450
0271
0220
0563
0460
0632
0020
0252
0910
0260
0030
0760
0630
0510
0860
0100
0400
0650
0903
0600
0071
0770
0053
0521
0120
0920
0350
0380
0290
0330
0361
0310
0051
0272
0052
0230
0625
3020
0480
0960
0970
0280
3050
0564
0340
0360
0520
3060
0720
0240
0740
0900
0130
0193
0040
0190
0440
0780
3010
Ave
rage
Ran
k D
iffer
ence
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
105
1549
625
337
4140
655 12
010
510
521
055
1081
1485
561
1081
120
28 231
351
153
171
91 105
153
55 820
435
351
120
903
351
465
105
435
66 55 253
435
36 153
136
45 78 15 231
66 190
136
120
231
36 406
406
45 496
4005
45 406
55 28 666
66 496
595
190
45 105
45 5524
1555 10 21
351
1520
1636 21
1102
6
Binary MI does better.
Real−valued MI does better.
Ribosome
Figure 6.10: For each pathway the difference in average rank (Section 6.2.6) between thebinary MI scheme and the real-valued MI scheme (as suggested by (Date and Marcotte,2003)) is plotted. Values greater than 0 indicate that the binary scheme ranks examplesfrom that function better. The numbers inside the bars give the number of positive pairsarising from the function. Function numbers (x-axis) are KEGG ID numbers. A mappingfrom KEGG IDs to function names can be found in Table 6.1.
real-valued mutual information performs badly (blue curve in Figure 6.10). Binary mutual
information performs much better (yellow curve), though still not at the level that can
be achieved when the ribosome pairs are included. Real-valued mutual information seems
to be useful, then, primarily to identify ribosomal proteins, and mutual information in
general seems to find ribosomal pairs easy to distinguish from unrelated pairs.
ROC analysis with all functions lumped together (as in Figure 6.9) assumes that the
importance of a pathway is proportional to the number of examples that come from it.
With a more complete data set that might be true, but at this stage likely the opposite
holds: those pathways for which few proteins are known are the best ones for which to
make new predictions. This motivates the use of a per-function analysis of the performance
of various methods, such as shown in Figure 6.10. For each KEGG pathway, the difference
of the average fractional rank (Section 6.2.6) achieved by binary MI compared with real-
valued MI for examples from that function is plotted. Bars above zero indicate that
examples of that function are scored more favorably using binary MI. Most functions (63
122
KEGG Pathways
0040
0290
0330
0740
0970
0720
0630
0480
0361
0252
0410
0272
0360
0260
0620
0790
0052
0251
0450
0520
0670
0271
0380
0460
0910
0440
0051
0860
0780
0710
0010
0350
0920
0030
0680
0500
0020
0561
0564
0280
0770
0650
0190
0071
0310
0400
0340
0053
0625
0120
0640
0193
0300
0900
0521
0530
4010
0220
0230
0240
0632
0100
0600
0903
0760
0510
3050
0130
3010
3060
0960
4070
0562
0563
4120
0513
4110
3020
3022
3030
Ave
rage
Ran
k D
iffer
ence
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
1512
023
155 66
655 10
555 36 351
28 45 45 903
561
120
496
351
171
105
105
91 136
55 120
36 406
66 21 153
1081
190
66 351
1510
8143
523
159
566 45 43
520
1613
640
625
319
078 45 23
155 35
110
510 15 12
014
8510
540
0524
1582
055 15
336
465
435
496
2111
026
45 28 55 496
153
406
105
3741
406
253
210
SGL does better.
Binary MI does better.
< 20% Present
Figure 6.11: Advantage of SGL over binary MI on the 80 KEGG pathway with at least 10pairs. Bars indicate, as in Figure 6.10, the difference between the average rank assignedto pairs of each function by the two methods. For the 50 pathways with bars above 0,SGL outperforms mutual information.
out of 80 functions with at least 10 pairs) are handled more effectively by the binary MI
scheme than the real-valued scheme. The most important exception is the largest pathway,
the ribosome, as discussed above. Because it is more effective on most pathways, we will
compare with the binary MI scheme for the rest of this chapter.
6.3.3 Predicting Linkages From Shared Gains and Losses
Counting the number of edges (i, j) for which both proteins are inferred to be present in
the ancestor i and both absent in the descendant j or vice versa (AA −→ PP or PP −→
AA) performs better than MI on 50 out of 80 pathways that have at least 10 pairs of
proteins in them (Figure 6.11). MI outperforms SGL for the ribosome pathway, along
with the pathways that include many proteins that contain few homologs (such pathways
are highlighted in yellow in the figure). These pathways tend to have low ROC upper
bounds (Section 6.3.1) and have few gain or loss events, and thus the SGL method is not
suited for predicting shared function of this type.
The ability of such a straightforward method to rank examples from more than half
123
Root LabelsPP AP AA
Incr
ease
−0.5
−0.3
−0.1
0.1
0.3
0.5
0.270
−0.178−0.123
B
PP AP AA
Pro
babi
lity
0
0.1
0.2
0.3
0.4
0.5A
Positives
Negatives
Figure 6.12: (a) Empirical root probabilities Pr[R = αβ | same pathway] (blue) andPr[R = αβ | different pathways] (red). (b) Relative increases for each root label prob-ability for proteins with shared functions compared with those that do not have sharedfunctions. See Equation (6.14).
of the pathways better than MI is encouraging, especially because SGL does not consider
where in the tree the shared events took place or their relation to one another. It also
does not account for how many organisms are labeled with the same state in the two
trees — it is simple to find examples with several shared gains and losses but large areas
of disagreement in the tree. While such pairs may indeed be functionally linked, it is
reasonable that they should score less favorably than a pair with several shared gains and
large areas of agreement.
6.3.4 Predicting Linkages By Comparing Likelihoods
Estimating the Transition Probabilities
In order to test our LRATIO method, we randomly divide the positive and negatives into
three parts and compute transition probabilities (as outlined in Section 6.2.4) for both
positive and negative examples using 2/3rds of the examples. There is certainly a large
124
Edge Labels (Parent, Parent, Child, Child)PPPP PPAP PPAA APPP APPA APAP APAA AAPP AAAP AAAA
Incr
ease
−0.3
−0.1
0.1
0.3
0.5
0.7
0.002
−0.088
0.325 0.3100.224
−0.019
0.086
0.574
0.008
−0.003
B
PPPP PPAP PPAA APPP APPA APAP APAA AAPP AAAP AAAA
Pro
babi
lity
0
0.2
0.4
0.6
0.8
1
Positives
Negatives
A
Figure 6.13: (a) Empirical transition probabilities for edges (i, j) in bacteria: Pr[j = σδ |i = αβ, same pathway] (blue) and Pr[j = σδ | i = αβ,different pathways] (red). Here, αβare labels of a parent node, and σδ are labels of a child node. The distributions are verysimilar in this view; their differences are more apparent in the (b) panel. (b) Relativeincreases for each transition probability for proteins with shared functions compared withthose that do not have shared functions (Equation (6.14)).
125
class of pairs of proteins that share a pathway for which the assumption of shared evolution
does not hold. These pairs, which are unlikely to be identified by any method geared
towards exploiting shared evolution, may obscure the true differences in the probability
distributions between proteins with correlated and uncorrelated evolution. Accordingly,
when computing the probabilities, we do not use any pair (positive or negative) that
involves a protein with a profile that can primarily be explained by speciation events.
These are considered to be those profiles for which there is a node in the tree for which
every leaf under the node has an e-value < 10−5 while all other nodes have an e-value
> 10−5, or vice versa. (If either of those two groups contains more than 5 organisms, we
allow a single exception.)
Additionally, because the ribosome pathway is so large and we do not want to train
the method to recognize only this well-studied pathway, we leave out the positive pairs
coming from the ribosome pathway.
The probabilities for the root labels (Equation 6.11) for one of the cross-validation tests
are shown in Figure 6.12(a). Immediately, we see the difference in the distributions of root
labels between positives and negatives: a larger fraction of the roots of trees for proteins
with shared function are both labeled with the ‘present’ state, while the disagreeing state
AP is more common in unrelated proteins. Below the distributions, for each root state,
we plot the relative increase in probability of the positives:
p+ − p−p+
. (6.14)
So, if Equation (6.14) is greater than zero, a labeling is more common in positives exam-
ples.
The ability to handle distantly related organisms differently is one of the advantages
of having a tree connecting the organisms. Accordingly, we estimate a separate set of
probabilities for each of the three main branches of life, eukaryota, archaea, and bacteria.
The computed transition probabilities for the bacteria edges, the largest class of edges,
126
Figure 6.14: Schematic of anti-correlated evolution AP −→ PA. The two rectangles repre-sent profiles, and the triangle represents a subtree which would be primarily labeled AP.If there is a loss of one protein and a compensating gain by another, there will be anAP −→ PA edge at the top of the subtree for the organisms in which the complementarygain occurred.
are shown in Figure 6.13(a). They make sense at a high level: the most probable transi-
tions are those in which no change is made in the gene state of either protein, and those
transitions in which one protein maintains its state are slightly more probable than having
both genes change their state. This is expected since the reconstruction of ancestor label-
ings sought to minimize changes. While the transition probabilities for positive examples
appear very similar to those for negatives, there are important differences as seen by the
plot of Equation (6.14) under each probability (Figure 6.13(b)). Shared gains or losses
(PP −→ AA or AA −→ PP) are more likely to occur in positive examples. Transitions where
disagreeing gene states are brought into agreement (AP −→ PP and AP −→ AA) also have a
higher probability in positives, while transitions in which the states diverge have the same
or lower probability in positives. Finally, and perhaps most interestingly, anti-correlated
edges (AP −→ PA) are more frequently seen between positive pairs. These edges would
occur in blocks such as those depicted in Figure 6.14, which may arise if a protein is lost
because another can assume its role. ((Bowers et al., 2004a) and (Morett et al., 2003) have
investigated such complementary profiles. An advantage of incorporating a species tree
is that we can detect local complementarity naturally, while previous methods typically
require the complementary to hold to a large extent across the entire profile.)
127
KEGG Pathways
0970
0290
0252
0710
0460
0271
0330
0630
0260
0052
0450
0520
0790
0360
0380
0480
0251
0670
0620
0720
0920
0272
0910
0030
0500
0361
0564
0340
0310
0051
0350
0010
0650
0220
0020
0280
0440
0860
0400
0561
0625
0740
0632
0071
4010
0300
0053
0120
0770
0190
0193
0760
0530
0903
0640
0100
0240
0600
0230
0510
3060
3050
0562
4110
0563
3020
4120
4070
3030
0513
3022
Ave
rage
Ran
k D
iffer
ence
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
222
4011
751 18 30 77 35 301
165
57 35 40 15 45 18 117
35 187
18 22 15 40 117
360
12 198
63 135
135
63 360
145
35 145
22 12 22 84 77 15 18 273
45 495
35 26 77 15 672
117
155
40 12 18 18 805
5113
3514
515 16
516
512
4751 13
513
518 70 35
84
LRATIO does better.
Binary MI does better.
< 20% Present
Figure 6.15: Improvement in average fractional rank achieved by LRATIO compared withMI for each KEGG pathway that contains at least 10 pairs of proteins. Values above 0indicate that LRATIO scored the pairs in the pathway more favorably than MI. Onlythose pathways that had on average at least 10 pairs of proteins in the three tests areshown. The numbers in the bars give the average number of examples arising from thatpathway. Yellow marks those pathways with many proteins that have few homologs acrossthe species.
Assessing Performance of LRATIO
We test LRATIO using three-way cross-validation, computing probabilities like those
shown in Figures 6.12 and 6.13 using two-thirds of the examples, and assessing per-
formance on the remaining third. The average fractional ranks (Section 6.2.6) for each
of the pathways are themselves averaged over the three tests. The differences between
these averaged average fractional ranks are shown in Figure 6.15 for each function. Out
of the 71 pathways that had on average at least 10 pairs, LRATIO ranks the examples
from 51 better than MI. MI, as expected, outperforms LRATIO on those pathways with
few present entries in their profiles (yellow in Figure 6.15).
The pathway with the largest improvement in performance is animnoacyl-tRNA biosyn-
thesis, which contains about 222 pairs in each testing slice. The average fractional rank
is reduced from 0.77 to 0.20 when using LRATIO because many proteins in this pathway
are present in most sampled organisms, and such all-present profiles are scored very low
128
(a) YPR145WYPR035WYOR375CYLR160CYLR158CYLR157CYLR155CYJL126WYGR124WYFR055WYDR321WYDR019CYDL215CYDL171CYAL062WYAL012W
(b) YPR006CYPR001WYOR388CYOL126CYNR001CYNL117WYLR304CYKL085WYJL200CYIR031CYGR204WYER065CYDL078CYCR005CYBR084W
Figure 6.16: Profiles for the proteins in pathways with the lowest average rank usingLRATIO. (a) nitrogen metabolism (KEGG ID 00910). (b) glyoxylate and dicarboxylatemetabolism (00630). White indicates the presence of a homolog.
by MI. In contrast, a distance measure such as Hamming distance would give such pairs
the highest score possible. LRATIO may take the middle path — rewarding such pairs
in the amount warranted. Not all improvements arise from such constant profiles. The
profiles of the proteins involved in the top average ranked pathways (nitrogen metabolism
and glyoxylate and dicarboxylatae metabolism, KEGG IDs 0910 and 00630, respectively)
are shown in Figure 6.16. These pathways contain few constant profiles.
To test whether LRATIO is using real evolutionary events or merely fitting a small set
of parameters to the data, we repeat the training and testing but shuffle where the extant
organisms appear in the tree; this amounts to the same thing as shuffling the dimensions
of all the profiles using the same permutation. If LRATIO takes its advantage simply
from its ability to set parameters from a set of training examples, one would expect that
the random tree will do nearly as well. This is not the case, as shown in Figure 6.17 —
the real tree outperforms the shuffled tree in 51 out of 71 pathways. The pathways which
the random tree scores better are those that are have many proteins that have homologs
129
KEGG Pathways
0052
0790
0310
0480
0520
0920
0500
0300
0340
0380
0460
0564
0051
0650
0561
0720
0632
0620
0010
0220
0630
0625
0020
0903
0260
0350
0330
0290
0450
0120
0252
0760
0361
0053
0360
0071
0910
0710
0740
0670
0272
0030
0640
0440
0280
0770
0100
0251
0400
0860
0530
0271
0230
0562
0240
0970
0190
0193
0510
0600
3060
4010
4070
3020
4110
3030
3050
4120
0563
3022
0513
Ave
rage
Ran
k D
iffer
ence
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
165
40 135
18 35 22 360
35 63 45 18 198
135
145
77 18 273
187
360
35 35 15 145
12 301
63 77 40 57 77 117
155
12 26 15 45 40 51 18 35 15 117
18 12 22 15 18 117
84 22 40 3013
3516
580
522
267
211
714
551
1549
518 13
512
4770
165
135
5184
35
Real tree does better.
Random tree does better.
< 20% Present
Figure 6.17: Improvement in average fractional rank achieved by LRATIO using the realtree compared with using a tree with the leaves shuffled. Values above 0 indicate that usingLRATIO with the real tree was more successful identifying positive pairs. Numbers inthe bars give average number of examples in that pathway. Yellow marks those pathwayswith many proteins that have few homologs across the species.
only among the eukaryota, or only among the eukaryota and archaea. This makes sense:
if the leaves are randomized for these pathways, the single shared gene gain at the top of
the eukaryota subtree is transformed into six or seven shared gains spread over the tree —
a pattern that would be much more indicative of shared evolution. This results in better
scores for these pathways.
6.4 Discussion
We assessed the relative merits of two protocols for using mutual information to predict
functional linkages, finding binary MI to perform best for most pathways. One issue that is
brought to the forefront by this research is the need to consider performance on functions
separately. If each pathway’s importance is taken to be proportional to the number
of examples that are derived from it (as in standard ROC analysis), the improvements
on well-studied functions such as the ribosome and cell cycle will obscure successes in
identifying less obvious functional linkages. Our per-function analysis does not suffer
130
from this problem and also makes it clear that there are several classes of functions that
should be attacked through different methods.
Mutual information can be divided into the sum of two components. Informally,
the first, H(g1) + H(g2), measures how “interesting” the pairs of profiles (g1, g2) are
— all-present or all-absent profiles may simply indicate we have not yet sampled enough
organisms to detect a signal. The second component, −H(g1, g2), measures the correlation
between profiles. That MI captures both of these aspects is one reason it is such an
intuitively pleasing measure. An attractive feature of incorporating the phylogenetic tree
is to promote a more delicate definition of “interesting” than mutual information permits.
For example, if two proteins are both present in all firmicutes save one for which they
are both absent this may be more “interesting” then if the same pattern is distributed
randomly over the tree. If the right balance between requiring interesting profiles and
agreeing profiles can be struck, predictive ability may be improved. We have begun to
gain some insight about what patterns are most characteristic of shared function and are
thus informative. For example, the presence of several correlated gains and losses are
very indicative of shared function, and counting these events is a useful way to identify
proteins involved in the same pathways.
Not all edges in the tree should be treated the same. Shared gene absence is more
indicative of shared function when it occurs on edges between organisms closely related
to yeast, while shared gene presence is most indicative in the opposite case. This is seen
in Figure 6.18 in which for each tree edge we plot number of AA −→ AA seen on that edge
in positive examples divided by number such edge labelings seen in negative examples
against the edge’s tree distance from the yeast E. gossypii, a close relative of S. cerevisiae.
Similar values are plotted for PP −→ PP. It may be informative to attempt to derive
correlations between other patterns and their environment in the tree (e.g. height, node
degree, distance to yeast) and use such insights to continue to improve our understanding
of what are informative evolutionary events.
131
Distance to E. gossypii0 0.5 1 1.5 2 2.5 3 3.5 4
#pos
/#ne
g
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Shared presence (PPPP)
Shared absence (AAAA)
Figure 6.18: How indicative AA −→ AA and PP −→ PP edges are of a pair being a positiveexample, plotted against that edge’s proximity to the yeast E. gossypii. Ribosomal ex-amples are not included. By random chance, one expects a value of total # of positives /total # of negatives = 0.049. Edges where both proteins are labeled absent at each endpoint are more indicative of shared function when they are near yeast in the tree; theopposite is true for edges where each protein is present at both end points.
Looking at the variation between empirical transition probabilities computed for pairs
with shared functions versus those without, we get an idea of which transitions are more
prominent in positive examples. The absolute differences between the positive and neg-
ative distributions are very small, and, despite the filtering we perform on the training
examples, there undoubtedly remain positive training examples for which the assumption
of shared evolution does not hold. Better filtering schemes to identify those pairs that are
most promising to train on will likely improve the performance of LRATIO on many path-
ways. Or, perhaps, given the values in Figures 6.12 and 6.13, we can concoct transition
distributions that maintain the relative order of transition probabilities but enlarge the
difference between the positive and negative distributions. Additionally, to avoid giving
undue importance to large pathways, our LRATIO training method may also benefit if
examples are weighted so that the contribution from each function is the same.
Another interesting path for future research is to investigate additional non-learning
132
scoring schemes. That SGL does so well suggests that if it could be augmented in a
natural way to account for agreeing or disagreeing subtrees or events such as AP −→ PA
it may be able to perform even better. Perhaps the quest to improve SGL will give some
further appreciation of which larger-scale evolutionary features are important for detecting
co-evolution.
In this chapter, we have made some significant first steps toward understanding how
cross-genomic evolution can be exploited to identify proteins that are involved in the same
pathways in S. cerevisiae. We presented a framework for taking relationships between
organisms into account when comparing profiles, and we gave several methods for using
patterns of inferred gene presence and absence across a tree connecting extant species to
predict whether two proteins are involved in the same pathways. Our per-function analysis
of the methods suggests that there is benefit in considering relationships between the
species, and that the methods may be useful additions to the repertoire of approaches to
find proteins with correlated evolution and provide a starting point for continued research
in the area.
6.5 List of Pathways and the Phylogenetic Tree
List of KEGG Pathways and Their Identifiers
ID Pathway ID Pathway0010 Glycolysis / Gluconeogenesis 0020 Citrate cycle (TCA cycle)0030 Pentose phosphate pathway 0040 Pentose and glucuronate intercon-
versions0051 Fructose and mannose metabolism 0052 Galactose metabolism0053 Ascorbate and aldarate metabolism 0061 Fatty acid biosynthesis (path 1)0062 Fatty acid biosynthesis (path 2) 0071 Fatty acid metabolism0072 Synthesis and degradation of ketone
bodies0100 Biosynthesis of steroids
0120 Bile acid biosynthesis 0130 Ubiquinone biosynthesis0150 Androgen and estrogen metabolism 0190 Oxidative phosphorylation0193 ATP synthesis 0220 Urea cycle and metabolism of amino
groups0230 Purine metabolism 0240 Pyrimidine metabolism0251 Glutamate metabolism 0252 Alanine and aspartate metabolism
133
List of KEGG Pathways and Their Identifiers
ID Pathway ID Pathway0260 Glycine, serine and threonine
metabolism0271 Methionine metabolism
0272 Cysteine metabolism 0280 Valine, leucine and isoleucine degra-dation
0290 Valine, leucine and isoleucine biosyn-thesis
0300 Lysine biosynthesis
0310 Lysine degradation 0330 Arginine and proline metabolism0340 Histidine metabolism 0350 Tyrosine metabolism0360 Phenylalanine metabolism 0361 gamma-Hexachlorocyclohexane
degradation0362 Benzoate degradation via hydroxyla-
tion0380 Tryptophan metabolism
0400 Phenylalanine, tyrosine and trypto-phan biosynthesis
0401 Novobiocin biosynthesis
0410 beta-Alanine metabolism 0430 Taurine and hypotaurinemetabolism
0440 Aminophosphonate metabolism 0450 Selenoamino acid metabolism0460 Cyanoamino acid metabolism 0480 Glutathione metabolism0500 Starch and sucrose metabolism 0510 N-Glycan biosynthesis0512 O-Glycan biosynthesis 0513 High-mannose type N-glycan biosyn-
thesis0520 Nucleotide sugars metabolism 0521 Streptomycin biosynthesis0530 Aminosugars metabolism 0561 Glycerolipid metabolism0562 Inositol phosphate metabolism 0563 Glycosylphosphatidylinositol(GPI)-
anchor biosynthesis0564 Glycerophospholipid metabolism 0590 Prostaglandin and leukotriene
metabolism0600 Glycosphingolipid metabolism 0602 Blood group glycolipid biosynthesis-
neolactoseries0603 Globoside metabolism 0604 Ganglioside biosynthesis0620 Pyruvate metabolism 0625 Tetrachloroethene degradation0630 Glyoxylate and dicarboxylate
metabolism0632 Benzoate degradation via CoA liga-
tion0640 Propanoate metabolism 0650 Butanoate metabolism0670 One carbon pool by folate 0680 Methane metabolism0710 Carbon fixation 0720 Reductive carboxylate cycle (CO2
fixation)0730 Thiamine metabolism 0740 Riboflavin metabolism0750 Vitamin B6 metabolism 0760 Nicotinate and nicotinamide
metabolism0770 Pantothenate and CoA biosynthesis 0780 Biotin metabolism0790 Folate biosynthesis 0860 Porphyrin and chlorophyll
metabolism0900 Terpenoid biosynthesis 0903 Limonene and pinene degradation0910 Nitrogen metabolism 0920 Sulfur metabolism
134
List of KEGG Pathways and Their Identifiers
ID Pathway ID Pathway0960 Alkaloid biosynthesis II 0970 Aminoacyl-tRNA biosynthesis2020 Two-component system 3010 Ribosome3020 RNA polymerase 3022 Basal transcription factors3030 DNA polymerase 3050 Proteasome3060 Protein export 4010 MAPK signaling pathway4070 Phosphatidylinositol signaling sys-
tem4110 Cell cycle
4120 Ubiquitin mediated proteolysis
Listing 6.1: Phylogenetic tree relating the 215 organisms. Tree is given in Newick/NHformat.
[ Eukaryota ]
(((((Eremothecium_gossypii, spombe), Encephalitozoon_cuniculi), (celegans,
dmelanogaster)), athaliana_all, pfalciparum),
[ Archaea ]
((Aeropyrum_pernix, Pyrobaculum_aerophilum, (Sulfolobus_solfataricus,
Sulfolobus_tokodaii)), Nanoarchaeum_equitans, (Archaeoglobus_fulgidus,
(Haloarcula_marismortui_ATCC_43049, Halobacterium_sp),
Methanobacterium_thermoautotrophicum, Methanopyrus_kandleri,
(Methanococcus_jannaschii, Methanococcus_maripaludis_S2),
(Methanosarcina_acetivorans, Methanosarcina_mazei),
(Picrophilus_torridus_DSM_9790, (Thermoplasma_acidophilum,
Thermoplasma_volcanium)), (Pyrococcus_abyssi, Pyrococcus_furiosus,
Pyrococcus_horikoshii))),
[ Bacteria ]
(Thermotoga_maritima, Pirellula_sp, Aquifex_aeolicus, Fusobacterium_nucleatum,
[ Firmicutes ]
((Mesoplasma_florum_L1, Onion_yellows_phytoplasma,
(Ureaplasma_urealyticum, (Mycoplasma_gallisepticum, Mycoplasma_genitalium,
Mycoplasma_hyopneumoniae_232, Mycoplasma_mobile_163K, Mycoplasma_mycoides,
Mycoplasma_penetrans, Mycoplasma_pneumoniae, Mycoplasma_pulmonis))),
(Thermoanaerobacter_tengcongensis, (Clostridium_acetobutylicum,
Clostridium_perfringens, Clostridium_tetani_E88)),
((Enterococcus_faecalis_V583, (Lactobacillus_johnsonii_NCC_533,
Lactobacillus_plantarum), (Lactococcus_lactis, (Streptococcus_mutans,
(Streptococcus_pneumoniae_R6, Streptococcus_pneumoniae_TIGR4),
(Streptococcus_agalactiae_NEM316, Streptococcus_agalactiae_2603),
(Streptococcus_thermophilus_CNRZ1066, Streptococcus_thermophilus_LMG_18311),
(Streptococcus_pyogenes, Streptococcus_pyogenes_MGAS10394,
Streptococcus_pyogenes_MGAS8232), (Streptococcus_pyogenes_MGAS315,
Streptococcus_pyogenes_SSI-1)))), ((Staphylococcus_epidermidis_ATCC_12228,
(Staphylococcus_aureus_aureus_MRSA252, Staphylococcus_aureus_aureus_MSSA476,
135
Staphylococcus_aureus_Mu50, Staphylococcus_aureus_MW2,
Staphylococcus_aureus_N315)), (Listeria_innocua, (Listeria_monocytogenes,
Listeria_monocytogenes_4b_F2365)), (Oceanobacillus_iheyensis,
Geobacillus_kaustophilus_HTA426, (Bacillus_halodurans, Bacillus_subtilis,
(Bacillus_licheniformis_ATCC_14580, Bacillus_licheniformis_DSM_13),
(Bacillus_thuringiensis_konkukian, (Bacillus_anthracis_Ames_0581,
Bacillus_anthracis_str_Sterne, Bacillus_anthracis_A2012,
Bacillus_anthracis_Ames), (Bacillus_cereus_ATCC_10987,
Bacillus_cereus_ATCC14579, Bacillus_cereus_ZK))))))),
[ Proteobacteria ]
((Zymomonas_mobilis_ZM4, Silicibacter_pomeroyi_DSS-3,
(Anaplasma_marginale_St_Maries,
(Wolbachia_endosymbiont_of_Drosophila_melanogaster, (Rickettsia_conorii,
(Rickettsia_typhi_wilmington, Rickettsia_prowazekii)))), Mesorhizobium_loti,
Caulobacter_crescentus, ((Sinorhizobium_meliloti,
(Agrobacterium_tumefaciens_C58_Cereon, Agrobacterium_tumefaciens_C58_UWash)),
(Bartonella_henselae_Houston-1, Bartonella_quintana_Toulouse),
(Bradyrhizobium_japonicum, Rhodopseudomonas_palustris_CGA009),
(Brucella_melitensis, Brucella_suis_1330))), (Azoarcus_sp_EbN1,
Nitrosomonas_europaea, (Chromobacterium_violaceum,
(Neisseria_meningitidis_MC58, Neisseria_meningitidis_Z2491)),
((Ralstonia_solanacearum, (Burkholderia_mallei_ATCC_23344,
Burkholderia_pseudomallei_K96243)), (Bordetella_bronchiseptica,
Bordetella_parapertussis, Bordetella_pertussis))),
(Methylococcus_capsulatus_Bath, Francisella_tularensis_tularensis,
Wigglesworthia_brevipalpis, (Coxiella_burnetii, (Legionella_pneumophila_Lens,
Legionella_pneumophila_Paris, Legionella_pneumophila_Philadelphia_1)),
(Idiomarina_loihiensis_L2TR, Shewanella_oneidensis), (Acinetobacter_sp_ADP1,
(Pseudomonas_aeruginosa, Pseudomonas_putida_KT2440, Pseudomonas_syringae)),
(Blochmannia_floridanus, Photorhabdus_luminescens, (Buchnera_sp,
(Buchnera_aphidicola_Sg, Buchnera_aphidicola)),
Erwinia_carotovora_atroseptica_SCRI1043, (Shigella_flexneri_2a_2457T,
Shigella_flexneri_2a), (Salmonella_typhi_Ty2, Salmonella_typhimurium_LT2,
Salmonella_enterica_Paratypi_ATCC_9150, Salmonella_typhi),
(Escherichia_coli_CFT073, Escherichia_coli_K12, Escherichia_coli_O157H7,
Escherichia_coli_O157H7_EDL933), (Yersinia_pseudotuberculosis_IP32953,
(Yersinia_pestis_CO92, Yersinia_pestis_KIM,
Yersinia_pestis_biovar_Mediaevails))), (Photobacterium_profundum_SS9,
(Vibrio_cholerae, Vibrio_parahaemolyticus, (Vibrio_vulnificus_CMCP6,
Vibrio_vulnificus_YJ016))), ((Xanthomonas_campestris, Xanthomonas_citri),
(Xylella_fastidiosa, Xylella_fastidiosa_Temecula1)), (Pasteurella_multocida,
Mannheimia_succiniciproducens_MBEL55E, (Haemophilus_ducreyi_35000HP,
Haemophilus_influenzae))), ((Bdellovibrio_bacteriovorus,
Desulfotalea_psychrophila_LSv54, Desulfovibrio_vulgaris_Hildenborough,
Geobacter_sulfurreducens), (Campylobacter_jejuni, (Wolinella_succinogenes,
(Helicobacter_hepaticus, (Helicobacter_pylori_26695,
Helicobacter_pylori_J99)))))),
136
[ Actinobacteria ]
(Symbiobacterium_thermophilum_IAM14863, (Bifidobacterium_longum,
(Propionibacterium_acnes_KPA171202, (Streptomyces_avermitilis,
Streptomyces_coelicolor), (Leifsonia_xyli_xyli_CTCB0,
(Tropheryma_whipplei_Twist, Tropheryma_whipplei_TW08_27)),
(Nocardia_farcinica_IFM10152, (Corynebacterium_diphtheriae,
Corynebacterium_efficiens_YS-314, Corynebacterium_glutamicum),
(Mycobacterium_avium_paratuberculosis, Mycobacterium_leprae,
(Mycobacterium_bovis, (Mycobacterium_tuberculosis_CDC1551,
Mycobacterium_tuberculosis_H37Rv))))))),
[ Deinococci ]
(Deinococcus_radiodurans, (Thermus_thermophilus_HB27,
Thermus_thermophilus_HB8)),
[ Chlamydiales ]
(Parachlamydia_sp_UWE25, ((Chlamydia_muridarum, Chlamydia_trachomatis),
(Chlamydophila_caviae, Chlamydophila_pneumoniae_AR39,
Chlamydophila_pneumoniae_CWL029, Chlamydophila_pneumoniae_J138,
Chlamydophila_pneumoniae_TW_183))),
[ Bacteroidetes / Chlorobi group ]
(Chlorobium_tepidum_TLS, (Porphyromonas_gingivalis_W83,
(Bacteroides_fragilis_YCH46, Bacteroides_thetaiotaomicron_VPI-5482))),
[ Spirochaetales ]
((Leptospira_interrogans_serovar_Copenhageni,
Leptospira_interrogans_serovar_Lai), ((Borrelia_burgdorferi,
Borrelia_garinii_PBi), (Treponema_denticola_ATCC_35405, Treponema_pallidum))),
[ Cyanobacteria ]
(Gloeobacter_violaceus, Nostoc_sp, (Prochlorococcus_marinus_MED4,
Prochlorococcus_marinus_MIT9313, Prochlorococcus_marinus_CCMP1375),
(Synechococcus_sp_WH8102, Synechocystis_PCC6803,
Thermosynechococcus_elongatus))));
137
Chapter 7
Conclusion and Future Work
In this thesis, we have made progress toward solving several problems essential to the
computational exploration of the processes of life. No problem is more central to the
workings of the cell than understanding protein structure, as proteins are the building
blocks of cellular structures and mechanisms. Equally important is understanding the
regulatory network that is responsible for modulating the creation of these building blocks.
More generally, determining the role each protein plays in the cell is one of the basic steps
in making sense of life. Looking further into the future, once the workings of the cell are
better understood, protein design will allow humans to tinker with the exquisite machinery
of life to cure disease.
Chapters 3 and 4 tackled a subproblem of protein structure prediction and design
using mathematic programming techniques. Our methods hold the advantage of focus-
ing on provably optimal and near-optimal solutions, rather than merely empirically good
solutions. Chapter 5 presented a new way of discovering in DNA the binding sites of
regulatory proteins, and Chapter 6 discusses assigning functions to proteins using evolu-
tionary information. Though our contributions make significant progress, many avenues
of future work are apparent.
For our SCP work, it would be useful to extend our methodologies to handle backbone
flexibility — either fully flexible backbones or smaller movements. Though packages are
currently available that approach structure prediction and design by allowing backbone
138
motions, backbone flexibility and side-chain flexibility are usually treated independently.
Putting these into a single optimization framework may allow for better solutions. Con-
tinued speed improvements would be welcome as well.
A significant open question in our side-chain positioning work is why integral solutions
are so often observed for native-backbone and homology modeling problems, while rarely
for design problems. One very reasonable hypothesis is that for the former only a small set
of rotamer choices are particularly good, while for the latter, because several amino acids
have very similar side chains and because of the added flexibility afforded by designing
several positions at once, there are many more “good” choices of rotamers. It would be
interesting to test this hypothesis.
As SDP solvers are improved, it may become useful to apply them to larger instances
of design problems to see how they scale. The SCP design problems make a nice real-
world test case for emerging SDP solver packages. It may also be interesting to apply our
rounding schemes to other optimization problems.
For the problem of motif finding, our ILP should be combined with the existing graph
pruning techniques to determine just how large are the problems that can be solved to
optimality. Casting the problem so that it depends so explicitly on the number of possible
distances between motifs, as we have done, may suggest some practical improvements
where, for example, successive ILPs are solved, increasing the number of allowed edge
weights until we can be assured of having found a good solution. It would also be useful
to extend the approach to handle the cases where sequences can each contain zero or more
than one motifs. Further, a rigorous treatment of the significance of the motifs predicted
may help to weed out spurious matches.
Incorporating the improved phylogenetic profile comparison methods presented in
Chapter 6 into meta-methods that aggregate data from many sources should be useful.
It would be also interesting to investigate which organisms are most useful for identifying
the functional linkages arising from which pathways. More generally, one wonders how
the organismal composition of the profiles affects which methods are useful for predicting
shared function. A natural next step is to apply the methods to predict functional linkages
139
in other organisms.
This thesis has provided new computation methods for attacking several important
problems, along with their very careful testing. The methods presented here should be an
excellent starting place from which to continue tackling some of these central problems in
computational biology.
140
Bibliography
Akutsu, T., Arimura, H., and Shimozono, S. (2000). On approximation algorithms for
local multiple alignment. In Proc. Annual Internat. Conf. on Computat. Mol. Biol.,
pages 1–7.
Alizadeh, F. (1995). Interior point methods in semidefinite programming with applications
to combinatorial optimization. SIAM J. Optim., 5(1):13–51.
Alon, N. and Kahale, N. (1998). Approximating the independence number via the θ-
function. Math. Programming, 80:253–264.
Aloy, P., Stark, A., Hadley, C., and Russell, R. B. (2003). Predictions without templates:
new folds, secondary structure, and contacts in CASP5. PROTEINS: Struct. Funct.
Genet., 53:436–456.
Althaus, E., Kohlbacher, O., Lenhof, H.-P., and Muller, P. (2000). A combinatorial
approach to protein docking with flexible side-chains. In Proc. 4th Annual Internat.
Conf. on Computat. Mol. Biol., pages 15–24, New York, NY. ACM Press.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic
local alignment search tool. J. Mol. Biol., 215:403–410.
Arora, S., Hazan, E., and Kale, S. (2005). Fast algorithms for approximate semidefinite
programming using the multiplicative weights update method. Proc. of the 45th
Annual IEEE Sympos. Found. Comput. Sci., in press.
141
Arora, S., Lund, C., Motwani, R., Sudan, M., and Szegedy, M. (1998). Proof verification
and hardness of approximation problems. J. ACM, 45(3):501–555.
Arora, S. and Safra, M. (1998). Probabilistic checking of proofs: A new characterization
of NP. J. ACM, 45(1):70–122.
Bafna, V., Lawler, E., and A., P. P. (1997). Approximation algorithms for multiple
alignment. Theoretical Computer Science, 182:233–244.
Bahadur, K. C. D., Akutsu, T., Tomita, E., and Seki, T. (2004). Protein side-chain packing
problem: a maximum edge-weight clique algorithmic approach. In Proceedings of
the Second Conference on Asia-Pacific Bioinformatics, pages 191–200, Darlinghurst,
Australia. Australian Computer Society, Inc.
Bailey, T. and Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers
using expectation maximization. Machine Learning, 21:51–80.
Banner, D. W., Bloomer, A., Petsko, G. A., Phillips, D. C., and Wilson, I. A. (1976).
Atomic coordinates for triose phosphate isomerase from chicken muscle. Biochem.
Biophys. Res. Commun., 72(1):146–55.
Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogenetic-
statistical analyses of whole genomes. PLOS Comp. Biology, 1(1):24–31.
Benson, S. J., Ye, Y., and Zhang, X. (1999). Mixed linear and semidefinite programming
for combinatorial and quadratic optimization. Optim. Methods and Software, 11:515–
544.
Bertsimas, D. and Ye, Y. (1998). Semidefinite relaxations, multivariate normal distribu-
tions, and order statistics. In Du, D.-Z. and Pardalos, P. M., editors, Handbook of
Combinatorial Optimization, volume 3, pages 1–19. Kluwer Academic Publishers.
Betz, H., Burcham, P. B., and Ewing, G. M. (1954). Differential Equations with Appli-
caitons. Harper & Brothers, New York.
142
Boppana, R. B. (1987). Eigenvalues and graph bisection: An average-case analysis. In
Proc. of the 28th Annual Sympos. on Found. of Comput. Sci., pages 280–285, Wash-
ington, D.C. IEEE Computer Society Press.
Bower, M. J., Cohen, F. E., and Dunbrack, Jr, R. L. (1997). Prediction of protein side-
chain rotamers from a backbone-dependent rotamer library: A homology modeling
tool. J. Mol. Biol., 267:1268–1282.
Bowers, P. M., Cokus, S. J., Eisenberg, D., and Yates, T. O. (2004a). Use of logic
relationships to decipher protein network organization. Science, 306:2246–2249.
Bowers, P. M., Pellegrini, M., Thompson, M. J., Fierro, J., Yeates, T. O., and Eisen-
berg, D. (2004b). Prolinks: a database of protein functional linkages derived from
coevolution. Genome Biol., 5(5):R35.
Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., and
Karplus, M. (1983). CHARMM: A program for macromolecular energy, minimization,
and dynamics calculations. J. Comp. Chem., 4:187–217.
Canutescu, A. A., Shelenkov, A. A., and Dunbrack Jr., R. L. (2003). A graph-theory
algorithm for rapid protein side-chain prediction. Protein Sci., 12:2001–2014.
Chazelle, B., Kingsford, C., and Singh, M. (2003). The side-chain positioning problem:
a semidefinite programming formulation with new rounding schemes. In Proceed-
ings of the Paris C. Kanellakis memorial workshop on Principles of computing and
knowledge, pages 86–94.
Chazelle, B., Kingsford, C., and Singh, M. (2004). A semidefinite-programming approach
to side-chain positioning with new rounding strategies. INFORMS J. on Comput.,
16:380–392.
Cliften, P., Sundarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., et al. (2003).
Finding functional features in Saccharomyces genomes by phylogenetic footprinting.
Science, 301:71–76.
143
Cook, W., Cunningham, W., Pulleyblank, W., and Schrijver, A. (1997). Combinatorial
Optimization. Wiley-Interscience, New York.
Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, Jr., K. M., Ferguson,
D. M., Spellmeyer, D. C., Fox, T., Caldwell, J. W., and Kollman, P. A. (1995). A
second generation force field for the simulation of proteins, nucleic acids, and organic
molecules. J. Am. Chem. Soc., 117:5179–5197.
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-
Interscience.
Dahiyat, B. I. and Mayo, S. L. (1997). De novo protein design: Fully automated sequence
selection. Science, 278:82–87.
Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by
genome-wide analysis of functional linkages. Nat. Biotechnol., 21(9):1055–1062.
Desmet, J., De Maeyer, M., Hazes, B., and Lasters, I. (1992). The dead-end elimination
theorem and its use in protein side-chain positioning. Nature, 356:539–542.
Desmet, J., De Maeyer, M., and Lasters, I. (1994). The “dead end elimination” theorem
as a new approach to the side-chain packing problem. In Merz, K. and LeGrand,
S., editors, The Protein Folding Problem and Tertiary Structure Prediction, pages
307–337. Birkhauser, Boston, MA.
Donath, W. E. and Hoffman, A. (1972). Algorithms for partitioning of graphs and com-
puter logic based on eigenvectors of connection matrices. IBM Tech. Disclosure Bull.,
15:938–944.
Dunbrack Jr, R. L. and Karplus, M. (1993). Backbone-dependent rotamer library for
proteins: application to side-chain prediction. J. Mol. Biol., 230:543–574.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999). Protein
interaction maps for complete genomes based on gene fusion events. Nature, 402:86–
90.
144
Eriksson, O., Zhou, Y., and Elofsson, A. (2001). Side chain-positioning as an integer
programming problem. In Proc. of 1st Workshop on Algorithms in BioInformatics,
pages 129–141, BRICS, University of Aarhus, Denmark.
Eskin, E. and Pevzner, P. (2002). Finding composite regulatory patterns in dna sequences.
Bioinformatics, 18:S354–S363.
Feige, U. and Kilian, J. (1998). Heuristics for finding large independent sets, with appli-
cations to coloring semi-random graphs. In Proc. of the 39th Annual Sympos. Found.
Comput. Sci., pages 674–683, Los Alamitos, CA. IEEE Computer Society.
Felsenstein, J. (1981). Evolutionary trees from dna sequences: A maximum likelihood
approach. J. Mol. Evol., 17:368–376.
Fourer, R., Gay, D. M., and Kernighan, B. W. (2002). AMPL: A Modeling Language for
Mathematical Programming. Brooks/Cole Publishing Company, Pacific Grove, CA.
Friedman, N., Ninio, M., Pe’er, I., and Pupko, T. (2002). A structural EM algorithm for
phylogenetic inference. J. Comp. Biol., 9(2):331–353.
Frieze, A. and Jerrum, M. (1997). Improved approximation algorithms for MAX k-CUT
and MAX BISECTION. Algorithmica, 18(1):61–77.
Frith, M., Hansen, U., Spouge, J., and Z., W. (2004). Finding functional sequence elements
by multiple local alignment. Nucleic Acids Res., 32:189–200.
Fujisawa, K., Fukuda, M., Kojima, M., and Nakata, K. (1997). Numerical evaluation
of SDPA. Technical Report B-330, Department of Mathematical and Computing
Sciences, Tokyo Institute of Technology, Oh-Okayama, Meguro-ku, Tokyo 152, Japan.
Gasterland, T. and Ragan, M. A. (1998). Microbial genescapes: phyletic and functional
patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics, 3:199–
217.
145
Gertz, J., Elfond, G., Shustrova, A., Weisinger, M., Pellegrini, M., Cokus, S., and Roth-
schild, B. (2003). Inferring protein interactions from phylogenetic distance matrices.
Bioinformatics, 19(16):2039–2045.
Goemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming. J. ACM,
42:1115–1145.
Goh, C.-S., Bogan, A. A., Joachimiak, M., Walther, D., and Cohen, F. E. (2000). Co-
evolution of proteins with their interaction partners. J. Mol. Biol., 299:283–293.
Goldstein, R. F. (1994). Efficient rotamer elimination applied to protein side-chains and
related spin glasses. Biophys. J., 66:1335–1340.
Gordon, D. B., Hom, G., Mayo, S., and Pierce, N. (2002). Exact rotamer optimization
for protein design. J. Comput. Chemistry, 24:232–243.
Gordon, D. B. and Mayo, S. L. (1998). Radical performance enhancements for combinato-
rial optimization algorithms based on the dead-end elimination theorem. J. Comput.
Chem., 19(13):1505–1514.
Grey, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A.,
and Baker, D. (2003). Protein-protein docking with simultaneous optimization of
rigid-body displacement and side-chain conformations. J. Mol. Biol., 331:281–299.
Grotschel, M., Lovasz, L., and Schrijver, A. (1993). Geometric Algorithms and Combina-
torial Optimization. Springer-Verlag, Berlin, Germany, 2nd edition.
Grotschel, M., Lovasz, L., and Schrijver, A. (1993). Geometric Algoritms and Combina-
torial Optimization. Springer, 2nd corrected edition.
Gusfield, D. (1993). Efficient methods for multiple sequence alignment with guaranteed
error bounds. Bull. Math. Biol., 55:141–154.
Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T., and Kim, P. S. (1998). High-resolution
protein design with backbone freedom. Science, 282:1462–1467.
146
Hartwell, L. H., Hopfield, J. J., Leibler, S., and Murray, A. W. (1999). From molecular
to modular cell biology. Nature, 402:C47–C52.
Hertz, G. and Stormo, G. (1999). Identifying dna and protein patterns with statistically
significant alignments of multiple sequences. Bioinformatics, 15:563–577.
Holm, S. and Sander, C. (1991). Database algorithm for generating protein backbone and
sidechain coordinates from a Ca trace: Application to model building and detection
of coordinate errors. J. Mol. Biol., 218:183–194.
ILOG CPLEX (2000). ILOG CPLEX 7.1. http://www.ilog.com/products/cplex/.
Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Sny-
der, M., Greenblatt, J. F., and Gerstein, M. (2003). A bayesian networks approach for
predicting protein-protein interactions from genomic data. Science, 302(5644):449–
453.
Jones, T. A. and Kleywegt, G. J. (1999). CASP3 comparative modeling evaluation.
Proteins, 37:30–46.
Jothi, R., Kann, M., and Przytycka, T. (2005). Predicting protein-protein interaction by
searching evolutionary tree authoporhism space. In Proceedings of the 13th Annual
International Conference on Intelligent Systems for Molecular Biology.
Kanehisa, M. (1997). A database for post-genome analysis. Trends Genet., 13:375–376.
Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes.
Nuc. Acids Res., 28:27–30.
Karger, D., Motwani, R., and Sudan, M. (1998). Approximate graph coloring by semidef-
inite programming. J. ACM, 45(2):246–265.
Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. (2003). Sequencing
and comparison of yeast species to identify genes and regulatory elements. Nature,
423:241–254.
147
Kingsford, C., Chazelle, B., and Singh, M. (2005). Solving and analyzing side-chain posi-
tioning problems using linear and integer programming. Bioinformatics, 21(7):1028–
1039.
Klepeis, J. L., Floudas, C. A., Morikis, D., Tsokos, C. G., Argyropoulos, E., Spruce, L.,
and Lambris, J. D. (2003). Integrated computational and experimental approach for
lead optimization and design of compstatin variants with improved activity. J. Am.
Chem. Soc., 125:8422–8423.
Kohlbacher, O. and Lenhof, H.-P. (2000). BALL — rapid software prototyping in com-
putational molecular biology. Bioinformatics, 16(9):815–824.
Kortemme, T., Joachimiak, L. A., Bullock, A. N., Schuler, A. D., Stoddard, B. L., and
Baker, D. (2004). Computational redesign of protein-protein interaction specificity.
Nature Struct. Mol. Biol., 11(4):371–379.
Kuhlman, B. and Baker, D. (2000). Native protein sequences are close to optimal for their
structures. Proc. Natl. Acad. Sci. USA, 97(19):10383–10388.
Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L., and Baker, D.
(2003). Design of a novel globular protein fold with atomic-level accuracy. Science,
302:1364–1368.
Lasters, I., De Maeyer, M., and Desmet, J. (1995). Enhanced dead-end elimination in
the search for the global minimum energy conformation of a collection of protein side
chains. Prot. Eng., 8:815–822.
Lau, H. C. (2002). A new approach for weighted constraint satisfaction. Constraints,
7:151–165.
Lau, H. C. and Watanabe, O. (1996). Randomized approximation of the constraint satis-
faction problem. In Karlsson, R. and Lingas, A., editors, Proc. of the 5th Scandinavian
Workshop on Algorithm Theory, pages 76–87, Berlin, Germany. Springer.
148
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J. (1993).
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Science, 262:208–214.
Lawrence, C. and Reilly, A. (1990). An expectation maximization (EM) algorithm for
the identification and characterization of common sites in unaligned biopolymer se-
quences. Proteins: Structure, Function, and Genetics, 7:41–51.
Leach, A. R. and Lemon, A. P. (1998). Exploring the conformational space of protein side
chains using dead-end elimination and the A* algorithm. Proteins, 33:227–239.
Lecompte, O., Ripp, R., Thierry, J.-C., Moras, D., and Poch, O. (2002). Comparative
analysis of ribosomal proteins in complete genomes: an example of reductive evolution
at the domain scale. Nuc. Acids Res., pages 5382–5390.
Lee, C. (1994). Predicting protein mutant energetics by self-consistent ensemble optimiza-
tion. J. Mol. Biol., 236(3):918–939.
Lee, C. and Subbiah, S. (1991). Prediction of protein side-chain conformation by packing
optimization. J. Mol. Biol., 217(2):373–388.
Lee, I., Date, S. V., Adai, A. T., and Marcotte, E. M. (2004). A probabilistic functional
network of yeast genes. Science, 306:1555–1558.
Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, et al. (2002). Tran-
scriptional regulatory network Saccharomyces cerevisiae. Science, 298:799–804.
Lesk, A. M., Branden, C.-I., and Chothia, C. (1989). Structural principles of α/β barrel
proteins: The packing of the interior of the sheet. Proteins, 5:139–148.
Liberles, D. A., Thoren, A., von Heijne, G., and Elofsson, A. (2002). The use of phyloge-
netic profiles for gene predictions. Current Genomics, 3:131–137.
Lilien, R. H., Stevens, B. W., Anderson, A. C., and Donald, B. R. (2004). A novel
ensemble-based scoring and search algorithm for protein redesign, and its applica-
tion to modify the substrate specificity of the gramicidin synthetase a phenylalanine
149
adenylation enzyme. In Proc. 8th Annual Internat. Conf. on Computat. Mol. Biol.,
pages 46–57, New York, NY. ACM Press.
Liu, X., Brutlag, D., and Liu, J. (2001). Bioprospector: discovering conserved DNA motifs
in upstream regulatory regions of co-expressed genes. In Pac. Symp. Biocomp., pages
127–138.
Looger, L. L., Dwyer, M. A., Smith, J. J., and Hellinga, H. W. (2003). Computational
design of receptor and sensor proteins with novel functions. Nature, 423:185–190.
Looger, L. L. and Hellinga, H. W. (2001). Generalized dead-end elimination algorithms
make large-scale protein side-chain structure prediction tractable: implications for
protein design and structural genomics. J. Mol. Biol., 307(1):429–445.
Lovasz, L. (1979). On the Shannon capacity of a graph. IEEE Trans. Inform. Theory,
25:1–7.
Lu, L., Xia, Y., Paccanaro, A., Yu, H., and Gerstein, M. (2005). Assessing the limits of
genomic data integration for predicting protein networks. Genome Res, 15:945–953.
MacKerell, A. D., Jr., B. B., C. L. Brooks, I., Nilsson, L., Roux, B., Won, Y., and
Karplus, M. (1998). CHARMM: The energy function and its parameterization with
an overview of the program. In v. R. Schleyer, P. et al., editors, The Encyclopedia of
Computational Chemistry, volume 1, pages 271–277. John Wiley & Sons, Chichester.
Malakauskas, S. M. and Mayo, S. L. (1998). Design, structure and stability of a hyper-
thermophilic protein variant. Nat. Struct. Biol., 5(6):470–475.
Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O., and Eisenberg,
D. (1999a). Detecting protein function and protein-protein interactions from genome
sequences. Science, 285(5428):751–753.
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., and Eisenberg, D.
(1999b). A combined algorithm for genome-wide prediction of protein function. Na-
ture, 402:83–86.
150
Marsan, L. and Sagot, M. F. (2000). Algorithms for extracting structured motifs using a
suffix tree with an application to promoter and regulatory site consensus identifica-
tion. J. Comp. Bio., 7:345–362.
Martin, A. C. R. (2001). Profit program version 2.2.
http://www.bioinf.org.uk/software/profit.
McGuire, A., Hughes, J., and Church, G. (2000). Conservation of dna regulatory motifs
and discovery of new motifs in microbial genomes. Genome Res., 10:744–757.
McLachlan, A. D. (1982). Rapid comparison of protein structures. Acta Cryst, A38:871–
873.
Mewes, H. W., Frishman, D., Guldener, U., Hannhaupt, G., Mayer, K., Mokrejs, M.,
Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. (2002). MIPS: a database
for genomes and protein sequences. Nuc. Acids Res., 30(1):31–34.
Morett, E., Korbel, J. O., Rajan, E., Saab-Rincon, G., Olvera, L., Olvera, M., Schmidt,
S., Snel, B., and Bork, P. (2003). Systematic discovery of analogous enzymes in
thiamin biosynthesis. Nat. Biotechnol., 21(7):790–795.
Mueller, U., Perl, D., Schmid, F. X., and Heinemann, U. (2000). Thermal stability and
atomic-resolution crystal structure of the Bacillus caldolyticus cold shock protein. J.
Mol. Biol., 297(4):975–988.
NCBI (2005). National Center for Biotechnology Information.
ftp://ftp.ncbi.nih.gov/genomes/.
Nesterov, Y. and Nemirovskii, A. (1993). Interior Point Polynomial Methods in Convex
Programming: Theory and Algorithms. SIAM, Philadelphia, PA.
Nicholls, A., Sharp, K. A., and Honig, B. (1991). Protein folding and association: In-
sights from the interfacial and thermodynamic properties of hydrocarbons. Proteins,
11(4):281–296.
151
Osada, R., Zaslavsky, E., and Singh, M. (2004). Comparative analysis of methods for
representing and searching for transcription factor binding sites. Bioinformatics,
20:3516–3525.
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D., and Maltsev, N. (1999). The use
of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci., 96(6):2896–2901.
Park, S., Yang, X., and Saven, J. G. (2004). Advances in computational protein design.
Curr. Opinion Struct. Biol., 14(4):487–494.
Pavesi, G., Mereghetti, P., Mauri, G., and Pesole, G. (2004). Weeder Web: discovery of
transcription factor binding sites in a set of sequences from co-regulated genes. Nucl.
Acids Res., 32:W199–W203.
Pazos, F. and Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein-
protein interaction. Prot. Eng., 14(9):609–614.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O. (1999).
Assigning protein functions by comparative genome analysis: protein phylogenetic
profiles. Proc. Natl. Acad. Sci., 96:4285–4288.
Petrey, D., Xiang, Z., Tang, C., Xie, L., Gimpelev, M., Mitros, T., Soto, C., Goldsmith-
Fischman, S., Kernytsky, A., Schlessinger, A., Koh, I., Alexov, E., and Honig, B.
(2003). Using multiple structure alignments, fast model building and energetic anal-
ysis in fold recognition and homology modeling. Proteins, 53:430–435.
Pevzner, P. and Sze, S. (2000). Combinatorial approaches to finding subtle signals in dna
sequences. In Proc. Intell. Syst. in Mol. Biol., pages 269–278.
Pierce, N. A., Spriet, J. A., Desmet, J., and Mayo, S. L. (2000). Conformational splitting:
A more powerful criterion for dead-end elimination. J. Comput. Chem, 21(11):999–
1009.
Pierce, N. A. and Winfree, E. (2002). Protein design is NP-hard. Prot. Eng., 15(10):779–
782.
152
Ponder, J. W. and Richards, F. M. (1987). Tertiary templates for proteins. use of packing
criteria in the enumeration of allowed sequences for different structural classes. J.
Mol. Biol., 193(4):775–791.
Raghavan, P. and Thompson, C. (1987). Randomized rounding: a technique for provably
good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374.
Ramani, A. K. and Marcotte, E. M. (2003). Exploiting the co-evolution of interacting
proteins to discover interaction specificity. J. Mol. Biol., 327:273–284.
Reinert, K., Lenhof, H., Mutzel, P., Mehlhorn, K., and Kececioglu, J. (1997). A branch-
and-cut algorithm for multiple sequence alignment. In Proc. Annual Internat. Conf.
on Computat. Mol. Biol., pages 241–249.
Rigoutsos, I. and Floratos, A. (1998). Combinatorial pattern discovery in biological se-
quences: The TEIRESIAS algorithm. Bioinformatics, 14:55–67.
Robison, K., McGuire, A. M., and Church, G. M. (1998). A comprehensive library of
DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli
K-12 genome. J. Mol. Biol., 284:241–254.
Rolim, J. D. P. and Trevisan, L. (1998). A case study of de-randomization methods for
combinatorial approximation problems. J. of Combin. Optim., 2(3):219–236.
Samudrala, R. and Moult, J. (1998). A graph-theoretic algorithm for comparative mod-
eling of protein structure. J. Mol. Biol., 279(1):287–302.
Schuler, G., Altschul, S., and Lipman, D. (1991). A workbench for multiple alignment
construction and analysis. Proteins, 9(3):180–190.
Seneta, E. (1981). Non-negative Matrices and Markov Chains. Springer-Verlag, New York,
NY, 2nd edition.
Simons, K. T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of pro-
tein tertiary structures from fragments with similar local sequences using simulated
annealing and bayesian scoring functions. J. Mol. Biol., 268:209–225.
153
Sinha, S. and Tompa, M. (2003). YMF: A program for discovery of novel transcription
factor binding sites by statistical overrepresentation. Nucl. Acids Res., 31:3586–3588.
Summers, N. and Karplus, M. (1989). Construction of side-chains in homology modeling.
application to the C-terminal lobe of rhizopuspepsin. J. Mol. Biol., 210:785–811.
Sze, S.-H., Lu, S., and Chen, J. (2004). Integrating sample-driven and pattern-driven
approaches in motif finding. In Proc. Workshop on Algo. in Biocomp., pages 438–
449.
Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997). A genomic perspective on
protein families. Science, 278(5338):631–637.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., and Church, G. (1999). System-
atic determination of genetic network architecture. Nature Genetics, 22(3):281–285.
Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Clustal W: improving
the sensitivity of progressive multiple sequence alignment through sequence weight-
ing, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res.,
22:4673–4680.
Tompa, M. (1999). An exact method for finding short motifs in sequences, with application
to the ribosome binding site problem. In Proc. Intell. Syst. in Mol. Biol., pages 262–
271.
Tompa, M., Li, N., Bailey, T. L., Church, G., De Moor, B., Eskin, E., et al. (2005).
Assessing computational tools for the discovery of transcription factor binding sites.
Nature Biotech., 23:137–144.
Troyanskaya, O., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D. (2003).
A Bayesian framework for combining heterogeneous data sources for gene function
prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci., 100(14):8348–8353.
van Helden, J., Rios, A., and Collado-Vides, J. (2000). Discovering regulatory elements in
non-coding sequences by analysis of spaced dyads. Nucleic Acids Res., 28:1808–1818.
154
Vandenberghe, L. and Boyd, S. (1996). Semidefinite programming. SIAM Rev., 38(1):49–
95.
Vazirani, V. (2001). Approximation Algorithms. Springer-Verlag, Berlin.
Ventura, S. and Serrano, L. (2004). Designing proteins from the inside out. Proteins,
56:1–10.
Vert, J.-P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics, 18:S275–
S284.
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., and Snel, B. (2003).
STRING: a database of predicted functional associations between proteins. Nuc.
Acids Res., 31:258–261.
Wang, C., Schueler-Furman, O., and Baker, D. (2005). Improved side-chain modeling for
protein-protein docking. Protein Sci., 14:1328–1339.
Wu, J., Kasif, S., and DeLisi, C. (2003). Identification of functional links between genes
using phylogenetic profiles. Bioinformtics, 19(12):1524–1530.
Xiang, Z. and Honig, B. (2001). Extending the accuracy limits of prediction for side-chain
conformations. J. Mol. Biol., pages 421–430.
Xu, J. (2005). Rapid protein side-chain packing via tree decomposition. In Miyano, S.,
Mesirov, J., Kasif, S., Istrail, S., Pevzner, P., and Waterman, M., editors, 9th Annual
Internat. Conf. Res. In Comp. Mol. Biol., pages 423–439. Springer.
Zaslavsky, E. and Singh, M. (2005). Combinatorial optimization approaches to motif
finding. Manuscript, submitted for publication.
Zwick, U. (1999). Outward rotations: a tool for rounding solutions of semidefinite pro-
gramming relaxations, with applications to MAX CUT and other problems. In Proc.
of the 31st Annual ACM Sympos. on Theory of Comput., pages 679–687, New York,
NY. ACM Press.
155
Recommended