Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Segmentation Conditional Random Fields (SCRFs)

A New Approach for Protein Fold Recognition

Yan Liu1, Jaime Carbonell1, Peter Weigele2,Vanathi Gopalakrishnan3

1. School of Computer Science, Carnegie Mellon University

2. Biology Department, Massachusetts Institute of Technology3. Center for Biomedical Informatics, University of Pittsburgh



Structural motif recognition

• Structural motif– Regular arrangement of secondary structural elements, which

commonly appears in a variety of protein families– Super-secondary structure, or protein fold– Example

• Structural motif recognition– Given a structural motif and a protein sequence, predict the

presence of the motif and the exact location in the protein, based on sequences only

β-α-β (2CMD) Leucine-rich repeats (1A4Y)



Previous work on structural motif recognition

• General approaches for structural motif recognition– Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]

– Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]

– Homology modeling or threading, e.g. Threader [Jones, 1998]

• Methods of careful design for specific structure motifs– Example: αα- and ββ- hairpins, β-turn and β-helix

Our goal is to have a general probabilistic framework to address all these problems for structural motif prediction

Major challenges: structural similarity without clear sequence similarityLong-range interactions, such as β-sheets

Hard to generalize



Outline

• Introduction• Conditional random fields• Segmentation conditional random fields• Case study on β-helix fold recognition• Conclusion



Graphical models for protein structure prediction

• Graphical models for protein structure prediction– Probabilistic causal networks [Delcher et al, 1993]

– Markov random fields [White et al, 1994]

– Hidden Markov model [Bystroff et al, 2000]

– Bayesian segmentation model [Schmidler and Liu, 2000]

• Protein structure prediction can be generalized as learning problems for structured data– Structured data: observation with internal or external structures– Conditional graphical models are successful in various applications



Condition random fields

• Condition random fields (CRFs) [lafferty et al, 2001]

– A conditional undirected graphical model

– The conditional probability is defined as

– Flexible feature definition– Convex optimization function guarantees the

globally optimal solution

– Efficient inference algorithms

– Kernel CRFs permits the use of implicit feature spaces via kernels [Lafferty et al, 2004]

)),,,(exp(1

)|(1 1

1

N

i

K

kiikk yyixf

ZxyP

HMMs CRFs



Graphical models for Structural Motif detection

• Structural motif detection

– Structural components• Secondary structural elements instead of

individual residues

– Informative features• Indicator for conserved regions• Length of each component• Propensities to form hydrogen bond in β-

sheet

Segmented Markov Models



Segmentation conditional random fields (I)

• Protein structural graph G = <V, E>– V: nodes for the secondary structural elements of variable lengths– E1 edges between adjacent nodes for peptide bonds– E2 edges between distant nodes for hydrogen bonds or disulfide

bonds

• Example: β-α-β motif

• Tradeoff between fidelity of the model and graph complexity



Segmentation conditional random fields (II)

• Segmentation conditional random fields (SCRFs)– Given a protein structure graph G, we define a segmentation of the

sequence W = (M, S), where Si = <pi, qi, yi>

– The conditional probability of the segmentation W given the observation x is defined as

– If each subgraph of the resulting graph is a tree or a chain, we can simplify the model to be

)),,(exp(1

)|(1 1

M

i

K

kikk iSSxf

ZxWP

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1



Training and Testing for SCRFs

• Training phase : learn the model parameters

– Minimizing regularized log loss

– Seek the direction whose empirical values agrees with the expectation

– Iterative searching algorithm have to be applied

• Testing phase: search the segmentation that maximize P(w|x)

)(log),,(1 1

ZssxfLM

i

K

kikk i

0)())],,([),,((1

)|(

M

iikxSpik

kii

SSxfEssxfL



Inference algorithm

• Backward-forward algorithm*

• Viterbi algorithm*

)),,(exp(),1(),(),(

1,

,,,,

ssxfypyqyrK

kkkryq

qppylryl ll

Q

)),,(exp(),1(),(max),(1

,,,,

,

ssxfypyqyrK

kkkryqyl

qppryl ll

Q



SCRFs for β-helix fold recognition (I)

• Right-handed β-helix fold– A regular structural fold with an elongated helix-like structures whose

successive rungs composed of three parallel β-strands (B1, B2, B3 strands)– T2 turn: a unique two-residue turn– Perform important functions such as the bacterial infection of plants,

binding the O-antigen and etc.

• Computational challenges– Long insertions in T1 and T3 turn– Structural similarity with low sequence similarity

• Previous work– BetaWrap [Bradley et al 2001, Bradley et al. 2001, Cowen et al 2002]

– BetaWrapPro [McDonnell et al]Pectate Lyase C (Yoder et al. 1993)



SCRFs for β-helix fold recognition (II)

• Protein structure graph– 5 states: B1, B23, T1, T3, I– Length constraints

• B1, B23: fixed length as 3 and 9• T1, T3: 1 – 80

– Long-range interactions between B23

• Prediction scores– Log-ratio scores )|(

)|(max

xnullP

xWP



Features• Node features

– Regular expression template, HMM profiles– Secondary structure prediction scores– Segment length

• Inter-node features– β-strand Side-chain alignment scores– Preferences for parallel alignment scores [Steward & Thonton, 2002]

– Distance between adjacent B23 segments

• Features are general and easy to extend



Experiments (I)

• Cross-family validation for known β-helix proteins– PDB select dataset: non-homologous proteins in PDB removing β-helix

proteins

– SCRFs can score all known β-helices higher than non β-helices



Experiments (II)

• Predicted Segmentation for known Beta-helices



Experiments (III)

• Histograms for known β-helices against PDB-minus dataset– 18 non β-helix proteins have a

score higher than 0

– 13 from β-class and 5 from α/β class

– Most confusing proteins: β-sandwiches and left-handed β-helix

5



Discovery of potential β-helices

• Hypothesize on Uniprot reference databases with less than 50% identity (UniRef50)

– 93 sequences were returned with scores above a cutoff of 5

– 48 proteins are homologous with proteins known be β-helices

• Full list can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html

•Verification on recently crystallized structuresSuccessfully identify gp14 of Shigella bacteriophage as a β-helix protein with scoring 15.63



Conclusion• Segmentation conditional random fields (SCRFs) for protein

structural motif detection– Consider the structural characteristics in a general probabilistic

framework– Conditional graphical models that considers the long-range

interactions directly and conveniently

– A case study for β-helix fold recognition

• Future work– Computational complexity: O(N2)

• Chain graph model: localized SCRFs model

– Generality of the model• Leucine-rich repeats, Ankyrin proteins and some virus-spike folds



Further Exploration-(I)

• Chain graph model– A combination of directed and undirected graph– Local normalization version of segmentation CRFs– Reduce the computational complexity to O(N)

• Experiment on β-helix fold and Leucine-rich repeats– Achieve approximate results as SCRFs with only slight difference

1A4Y 1OGQ



Further Exploration (II)

• Cross-family validation for known LLR by chain graph model– 41 LLR proteins with known structures– 2 super-family and 11 families



SCRFs for general graph• For any graph G = <V, E>, the conditional

probability of the segmentation W given the observation x is defined as

– If there are no E2 edges (long-range interactions)

• semi-markov conditional random fields (Sarawagi

& Cohen, 2004)

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1







& Cohen, 2004) • Efficient algorithms for inference

– If the state transition is deterministic and the resulting graph consists of trees or chains

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1







& Cohen, 2004) • Efficient algorithms for inference

– If the state transition is deterministic and the resulting graph consists of trees or chains

– If the state transition is not deterministic or a complex graph

• Approximation methods have to be applied, such as variational methods or sampling

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1



Acknowledgement

Jonathan King @ MIT

Bonnie Berger @ MIT

Robert E. Steward and Janet Thornton @ EMBL-EBI

John Lafferty @ CMU

Documents

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional