25
Carnegie Mellon School of Computer Science Biological Language Modeling Project Copyright © 2004, Carnegie Mellon. All Rights Reserved. Segmentation Conditional Random Fields (SCRFs) A New Approach for Protein Fold Recognition Yan Liu 1 , Jaime Carbonell 1 , Peter Weigele 2 ,Vanathi Gopalakrishnan 3 1. School of Computer Science, Carnegie Mellon University 2. Biology Department, Massachusetts Institute of Technology 3. Center for Biomedical Informatics, University of Pittsburgh

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Segmentation Conditional Random Fields (SCRFs)

A New Approach for Protein Fold Recognition

Yan Liu1, Jaime Carbonell1, Peter Weigele2,Vanathi Gopalakrishnan3

1. School of Computer Science, Carnegie Mellon University

2. Biology Department, Massachusetts Institute of Technology3. Center for Biomedical Informatics, University of Pittsburgh

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Structural motif recognition

• Structural motif– Regular arrangement of secondary structural elements, which

commonly appears in a variety of protein families– Super-secondary structure, or protein fold– Example

• Structural motif recognition– Given a structural motif and a protein sequence, predict the

presence of the motif and the exact location in the protein, based on sequences only

β-α-β (2CMD) Leucine-rich repeats (1A4Y)

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Previous work on structural motif recognition

• General approaches for structural motif recognition– Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]

– Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]

– Homology modeling or threading, e.g. Threader [Jones, 1998]

• Methods of careful design for specific structure motifs– Example: αα- and ββ- hairpins, β-turn and β-helix

Our goal is to have a general probabilistic framework to address all these problems for structural motif prediction

Major challenges: structural similarity without clear sequence similarityLong-range interactions, such as β-sheets

Hard to generalize

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Outline

• Introduction• Conditional random fields• Segmentation conditional random fields• Case study on β-helix fold recognition• Conclusion

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Graphical models for protein structure prediction

• Graphical models for protein structure prediction– Probabilistic causal networks [Delcher et al, 1993]

– Markov random fields [White et al, 1994]

– Hidden Markov model [Bystroff et al, 2000]

– Bayesian segmentation model [Schmidler and Liu, 2000]

• Protein structure prediction can be generalized as learning problems for structured data– Structured data: observation with internal or external structures– Conditional graphical models are successful in various applications

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Condition random fields

• Condition random fields (CRFs) [lafferty et al, 2001]

– A conditional undirected graphical model

– The conditional probability is defined as

– Flexible feature definition– Convex optimization function guarantees the

globally optimal solution

– Efficient inference algorithms

– Kernel CRFs permits the use of implicit feature spaces via kernels [Lafferty et al, 2004]

)),,,(exp(1

)|(1 1

1

N

i

K

kiikk yyixf

ZxyP

HMMs CRFs

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Graphical models for Structural Motif detection

• Structural motif detection

– Structural components• Secondary structural elements instead of

individual residues

– Informative features• Indicator for conserved regions• Length of each component• Propensities to form hydrogen bond in β-

sheet

Segmented Markov Models

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Segmentation conditional random fields (I)

• Protein structural graph G = <V, E>– V: nodes for the secondary structural elements of variable lengths– E1 edges between adjacent nodes for peptide bonds– E2 edges between distant nodes for hydrogen bonds or disulfide

bonds

• Example: β-α-β motif

• Tradeoff between fidelity of the model and graph complexity

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Segmentation conditional random fields (II)

• Segmentation conditional random fields (SCRFs)– Given a protein structure graph G, we define a segmentation of the

sequence W = (M, S), where Si = <pi, qi, yi>

– The conditional probability of the segmentation W given the observation x is defined as

– If each subgraph of the resulting graph is a tree or a chain, we can simplify the model to be

)),,(exp(1

)|(1 1

M

i

K

kikk iSSxf

ZxWP

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Training and Testing for SCRFs

• Training phase : learn the model parameters

– Minimizing regularized log loss

– Seek the direction whose empirical values agrees with the expectation

– Iterative searching algorithm have to be applied

• Testing phase: search the segmentation that maximize P(w|x)

)(log),,(1 1

ZssxfLM

i

K

kikk i

0)())],,([),,((1

)|(

M

iikxSpik

kii

SSxfEssxfL

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Inference algorithm

• Backward-forward algorithm*

• Viterbi algorithm*

)),,(exp(),1(),(),(

1,

,,,,

ssxfypyqyrK

kkkryq

qppylryl ll

Q

)),,(exp(),1(),(max),(1

,,,,

,

ssxfypyqyrK

kkkryqyl

qppryl ll

Q

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

SCRFs for β-helix fold recognition (I)

• Right-handed β-helix fold– A regular structural fold with an elongated helix-like structures whose

successive rungs composed of three parallel β-strands (B1, B2, B3 strands)– T2 turn: a unique two-residue turn– Perform important functions such as the bacterial infection of plants,

binding the O-antigen and etc.

• Computational challenges– Long insertions in T1 and T3 turn– Structural similarity with low sequence similarity

• Previous work– BetaWrap [Bradley et al 2001, Bradley et al. 2001, Cowen et al 2002]

– BetaWrapPro [McDonnell et al]Pectate Lyase C (Yoder et al. 1993)

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

SCRFs for β-helix fold recognition (II)

• Protein structure graph– 5 states: B1, B23, T1, T3, I– Length constraints

• B1, B23: fixed length as 3 and 9• T1, T3: 1 – 80

– Long-range interactions between B23

• Prediction scores– Log-ratio scores )|(

)|(max

xnullP

xWP

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Features• Node features

– Regular expression template, HMM profiles– Secondary structure prediction scores– Segment length

• Inter-node features– β-strand Side-chain alignment scores– Preferences for parallel alignment scores [Steward & Thonton, 2002]

– Distance between adjacent B23 segments

• Features are general and easy to extend

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Experiments (I)

• Cross-family validation for known β-helix proteins– PDB select dataset: non-homologous proteins in PDB removing β-helix

proteins

– SCRFs can score all known β-helices higher than non β-helices

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Experiments (II)

• Predicted Segmentation for known Beta-helices

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Experiments (III)

• Histograms for known β-helices against PDB-minus dataset– 18 non β-helix proteins have a

score higher than 0

– 13 from β-class and 5 from α/β class

– Most confusing proteins: β-sandwiches and left-handed β-helix

5

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Discovery of potential β-helices

• Hypothesize on Uniprot reference databases with less than 50% identity (UniRef50)

– 93 sequences were returned with scores above a cutoff of 5

– 48 proteins are homologous with proteins known be β-helices

• Full list can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html

•Verification on recently crystallized structuresSuccessfully identify gp14 of Shigella bacteriophage as a β-helix protein with scoring 15.63

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Conclusion• Segmentation conditional random fields (SCRFs) for protein

structural motif detection– Consider the structural characteristics in a general probabilistic

framework– Conditional graphical models that considers the long-range

interactions directly and conveniently

– A case study for β-helix fold recognition

• Future work– Computational complexity: O(N2)

• Chain graph model: localized SCRFs model

– Generality of the model• Leucine-rich repeats, Ankyrin proteins and some virus-spike folds

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Further Exploration-(I)

• Chain graph model– A combination of directed and undirected graph– Local normalization version of segmentation CRFs– Reduce the computational complexity to O(N)

• Experiment on β-helix fold and Leucine-rich repeats– Achieve approximate results as SCRFs with only slight difference

1A4Y 1OGQ

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Further Exploration (II)

• Cross-family validation for known LLR by chain graph model– 41 LLR proteins with known structures– 2 super-family and 11 families

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

SCRFs for general graph• For any graph G = <V, E>, the conditional

probability of the segmentation W given the observation x is defined as

– If there are no E2 edges (long-range interactions)

• semi-markov conditional random fields (Sarawagi

& Cohen, 2004)

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

SCRFs for general graph• For any graph G = <V, E>, the conditional

probability of the segmentation W given the observation x is defined as

– If there are no E2 edges (long-range interactions)

• semi-markov conditional random fields (Sarawagi

& Cohen, 2004) • Efficient algorithms for inference

– If the state transition is deterministic and the resulting graph consists of trees or chains

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

SCRFs for general graph• For any graph G = <V, E>, the conditional

probability of the segmentation W given the observation x is defined as

– If there are no E2 edges (long-range interactions)

• semi-markov conditional random fields (Sarawagi

& Cohen, 2004) • Efficient algorithms for inference

– If the state transition is deterministic and the resulting graph consists of trees or chains

– If the state transition is not deterministic or a complex graph

• Approximation methods have to be applied, such as variational methods or sampling

GCc

c

K

kkk wxf

ZxWP )),(exp(

1)|(

1

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2004, Carnegie Mellon. All Rights Reserved.

Acknowledgement

Jonathan King @ MIT

Bonnie Berger @ MIT

Robert E. Steward and Janet Thornton @ EMBL-EBI

John Lafferty @ CMU