63
Machine Learning Machine Learning in the Study of in the Study of Protein Structure Protein Structure Rui Kuang Rui Kuang Columbia University Columbia University Candidacy Exam Talk May 5th, 2004 Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara Jebara

Machine Learning in the Study of Protein Structure

  • Upload
    grant

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Machine Learning in the Study of Protein Structure. Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara. Table of contents. Introduction to protein structure and its prediction HMM, SVM and string kernels - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning in the Study of Protein Structure

Machine Learning Machine Learning in the Study of in the Study of

Protein StructureProtein Structure

Rui KuangRui KuangColumbia UniversityColumbia UniversityCandidacy Exam Talk May 5th, 2004Candidacy Exam Talk May 5th, 2004

Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara

Page 2: Machine Learning in the Study of Protein Structure

Table of contentsTable of contents1. Introduction to protein structure and its

prediction2. HMM, SVM and string kernels3. Machine learning in the study of protein

structure• Protein ranking • Protein structural classification• Protein secondary structural and

conformational state prediction• Protein domain segmentation

4. Conclusion and Future work

Page 3: Machine Learning in the Study of Protein Structure

Thanks to Carl-Ivar Branden and John Tooze

Part 1: Introduction to Protein Structure and Its Prediction

1. Introduction 2. HMM, SVM and string kernels3. Topics4. Conclusion and future work

Page 4: Machine Learning in the Study of Protein Structure

Why study protein structureWhy study protein structure• Protein –

Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius

• Crucial in all biological processes

• Function depends on structurestructure can help us to understand function

Page 5: Machine Learning in the Study of Protein Structure

How to Describe Protein How to Describe Protein StructureStructure

• Primary: amino acid sequence• Secondary structure • Tertiary structure• Quaternary: arrangement of several polypeptide

chains

Page 6: Machine Learning in the Study of Protein Structure

Secondary Structure : Alpha Secondary Structure : Alpha HelixHelix

hydrogen bonds between C’=O at position n and N-H at position n+i (i=3,4,5)

Page 7: Machine Learning in the Study of Protein Structure

Secondary Structure : Beta Secondary Structure : Beta SheetSheet

Parallel Beta Sheet

Antiparallel Beta Sheet

We can also have a mix of both.

Page 8: Machine Learning in the Study of Protein Structure

Secondary Structure : Loop Secondary Structure : Loop RegionsRegions

– Less conserved structure

– Insertions and deletions are more often

– Conformations are flexible

Page 9: Machine Learning in the Study of Protein Structure

Phi – N - bondPsi – -C’ bond

Tertiary StructureTertiary Structure

C

C

Page 10: Machine Learning in the Study of Protein Structure

Phi-Psi angle distributionPhi-Psi angle distribution

Page 11: Machine Learning in the Study of Protein Structure

Protein DomainsProtein Domains

• A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.

Page 12: Machine Learning in the Study of Protein Structure

Determination of Protein Determination of Protein StructuresStructures

• Experimental determination (time consuming and expensive)– X-ray crystallography – Nuclear magnetic resonance (NMR)

• Computational determination [Schonbrun 2002 (B2)]

– Comparative modeling– Fold recognition ('threading')– Ab initio structure prediction (‘de novo’)

Page 13: Machine Learning in the Study of Protein Structure

Picture due to Michal Linial

Sequence, Structure and FuSequence, Structure and Function nction

[Domingues 2000 (B1)][Domingues 2000 (B1)]

Structure (24,000): discrete groups of folds with unclear boundaries

Sequence (1,000,000) •>30% sequence similarity suggests strong structure similarity•Remote homologous proteins can also share similar structure

Function (Ill-defined)

•Function associated with different structures•Super-family with the same fold can evolve into distinct functions.•66% of proteins having similar fold also have a similar function

Page 14: Machine Learning in the Study of Protein Structure

Thanks to Nello Cristianini

Part 2: Hidden Markov Model, Support Vector Machine and String Kernels

K( , )

1. Introduction 2. HMM, SVM and string kernels3. Topics4. Conclusion and future work

Page 15: Machine Learning in the Study of Protein Structure

Hidden Markov Models for Hidden Markov Models for Modeling Protein Modeling Protein [[Krogh 1993Krogh 1993(B3)](B3)]

Alignment

HMM

MaximumLikelihood

OrMaximum a

posteriori

If we don’t know the alignment, use EM to train HMM.

Page 16: Machine Learning in the Study of Protein Structure

Hidden Markov Models for Hidden Markov Models for Modeling Protein Modeling Protein [[Krogh 1993Krogh 1993(B3)](B3)]

• Probability of sequence x through path q

• Viterbi algorithm for finding the best path

• Can be used for sequence clustering, database search…

Page 17: Machine Learning in the Study of Protein Structure

• Relate to structural risk minimization• Linear-separable case

– Primal qp problemMinimize subject to

– Dual convex problemMinimize

subject to &

Support Vector Machine Support Vector Machine [Burges 1998(B4)][Burges 1998(B4)]

Page 18: Machine Learning in the Study of Protein Structure

Support Vector Machine Support Vector Machine [Burges 1998(B4)][Burges 1998(B4)]

• Kernel: one nice property of dual qp problem is that it only involves the inner product between feature vectors, we can define a kernel function to compute it more efficiently

• Example:

Page 19: Machine Learning in the Study of Protein Structure

String Kernels for Text String Kernels for Text Classification Classification [[Lodhi 2002Lodhi 2002(M2)](M2)]

• String subsequence kernel –SSK :

• A recursive computation of SSK has the complexity of the computation O(n|s||t|). It is quadratic in terms of the length of input sequences. Not practical.

n

isui

iiu uu

,][:

11||

Page 20: Machine Learning in the Study of Protein Structure

Part 3Part 3

Machine learning in the study of protein structure

3.1 Protein ranking 3.2 Protein structural classification3.3 Protein secondary structure and

conformational state prediction3.4 Protein domain segmentation

1. Introduction 2. HMM, SVM and string kernels3. Topics4. Conclusion and future work

Page 21: Machine Learning in the Study of Protein Structure

Part 3.1 Protein Ranking

Please!!!Stand in order

• Smith-Waterman • SAM-T98

• BLAST/PSI-BLAST• Rank Propagation

Page 22: Machine Learning in the Study of Protein Structure

Thanks to Jean Philippe

Local alignment: Local alignment: Smith-Waterman Smith-Waterman algorithmalgorithm

• For two string x and y, a local alignment with gaps is:

• The score is:

• Smith-Waterman score:

Page 23: Machine Learning in the Study of Protein Structure

BLAST BLAST [Altschul 1997 (R1)][Altschul 1997 (R1)]: : a heuristic algorithm for matching DNA/Protein a heuristic algorithm for matching DNA/Protein

sequencessequences

• Idea: True matches are likely to contain a short stretch of identity

AKQDYYYYE…

AKQ KQD QDY DYY YYY…

cut SearchProtein

Database

match

Query: ………DYY………………Target: …ASDDYYQQEYY…

substitution score>T

Extend match Extend match

Neighbor mapping

AKQ SKQ.. KQD AQD.. QDY .. DYY .. YYY…

Page 24: Machine Learning in the Study of Protein Structure

PSI-BLAST: Position-specific PSI-BLAST: Position-specific Iterated BLAST Iterated BLAST [Altschul 1997 (R1)][Altschul 1997 (R1)]

• Only extend those double hits within a certain range.

• A gapped alignment uses dynamic programming to extend a central pair of aligned residues in both directions.

• PSI-BLAST can takes PSSM as input to search database

Page 25: Machine Learning in the Study of Protein Structure

SAM-T98 SAM-T98 [Karplus 1999 (C3)][Karplus 1999 (C3)]

Query sequence Blast searchNR Protein database

Profile/Alignment

Build alignment

with hitsse

arch

Iterate 4

rounds

HMM

Page 26: Machine Learning in the Study of Protein Structure

• Affinity matrix

• D is a diagonal matrix of sum of i-th row of W

• Iterate

• F* is the limit of seuqnce {F(t)}

Local and Global Local and Global Consistency Consistency [Zhou 2003 (M1)][Zhou 2003 (M1)]

22 2/)||||exp( jiij xxW

2/12/1 WDDS

YtSFtF )1()()1(

*maxarg ijcji Fy

Page 27: Machine Learning in the Study of Protein Structure

Rank propagation Rank propagation [Weston 2004 (R2)][Weston 2004 (R2)]

• Protein similarity network: – Graph nodes: protein sequences in the

database – Directed edges: a exponential function of the

PSI-BLAST e-value (destination node as query) – Activation value at each node: the similarity to

the query sequnce

• Exploit the structure of the protein similarity network

tqt KYKY 1

Page 28: Machine Learning in the Study of Protein Structure

Result Result [Weston 2004 (R2)][Weston 2004 (R2)]

Page 29: Machine Learning in the Study of Protein Structure

Part 3.2 Protein structural classification

Where are my

relatives?

• Fisher Kernel• Mismatch Kernel• ISITE Kernel

• SVM-Pairwise• EMOTIF Kernel• Cluster Kernels

Page 30: Machine Learning in the Study of Protein Structure

SCOPSCOP [Murzin 1995 (C1)][Murzin 1995 (C1)]

SCOP

Fold

Superfamily

Family

Positive Training Set

Positive Test Set

Negative Training Set

Negative Test Set

Family : Sequence identity > 30% or functions and structures are very similarSuperfamily : low sequence similarity but functional features suggest probable common evolutionary originCommon fold : same major secondary structures in the same arrangement with the same topological connections

Page 31: Machine Learning in the Study of Protein Structure

CATHCATH [Orengo 1997 (C2)][Orengo 1997 (C2)]• ClassSecondary structure composition

and contacts

• ArchitectureGross arrangement of secondary

structure

• TopologySimilar number and arrange of

secondary structure and same connectivity linking

• Homologous superfamily

• Sequence family

Page 32: Machine Learning in the Study of Protein Structure

Fisher Kernel Fisher Kernel [Jaakkola 2000 (C4)][Jaakkola 2000 (C4)]

• A HMM (or more than one) is built for each family

• Derive feature mapping from the Fisher scores of each sequence given a HMM H1:

),|(log 1 HXPU X

k

jj

jij kE

ie

iEU )(

)(

)(

Page 33: Machine Learning in the Study of Protein Structure

SVM-pairwise SVM-pairwise [Liao 2002 (C5)][Liao 2002 (C5)]

• Represent sequence P as a vector of pairwise similarity score with all training sequences

• The similarity score could be a Smith-Waterman score or PSI-BLAST e-value.

Page 34: Machine Learning in the Study of Protein Structure

Mismatch Kernel Mismatch Kernel [ Leslie 2002 (C6)][ Leslie 2002 (C6)]

AKQDYYYYE…

AKQ KQD QDY DYY YYY…

AKQCKQ

DKQ AAQAKY… …

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AAQ AKQ DKQ EKQ

AKQ

Implementation with suffix tree achieves linear time complexity O(||mkm+1(|x|+|y|))

Page 35: Machine Learning in the Study of Protein Structure

EMOTIF Kernel EMOTIF Kernel [Ben-Hur 2003 (C8)][Ben-Hur 2003 (C8)]

• EMOTIF TRIE built from eBLOCKS [Nevill-manning 1998 (C7)]

• EMOTIF feature vector: where is the number of occurrences of the

motif m in x

Mmm xx ))(()( )(xm

Page 36: Machine Learning in the Study of Protein Structure

I-SITE Kernel I-SITE Kernel [Hou 2003 (C10)][Hou 2003 (C10)]

• Similar to EMOTIF kernel I-SITE kernel encodes protein sequences as a vector of the confidence level against structural motifs in the I-SITES library [Bystroff 1998 (C9)]

Page 37: Machine Learning in the Study of Protein Structure

Cluster kernels Cluster kernels [Weston 2004 (C11)][Weston 2004 (C11)]

• Neighborhood KernelsImplicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence (dependent on the size of the neighborhood and total length of unlabeled sequences)

• Bagged KernelsRun bagged k-means to estimate p(x,y), the empirical probability that x and y are in the same cluster. The new kernel is the product of p(x,y) and base kernel K(x,y)

Page 38: Machine Learning in the Study of Protein Structure

ResultsResults

Page 39: Machine Learning in the Study of Protein Structure

Part 3.3: Protein secondary structure and conformational state prediction

Can we really do

that?

• PHD• PSI-PRED

• PrISM• HMMSTR

Page 40: Machine Learning in the Study of Protein Structure

PHD: Profile network from PHD: Profile network from HeiDelberg HeiDelberg [Rost 1993 (P1)][Rost 1993 (P1)]

Accuracy: 70.8%

Page 41: Machine Learning in the Study of Protein Structure

PSIPRED PSIPRED [Jones 1999 (P2)][Jones 1999 (P2)]

Accuracy: 76.0%

Page 42: Machine Learning in the Study of Protein Structure

Conformational State Conformational State PredictionPrediction

Page 43: Machine Learning in the Study of Protein Structure

PrISM PrISM [Yang 2003 (P3)][Yang 2003 (P3)]

Prediction with this conformation library based on sequence and secondary structure similarity, accuracy: 74.6%

Page 44: Machine Learning in the Study of Protein Structure

• I-sites motifs are modeled as markov chains and merged into one compact HMM to capture grammatical structure

• The HMM can be used for Gene finding, secondary or conformational state prediction, sequence alignment…

• Accuray: – secondary structure prediction:

74.5%– Conformational state prediction:

74.0%

HMMSTR HMMSTR [Bystroff 2000 (P4)][Bystroff 2000 (P4)]: : a Hidden Markov Model for Local Sequence-a Hidden Markov Model for Local Sequence-

Structure Correlations in ProteinsStructure Correlations in Proteins

Page 45: Machine Learning in the Study of Protein Structure

Part 3.4: Protein domain segmentation

Cut? where???

• DOMAINATION • Pfam Database • Multi-experts

Page 46: Machine Learning in the Study of Protein Structure

DOMAINATION DOMAINATION [George 2002 (D1)][George 2002 (D1)]

• Get a distribution of both the N- and C-termini in PSI-BLAST alignment at each position, potential domain boundaries with Z-score>2

• Acuracy: 50% over 452 multi-domain proteins

Page 47: Machine Learning in the Study of Protein Structure

Pfam Pfam [Sonnhammer 1997 (D2)][Sonnhammer 1997 (D2)]

• A database of HMMs of domain families• Pfam A: high quality alignments and

HMMS built from known domains• Pfam B: domains built from Domainer

algorithm from the remaining protein sequences with removal of Pfam-A domains

Page 48: Machine Learning in the Study of Protein Structure

A multi-expert system from A multi-expert system from sequence information sequence information

[Nagarajan 2003 (D3)][Nagarajan 2003 (D3)]

Seed Sequence

Multiple Alignment

blast search

Neural Network

Correlation

Entropy

Sequence Participation

Contact Profile

Secondary Structure

Physio-Chemical Properties

Final Predictions

DNA DATA

Intron Boundaries

Page 49: Machine Learning in the Study of Protein Structure

Results Results [Nagarajan 2003 (D3)][Nagarajan 2003 (D3)]

Page 50: Machine Learning in the Study of Protein Structure

Part 4: Conclusion and Future Work

Mars is not too far!?

1. Introduction 2. HMM, SVM and string kernels3. Topics4. Conclusion and future work

Page 51: Machine Learning in the Study of Protein Structure

Distribution of Paper Year

0

1

2

3

4

5

6

7

8

<1997 1997 1998 1999 2000 2001 2002 2003 2004

Year

Co

un

t

Page 52: Machine Learning in the Study of Protein Structure

ConclusionConclusion• Structural genomics plays important role

for understanding our life• Protein structure can be studied from

different perspectives with different methods

• Machine learning is one of the most important tools for understanding genome data

• Protein structure prediction is a challenging task given the data we have now

Page 53: Machine Learning in the Study of Protein Structure

Future WorkFuture Work• Rank propagation with domain

activation regions• Profile kernel with secondary

structure information for protein classification

• Rank propagation for domain segmentation

• Specialist algorithm for protein conformational state prediction

Page 54: Machine Learning in the Study of Protein Structure

The EndThe End

Page 55: Machine Learning in the Study of Protein Structure

Determination of Protein Determination of Protein Structures Structures (back)(back)

• X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow.

• Nuclear magnetic resonance (NMR)Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules.

Page 56: Machine Learning in the Study of Protein Structure

Hidden Markov Models for Hidden Markov Models for Modeling Protein Modeling Protein [[Krogh 1993Krogh 1993(B3)] (B3)]

(back)(back)

Build HMM from sequences not alignedEM algorithm 1. Choose initial length and parameters2. Iterate until the change of likelihood is

small– Calculate expected number of times

each transition or emission is used– Maximize the likelihood to get new

parameters

Page 57: Machine Learning in the Study of Protein Structure

Thanks to Tony Jebara

Support Vector Machine Support Vector Machine [Burges 1998(B4)] [Burges 1998(B4)] (back)(back)

• With probability 1-η the bound holds

– l is the number of data points– h is VC dimension

• Structural Risk Minimization– For each hi,

– Get bestα*=argmin Remp(α)

– Choose model with min J(α*,hi)

))4/log()1)/2(log(

()()()(l

hlhRJR emp

Page 58: Machine Learning in the Study of Protein Structure

EMOTIF Database EMOTIF Database [Nevill-manning 1998 (C7)][Nevill-manning 1998 (C7)]

• A motif database of protein families• Substitution groups from separation score

Page 59: Machine Learning in the Study of Protein Structure

EMOTIF Database EMOTIF Database [Nevill-manning 1998 (C7)] [Nevill-manning 1998 (C7)] (back)(back)

•All possible motifs are enumerated from sequence alignments

Page 60: Machine Learning in the Study of Protein Structure

I-SITE Motif Library I-SITE Motif Library [Bystroff 1998 (C9)] [Bystroff 1998 (C9)] (back)(back)

• Sequence segments (3-15 amino acids long) are clustered via K-means

• Within each cluster structure similarity is calculated in terms of dme and mda

• Only those clusters with good dme and mda are refined and considered motifs afterwords

Ndme

sji

L

i

i

ij

sji )( 2

1

5

5

1

),(max)( 11,1 iiLiLmda

Page 61: Machine Learning in the Study of Protein Structure

PrISM PrISM [Yang 2003 (P3)] [Yang 2003 (P3)] (back)(back)

Page 62: Machine Learning in the Study of Protein Structure

Pfam Pfam [Sonnhammer 1997 (D2)] [Sonnhammer 1997 (D2)] (back)(back)

• Construction of Pfam A:– Pick seed sequences from several sources

and build seed alignment– Build HMM from seed alignment and use to

it pull in new members and align them to the HMM to get full alignment

Page 63: Machine Learning in the Study of Protein Structure

Sonnhammer, 1997

Pfam Pfam [Sonnhammer 1997 (D2)] [Sonnhammer 1997 (D2)] (back)(back)

• Construction of Pfam B:– Domainer program merges homology segment

pairs into homologous segment sets together with links. This graph is partitioned into domains

– Use domainer program to build alignment from all protein segments not covered by Pfam-A

• Incremental updating– New sequence is added to the full alignment of

existing models if they score above a threshold– If the new sequence causes problems, the seed

alignment will be altered and Pfam-B will be regenerated afterwards.