Upload
vanna-petty
View
27
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Inferring Protein Structure with Discriminative Learning and Network Diffusion. Rui Kuang Department of Computer Science Center for Computational Learning Systems Columbia University Thesis defense, August 16 th 2006. Agenda. Biological Background - PowerPoint PPT Presentation
Citation preview
Inferring Protein Structure with Discriminative Learning and
Network Diffusion
Rui KuangDepartment of Computer Science
Center for Computational Learning SystemsColumbia University
Thesis defense, August 16th 2006
Agenda
Biological Background Protein Classification with String Kernels Domain Identification and Boundary Detection Protein Ranking with Network Diffusion Conserved Motifs between Remote Homologs Conclusion & Future Research
What are proteins?
Proteins – encoded by genes A protein (polypeptide chain) is a sequence of
amino acid residues Derived from Greek word proteios meaning “o
f the first rank” by Jöns J. Berzelius in 1838
(Picture courtesy of Branden & Tooze )
Amino Acid Polypeptide Chain
BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research
Why study proteins?
Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers, structural elements…
Function depends on unique 3-D structure. Easy to obtain protein sequences, difficult to
determine structure.
Structure Function
determine
(Picture courtesy of Branden & Tooze )
fold
Sequence
NLAFALSELDRITAQLKLPRHVEEEAARLYREAVRKGLIRGRSIESVMAACVYAACRLLKVPRTLDEIADIARVDKKEIGRSYRFIARNLNLTPKKLF…
Sequence Space and Structure Space
Structure (38,086 known in PDB): discrete groups of folds with unclear boundaries (by 8/14/2006)
Sequence (>2,500,000)
•Homologous proteins share >30% sequence identity, which suggests strong structural similarity.•Remote homologous proteins share similar structure but low sequence similarity.
Remote Homology Detection
Remote homology : remote evolutionary relationship conserved structure/function, low sequence similarity
It is often not possible to detect statistically significant sequence alignment between remote homologs
ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVGTAHIIYNSVDKRLSAVVSYPNADSATVSYDVDLDNVLPEWVRVGLSASTGLYKETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQGDATTGTDGNLELTRVSSNGPQGSSVGRALFYAPVHIWESSAVVASFEATFTFLIKSPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN
MSLLPVPYTEAASLSTGSTVTIKGRPLVCFLNEPYLQVDFHTEMKEESDIVFHFQVCFGRRVVMNSREYGAWKQQVESKNMPFQDGQEFLSISVLPDKYQVMVNGQSSYTFDHRIKPEAVKMVQVWRDISLTKFNVSYLKR
<10% sequence identity
DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA
D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV
DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF
Protein Domains
Proteins often consist of several independent domains fold autonomically often function differently represent fundamental
structural, functional and evolutionary units
Example: a two-domain protein 3-layer(aba) sandwich at the
N-terminal a mainly alpha in an
orthogonal bundle at the C-terminal
Inferring protein structure/function from sequence similarity
For newly sequenced genomes, often homology detection can only identify less than a half of the genes.
Remote homology detection and domain segmentation are crucial steps for studying genes with no close homology.
Sequence
Alignment
Query sequence
Domain Family 3
Domain Family 2
Domain Family 1
Domain Family n
Domain Family n-1
Domain Family n-2
Domain database
Domain 1 Domain 2 Domain 3
Boundary identification
Remote homology detection We want to correctly segment protein sequences into domains
(domain boundary identification) and associate them with their corresponding structural/functional class (remote homology detection).
Protein Structural Classification Protein classification is the prediction of the structural or f
unctional class of a protein from its primary sequence. SCOP: Structural Classification of Proteins Known domain structures are organized in a hierarchy: fa
mily, superfamily and fold.
SCOP
Family : Sequence identity > 30% or functions and structures are very similar
Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin
BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research
Remote Homology Detection in Protein Classification
Remote homologs: sequences that belong to the same super-family but not the same family.
SCOP
Fold
Super-family
FamilyPositive Training Set
Positive Test Set
Negative Training Set
Negative Test Set
SVM: Large margin-based discriminative learning approach.
Find a hyperplane to separate positive data from negative data and also maximize the margin.
A Quadratic Programming problem only depends on the inner product between data points.
Support Vector Machine (SVM) Classifiers
Kernel trick: To train an SVM, can use kernel rather than explicit feature map
Can define kernels for sequences, graphs, other discrete objects:
{ sequences } RN
Kernel value is inner product in feature space:
K(x, y) = (x), (y) Original string kernels (Watkins, Haussler, Lodhi et al.) r
equire quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y).
We introduce fast novel string kernels with linear time complexity.
Kernels for Discrete Objects
Profile Kernel and its Family Tree
Three generations Spectrum Kernels Mismatch Kernels Profile Kernels
Faster computation, i.e., linear computation time in sequence length.
Profile-based string kernels take advantage of abundant unlabeled data to capture homologous/evolutionary information for remote homology detection.
Leslie, Eskin and Noble, PSB 2002
Spectrum Kernel
Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet of amino acids, || = 20
Q1:AKQDYYYYE
AKQ KQD QDY DYY YYY YYY
YYE
Q2:DYYEIAKQE
DYY YYE YEI EIA IAK AKQ
KQE
Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQE 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0
K(Q1,Q2)= <(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)> =3
K-mers capture some position-independent local similarity, but they don’t effectively model evolutionary divergence.
Leslie, Eskin, Weston and Noble, NIPS 2002
Mismatch Kernel
For k-mer s, the mismatch neighborhood N(k,m)
(s) is the set of all k-mers t within m mismatches from s
Size of mismatch neighborhood is O(||mkm)
AKQCKQ
DKQ AAQAKY… …
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AAQ AKY CKQ DKQ
AKQ
Arbitrary mismatch does not model the mutation probability between amino acids.
Profile Kernel Profile kernel: specialized to protein sequences, probab
ilistic profiles to capture homology information Semi-supervised approach: profiles are estimated using
unlabeled data (about 2.5 million proteins ) E.g. PSI-BLAST profiles: estimated by iteratively alignin
g database homologs to query sequence. Profiles are build from multiple sequence alignment to
model the positional mutation probability.
L K L …A 3 -2 1 …C -1 0 2 …D -1 0 0 …… … … … …Y 2 -3 -3 …
QUERY LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK.......
HOMOLOG1 IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........
HOMOLOG2 LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD......
HOMOLOG3 IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........
[Kuang & Leslie, JBCB 2005 and CSB 2004]
Profile-based k-mer Map
Use profile to define position-dependent mutation neighborhoods:
E.g. k=3, =5 and a profile of negative log probabilities
xjbbpxP j 1,),()(
AKQYKQ
(2+1+1<)AKQ
(1+1+1<)AKC
(1+1+2<)
YKC(2+1+1<)
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AKC AKQ YKC YKQ
AKQ
A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …
i iijk
k
bpbbb
kjjxPM
log:
:1
21
,
Efficient Computing with Trie
Use trie data structure to organize lexical traversal of all instances of k-mers in the training profiles.
Scales linearly with length, O(km_max+1||m_max(|x|+|y|)), where m_max is maximum number of mismatches that occur in any mutation neighborhood.
E.g. k=3, =5
AQ
C
Update K(x, y) by adding contribution for feature AQC but not AQD
A Q K …A 1 3 2 …C 3 2 1 …D 3 2 1 …… … … … …Q 3 1 2 …… … … … …Y 2 1 3 …
Sequence x … A Q Y …A ….5 2 1 …C … 2 1 2 …D … 2 1 4 …… … … … … …Q … 2 .6 2 …… … … … … …Y … 3 3 3 …
Sequence y
D
x: 1+1+1< y: .5+.6+2 <
x: 1+1+1 < y: .5+.6+4 >
Inexact Matching Kernels[Leslie & Kuang, JLMR 2004, KMCB 2004 & COLT 2003]
Gappy kernels For g-mer s, g > k, the gapped match set G(g,k)(s) consists
of all k-mers t that occur in s with up to (g - k) gaps Wildcard kernels
Introduce wildcard character “”, define feature space indexed by k-mers from {}, allowing up to m wildcards
Substitution kernels Use substitution matrices to obtain P(a|b), substitution pro
babilities for residues a, b The mutation neighborhood M(k,)(s) is the set of all k-mers
t such that
- i=1…k log P(si|ti) <
Experiments
SCOP 1.59 benchmark with 54 experiments Train PSI-BLAST profiles on NR database Comparison against PSI-BLAST and recent SVM-ba
sed methods: PSI-BLAST rank: use training sequence as query and
rank testing sequences with PSI-BLAST e-value eMotif Kernel (Ben-Hur et al., 2003): features are kno
wn protein motifs, stored using trie SVM-pairwise (Liao & Noble, 2002): feature vectors of
pairwise alignment scores (e.g. PSI-BLAST scores) Cluster Kernel (Weston et al., 2003): Implicitly average
the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence
Results
Results (Cont.)
Kernels ROC ROC50
PSI-BLAST 0.743 0.293
eMotif 0.711 0.247
Mismatch(5,1) 0.875 0.416
Gappy(6,4) 0.851 0.387
Substitution(4,6.0) 0.876 0.441
Wildcard(5,2,1.0) 0.881 0.447
SVM-Pairwise 0.866 0.533
Cluster 0.923 0.699
Profile(5,7.5)-5 Iteration 0.984 0.874
Identify Protein Domains and Domain Boundaries SVM-based remote homology detection methods do not rely
on sequence alignment. To learn the domain segmentation, we use our SVM classifiers
as domain recognizers and find the optimal segmentation giving the maximum sum of the classification scores.
BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research
SCOP
Query sequence
SVM1 SVM2 SVM3 SVM4
Boundaries
Domain recognizers:
Super-families:
Algorithms for finding optimal segmentation
Assuming we know the number of domains on a protein, we look for the optimal segmentation with the maximum sum of classification scores with dynamic programming. No gaps:
Allowing gaps (can also be solved as a LP problem):
cj
TSFT
NjSFT
kcjkjk
jc
jj
when ,inf
otherwise ,))((max
1for ,)(
),1(,1
),(
,1),1(
cj
TSFT
NjSFT
kclkjlk
jc
lkjlkj
when ,inf
otherwise ,))((max
1 all ,)(max
),1(,1
),(
,1),1(
: segment between position i and position j of sequence S : the best classification score of segment s: the maximum sum of classification scores from c segments on S1,j),(
,
)(
jc
ji
T
sF
S
Toy Example of Dynamic Programming
1 4 2 2 1 1c=1
c=2
c=3
-INF 2 5 8 6 7
-INF -INF 3 6 9 13
Algorithms for finding optimal segmentation (unconstrained number of domains)
The regions across boundaries are less classifiable than other regions within one domain.
Use dynamic programming to alternate between domain regions and boundary regions.
),)((max
))((max
0,0
1,1
1,1
00
ijigapjij
ijisegjij
PSlG
GSlP
GP
: segment between position i and position j of sequence S : confidence score of assigning s as domain region: confidence score of assigning s as boundary region)(
)(
,
sl
sl
S
gap
seg
ji
boundary region
domain region
Experiments
Datasets: 25 SCOP folds: 2678 training domains and 471 test
chains (189 multi-domain proteins). 40 SCOP super-families: 1917 training domains and
375 test chains (131 multi-domain proteins). Baseline approach:
Align test proteins against PSI-BLAST profiles of the training domains, and use the best aligned regions as domain regions.
Evaluation: Domain label: significant overlap between the true
domain and the predicted domain. Boundaries: both the predicted start and end positions
should be close to the true ones.
Experiments (Cont.)
DatasetFold Dataset
(414 domains)
Super-family Dataset
(288 domains)
Boundary distance
25 50 25 50
PSI-BLAST 8.7%(36) 11.4%(47) 5.6%(16) 15.3%(44)
DP_nogap 16.4%(68) 26.1%(108) 21.5%(62) 26.7%(77)
DP_gap 30.7%(127) 35.8%(148) 25.0%(72) 30.2%(87)
DP_alter 18.8%(78) 25.6%(106) 17.7%(51) 26.4%(76)
At least 75% percent positional overlap between the true domain and the prediction.
Protein Ranking Protein ranking: search protein database for sequences that sha
re an evolutionary or functional relationship with a given query sequence.
Standard protein ranking algorithm: pairwise alignment-based algorithm, PSI-BLAST, can easily detect close homologs.
Pairwise alignment-based algorithms are not effective for remote homolog detection.
Query
Homologous protein
Remote homolog
Other labeled protein
BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research
From Local Similarity to Global Structure (RankProp)
Query
2
1
34
5
7
8
6 12367845
Ranking based
on local similarity
Homologous protein
Remote homolog
Other labeled protein
Unlabeled protein
Cluster assumption: proteins with structural or functional relation tend to be in the same cluster in the network.
Diffusion on the protein similarity network to capture cluster structure.
123
678
45
Correct
Ranking
Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004
Noble, Kuang, Leslie, Weston, FEBS 2005
Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005
RankProp (Cont.) Capture global structure with diffusion. Protein similarity network:
Graph nodes: protein sequences in the database Directed edges: weighted by PSI-BLAST e-value Initial ranking score at each node: the similarity to the
query sequence Iterative diffusion operation:
: Initial ranking score : Ranking score at step tK : Normalized connectivity matrix : a parameter for balancing the initial ranking score
and propagation
tt KYYY 01
0Y
tY
MotifProp Motivated by HITS algorithm for pa
ge ranking and NLP algorithms Protein-motif network
Nodes: proteins and motifs Edges: whether a motif is conta
ined in a protein Motifs: patterns/models built on pr
otein segments conserved during evolution.
Often characterize structural/ function properties of a protein.
Examples: eMOTIF, PROSITE, K-mers, BLOCKS…
[Kuang, Weston, Noble, Leslie
Bioinformatics, 2005]
…FYPGKGHTEDNIVVWLPQYNILVGGCLVKSTSAKDLGNVADAYVNEWSTSIENVLKRYRNINAVVPGHGEVG…
Motif Database
Query
MotifProp (Cont.)
MotifProp can identify motif-rich regions derived from motif ranking to help interpret diffusion algorithm.
Low computational cost: protein-motif network is fast to build.
Motifs serve as bridges connecting homologous/remote homologous proteins.
Motif vertices
Query
Protein vertices
In MotifProp, protein nodes and motif nodes enforce their similarity to the query sequence through propagation.
MotifProp:
Normalize affinity matrix H to
Initialize P and F with the initial activation value
Iterate until converge ( )
For all
For all
Diffusion in Protein-motif Network
j
jiji FHPPi~
0 )1(,
j
jiji PHFFi~
0 ')1(,
~
H
)1,0(
Motif vertices
Query
Protein vertices
Experiments
7329 sequences (4246 for training and 3083 for testing) of <95% identity from SCOP 1.59 plus 100,000 proteins from Swiss-Prot.
Motif sets: 4-mers, PROSITE and eMOTIF.
Experiments (Cont.)
Algorithm ROC1 ROC10 ROC50
Sequential MotifProp 0.640 0.663 0.688
k-mer MotifProp 0.621 0.648 0.679
RankProp 0.592 0.667 0.725
PROSITE MotifProp 0.600 0.643 0.664
eMOTIF MotifProp 0.527 0.612 0.666
PSI-BLAST 0.594 0.616 0.641
Conserved Motifs between Remote Homologs We can derive weights of k-mer features from SVM cla
ssifiers trained with profile kernel. MotifProp provides activation values on the k-mer feat
ures after propagation. Both the SVM weights and Motif activation values can
be mapped back to protein sequences to identify conserved structural/functional motifs.
Positional contribution to classification score:
where Δ is the SVM weights or MotifProp activation values on k-mer features.
BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research
,,:1:1 kjjxPkjjxS
Mapping Discriminative Regions to Structure (Profile Kernel)
In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily
Example: Homeodomain-like protein superfamily.
Ecoli MarA protein (1bl0)
Motif Rich Regions (MotifProp) Motif-rich regions on chain B of arsenite oxidase protein from the ISP prot
ein super-family. The PDB annotation and motif-rich regions are given. The 3D protein structure with motif-rich regions in yellow.
Conclusions &Contributions
Profile-based string kernels exploit compact representation of homology information for better detection of remote homologs.
Dynamic Programming-based approach improves multi-label domain classification and domain boundary detection over PSI-BLAST alignment-based approach.
MotifProp improves protein ranking over PSI-BLAST by network diffusion on protein-motif network.
Interpretation of profile-SVM classifier and MotifProp by motif regions: conserved structural components.
Fast kernels for inexact string matching. Classifiers for protein backbone angle prediction (not
presented).
BackgroundStructural classificationDomain segmentationProtein rankingMotif discovery & angle predictionConclusion & Future Research
SVM-FOLD Web Server
Protein function inference by structural genomics and proteomics Identify functional properties of protein structures with ke
rnel methods, e.g. prediction of protein functional sites and structure-based identification of protein-protein interaction sites.
Protein function inference from proteomics, e.g. protein function prediction based on protein-protein interaction patterns and protein structures.
Protein structure prediction Unified prediction of protein backbone and side chain po
sitions (Phi-Psi angles and rotamers) with energy-based cost function.
Future Research
Acknowledgements: Committee
Tony JebaraDept. of Computer Science, Columbia University
Christina Leslie (advisor)Center for Computational Learning Systems & C2B2, Columbia University
Kathleen Mckeown (chair)Dept. of Computer Science, Columbia University
William Stafford NobleDept. of Genome Science & Dept. of Computer Science, University of Washington
Rocco ServedioDept. of Computer Science, Columbia University
Jason WestonMachine Learning Group, NEC Labs (USA)
Acknowledgements: Collaborators
An-Suei Yang (Genome Research Center, Academia Sinica of Taiwan) Dengyong Zhou (Machine Learning Group, Microsoft)
Yoav Freund (Computer Science Department, UCSD)
Eugene Ie (Computer Science Department, UCSD)
Ke Wang (Computer Science Department, Columbia University)
Wei Chu (Center For Computational Learning Systems, Columbia University)
Kai Wang (Biomedical Informatics Department, Columbia University)
Iain Melvin (Machine Learning Group, NEC)
Girish Yao (Computer Science Department, Columbia University)
Lan Xu (Molecular Biology Department, The Scripps Research Institute)
Publications
Structural classification: Profile kernels (JBCB 2005 and CSB 2004) Inexact marching kernels (JMLR 2004, COLT 2003 & KMCB 2004)
Protein ranking: RankProp (FEBS 2005 and BMC Bioinformatics 2005)
MotifProp (Bioinformatics 2005)
Protein local structure prediction Kernel methods based on sliding-window (Bioinformatics
2004)
Structured output learning (ongoing research)
Protein domain segmentation (In preparation)
Phi-Psi Angles……
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
Protein backbone angle prediction
3-D structure
Conformational
States
A
A
A
G
B
B
B
B
B
…
Discretization of Phi-Psi angles
Sliding-window SVM approach [Kuang, Leslie & Yang 2004]
Encode each position independently with sequence information within a length-k window. Conformational
States
A
A
A
B
B
B
B
G
G
E
B
B
B
B
B
A:-3 –4 –4 –4 –3 –4…..
A:0 –1 –1 3 –4 3 4 1…..
B:0 –1 2 1 –3 4 0 –1……
B:-2 –3 –4 –5 –2 4……
B:0 –3 –1 –2 –4 –1……
……To
SVM
Smoothing: use predictions to train a second sets of SVMs
Experiments
Datasets: 697 sequences of 97,365 amino acids with sequence identity < 25 % from PDB (PDB_SELECT25).
Comparison against: LSBSP1: query against local structure-based sequence profile database.
HMMSTR: Hidden Markov Model based on local structural motifs.
RankProp in Genome Browser
Regularization Framework
n
iii
n
jijiij YYYYWYQ
1
0*2
1,
*** ||||1
||||2
1)(
01* ))(1( YWIY
*
**
F
PY
Closed form solution of MotifProp
Related to the regularization framework in Zhou et. al. NIPS 2003
, Where
• Initial Ranking : Final Ranking :
• Normalized Affinity Matrix :
0
00
F
PY
W H
~
'~H0
0
Discussion: Alignment VS K-mers Rangwala and Karypis et. al. 2005 achieved further improv
ement on a previous benchmark dataset on SCOP 1.53 with kernels defined on profile-profile alignment.
Proteins are documents with no punctuation and there is no dictionary!!!
Top SecretClassified
Top SecretClassified
Optimal local alignment detects the most conserved paragraph.
Sensitive for detecting homologous proteins.
Good for remote homologs with one relatively long conserved region.
Alignment is easily interpretable.
Length k subsequences summarize local matches
Can detect discontinuous and disordered conserved regions between remote homologs
Can achieve fast computation
Well defined k-mer feature space for applying learning algorithms
Extracting Discriminative Motif Regions
SVM training determines support vector sequence profiles and their weights: (P(xi), i)
SVM decision hyperplane normal vector:
w = i yi i (P(xi)) Positional contribution to classification score:
Averaged positional score for positive sequences:
w,:1:1 kjjxPkjjxS
kq
avg qjqkjxSjxS1
1:
Map Motif Rich Regions
Map final motif activation values to query sequence to find conserved structural components between remote homologs.
Determination of Protein Structures
X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow.
Nuclear magnetic resonance (NMR)Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules.
Protein structure prediction
Comparative modeling:Where there is a clear sequence relationship between the target structure and one or more known structures.
Fold recognition ('threading'):No sequence homology with known structures. Find consistent folds (remote homology detection).
Ab initio structure prediction(‘de novo’):Deriving structures, approximate or otherwise, from sequence.
RankProp Protein similarity network:
Graph nodes: protein sequences in the database Directed edges: exp(-Sij/σ), where Sij is the PSI-BL
AST e-value between ith protein and jth protein. Initial ranking score at each node: the similarity to th
e query sequence Iterative diffusion operation: Yt+1=Y0+αKYt
Y0: Initial ranking score
Yt: Ranking score at step t
K: Normalized connectivity matrix
α: (0,1) a parameter for balancing the initial ranking score and propagation
Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004
Noble, Kuang, Leslie, Weston, FEBS 2005
Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005
Diffusion in protein similarity network
Remote homology can be detected by diffusion from common neighbors in the cluster.
Protein network is expensive to build Hard to interpret the ranking
Sort positional scores: about 40%-50% of positions in positive training sequences contribute 90% of classification score
Peaky positional plots discriminative motifs
Extracting Discriminative Motif Regions
Inferring Protein Structure with Machine Learning Machine learning builds
statistical models from data to learn underlining principles automatically.
Machine Learning techniques are promising for inferring protein structure: large amount of data but little theory.
Solved structures in protein databank (PDB) provide valuable knowledge about protein folding patterns.
Number of total
Structures
New Structures added in a year.
Sequence Alignment for Matching Proteins Smith-Waterman algorithm finds the optimal local
alignment with maximum substitution scores between two sequences by dynamic programming
Alignment-based Algorithms
Smith-Waterman algorithm: find optimal local alignments by dynamic programming
BLAST & PSI-BLAST [Altschul 1997]: fast approximations of Smith-waterman algorithm. Only extend alignment from short identical stretches potentially contained in true matches. Profiles are built to search database iteratively.
SAM-98 [Karplus 1999]: HMM based approach. Build HMM for the query and target sequences. Rank sequences by likelihood computed from HMMs.
Kleinber, 1998
HITS Algorithm for Page Ranking
Good hubs: web pages with many pointers to related pages. Good authorities: web pages pointed to by hubs. Recursive updating enforces good hubs and good authorities.
Hubs Authorities
Let N be the set of edges
Initialize Hub[A] and Aut[A] to 1
Iterate until converge
For all A, Aut[A]=∑(B,A) NHub[B]
For all A, Hub[A]=∑(A,B) NAut[B]
Normalize Aut and Hub