View
217
Download
1
Tags:
Embed Size (px)
Citation preview
January, 2009
Jaime Carbonell et alCarnegie Mellon University
Data-Intensive Scalabilityin Machine Learning
and Computational Proteomics
January, 2009 © 2009, Jaime G. Carbonell 2
Active and Proactive Learning Training data:
Objective: learn decision function with minimal training (sampling)
Functional space: Fitness Criterion:
a.k.a. loss function
Sampling Strategy:
iinkiikiii yxOxyx :}{},{ ,...1,...1
}{ lj pf
),()(minarg ,
,lj
iipji
ljpfxfy
l
},...,{|)),(ˆ(minarg 1},...,{ 1
kitesttestxxx
xxxyxfLnki
January, 2009 © 2009, Jaime G. Carbonell 3
Computational Challenge True decision F’s are in non-linear
high-D manifolds. Only simplified functional forms
(e.g. d-trees, hyperplanes) can be tractably explored today
Require global optimization and shared model 3-5 order of magnitude beyond current workstations
Non-Euclidian manifolds
Optimal cost-sensitive sampling requires full model sharing (clouds are not the best computational model)
1
1
1
1( , )= ln(1 min ( 1))k k
ij
pp p
i j p Pk
d x x e
January, 2009 © 2009, Jaime G. Carbonell 4
Predicting Quaternary Protein Foldsby Structural Homology & First Principles
Triple beta-spirals [van Raaij et al. Nature 1999]
Virus fibers in adenovirus, reovirus and PRD1
Double barrel trimer [Benson et al, 2004]
Coat protein of adenovirus, PRD1, STIV, PBCV
January, 2009 © 2009, Jaime G. Carbonell 5
Linked Segmentation Conditional Random Fields [Liu & Carbonell]
Goal: Predict how protein complex will fold Nodes: Secondary protein structures and/or simple folds Edges: Local interactions and long-range inter-chain and intra-
chain interactions L-SCRF: conditional probability of y given x is defined as
, , ,
1 1 , , ,,
1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))
i j G i j a b G
R R k k i i j l k i a i j a bV k lE
P f g yZ
y y y
y y x x x y x x y
Joint Labels
January, 2009 © 2009, Jaime G. Carbonell 6
Classification: Training : learn the model parameters λ
Minimizing regularized negative log loss
Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
Complex graphs results in huge computational complexity
Ideal case: Co-train a multiverse of models Exploit large common substructures Immediately propagate constrains among variants Requires complex computation on co-resident models
Computational Challenges
( | )( ( , ) [ ( , )]) ( ) 0G
k c p k cc Ck
Lf E f
y xx y x y
21
( , ) log ( )G
K
k k cc C k
L f Z
x y
1
* argmax ( , )G
K
k k cc C k
y f Y
x
January, 2009 © 2009, Jaime G. Carbonell 7
Human-PPI (Revise 08)HIV-Human PPI (Revise)
Learning Protein Interaction Networks Intra- and Inter-Organism [Qi, Klein-Seetharaman, Tastan, Carbonell]
Pairwise Interactions
Pathway
Function Implication
Func ?Func A
Protein Complex
PSB 05PROTEINS 06BMC Bioinfo 07CCR 08 ISMB 08
(Preparation)
Genome Biology 08
PPI Network
Domain/Motif Interactions
January, 2009 © 2009, Jaime G. Carbonell 8
HIV-Human Protein Interactions
HIV-1 depends on the cellular machinery in every
aspect of its life cycle.
Fusion
Reverse transcription
MaturationBudding
Transcription
Peterlin and Torono, Nature Rev Immu 2003.
January, 2009 © 2009, Jaime G. Carbonell 9
Computational Challenges in Inducing the Interactome
Degree distribution / hub analysis / pair-wise coupling checking Graph modules analysis (from bi-clustering study) Protein-family based graph patterns (receptors / subclasses / ligands) )
9
• O(106) different proteins
• O(104) largest network induced to date at right
• Want to Learn interactions from induced structural fold models (previous slides)
• Requires O(10(2+3)) memory and computation [100X for full interactome, 1000X for high-fidelity model]