9
January, 2009 Jaime Carbonell et al Carnegie Mellon University [email protected] Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 Jaime Carbonell et al Carnegie Mellon University [email protected] Data-Intensive Scalability in Machine Learning and Computational Proteomics

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009

Jaime Carbonell et alCarnegie Mellon University

[email protected]

Data-Intensive Scalabilityin Machine Learning

and Computational Proteomics

Page 2: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 2

Active and Proactive Learning Training data:

Objective: learn decision function with minimal training (sampling)

Functional space: Fitness Criterion:

a.k.a. loss function

Sampling Strategy:

iinkiikiii yxOxyx :}{},{ ,...1,...1

}{ lj pf

),()(minarg ,

,lj

iipji

ljpfxfy

l

},...,{|)),(ˆ(minarg 1},...,{ 1

kitesttestxxx

xxxyxfLnki

Page 3: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 3

Computational Challenge True decision F’s are in non-linear

high-D manifolds. Only simplified functional forms

(e.g. d-trees, hyperplanes) can be tractably explored today

Require global optimization and shared model 3-5 order of magnitude beyond current workstations

Non-Euclidian manifolds

Optimal cost-sensitive sampling requires full model sharing (clouds are not the best computational model)

1

1

1

1( , )= ln(1 min ( 1))k k

ij

pp p

i j p Pk

d x x e

Page 4: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 4

Predicting Quaternary Protein Foldsby Structural Homology & First Principles

Triple beta-spirals [van Raaij et al. Nature 1999]

Virus fibers in adenovirus, reovirus and PRD1

Double barrel trimer [Benson et al, 2004]

Coat protein of adenovirus, PRD1, STIV, PBCV

Page 5: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 5

Linked Segmentation Conditional Random Fields [Liu & Carbonell]

Goal: Predict how protein complex will fold Nodes: Secondary protein structures and/or simple folds Edges: Local interactions and long-range inter-chain and intra-

chain interactions L-SCRF: conditional probability of y given x is defined as

, , ,

1 1 , , ,,

1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))

i j G i j a b G

R R k k i i j l k i a i j a bV k lE

P f g yZ

y y y

y y x x x y x x y

Joint Labels

Page 6: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 6

Classification: Training : learn the model parameters λ

Minimizing regularized negative log loss

Iterative search algorithms by seeking the direction whose empirical values agree with the expectation

Complex graphs results in huge computational complexity

Ideal case: Co-train a multiverse of models Exploit large common substructures Immediately propagate constrains among variants Requires complex computation on co-resident models

Computational Challenges

( | )( ( , ) [ ( , )]) ( ) 0G

k c p k cc Ck

Lf E f

y xx y x y

21

( , ) log ( )G

K

k k cc C k

L f Z

x y

1

* argmax ( , )G

K

k k cc C k

y f Y

x

Page 7: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 7

Human-PPI (Revise 08)HIV-Human PPI (Revise)

Learning Protein Interaction Networks Intra- and Inter-Organism [Qi, Klein-Seetharaman, Tastan, Carbonell]

Pairwise Interactions

Pathway

Function Implication

Func ?Func A

Protein Complex

PSB 05PROTEINS 06BMC Bioinfo 07CCR 08 ISMB 08

(Preparation)

Genome Biology 08

PPI Network

Domain/Motif Interactions

Page 8: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 8

HIV-Human Protein Interactions

HIV-1 depends on the cellular machinery in every

aspect of its life cycle.

Fusion

Reverse transcription

MaturationBudding

Transcription

Peterlin and Torono, Nature Rev Immu 2003.

Page 9: January, 2009 Jaime Carbonell et al Carnegie Mellon University jgc@cs.cmu.edu Data-Intensive Scalability in Machine Learning and Computational Proteomics

January, 2009 © 2009, Jaime G. Carbonell 9

Computational Challenges in Inducing the Interactome

Degree distribution / hub analysis / pair-wise coupling checking Graph modules analysis (from bi-clustering study) Protein-family based graph patterns (receptors / subclasses / ligands) )

9

• O(106) different proteins

• O(104) largest network induced to date at right

• Want to Learn interactions from induced structural fold models (previous slides)

• Requires O(10(2+3)) memory and computation [100X for full interactome, 1000X for high-fidelity model]