Upload
gil
View
50
Download
4
Embed Size (px)
DESCRIPTION
Statistical Machine Learning and Computational Biology. Michael I. Jordan University of California, Berkeley November 5, 2007. Statistical Modeling in Biology. Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment Motivated by our ignorance - PowerPoint PPT Presentation
Citation preview
Statistical Machine Learning and Computational Biology
Michael I. JordanUniversity of California, Berkeley
November 5, 2007
Statistical Modeling in Biology
• Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment
• Motivated by our ignorance evolution of molecular function protein folding molecular concentrations incomplete fossil record
• Motivated by the need to fuse disparate sources of data
Outline
• Graphical models phylogenomics
• Nonparametric Bayesian models protein backbone modeling multi-population haplotypes
• Sparse regression protein folding
Part 1: Graphical Models
5
Probabilistic Graphical Models
X1
X2
X3
X4 X5
X6
• Given a graph G = (V,E), where each node vV is associated with a random variable Xv
p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5)
• The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edges E :
p(x6| x2, x5)
p(x1)
p(x5| x4)p(x4| x1)
p(x2| x1)
p(x3| x2)
6
Inference• Conditioning
• Marginalization
• Posterior probabilities
Inference Algorithms
• Exact algorithms sum-product junction tree
• Sampling algorithms Metropolis-Hastings Gibbs sampling
• Variational algorithms mean-field Bethe, Kikuchi convex relaxations
8
Hidden Markov Models
• Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields
9
Hidden Markov Model Variations
10
Phylogenies
• The shaded nodes represent the observed nucleotides at a given site for a set of organisms
• Site independence model (note the plate)• The unshaded nodes represent putative ancestral
nucleotides• Computing the likelihood involves summing over the
unshaded nodes
11
Hidden Markov Phylogeny
• This yields a gene finder that exploits evolutionary constraints
Evolutionary rate is state-dependent (edges from state to nodes in phylogeny are omitted for
simplicity)• Based on sequence data from 12-15 primate species, we
obtain a nucleotide sensitivity of 100%, with a specificity of 89%
GENSCAN yields a sensitivity of 45%, with a specificity of 34%
Annotating new genomes
>Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSVETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQVHMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGLDGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGATRIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF
What molecular function does protein Q8X1T6 have?
Species: Aspergillus nidulans (Fungal organism)
Images courtesy of Broad Institute, MIT
Annotation Transfer
• Species Name Molecular Function Score E-value• Schizosaccharomyces pomb adenosine deaminase 390 e-107• Gibberella zeae hypothetical protein FG01567.1 345 7e-94• Saccharomyces cerevisiae adenine deaminase 308 1e-82• Wolinella succinogenes putative adenosine deaminase 268 1e-70• Rhodospirillum rubrum adenosine deaminase 266 6e-70• Azotobacter vinelandii adenosine deaminase 260 4e-68• Streptomyces coelicolor probable adenosine deaminase 254 2e-68• Caulobacter crescentus CB1 adenosine deaminase 253 5e-66• Streptomyces avermitilis putative adenosine deaminase 251 2e-65• Ralstonia solanacearum adenosine deaminase 251 2e-65• environmental sequence unknown 246 5e-64• Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63• Pseudomonas aeruginosa adenosine deaminase 245 1e-63• environmental sequence unknown 244 3e-63• Pseudomonas fluorescens adenosine deaminase 243 7e-63• Pseudomonas putida KT2440 adenosine deaminase 243 7e-63
BLAST Search: Q8X1T6 (Aspergillus nidulans)
Methodology to System: SIFTER
Species Name Molecular Function Score E-valueSchizosaccharomyces pombe adenosine deaminase 390 e-107Gibberella zeae hypothetical protein FG01567.1 345 7e-94Saccharomyces cerevisiae adenine deaminase 308 1e-82Wolinella succinogenes putative adenosine deaminase 268 1e-70Rhodospirillum rubrum adenosine deaminase 266 6e-70Azotobacter vinelandii adenosine deaminase 260 4e-68Streptomyces coelicolor probable adenosine deaminase 254 2e-68Caulobacter crescentus adenosine deaminase 253 5e-66Streptomyces avermitilisputative adenosine deaminase 251 2e-65Ralstonia solanacearum adenosine deaminase 251 2e-65environmental sequenceunknown 246 5e-64Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63Pseudomonas aeruginosa adenosine deaminase 245 1e-63environmental sequenceunknown 244 3e-63Pseudomonas fluorescens adenosine deaminase 243 7e-63Pseudomonas putida KT adenosine deaminase 243 7e-63
MP
Gene Tree
Species Tree
Set of homologous proteins (Pfam)
adenosine
adenosine
adenosine
adenine
adenineGene Ontology SIFTER
Functional diversity problem1887 Pfam-A families with more than two experimentally
characterized functions
2-5 different functions
6-10 different functions
11-20 different functions
21-50 different functions
>51 different functions
Available methods for comparison
• Sequence similarity methods• BLAST [Altschul 1990]: sequence similarity search,
transfer annotation from sequence with most significant similarity
• Runs against largest curated protein database in world• GOtcha [Martin 2004]: BLAST search on seven genomes
with GO functional annotations • GOtcha runs use all available annotations• GOtcha-exp runs use only available experimental annotations
• Sequence similarity plus bootstrap orthology• Orthostrapper [Storm 2002]: transfer annotation when
query protein is in statistically supported orthologous cluster with annotated protein
AMP/adenosine deaminase • 251 member proteins in Pfam v. 18.0• 13 proteins with experimental evidence
GOA• 20 proteins with experimental annotations
from manual literature search• 129 proteins with electronic annotations
from GOA• Molecular function: remove amine group
from base of substrate• Alignment from Pfam family seed
alignment• Phylogeny built with PAUP* parsimony,
BLOSUM50 matrix
Mouse adenosine deaminase, courtesy PDB
AMP/adenosine deaminase
SIFTER Errors
Leave-one-out cross-validation: 93.9% accuracy (31 of 33)BLAST: 66.7% accuracy (22 of 33)
AMP/adenosine deaminase
Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot
Note: x-axis is on log scale
Sulfotransferases: ROC curve
• SIFTER (no truncation): 70.0% accuracy (21 of 30)• BLAST: 50.0% accuracy (15 of 30)
Note: x-axis is on log scale
Nudix Protein Family
• 3703 proteins in the family
• 97 proteins with molecular functions characterized
• 66 different candidate molecular functions
Nudix: SIFTER vs BLAST•SIFTER truncation level 1: 47.4% accuracy (46 of 97)•BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at all in search results
Trade specificity for accuracy• Leave-one-out cross-validation, truncation at 1: 47.4% accuracy
66 candidate functions
15 candidate functions
Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy
Fungal genomes
Archeascomycota
Basidiomycota
Hem
iasc
omyc
ota
Work with Jason Stajich; Images courtesy of Broad Institute
Euas
com
ycot
a
Zygomycota
Fungal Genomes Methods
• Gene finding in all 46 genomes• hmmsearch for all 427,324 genes• Aligned hits with hmmalign to 2,883 Pfam v. 20 families• Built trees using PAUP* maximum parsimony for 2,883 Pfam
v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact
match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b)
experimental and electronic annotations
SIFTER Predictions by Species
Part 2: Nonparametric Bayesian Models
Clustering
• There are many, many methodologies for clustering
• Heuristic methods hierarchical clustering
• M estimation K means spectral clustering
• Model-based methods finite mixture models Dirichlet process mixture models
Nonparametric Bayesian Clustering
• Dirichlet process mixture models are a nonparametric Bayesian approach to clustering
• They have the major advantage that we don’t have to assume that we know the number of clusters a priori
Chinese Restaurant Process (CRP)
• Customers sit down in a Chinese restaurant with an infinite number of tables first customer sits at the first table th subsequent customer sits at a table
drawn from the following distribution:
where is the number of occupants of table
The CRP and Mixture Models
• The customers around a table form a cluster associate a mixture component with each
table the first customer at a table chooses from
the prior e.g., for Gaussian mixtures, choose
• It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable
1 2 3 4
Example: Mixture of Gaussians
Dirichlet Process
0
• Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process
1
34
• Given observations , we model each with a latent factor :
• We put a Dirichlet process prior on :
Dirichlet Process Mixture Models
Connection to the Chinese Restaurant
• The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process
• Let’s now consider how to build on these ideas and solve the multiple clustering problem
Multiple Clustering Problems
• In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems
• Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: shrinkage empirical Bayes hierarchical Bayes
• Does this multiple group problem arise in clustering? I’ll argue “yes!”
• If so, how do we “shrink” in clustering?
Multiple Data Analysis Problems
• Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean:
• Maximum likelihood estimates of are obtained independently
• This often isn’t what we want (on theoretical and practical grounds)
38
Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter
• Yields shrinkage estimators for the
Protein Backbone Modeling
• An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids
• For each amino acid, it turns out that two angles suffice; traditionally called φ and ψ
• A plot of φ and ψangles across some ensemble of amino acids is called a Ramachandran plot
A Ramachandran Plot
• This can be (usefully) approached as a mixture modeling problem
• Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid
Ramachandran Plots• But that plot is an overlay of 400 different plots, one for
each combination of 20 amino acids on the left and 20 amino acids on the right
• Shouldn’t we be treating this as a multiple clustering problem?
Haplotype Modeling
• A haplotype is the pattern of alleles along a single chromosome
• Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes:
• Need to restore haplotypes from genotypes• A genotype is well modeled as a mixture model,
where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)
Multiple Population Haplotype Modeling
• When we have multiple populations (e.g., ethnic groups) we have multiple mixture models
• How should we analyze these data (which are now available, e.g., from the HapMap project)?
• Analyze them separately? Lump them together?
Scenes, Objects, Parts and Features
Shared Parts
Hidden Markov Models
• An HMM is a discrete state space model• The discrete state can be viewed as a cluster
indicator• We thus have a set of clustering problems, one
for each value of the previous state (i.e., for each row of the transition matrix)
Solving the Multiple Clustering Problem
• It’s natural to take a hierarchical Bayesian approach
• It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori
• How do we do this?
48
Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter
• Yields shrinkage estimators for the
49
• Let us try to model each group of data with a Dirichlet process mixture model let the groups share an underlying
hyperparameter • But each group is generated
independently different groups cannot share the
same components if is continuous.
Hierarchical DP Mixture Model?
spikes do notmatch up
Hierarchical Dirichlet Process Mixtures
The Chinese Restaurant Franchise
HDP Model of Ramachandran Plots• We would like to solve 400 different related clustering problems,
one for each combination of 20 amino acids on the left and 20 amino acids on the right
53
HDP Model of Ramachandran Plots
Some HDP Success Stories
• New backbone model for Rosetta• New method for multi-population haplotype phasing• Solution to problem of choosing number of states in HMMs• State-of-the-art method for statistical parsing• Competitive method for image denoising• Competitive method for scene categorization• State-of-the-art method for object recognition
Part 3: Sparse Regression
Rosetta Ab Initio Search• Very successful method from David Baker’s lab (UW)
Consistent top performer at CASP Already used in real-world problems (e.g. HIV vaccine
design)• Monte Carlo procedure
Treat energy (actually ) as a probability density, sample from it
• Primary move set: fragment insertion
Fragments come from library of solved structures with similar residue subsequences
Energetically plausible local solutions
Like a coordinate descent move—jump to new local minimum in a few coordinates
Rosetta Ab Initio Search• Start from fully extended chain.• Repeat:
• Propose fragment insertion (or other) move• Do local gradient descent to evaluate proposal• Accept or reject by a Metropolis criterion
• Throw away all but lowest energy sample from previous round
• Switch to high resolution (full-atom) energy function, perform further search (relaxation)
• Return single lowest energy conformation seen in sampling.
Run many, many times.
Our idea: resampling• No more blind repetition—learn where to search from
previous searches• An initial round of sampling gives us lots of information
about the search space Areas of conformation space that always have very
poor energy Structural elements that are predicted very consistently
• How can we use this information to guide further sampling (without getting too greedy?)
• General approach: • Define a reduced search space• Learn a smoothed energy function from the decoys• Minimize smoothed energy to find new decoys• Repeat
• Restrict attention to local minima (decoys)• Fit a smoothed response surface to the decoys• Minimize response surface to find new candidate• Full-atom relax candidate• Add candidate to decoy pool• Re-fit surface, repeat
Step 2: Response surface minimization
Features
• Torsion angle features e.g., from the HDP model of the
Ramachandran plot
• Secondary structure features
Sheet Loop Helix
φ
ψ
(180, 180)
(-180, -180)
G
E
E
A
B
B
More Features• Sidechain rotamer features
• Burial features
Buried Exposed
• Register shift features
Sparse Regression Models• Lasso regression: penalize large weights
• L1 regularization leads to sparse solutions• LARS (Efron et al. 2004) : find estimates for all C
simultaneously, as efficiently as least-squares
Results: 1ogw
Results: 1n0u
Results: 1di2
Collaborators
• David Baker• Ben Blum• Steven Brenner• Roland Dunbrack• Barbara Engelhardt• Guillaume Obozinski• Yee Whye Teh• Daniel Ting• Eric Xing
http://www.cs.berkeley.edu/~jordan
Finis
http://www.cs.berkeley.edu/~jordan
• For more information (papers, slides, tutorials, software: