Statistical Machine Learning and Computational Biology

Statistical Machine Learning and Computational Biology

Michael I. JordanUniversity of California, Berkeley

November 5, 2007

Statistical Modeling in Biology

• Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment

• Motivated by our ignorance evolution of molecular function protein folding molecular concentrations incomplete fossil record

• Motivated by the need to fuse disparate sources of data

Outline

• Graphical models phylogenomics

• Nonparametric Bayesian models protein backbone modeling multi-population haplotypes

• Sparse regression protein folding

Part 1: Graphical Models

5

Probabilistic Graphical Models

X1

X2

X3

X4 X5

X6

• Given a graph G = (V,E), where each node vV is associated with a random variable Xv

p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5)

• The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edges E :

p(x6| x2, x5)

p(x1)

p(x5| x4)p(x4| x1)

p(x2| x1)

p(x3| x2)

6

Inference• Conditioning

• Marginalization

• Posterior probabilities

Inference Algorithms

• Exact algorithms sum-product junction tree

• Sampling algorithms Metropolis-Hastings Gibbs sampling

• Variational algorithms mean-field Bethe, Kikuchi convex relaxations

8

Hidden Markov Models

• Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields

9

Hidden Markov Model Variations

10

Phylogenies

• The shaded nodes represent the observed nucleotides at a given site for a set of organisms

• Site independence model (note the plate)• The unshaded nodes represent putative ancestral

nucleotides• Computing the likelihood involves summing over the

unshaded nodes

11

Hidden Markov Phylogeny

• This yields a gene finder that exploits evolutionary constraints

Evolutionary rate is state-dependent (edges from state to nodes in phylogeny are omitted for

simplicity)• Based on sequence data from 12-15 primate species, we

obtain a nucleotide sensitivity of 100%, with a specificity of 89%

GENSCAN yields a sensitivity of 45%, with a specificity of 34%

Annotating new genomes

>Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSVETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQVHMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGLDGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGATRIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF

What molecular function does protein Q8X1T6 have?

Species: Aspergillus nidulans (Fungal organism)

Images courtesy of Broad Institute, MIT

Annotation Transfer

• Species Name Molecular Function Score E-value• Schizosaccharomyces pomb adenosine deaminase 390 e-107• Gibberella zeae hypothetical protein FG01567.1 345 7e-94• Saccharomyces cerevisiae adenine deaminase 308 1e-82• Wolinella succinogenes putative adenosine deaminase 268 1e-70• Rhodospirillum rubrum adenosine deaminase 266 6e-70• Azotobacter vinelandii adenosine deaminase 260 4e-68• Streptomyces coelicolor probable adenosine deaminase 254 2e-68• Caulobacter crescentus CB1 adenosine deaminase 253 5e-66• Streptomyces avermitilis putative adenosine deaminase 251 2e-65• Ralstonia solanacearum adenosine deaminase 251 2e-65• environmental sequence unknown 246 5e-64• Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63• Pseudomonas aeruginosa adenosine deaminase 245 1e-63• environmental sequence unknown 244 3e-63• Pseudomonas fluorescens adenosine deaminase 243 7e-63• Pseudomonas putida KT2440 adenosine deaminase 243 7e-63

BLAST Search: Q8X1T6 (Aspergillus nidulans)

Methodology to System: SIFTER

Species Name Molecular Function Score E-valueSchizosaccharomyces pombe adenosine deaminase 390 e-107Gibberella zeae hypothetical protein FG01567.1 345 7e-94Saccharomyces cerevisiae adenine deaminase 308 1e-82Wolinella succinogenes putative adenosine deaminase 268 1e-70Rhodospirillum rubrum adenosine deaminase 266 6e-70Azotobacter vinelandii adenosine deaminase 260 4e-68Streptomyces coelicolor probable adenosine deaminase 254 2e-68Caulobacter crescentus adenosine deaminase 253 5e-66Streptomyces avermitilisputative adenosine deaminase 251 2e-65Ralstonia solanacearum adenosine deaminase 251 2e-65environmental sequenceunknown 246 5e-64Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63Pseudomonas aeruginosa adenosine deaminase 245 1e-63environmental sequenceunknown 244 3e-63Pseudomonas fluorescens adenosine deaminase 243 7e-63Pseudomonas putida KT adenosine deaminase 243 7e-63

MP

Gene Tree

Species Tree

Set of homologous proteins (Pfam)

adenosine

adenosine

adenosine

adenine

adenineGene Ontology SIFTER

Functional diversity problem1887 Pfam-A families with more than two experimentally

characterized functions

2-5 different functions




>51 different functions

Available methods for comparison

• Sequence similarity methods• BLAST [Altschul 1990]: sequence similarity search,

transfer annotation from sequence with most significant similarity

• Runs against largest curated protein database in world• GOtcha [Martin 2004]: BLAST search on seven genomes

with GO functional annotations • GOtcha runs use all available annotations• GOtcha-exp runs use only available experimental annotations

• Sequence similarity plus bootstrap orthology• Orthostrapper [Storm 2002]: transfer annotation when

query protein is in statistically supported orthologous cluster with annotated protein

AMP/adenosine deaminase • 251 member proteins in Pfam v. 18.0• 13 proteins with experimental evidence

GOA• 20 proteins with experimental annotations

from manual literature search• 129 proteins with electronic annotations

from GOA• Molecular function: remove amine group

from base of substrate• Alignment from Pfam family seed

alignment• Phylogeny built with PAUP* parsimony,

BLOSUM50 matrix

Mouse adenosine deaminase, courtesy PDB

AMP/adenosine deaminase

SIFTER Errors

Leave-one-out cross-validation: 93.9% accuracy (31 of 33)BLAST: 66.7% accuracy (22 of 33)

AMP/adenosine deaminase

Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot

Note: x-axis is on log scale

Sulfotransferases: ROC curve

• SIFTER (no truncation): 70.0% accuracy (21 of 30)• BLAST: 50.0% accuracy (15 of 30)

Note: x-axis is on log scale

Nudix Protein Family

• 3703 proteins in the family

• 97 proteins with molecular functions characterized

• 66 different candidate molecular functions

Nudix: SIFTER vs BLAST•SIFTER truncation level 1: 47.4% accuracy (46 of 97)•BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at all in search results

Trade specificity for accuracy• Leave-one-out cross-validation, truncation at 1: 47.4% accuracy

66 candidate functions

15 candidate functions

Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy

Fungal genomes

Archeascomycota

Basidiomycota

Hem

iasc

omyc

ota

Work with Jason Stajich; Images courtesy of Broad Institute

Euas

com

ycot

a

Zygomycota

Fungal Genomes Methods

• Gene finding in all 46 genomes• hmmsearch for all 427,324 genes• Aligned hits with hmmalign to 2,883 Pfam v. 20 families• Built trees using PAUP* maximum parsimony for 2,883 Pfam

v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact

match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b)

experimental and electronic annotations

SIFTER Predictions by Species

Part 2: Nonparametric Bayesian Models

Clustering

• There are many, many methodologies for clustering

• Heuristic methods hierarchical clustering

• M estimation K means spectral clustering

• Model-based methods finite mixture models Dirichlet process mixture models

Nonparametric Bayesian Clustering

• Dirichlet process mixture models are a nonparametric Bayesian approach to clustering

• They have the major advantage that we don’t have to assume that we know the number of clusters a priori

Chinese Restaurant Process (CRP)

• Customers sit down in a Chinese restaurant with an infinite number of tables first customer sits at the first table th subsequent customer sits at a table

drawn from the following distribution:

where is the number of occupants of table

The CRP and Mixture Models

• The customers around a table form a cluster associate a mixture component with each

table the first customer at a table chooses from

the prior e.g., for Gaussian mixtures, choose

• It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable

1 2 3 4

Example: Mixture of Gaussians

Dirichlet Process

0

• Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process

1

34

• Given observations , we model each with a latent factor :

• We put a Dirichlet process prior on :

Dirichlet Process Mixture Models

Connection to the Chinese Restaurant

• The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process

• Let’s now consider how to build on these ideas and solve the multiple clustering problem

Multiple Clustering Problems

• In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems

• Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: shrinkage empirical Bayes hierarchical Bayes

• Does this multiple group problem arise in clustering? I’ll argue “yes!”

• If so, how do we “shrink” in clustering?

Multiple Data Analysis Problems

• Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean:

• Maximum likelihood estimates of are obtained independently

• This often isn’t what we want (on theoretical and practical grounds)

38

Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter

• Yields shrinkage estimators for the

Protein Backbone Modeling

• An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids

• For each amino acid, it turns out that two angles suffice; traditionally called φ and ψ

• A plot of φ and ψangles across some ensemble of amino acids is called a Ramachandran plot

A Ramachandran Plot

• This can be (usefully) approached as a mixture modeling problem

• Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid

Ramachandran Plots• But that plot is an overlay of 400 different plots, one for

each combination of 20 amino acids on the left and 20 amino acids on the right

• Shouldn’t we be treating this as a multiple clustering problem?

Haplotype Modeling

• A haplotype is the pattern of alleles along a single chromosome

• Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes:

• Need to restore haplotypes from genotypes• A genotype is well modeled as a mixture model,

where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)

Multiple Population Haplotype Modeling

• When we have multiple populations (e.g., ethnic groups) we have multiple mixture models

• How should we analyze these data (which are now available, e.g., from the HapMap project)?

• Analyze them separately? Lump them together?

Scenes, Objects, Parts and Features

Shared Parts

Hidden Markov Models

• An HMM is a discrete state space model• The discrete state can be viewed as a cluster

indicator• We thus have a set of clustering problems, one

for each value of the previous state (i.e., for each row of the transition matrix)

Solving the Multiple Clustering Problem

• It’s natural to take a hierarchical Bayesian approach

• It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori

• How do we do this?

48

Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter

• Yields shrinkage estimators for the

49

• Let us try to model each group of data with a Dirichlet process mixture model let the groups share an underlying

hyperparameter • But each group is generated

independently different groups cannot share the

same components if is continuous.

Hierarchical DP Mixture Model?

spikes do notmatch up

Hierarchical Dirichlet Process Mixtures

The Chinese Restaurant Franchise

HDP Model of Ramachandran Plots• We would like to solve 400 different related clustering problems,

one for each combination of 20 amino acids on the left and 20 amino acids on the right

53

HDP Model of Ramachandran Plots

Some HDP Success Stories

• New backbone model for Rosetta• New method for multi-population haplotype phasing• Solution to problem of choosing number of states in HMMs• State-of-the-art method for statistical parsing• Competitive method for image denoising• Competitive method for scene categorization• State-of-the-art method for object recognition

Part 3: Sparse Regression

Rosetta Ab Initio Search• Very successful method from David Baker’s lab (UW)

Consistent top performer at CASP Already used in real-world problems (e.g. HIV vaccine

design)• Monte Carlo procedure

Treat energy (actually ) as a probability density, sample from it

• Primary move set: fragment insertion

Fragments come from library of solved structures with similar residue subsequences

Energetically plausible local solutions

Like a coordinate descent move—jump to new local minimum in a few coordinates

Rosetta Ab Initio Search• Start from fully extended chain.• Repeat:

• Propose fragment insertion (or other) move• Do local gradient descent to evaluate proposal• Accept or reject by a Metropolis criterion

• Throw away all but lowest energy sample from previous round

• Switch to high resolution (full-atom) energy function, perform further search (relaxation)

• Return single lowest energy conformation seen in sampling.

Run many, many times.

Our idea: resampling• No more blind repetition—learn where to search from

previous searches• An initial round of sampling gives us lots of information

about the search space Areas of conformation space that always have very

poor energy Structural elements that are predicted very consistently

• How can we use this information to guide further sampling (without getting too greedy?)

• General approach: • Define a reduced search space• Learn a smoothed energy function from the decoys• Minimize smoothed energy to find new decoys• Repeat

• Restrict attention to local minima (decoys)• Fit a smoothed response surface to the decoys• Minimize response surface to find new candidate• Full-atom relax candidate• Add candidate to decoy pool• Re-fit surface, repeat

Step 2: Response surface minimization

Features

• Torsion angle features e.g., from the HDP model of the

Ramachandran plot

• Secondary structure features

Sheet Loop Helix

φ

ψ

(180, 180)

(-180, -180)

G

E

E

A

B

B

More Features• Sidechain rotamer features

• Burial features

Buried Exposed

• Register shift features

Sparse Regression Models• Lasso regression: penalize large weights

• L1 regularization leads to sparse solutions• LARS (Efron et al. 2004) : find estimates for all C

simultaneously, as efficiently as least-squares

Results: 1ogw

Results: 1n0u

Results: 1di2

Collaborators

• David Baker• Ben Blum• Steven Brenner• Roland Dunbrack• Barbara Engelhardt• Guillaume Obozinski• Yee Whye Teh• Daniel Ting• Eric Xing

http://www.cs.berkeley.edu/~jordan


Finis


• For more information (papers, slides, tutorials, software:


Documents

Statistical Machine Learning and Computational Biology