28
Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis Callan (Princeton University) eoretical Physics and Cellular Biolo U. Kentucky, 12/8/2006

Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Embed Size (px)

Citation preview

Page 1: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun?

Curtis Callan (Princeton University)

Theoretical Physics and Cellular Biology

U. Kentucky, 12/8/2006

Page 2: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Overview

How biology is constrained by physical principles has always been a legitimate subject of study for theoretical physics. The recent explosion of quantitative data, due to whole genome sequencing, expression profiling, etc. has placed this question in a qualitatively new context -- one in which the making of quantitative models, and even theories, will be essential for understanding biological data. This provides an opportunity for theoretical physics (and physicists) to contribute to the advance of biological science.

Gene regulation presents an instructive particular case. Each cell in our body contains the same genetic information, coded in a few DNA molecules. Each cell controls how much of each protein is made, and therefore what the cell actually `does‘, via the binding of a special class of proteins (transcription factors) to short segments of DNA lying near protein-coding DNA regions (the genes). Despite the relentless advance of molecular biology over the last fifty years, deep physical questions about the working of this mechanism (in particular regarding specificity, kinetics and noise) remain imperfectly understood.

Page 3: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Biology as it is practiced today looks more and more like physics: quantitative experiments, large volumes of data, sophisticated data analysis, models, ...

Physics teaches us that models and data analysis must be guided by formal theory: qualitatively striking phenomena often demand new mathematical structures

Physics is not just a methodological model: Cells often operate in a regime where physical constraints are important - limits to specificity, precision, noise, ....

The genomic revolution (organism sequencing, expression profiling, …) has brought these issues into sharper focus …. we need much more than “bio-informatics” to extract meaning from the mass of data being produced.

Theoretical physics provides a reservoir of people and ideas well-suited to taking up the challenges of the new biology (that’s my opinion anyway).

To make this more concrete, I will describe a few cases where physics-motivated theoretical approaches are being used to address problems in gene regulation.

Some reasons why the time is ripe:

Page 4: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Cartoon Overview of Gene Expression

Transcription factor proteins (TFs) bind to promoter to help (hinder) RNAP copy gene to mRNA.

Sequence-dependent TF-DNA binding thermodynamics controls which TF binds to which gene. Specificity is governed by energy.

RNAP protein complex makes an mRNA copy of the gene. Ribosome translates triplets of bases into amino acids via the “genetic code”. Coding Problem: Same TF binds to many different sequences. No

analog of 3bp codons. Sites are statistically defined at best (next slide)

Binding Energy: Generic non-specific binding is weak; sequence-dep-endent binding provides a random energy landscape; strongest binding sites are probably the biologically significant ones.

TF protein binds here to a short (~20bp) DNA sequence

TF

Page 5: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

TF Binding Sites: Statistical At Best

Page 6: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

o Inferring models from data as an exercise in statistical inference: the transcription factor binding energy problem meets micro-arrays.

o Evolutionary comparisons as a tool for learning about transcription factor binding (and vice-versa): toward a stat mech of

evolution

o Noise, small numbers and stochastic aspects of gene expression dynamics: how the fruit fly

gets its shape.

Physical Variations on a Theme of Gene Regulation by TF Binding:

Page 7: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

TF-DNA Energy Models from Binding Assays

Transcription factors (TFs) are DNA-binding proteins which regulate gene transcription: key mechanism in all organisms.

A quantitative understanding of gene regulation, especially its evolution, requires a quantitative understanding of TF-DNA interaction, i.e. sequence-dependent binding energy (SDBE).

High-throughput experiments can give massive amounts of (somewhat noisy) information on TF-DNA binding:

PBM: protein binding microarrays (in vitro)

ChIP-chip: chromatin immuno-precipitation microarrays (in vivo)

Goal: infer quantitative SDBE models from this noisy data. We will take the statistical inference approach used by physicists to deal with WMAP/HEP data.

Page 8: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Experiment (Mukherjee et al 04)

• Use dsDNA microarrays to probe TF binding to all intergenic regions of S. cerevisiae. Sequence of every (~1000bp) region is known.

• Fluorescence of any well is (roughly) proportional to number of TF molecules bound to DNA sequence s in that well.

• After some normalization, etc., each DNA sequence s is assigned a numerical fluorescence value z (TF specific):

bind a TF

antibody label

scan

Page 9: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Binding Energy (and Binding) Model• Bases within a site (length L)

contribute additively to the binding energy. Model is a 4xL “energy matrix” M.

• A stretch of DNA is “bound” if it contains a site with E<1 (other-wise “unbound”). Site occup-ancy modeled by step function.

• A matrix model M predicts if any DNA sequence s is bound (x=1) unbound (x=0).

• How does this compare with what is seen in the expt? Does any choice of M fit the data?

C GTA AGT TCC

1.2

3.7

2.9

0.0

1.8

0.0

0.1

3.0

5.6

0.5

0.8

0.0

2.5

1.2

0.6

0.0

0.0

5.2

1.3

3.1

6.0

0.0

1.4

3.2

A

C

G

T

L = 6Site = TGTGACEnergy = 0.7 < 1

Region is “bound”

• Model predicts x (binary), but exp’t measures z (continuous)!

Page 10: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Connecting “Theory” and Experiment:

• Fluorescence z of a bound (unbound) region is probabilistic (chemistry etc.); need ``error models’’ for the two states:

• Experiment sees only the histogram of net fluorescence N0p(z|0) +N1p(z|1) of N=N0+N1 genes (discriminate by cut?).

• How well does model M predict the data? Likelihood measure:

• Baye’s rule then gives probability of a model, given the data:

• Small hitch: we have no clue about the actual error model!

Page 11: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Quenching the Error Model

• In ignorance of the true error model, we average data likelihood over all error models. Get error model averaged (EMA) likelihood:

• To evaluate this weird expression, some binning is needed:

• Bin each region si according to fluorescence; find prediction xi of model , record counts czx per bin.

• EMA likelihood evaluates to

• Use MCMC to collect models distri-buted according to EMA likelihood:

Mutual information is the key!

Page 12: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Results for TF Abf1p (Yeast)• Use MCMC to sample

40,000 matrices M from p(M|{zi})using EMA likelihood. Get well-sampled distributions for all 80 matrix elements.

• Most RMSDs ≤ 5% of functional range.

• Significant structure, even in the middle of the binding site where there is little specificity.

• Well-behaved prob dist’n in high-dimension model space is a big surprise..

Page 13: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Evidence From Evolution

Direct experimental evidence about TF-DNA binding energy is limited. But energy is the focus of our method. We need direct hi-throughput energy measurements…

Functional properties of binding sites are governed in large part by energy: energy, not sequence, should be conserved in evolution if function is to be maintained

Suggests that comparing “orthologous” binding sites between species is an indirect way of assessing whether our energy model is doing the right thing.

If it passes the test, our hundreds of binding sites are the raw material for a new kind of quantitative study of the dynamics of evolution.

Page 14: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Orthology and Alignment of Genomes + Sites

Alignment of related sequences amounts to finding the most parsimonious way of mutating one into the other (including possibility of creating gaps). Powerful software readily available: ClustalW used here. Can also search for “most likely” ancestor sequence of the descendants (well-studied subject in comp-bio).

Example of intergenic region with predicted ecoli binding site for Crp:

Page 15: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Binding Energy is Conserved (Yeast ABF1)

676 (!) intergenic sites in S. cerevisiae with Abf1p binding energy < 1 have identifiable orthologs in S. bayanus

~75% of these orthologs also have energy < 1. Conservation this strong is very non-generic!

Sites with energy > 1 have no correlation between energies in the two genomes: mutation randomizes energy.

Clear evidence of selection pressure on bound site energy; pressure on sequence is indirect.

Precise genotype-phenotype maps just what we need for quantitative understanding of how TF binding sites and regulatory networks evolved.

Use the energy model derived by likelihood analysis from S.cerevisiae assays to assign energies to sites (an a priori genotype-phenotype map):

Page 16: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Energy Imprints Itself on Sequence Evolution

ABF1

We have a rich yeast phylogenetic tree. Plus a large collection of orthologous site pairs. Ask population average questions about the likelihood of base changes at diff’t locations in thte ABF1 site. Look for selection on the basis of energy ….not site-by-site, but on average in the population.

Substitution probability pattern matches structure of energy matrix (bigger E’s disfavored).

Pattern evolves with time (from last common ancestor) as if under control of common `Hamiltonian’.

Page 17: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

“Equilibrium Thermodynamics” of Site Evolution

Assume: a) site s1 evolves over time into an ensemble {s1} thermal with respect to a fitness function F(E(s)) b) many equivalent sites [si] of a TF in one species can be treated as an equivalent ensemble (ergodicity?).

Population genetics (Kimura): raw mutation rate is modified by fitness change F

Infer the actual fitness function F(E) from ABF1 site energy statistics in S.cerevisiae. Known background determines dist’n P0(E) of non-functional sites. Difference from true dist’n Ptrue(E) is Pfunc(E). Stat mech says F(E) = ln(Pfunc(E)) !!

F(E) is the “Hamiltonian” governing stochastic evolution of ABF1 binding sites. Do MC simulations agree with data provided by real genomes?

ABF1 site statsF(E))

Pfunc(E)

Pbkgd(E)

Page 18: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Fluctuations, Noise & Genetic Switches

Many cellular processes depend on presence (absence) of a small number of actors (TFs, signaling molecules, photons, ..)

The associated fluctuations and noise have critical influence on how things work (not always fully appreciated):

Sensing chemical gradients (chemotaxis)

Flipping of genetic switches (development)

Can a gene do more than just be on or off?

Remarkably little is known about this: the stochastic properties of cellular events are just beginning to be explored quantitatively (its not just Poisson).

Page 19: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

If sensors are of size a, and molecules are at concentration c, sensor will “count” N ~ ca3 molecules, make fractional error

N/N ~ 1/N 1/2

If we average for a time T, we can make K ~ T/t measurements, where t ~ a2 /D is time to “clear” via diffusion ... fractional error reduced by 1/K1/2

Implied limiting precision of concentration sensing:

Physical limits to biochemical signaling precision

δcc__ 1

(cDaT)_______

1/2~

This old Berg+Purcell argument has recently been made rigorous and generalized to much broader context by fluctuation-dissipation arguments (Bialek+Setayeshgar). You get the and much more!

Page 20: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Cell Fate in Fruit Fly Embryo Development

Nuclei need to know `where they are’ in order to make cell fate choice (thorax v. abdomen). Use the concentration of Bicoid protein diffusing from the head of the embryo: if [Bcd] is big enough, they express Hb (or not) and choose their future.

[Bcd]

Maternal Bcd RNA

Thorax ([Hb] high) Abdomen ([Hb] low)

Until generation 13, the nuclei live in a `syncitium’ without cell walls

Page 21: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Accurate Pattern Formation in Drosophila Development?

Maternal Bcd has longitudinal gradient: Hb transcription reads out [Bcd], turns on at [c] > 5 molecules3

Hb expression threshold is sharp, separates neighboring nuclei (at cycle 14, x~6m), implying 10% readout of [Bcd].

But readout by transcription factor binding to small DNA sites has accuracy limited by thermodynamics

For a~3nm, D~5m2/s, [c]~2nM, T~60s: we find that c/c > 50% !

Noise can be reduced by time averaging. But, to achieve 10% accuracy, promoter would have to average for 1 hour!

Not plausible: perhaps [Bcd] is sensed as a spatial average? Then [Hb] fluctuations in nearby nuclei must be correlated.

This prediction can be subjected to experimental test ….

Page 22: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Hb

Bcd

DNA

Observations on Triply Immuno-stained Drosophila Embryos (Gregor et al):

Cycle 14 embryo with ~2300 nuclei:

Positions of nuclei in embryo identified by DNA signal (blue)

Infer [Hb] and [Bcd] within each nucleus from R/G signals

103s of nuclear data points then give the input/output relation for expression control system:Powerful tool for studying issues of noise and fluct-uations in gene expression

Observe diffusion gradient of primary morphogen [Bcd]: above some threshold, it turns on [Hb]

Quantitative study of regulation of expression of Hb by [Bcd]

Page 23: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Readout Noise in Gene Expression:Gregor’s Experiment Speaks

Normalized noise in inferred readout of [Bcd] from [Hb] level. Note approx 10% accuracy in decision region.

Spatial correlation function of noise in [Hb]: correlated over ~6 nuclei (noise reduction by spatial averaging?)

This Drosophila gene expression module seems to be working at or near the limits imposed by physics/thermodynamics. But mechanism for implied spatial correlation of Hb expression is totally unknown.

Page 24: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Summary and Some Conclusions

Page 25: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

PBM vs. ChIP-chip

PBM and ChIP-chip data give similar matrices (but with the ChIP-chip cutoff set to .75 instead of 1).

Simple 2 test used to asess the overlap between the PBM and ChIP-chip distributions for each matrix element, i.e. test for consistency.

Most elements have overlapping distributions. Only 3 don’t, and those are outside of the main site. Do they matter?

Page 26: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

PBM vs. ChIP-chip

We declare sites flagged by > 50% of energy matrices to be putative binding sites.

Putative ChIP-chip sites are almost all predicted by the PBM matrices.

Mukherjee et al. used a direct comparison of putatively bound regions to validate their results. Resulting agreement is not nearly as good as ours. Such comparisons can only be improved by decreasing experimental noise.

Noise is not nearly as problematic when data is bottlenecked through a parameterized model.

Page 27: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

Transcription Factors: Proteinsthat Regulate Gene Expression

Basic Mechanism: TFs bind to short (noncoding) DNA sequences tomodify expression level of nearby genes. Complex circuits are made.

Coding Problem: Same TF binds to many different sequences. Noanalog of 3bp codons. Sites are statistically defined at best.

Binding Energy: Generic non-specific binding is weak; sequence-dep-endent binding provides a random energy landscape; strongest binding sites are probably the biologically significant ones.

Modeling: To analyze gene regulatory networks, need good quantitative models for TF-DNA binding and dynamics. Stat mech problem governed by sequence to energy map; problem of false positives;

Page 28: Do the tools & methods of theoretical physics have anything useful to contribute to cellular biology? Or did Delbrück and Szilard have all the fun? Curtis

theoretical physics ideas new fields of experimentcoding of sensory signals in neural spike trains should be an efficient - perhaps even optimal - code ... must be matched to the distribution of inputs

(Bialek et al)

information in spike timing; neural code adapts to input statistics; info adaptation as fast as physics permits!

(de Ruyter, Meister, Berry, et al)

electrical dynamics of neurons determined by ion channels, but overly sensitive to numbers of different channel types ... "self tuning” mechanisms needed for robustness

(Abbott et al)

neurons do "remodel" their channel numbers; novel homeostatic regulation mechanisms are observed

(Marder, Turrigiano, et al)

computing with attractors ... network dynamics for memory, recall, optimization, ...

(Hopfield, Seung, et al)

eye position stabilization as the proto-typical example of short-term memory

(Tank et al)

Neuroscience as “existence proof” that theoretical physics has something to say

about biology:

Curiously, this story is a good two decades old. The obvious complexity of neural systems made it clear early on that `theory’ would be needed to make sense of experiment. Cell biology is making the same transition two decades later.