Final Submission for Arvind Thiagarajan

5/11/2018 Final Submission for Arvind Thiagarajan - slidepdf.com

http://slidepdf.com/reader/full/final-submission-for-arvind-thiagarajan 1/10

Measuring Dynamic Phosphorylation of Spo0A in Bacillus subtilis

Arvind Thiagarajan, Joseph H. Levine, Michael B. ElowitzDepartments of Biology,

Biological Engineering, and Applied PhysicsCalifornia Institute of Technology

(Dated: November 1, 2011)

Sporulation is an intricately controlled process in Bacillus subtilis, promising a potential wealth of novel network motifs. It has previously been shown that the key regulator involved in the inductionof sporulation is the phosphorylated form of the transcription factor Spo0A. For this reason, wedeveloped an experimental paradigm to measure Spo0AP levels dynamically in vivo. A strain of B.subtilis was engineered in which the Spo0A gene had been replaced by a Spo0A-GFP fusion andinto which an array containing 256 copies of a Spo0AP binding site had been inserted. Imaging of this strain revealed clusters of localized fluorescence amidst diffuse background fluorescence, corre-sponding respectively to bound Spo0AP and unbound Spo0AP and Spo0A. In order to determinerelative levels of Spo0AP and Spo0A from such images, we developed a machine learning algorithmfor identifying localized fluorescence and quantifying the corresponding intensities. Subsequently,we constructed a probabilistic model to estimate intracellular deviations in cluster intensities andconsequently determine the ratio between observed fluorescence and number of molecules present.A microfluidic chemostat was optimized for the study of Spo0AP dynamics during sporulation in B.subtilis, and to facilitate experiments in this device, we optimized a media to induce sporulation.

I. INTRODUCTION

A. Background

Systems Biology is a burgeoning new field, encompass-ing and impinging on many aspects of other biologicalsciences. The subject deals with the quantitative analysisof systems architectures in biological networks that leadto emergent phenomena. Such analysis draws heavily onelectrical engineering, physics, and control theory, and

consequently appeals to scientists from a variety of fields.Systems Biology is most interesting, however, because bi-ological systems are complex: a qualitative description of parts does not lead to a unique qualitative description of the whole, and consequently quantitative data must beused to determine the regimes between which a systemhas qualitatively different behavior.

One particularly interesting subtopic in systems biol-ogy is the control of cellular differentiation programs.In particular, while there are known network designsthat facilitate controlled differentiation, it would seem,a priori, that these networks, like all biological networks,ought to operate on a time scale shorter than the length

of the cell cycle. The constant dilution of chemicalconcentrations accompanying cell growth superimposesa constitutive negative feedback on any system, and thusmakes it difficult for networks to operate over multiplecell cycles. It is particularly intriguing, then, to study thedifferentiation of Bacillus subtilis into spores. B. subtilis

responds to a lack of nutrients by initiating first severalrounds of growth followed by differentiation into spores.This delayed process is of great interest because, unlikesimilar processes mediated by quorum sensing, this pro-cess has been shown to occur in a cell-autonomous fash-

ion, independent of the medium in which the cells aregrown.

It is known that the master regulator for this differen-tiation process is the phosphorylated form of the tran-scription factor Spo0A. There are multiple regulatorypathways that both control and are controlled by thephosphorylation level of Spo0A, and the Elowitz Lab hasanalyzed many of these using genetic knockdowns andknockouts of various key factors in the networks. However, any analysis of such mutations and their effectmust necessarily use as readout the level of phosphory-

lation of Spo0A. To this end, the Elowitz Lab had de-vised a relatively simple readout system to probe Spo0Aphosphorylation dynamics. Under this readout systema promoter sensitive to the phosphorylated Spo0A drivesproduction of yellow fluorescent protein (YFP), while aconstitutive promoter drives production of green fluores-cent protein (GFP) to serve as a control for the level offluorescence and promoter activity. The promoter activated by phosphorylated Spo0A is bound more often asthe concentration of Spo0AP increases, and this leads toincreased production of the fluorescent readout. Sincethe concentration of fluorescent protein in each cell isconstantly being diluted due to cell growth and division

a dynamic model must then be used to extract the rel-ative Spo0AP level as a function of time based on theoverall fluorescence level.

This readout allows for comparative analyses oSpo0AP levels, and even estimates of ratios between dif-ferent levels, but it does not provide any informationabout the actual number of phosphorylated Spo0A tran-scription factors. Furthermore, the readout system doesnot respond linearly to Spo0AP concentration, as thebinding of transcription factor to promoter follows a satu-ration curve. Finally, the system has low time resolution



Measuring Dynamic Phosphorylation of Spo0A in Bacillus subtilis 2

due to the discrepancy in time scales between binding ki-netics and transcription/translation.

Using this readout system, it was found that Spo0Aphosphorylation is pulsed once every cell cycle in thenatural course of sporulation. This discovery implicatedseveral potential mechanisms by which these pulsed dy-namics could enable cells to robustly defer sporulation.However, the lab was unable to experimentally distin-

guish between these different mechanisms because of thelow time resolution of the readout system.

I have conducted research this summer to constructand employ a new readout system, also based on timelapse microscopy, which addresses these issues and, inparticular, introduces higher time resolution. Since onlyphosphorylated Spo0A can bind DNA, my project inves-tigated whether direct observation of Spo0AP DNA bind-ing dynamics can give a more precise readout of Spo0AP

dynamics in individual cells.

B. Binding Array Based Readout System

Under the modified system that I employed this sum-mer, the chromosomal copy of the Spo0A gene was re-placed with a fusion protein of Spo0A and the red fluores-cent protein mCherry. It has been shown that this fusionprotein does not interfere with the structural propertiesof native Spo0A, and in particular does not inhibit thebinding of Spo0AP to the appropriate binding site. Fur-thermore, a series of short identical DNA sequences, eachisolated from the PSpo0F promoter and capable of binding

only Spo0AP, were inserted into the B. subtilis genome.The sequences together form a binding array for Spo0AP

molecules. Since any Spo0AP-mCherry molecules bound

to this array are localized within a small area, their flu-orescent intensity is much more concentrated spatiallythan the intensity of unbound Spo0A diffusing aroundthe cell. As such, this modified readout system wouldpresent concentrated spots of fluorescence, the intensi-ties of which would correspond to the number of Spo0AP

present in the cell.In order to employ this novel readout system, several

different aspects of the research process were carried outsimultaneously. We developed an algorithm for identify-ing spots of localized fluorescence and quantifying the rel-ative intensities of these spots as well as the backgroundfluorescence in each cell. Furthermore, a probabilistic

model was developed in which these determined inten-sities were used to estimate the fluorescences of singlebound and unbound Spo0A molecules, and consequentlyto estimate intracellular concentrations of Spo0AP andSpo0A. We also completed three particular subprojects inan effort to create the ideal experimental system. First,we utilized a system in which cells and intracellular flu-orescent spots could be observed under chemostatic con-ditions over long time scales. Second, we developed amedia solution that would induce cells to sporulate, soas to study sporulation consistently across a population

of cells. Finally, we performed experiments to determinethe level of Spo0A for which the dynamic range for local-ized fluorescence corresponding to Spo0AP is maximizedAll of these components will be used by continuing members of the Elowitz Lab to study the fine timecourse oSpo0AP dynamics during sporulation in B. subtilis.

II. METHODS AND RESULTS

Prior to discussing the work conducted this summerwe would like to relate the general method by which cellswere imaged. Typically, cells were imaged on agarosepads. These pads were prepared as follows: an appropriate weight of agarose powder was first mixed into SMSsolution by heating. If the cells were to be imaged foextended lengths of time, this solution was then mixedwith the desired growth media. The final agarose solutionwas then applied uniformly on the surface of a cover slipand sealed from above with another cover slip. Once thesealed solution had cooled, the cover slips were removed

and the agarose slab in between was sliced into smallersquare pads. Cells were then applied on the surface othese pads and, after allowing fifteen minutes for excessliquid to evaporate, the pads were placed face down inpetri dishes and imaged.

A. Identification and Quantification of Localized

Fluorescence

The problem that had to be solved in order to deter-mine the intensities of spots of localized fluorescence wasactually twofold. First, given an image with many cells

we had to be able to identify, within a given cell, whatregions did or did not constitute a spot. Then, given sucha spot, an intensity value had to be assigned to the spotin a meaningful way.

Prior to this summer, a validation strain of B. subtili

was created in which a TetR-GFP fusion protein was expressed constitutively, and in which each cell possesed atleast one binding array for this fusion protein. This strainwas cultured simply because it afforded greater clarity indot identification than did the experiment strain involv-ing Spo0AP binding arrays. This clarity would ensure anidentification algorithm with parameters less susceptibleto noise. Furthermore, since all the fusion proteins inthe validation strain were capable of binding to the ar-rays, we expected the ratio of calculated spot fluorescenceto background cellular fluorescence to be nearly constantfrom cell to cell. It is for these two reason that this strainwas used to optimize the algorithm by which spots wereidentified and their intensities quantified.

Initially, we identified the best conditions, with respectto growth media and progress along the growth curve, inwhich to image this validation strain of B. subtilis so as toobserve clear and distinct spots. We sampled three typesof growth media, namely SMS media with glucose, CH




media, and LB media. We also sampled cells hourly untilthey reached stationary phase. It was eventually foundthat the cells grown in SMS media, two hours into theexponential phase, presented the most desirable spots forimaging. Consequently, we prepared the validation strainin SMS media and after two hours of growth imaged thestrain repeatedly, both in brightfield and under a 395nm laser for GFP excitation, collecting data from several

hundreds of cells. The resulting images were then usedto optimize a computational algorithm for spot detectionand quantification.

The underlying structure of the algorithm is fairlystraightforward. The motivating them is to select poten-tial candidates for spots by some initial sieve, and then toidentify which candidates are true spots using machinelearning. Boundaries of cells are determined from thebrightfield images using standard edge detection algo-rithms. Each cell is then handled in an iterative fashion,using the portion of fluorescence data contained withinthe newly determined boundaries of the cell. First, themaximum intensity pixel in the cell is selected. If this

pixel is below some pre-specified intensity threshold, thenit is determined that all potential spots in the cell havebeen identified. Otherwise, the pixels around this maxi-mum intensity pixel are examined in order to determinethe point at which the difference in intensity between ad-

jacent pixels falls below a preset threshold of noise, mark-ing the boundary of the spot. Using the pixels within thespot, a number of informative features are calculated forthe spot.

We decided to use as features the total intensity of thespot (i.e. the sum of pixel intensities for all pixels withinthe spot, and the final assignment of spot intensity), theintensity of the central pixel, the characteristic radius of

the spot (i.e. the distance between the central pixel andthe boundary), the standard deviation of pixel intensitieswithin the spot, and a measure of the correlation betweenthe pixels in the spot and a gaussian distribution. Fol-lowing the calculation of these features, the intensities of the pixels in the potential spot are replaced by a runningestimate of the background intensity of the cell, based onan average over pixels that have not yet been identified asbeing within a potential spot. The entire process is thenrepeated. Finally, we provided annotations (i.e. whethera potential spot is truly a spot or not) for approximately100 potential spots, and using these labels and the fea-tures of the associated potential spots, we used LinearDiscriminant Analysis to compute a general model foridentifying real spots among potential spots. The modelindicated that gaussian correlation was by far the great-est indicator of true localization, with ten times greaterpredictive value than the second best indicator, the char-acteristic radius. Both of these features showed positivecorrelation with true localization of fluorescence.

Cross validation tests with the annotated data set in-dicate that this computed model has a very low falsenegative rate, and a false positive rate of approximately0.05. However, this is somewhat misleading. It is con-

ceivable that while this model has very good predictivevalue among spots within cells of the validition strain, thefeatures that are most important over a larger populationof cells, or more specifically over the cells of the experi-mental strain, might not be the same features identifiedfrom the validation strain alone. Furthermore, LinearDiscriminant Analysis assumes a linear dependence onthe features being used, and for our choice of features

this assumption is quite likely to be erroneous. As a simple example, consider that spots with either abnormallysmall or abnormally large characteristic radii might infact be nothing more than slight perturbations from thebackground fluorescence caused by stochastic diffusion othe fusion proteins. This is a highly nonlinear effect, andthat too in a feature identified as being particularly in-formative under the linear approximation. Thus, as weproceed it would be worthwhile to consider employingnonlinear machine learning algorithms. Furthermore, itwould also be useful to iteratively optimize the weightsdetermined for each feature by employing the algorithmover mixed populations of cells, using the output of the

previous iteration as the starting point for the next iter-ation, and eventually determining feature weights solelyover the cells of the experimental strain.

B. Probabilistic Model for Determining Molecule

Counts

The fluorescent intensities computed in the model generated by Linear Discriminant Analysis, while certainlyinformative, leave much to be desired on a number offronts. While an difference in intensity between differentspots necessarily correspond to a difference of the same

sign between the occupancies of the two correspondingbinding arrays, these differences do not share a linearelation, or an easily discernible relation for that mat-ter. Furthermore, in some cells of both the validationand experimental strains, the original array insertion pro-cess was so successful that not one, but two arrays weretransformed, and in these cells the distribution of totalocalized fluorescence among the two corresponding spotsoffers additional information.

To extract this information, in particular the conver-sion between intensities and molecule counts, a proba-bilistic model was developed to describe the system atthe level of chemical kinetics and statistical mechanicsWe begin with a simplified model, in which we consider only the distribution of molecules between the twobinding arrays, assuming a fixed number of total boundmolecules. We then proceed to consider both bound andunbound molecules, considering at this stage what infor-mation might not be available to us at present. Finallywe discuss how to infer the fluorescence of a single fusionprotein from these models. We leave the mathematicadetails of these models for the appendices, including onlythe results of our calculations here.

In considering the simplified model, let there be N B




copies of the fusion protein X bound in total and M bind-ing sites for X in each of two binding arrays. Note that2M ≥ N B. We assume no cooperativity in the bindingof different sites, and that each binding site is identicaland has equal probability of being bound. Label the twobinding arrays 1 and 2, and let N i be the number of X bound to array i. Since N 2 = N B−N 1, only N 1 is neededto characterize the macrostate of the system. From these

assumptions, we calculated that

N 1 =N

2

N 12 =

N

2

1 +

(N − 1)(M − 1)

2M − 1

(N 1 −N 2

2)2 = σN 1 =

N 1

2 −N 12

=

N (2M −N )

4(2M − 1)

Now we extend our analysis to unbound X . Let N bethe total number of copies of X present, L the number of free ’lattice’ sites available in the cytoplasm for X , −E

the energy change due to a single molecule of X bind-ing to a binding site on either array, T the temperatureof the solution in Kelvin, and x the Boltzmann factor

eE

kT . From these assumptions and statistical mechanics,we calculated the averages and variances of N 1 and N Bunder this model. However, due to the lack of informa-tion regarding the parameters E and L at present, theresults cannot be used directly for the inference of singlemolecule fluorescent intensity. Furthermore, neither onecan individually be inferred from the data, though a jointfunction of both parameters might be inferred.

Thus, in order to infer ν , the fluorescent intensity of a single X , we must rely solely on the distribution of

intensity between the two spots within cells, and noton the background fluorescence levels. For some ar-bitrary numbering over the cells from which data wascollected, let the values of N, N B,N 1, N 2 for cell i beN i, N i,B, N i,1, N i,2. Furthermore, let the actual fluores-cence measurements from each of these cells be denotedby replacing the N in the corresponding molecule countwith Y . Thus, for any set of subscripts j, we have thatY j = νN j . Furthermore, define F = νM . Given thesedefinitions and the model we have described, we pick asour estimate of ν the value of ν for which p(ν |Y i,1, Y i,2∀i)is maximized. This distribution can be determined usingBayes Law and a uniform but restricted prior p(ν ) over

possibly values of the proportionality constant ν . Carry-ing out this calculation gives

ν ≈ 2F

(Y i,1 − Y i,2)2

Y i(2F − Y i)

All that remains, then, is to determine the value of F in this expression. A heuristic method which we foundto be surprisingly effective was to just note that σN 1 forfixed N B is maximized when N B = M , and thus selectF as the value of Y B for which σY 1 is maximized. We

are currently in the process of solving the joint inferenceproblem for F and ν simultaneously.

C. Experimental Optimization

It is crucial that the proper conditions are maintainedfor growth of B. subtilis during experiments. In particular, we decided to conduct all experiments in a bacte-rial microfludic chemostat. While prior experiments hadbeen performed in non-chemostatic environments, theseexperiments presented many difficulties. Firstly, we hadbeen unable to track the growth of the B. subtilis cells formore than ten generations; indeed, while we are very in-terested in the dynamics of Spo0A phosphorylation overlong time intervals, the large number of progeny gener-ated from ten generations of division simply frustratedour attempts to measure these dynamics. Furthermoreit is effectively impossible to vary environmental condi-tions in a controlled way without using a chemostat.

In light of these advantages, we opted to use a chemo

static device in which only one lineage of B. subtilis cellsis actively maintained. Such a device was recently devel-oped by the Jun lab at Harvard. Their work utilized adevice in which a small 1um wide growth channel abuta larger trench through which a chemostatic nutrient so-lution is flown. The growth channel initially contains asingle cell. As this cell divides, its progeny are pushed outfrom the growth channel. At the same time, the mothercell, remaining in the growth channel, is exposed to achemostatic solution via diffusion of nutrients from themuch larger adjacent trench. In this way, then, only onelineage of the cells is retained, and this lineage is kept in achemostatic environment which can be controlled and al-

tered at will. We utilized this device, termed the ’mothermachine’, to study Spo0A phosphorylation dynamics inB. subtilis cells. We manufactured this device and optimized its design parameters by making liberal use of theKavli Nanoscience Institute’s (KNI) cleanroom facilitiesIn particular, the different mixtures of photoresist usedthe speed at which the photoresist is spun, and finallythe amount of time for which the photoresist is exposedto ultraviolet light affect both the height and the definition of the channels in the device. As such, by iteratingover these parameters in a systematic fashion and assum-ing a linear dependence of both height and dependenceon these parameters, we optimized channel height andchannel definition.

Concurrently, we worked to develop a conditioned me-dia solution to induce sporulation in B. subtilis cells asfollows. First, wild type B. subtilis grown in nutrient richmedia were isolated by centrifugation and resuspended innutrient deprived resuspension media. After some periodof time, the cells were filtered out of the resuspensionmedia solution, and what remained was taken to be theconditioned media. In an effort to test the efficacy of theconditioned media , we grew the cells, either in liquid orsolid phase, in four conditions, namely without glucose




with glucose, in conditioned media without glucose, andin conditioned media with glucose. For each test, weexpected the first group to sporulate, albeit slowly, thesecond group not to sporulate, and the third group tosporulate relatively quickly. Sporulation of the fourthgroup would have constituted conclusive validation of theefficacy of the conditioned media, but a lack of sporula-tion would not have been conclusive. We optimized over

these different preparation protocols by varying the ini-tial growth media, the stage of growth at which cellswere removed from this media, and the amount of timespent in resuspension media before filtration. We foundthat the most effective conditioned media was producedby taking cells grown to optical density 1 in CH mediaand culturing them in resupsension media for 2 hours.This media was subsequently tested on cells within themother machine, and successfully induced sporulation inthis setting.

Finally, we also attempted to optimize the dynamicrange of Spo0AP visualization by varying Spo0A pro-duction. To do this, we replaced the Spo0AP inducible

promoter controlling production of Spo0A with a xy-lose inducible promoter. We then proceeded to imagecells on media pads treated with xylose, under 587 nmlaser excitation to visualize mCherry signal and deter-mine the level of xylose induction required to achieve thegreatest ratio between localized fluorescence and back-ground fluorescence. After much experimentation withpre-imaging growth routines, we finally determined thatthe the basal rate of activity of the xylose inducible pro-moter produced the greatest dynamic range of Spo0AP

visualization. Thus, as we proceed we will likely needto choose a more tightly regulated promoter to controlSpo0A production.

In conclusion, then, we were able to optimize a num-

ber of components required to analyze Spo0AP dynamicsduring sporulation in B. subtilis. We optimized the design for a microfluidic system for analyzing individuacell lineages and produced the optimized device. We alsodeveloped a protocol for producing conditioned mediathat induces sporulation in B. subtilis. Furthermore, wedetermined the level of induction of Spo0A productionrequired for maximum dynamic range of Spo0AP visu

alization. Finally, we developed an algorithm for identifying and quantifying spots of localized fluorescence incells, as well as a probabilistic model for determining ac-tual molecule counts for Spo0A and Spo0AP from thesequantified fluorescence levels. As we move forward, weintend to replace the xylose inducible promoter with anIPTG inducible promoter for tighter regulation, and toimplement the dual inference of both binding array sizeand single molecule fluorescence intensity. Finally, withall of these components now available to us, we wish toexamine the sporulation system and answer two particu-lar questions. We will determine bounds on the time scaleof Spo0A phosphopulses, and we will determine whether

the amplitude of these pulses has any effect on the overaltimescale of sporulation.Acknowledgments

The author would like to thank Joseph H. Levine forhis patience and guidance throughout the research process and to thank Maria Hernandez for constructing alof the strains used in experiments this summer. Furthermore, the author is indebted to Professor MichaeB. Elowitz for allowing him the privilege to perform research in the Elowitz Lab at the California Institute ofTechnology.

[1] Grossman, A.D., Genetic networks controlling the initia-tion of sporulation and the development of genetic com-petence in Bacillus subtilis. Annu Rev Genet, 1995. 29: p.477-508.

[2] Molle, V., et al. The Spo0A regulon of Bacillus subtilis.Mol Microbiol, 2003. 50(5): p. 1683-1701

[3] Newport, J. and M. Kirschner, A major developmentaltransition in early Xenopus embryos: I. characterizationand timing of cellular changes at the midblastula stage.Cell, 1982. 30(3): p. 675-86.

[4] Raff, M., Intracellular developental timers. Cold SpringHarb Symp Quant Biol, 2007. 72: p. 431-5.

[5] Sonenshein, A.L. , Control of sporulation initiation inBacillus subtilis. Curr Opin Microbiol, 2000. 3(6): p. 561-6.

[6] Wang P., Robert L., Pelletier J., Dang W.L., Taddei F.Wright A., Jun S. (2010). Robust Growth of Escherichiacoli. Current Biology 20, 1099-1103.

[7] Waters, C.M. and B.L. Bassler, Quorum sensing: cell tocell communication in bacteria. Annu Rev Cell Dev Biol2005. 21: p. 319-46.



Figure 1. Shown above is a schematic diagram of experimental system. Each image on the left depicts a

molecular scenario, while the corresponding image on the right illustrates the observable fluorescence

pattern associated with the molecular scenario. When the fusion proteins are bound, their fluorescence

is much more localized and intense.

Figure 2. Shown to the left is

schematic of the mother machin

followed by a depiction of the way

which the device maintains single ce

lineages. As can be observed, there

one inlet channel and one outle

channel connecting to the centr

trench, which supplies nutrients t

and draws excess cells from the sid

channels.



Figure 5. Gillespie Simulations were used to simulate the two array system, producing the following data

(labeled in blue) for some fixed value of ｀. The values labeled in red denote the standard deviation of all

data points with YT within 104

of the denoted YT value. These depict the process by which YM = F was

determined as the value of YT for which this standard deviation is maximized. Finally, the black curve

above depicts the expected value of the standard deviation in single array fluorescence.

Figure 3. Induction of Sporulation

by Conditioned Media in the

Mother Machine

Figure 4. Visualization of Spo0AP

localized fluorescence during xylose

induction of Spo0A production.



Appendix: Probabilistic Model

1 Binomial Treatment

The problem statement here will be as follows. Let be N B copies of the fusion protein X bound in total andM binding sites for X in each of two binding arrays. Note that 2M ≥ N B . We assume no cooperativityin the binding of different sites, and that each binding site is identical and has equal probability of beingbound. Label the two binding arrays 1 and 2, and let N i be the number of X bound to array i. SinceN 2 = N B −N 1, only N 1 is needed to characterize the macrostate of the system. Now, there exist 2M bindingsites and N indistinguishable copies of X , and so there exist

2M

N

total binding configurations. Of these,

M

n

M

n

configurations have N 1 = n copies of X bound to array 1. Thus, the probability that N 1 = n is given

by

M n

M n

2M

N

.

Let us consider now the following function f (x, y) = (1 + xy)M (1 + y)M . Expanding this expression gives

f (x, y) =

n

M

n

x

n

y

nn

M

n

y

n=N

n

M

n

x

n

y

n M

N − n

y

N −n

=N

n

M

n M

N − n

x

n

y

N

The coefficient of yN in this polynomial (denoted by [yN ] (f (x, y))) is2M

N

times the generating function for

our probability distribution. From this, we have that

2M

N

N 1 =

n

n

M

n

M

N − n

= [yN ]

N

n

n

M

n

M

N − n

xnyN

x=1

= [yN ]

x

∂

∂xf (x, y)

x=1

=

[yN ]

M xy(1 + xy)M −1(1 + y)M x=1

= [yN −1]

M (1 + y)2M −1

= M

2M − 1

N − 1

2M N

N 1

2 =n

n2

M n

M

N − n

= [yN ]

N

n

n2

M n

M

N − n

xnyN

x=1

=

[yN ]

x

∂

∂xx

∂

∂xf (x, y)

x=1

= [yN ]

M xy(1 + xy)M −1(1 + y)M + M (M − 1)x2y2(1 + xy)M −2(1 + y)M x=1

=

[yN −1]

M (1 + y)2M −1

+ [yN −2]

M (M − 1)(1 + y)2M −2

= M

2M − 1

N − 1

+ M (M − 1)

2M − 2

N − 2

N 1 =N

2

N 1

2 =N

21 +

(N − 1)(M − 1)

2M − 1

(N 1 − N 2)2

4= σN 1 =

N 1

2 − N 12

=

N (2M − N )

4(2M − 1)

1



2 Generalized Treatment

Consider now a more general system, in which two binding arrays for protein X and a number of copies of protein X are present in the cytoplasm. Let N be the total number of copies of X present, N B the number of

X molecules bound to either array, M the number of binding sites in each array, N 1 and N 2 the numbers of X molecules bound to the first and second array respectively, L the number of free ’lattice’ sites available inthe cytoplasm for X , −E the energy change due to a single molecule of X binding to a binding site on eitherarray, T the temperature of the solution in Kelvin, and x = e

EkT .

Now, for any given value of N B, the treatment in the previous section gives the expected statistics for N 1.In particular, for any analytic function g(N 1), the treatment in the previous section allows us to determinean analytic function h(N B) = g(N 1), where the average is taken over all possible N 1 for fixed N B . In thecase of variable N B then, as we are discussing here, g(N 1) = h(N B), where both averages are taken over allpossible configurations of the system. Thus, we need only consider in this section determining averages of thenature h(N B), for analytic h. Now, from statistical mechanics we have that the probability of obtaining a

particular N B is given by1

Z

L

N − N B

2M

N B

, where the normalization factor is Z =

N B

L

N − N B

2M

N B

.

Employing the same argument as used in the first section, we obtain that Z = [wN ] (1 + w)L(1 + xw)2M =L

N

∞i=0

(2M )i(N )ixi

i!(L − N + 1)(i)where (a)i = a(a− 1) · · · (a− i + 1) and a(i) = a(a + 1) · · · (a + i− 1) are the falling

and rising factorials respectively. Here we introduce our first assumption, namely that L > N . We must note,of course, that the cytoplasm does not actually behave as a lattice with a finite number of sites, but even insuch a model as ours for which the cytoplasmic space is discretized, it is ludicrous to suggest that the numberof copies of X present in any functioning cell would outnumber the total number of cytoplasmic sites available.Given this assumption, then, it is clear that every term in this series is well defined. Furthermore, the seriesmust eventually terminate, as (a)i = 0 for i ≥ a + 1.

Using exact methods this calculation can be taken no further, as this series does not yield a closed form(it is in fact a hypergeometric function). Thus, to proceed we must enforce further assumptions and examinedifferent regimes of behavior. We assume, then, that L >> 2M, N . Again, this is a reasonable assumptionbecause of the vast and continuous nature of the cytoplasm in comparison to the size of individual proteins.

Under this assumption, we have that (L − N + 1)i ≈ Li. Finally, we shall consider two different regimesof behavior, namely when N << 2M and when N >> 2M . In the former case, our expression reduces to

Z ≈

L

N

1 +

2M x

L

N , while in the latter case, our expression reduces to Z ≈

L

N

1 +

N x

L

2M

. From

these two expression, we can calculate the quantities of interest in each of these two regimes. In particular,

we have now that h(N B) =1

Z h(x

d

dx)Z . Applying this in conjunction with the expressions obtained in the

first section gives

N B =

2MNxL+2Mx

, if N << 2M 2MNxL+Nx , if N >> 2M

N B

2

= 2MNx

L+2Mx+ (2Mx)2N (N −1)

(L+2Mx)2 , if N << 2M

2MNxL+Nx + (Nx)

2

(2M )(2M −1)(L+Nx)2 , if N >> 2M

σN B2 =

2MNxL+2Mx

− (2Mx)2N (L+2Mx)2

, if N << 2M

2MNxL+Nx − (Nx)2(2M )

(L+Nx)2 , if N >> 2M

2



N 1 =

MNxL+2Mx

, if N << 2M MNxL+Nx , if N >> 2M

N 12

= MNx

L+2Mx+

2N (N −1)(Mx)2

(2M −1)(L+2Mx)2 , if N << 2M

MNxL+Nx + M (Nx)2

(L+Nx)2 , if N >> 2M

σN 12 =

N 1 − N 2

2

2=

MNxL+2Mx

− (Mx)2

(L+2Mx)2

2M −32M −1N 2 + 2

2M −1N

, if N << 2M

MNx(L−(M −2)Nx)(L+Nx)2 , if N >> 2M

3 Inference of Proportionality Constants

We now consider a system in which each copy of protein X is replaced with a fusion protein, X-GFP. Now,whether bound to an array or floating free in the cytoplasm, each X molecule will produce a fluorescencesignal with amplitude ν . Only measurements of fluorescence can be taken on this system, but if ν were known,these fluorescence measurements could be converted into exact molecule counts. We will use the probabilistic

variations, described by the quantities calculated in the previous two sections, to calculate ν . However, sincethe parameter values for L, E in the generalized treatment are not known, we will only perform the inferenceusing the binomial model.

3.1 Independent Dependencies

We would like to select the value of ν such that the probability p(ν |d) ∝ p(d|ν ) is maximized for datasetd. Suppose each cell in our data set is labeled with an index i. Then for cell i, the data points Y i,1, Y i,2 arecollected, corresponding to the fluorescence measurements from the first and second binding arrays respectively.Furthermore, Y i,1 = νN i,1, Y i,2 = νN i,2 by definition, where N i,1, N i,2 are the number of X molecules presentwithin cell i and bound to array 1 and array 2 respectively. Now, in section 1 we calculated the first twomoments of N 1 as a function of N B. For a sufficiently large number of cells, the central limit theorem allowsus to asymptotically determine that

p(Y i,1|Y i,2, ν ) ≈

4νF

π(Y i,1 + Y i,2)(2F − (Y i,1 + Y i,2))e−ν

4F (Y i,1−12(Y i,1+Y i,2))

2

(Y i,1+Y i,2)(2F −(Y i,1+Y i,2))

Furthermore, since each cell is independent, we have that p(Y i,1∀i|v, Y i,2∀i) =

i p(Y i,1|v, Y i,2). Sincemaximizing this distribution with respect to ν is equivalent to maximizing the logarithm of this distributionwith respect to ν , we differentiate ln p with respect to ν and solve for ν such that the resulting expression is0. This gives the result established in the main text, namely that

ν ≈ 2F

(Y i,1 − Y i,2)2

Y i(2F − Y i)

3

Documents

Final Submission for Arvind Thiagarajan