University of Turin - Istituto Nazionale di Fisica …...DNA molecules consists of two long polynucleotide chains. Each chain is composed by a sequence of four nucleotides, two purines,

University of Turin

Doctoral school in complex systems for LifeSciences

A combination of transcriptional and microRNAregulation improves the stability of the relative

concentrations of target genes

Candidate:Andrea Riba

Supervisor:Prof. Michele Caselle

PhD thesis

November 29, 2014

Abstract

The cell is a computational device made of several thousands of interactors.It monitors its environment and evaluates the precise amount of each proteinthat has to be produced. This complex task of information processing is per-formed by complex regulatory networks, among which a crucial role is played bythe network of gene expression regulation. This work focuses on the discoveryof subunits of this specific network that are relevant for the cellular responseto environmenal signals as well as for its robustness to perturbations (i.e., forhomeostasis). In fact, the cell has to function reliably in presence of internaland external stochastic fluctuations, but at the same time has to adapt to envi-ronmental changes. This twofolded task is partially implemented at the level ofgene expression regulation. More specifically, gene expression is controlled bothat the level of transcription by transcription factors, and at the level of transla-tion by microRNAs. These two layers of regulation combine in a mixed networkof interactions, composed by small modules or genetic circuits performing spe-cific functions. In this dissertation, the role of microRNA regulation in thesecircuits is analyzed using computational and analytical tools from the theory ofstochastic processes, and bioinformatic data analysis. The main result of thisanalysis is that different circuit architectures involving microRNA regulationcan fine tune the relative expression of set of genes while ensuring the robust-ness of their expression to stochastic fluctuations. The specific role played by thephysical mechanism of microRNA regulation in implementing these functions isanalyzed in detail.

Contents

1 Introduction on gene expression 31.1 Gene promoter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Post-transcriptional regulation . . . . . . . . . . . . . . . . . . . 111.5 Regulatory network . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Introduction to gene expression model 202.1 Markov processes and the Chapman-Kolmogorov equation (CK) 202.2 Chemical Master equation (CME) . . . . . . . . . . . . . . . . . 212.3 Generating function . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 Linear noise approximation . . . . . . . . . . . . . . . . . . . . . 242.5 Numerical solution of the Master Equation . . . . . . . . . . . . 26

2.5.1 Gillespie algorithm (SSA) . . . . . . . . . . . . . . . . . . 262.6 Cases of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.1 Production-degradation process . . . . . . . . . . . . . . . 282.6.2 Expression of a constitutive gene . . . . . . . . . . . . . . 292.6.3 Transcriptional regulation . . . . . . . . . . . . . . . . . . 35

3 miRNA coherent feed forward loop 393.1 Putative functions of micFFLs . . . . . . . . . . . . . . . . . . . 403.2 Deterministic analysis . . . . . . . . . . . . . . . . . . . . . . . . 413.3 Steady state analysis with the logic approximation . . . . . . . . 433.4 Models in detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5 Master equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Stochastic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Comparison between NM4 and micFFL . . . . . . . . . . . . . . 523.8 A prototypical example: the micFFL involving E2F1 and RB1 as

targets and a set of miRNAs as master regulators . . . . . . . . . 52

4 Analysis of microRNA binding sites 554.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

1

5 Dynamics of mobile elements 775.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3 Source estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Bibliography 89

2

Chapter 1

Introduction on geneexpression

Cells are the basic structural, functional, and biological units of all known livingorganisms, but viruses. In order to produce what the living organism, to whicha cell belongs, needs, every cell resorts to the expression of genes encoded withinits genome. Gene expression mechanism is a process defined by multiple stepsand ruled by the central dogma in molecular biology. This principle in-tends to assign a direction into the flow of genetic information among the mostelementary and perhaps, the eldest, polymers that carry information in a cell,DNA, mRNAs and proteins. It was first stated by Francis Crick in 1956 andre-stated in a Nature paper published in 1970 [28]

DNA RNA

protein

Figure 1.1: Central dogma scheme for the information flow in a

“... . The central dogma of molecular biology deals with the detailedresidue-by-residue transfer of sequential information. It states thatsuch information cannot be transferred back from protein to eitherprotein or nucleic acid. ...”

This has also been described as “DNA makes RNA and RNA makes protein”.However, this oversimplification does not make clear that the central dogma asstated by Crick doesn’t preclude the reverse flow of information from RNA toDNA, only ruling out the flow from protein to RNA or DNA. In fact,

3

“... . The principal problem could then be stated as the formulationof the general rules for information transfer from one polymer witha defined alphabet to another. ...”

In his paper [28], information transfer stood for modification or productionof a polymer chain starting from another one (DNA,RNA or protein), see Table1.1 and Fig. 1.1. Stated as that, the central dogma still holds.

Gen

eral DNA → DNA replication

DNA → RNA transcriptionRNA → protein translation

Sp

ecia

l RNA → DNA retrotranscriptionRNA → RNA RNA dependent transcription

DNA → protein [73, 112]

Un

kn

own protein → DNA

protein → RNAprotein → protein

Table 1.1: Available links between polymer chains with related well-known pro-cesses or references.

In order to study this flow, we can start looking at what keeps the informa-tion. DNA molecules consists of two long polynucleotide chains. Each chain iscomposed by a sequence of four nucleotides, two purines, adenine and guanineand two pyrimidines, cytosine and thymine (or uracyle for RNA). The hydrogenbonds between nucleotides of two chains hold together the well-known doublehelix structure. More precisely, pyrimidines bind to purines, A = T via twoand G ≡ C via three hydrogen bonds. All the nucleotides are embedded on abackbone, made of sugar (deoxyribose) and phosphate, as figure 1.2 shows. Ineukaryotic cells DNA is enclosed in nucleus and the complete set of informationin an organism’s DNA is called genome. It stores the instructions to make allproteins and RNA molecules synthesized by the cell.

Figure 1.2: DNA structure

4

Each turn of DNA is made up of 10.4 nucleotide pair and the coiling createstwo grooves: major groove, the wider one and the smaller, minor groove. Inbacteria, all genes are on a single DNA molecule, often circular. While in eu-karyotes, DNA is allocated among chromosomes, and each chromosomes is madeup of a single DNA molecule, packed with proteins, of which most are histones,that fold DNA in a complex structure, called chromatin. Packed chromosomesare visible only during the last steps of cell division (mytosis), for the rest ofthe cell cycle (interphase) they fill the nucleus in their open state, see Fig. 1.3.

Figure 1.3: Chromosomes dislocation during mytosis by Speicher et al. [102] andin interphase by Schrock et al. [96].

DNA is wrapped around nucleosome, each one build up 8 proteins, and thisrepresents the first level of packaging. During division DNA can reach thanks tonucleosome-nucleosome interaction a fiber with a diameter of 10 nm [47], giv-ing birth to highly compacted chromosomes. In cell during the interphase, twochromatin states exist on the basis of an, perhaps, old classification: heterochro-matin, highly condensed phase and euchromatin, less condensed one. Today,we know it has multiple states [115], depending on non-histone proteins boundto DNA and on epigenetic markers. The chromatin state plays a role in DNAaccessibility and is a main regulator of DNA expression. DNase I hypersensitivesites (DHS) in chromatin were identified for the first time over 30 years ago andsince then have been used to map regulatory DNA regions including enhancers,promoters, silencers, insulators and locus control regions and they reveal openchromatin state on DNA sequence. In [109] ENCODE project analizes DHSsgenome-wide in 125 human cells. They found 970100 DHSs are specific for asingle cell type, 1920642 for two or more cell types and 3692 are in all cell types.The result is that few cell types share common open chromatin region, and soregulatory network could be quite variable among different human cell lines.

1.1 Gene promoter

A gene promoter has not a clear definition because of the really complicatedinteractions of DNA with itself into the nucleus. As well explored by the EN-CODE project [39], the transcription factors (TFs) are able to bind genome atspecific sites and condition expression of genes even at distances of hundredsof kilobases. What really matters, is the interplay among a set of transcriptionfactors bound on DNA and their arrangement into tridimensional space. Pro-

5

moters bound by TF may increase or decrease the affinity of RNA polymerasseto the transcription start site (TSS).

TSS

>10 kbdistal TF proximal TFgene

BS BS BS

As [109] pointed out within human cells only 3% of DHSs localize to tran-scriptional start sites (TSS) and 5% including DHSs lie within 2.5 kb of a TSS.The remaining DHS are more distal and evenly divided between intronic andintergenic regions. Further promoter accessibility is high across all 29 cell types,but distal DHSs are usually cell specific. Curiously, Long terminal repeats showenriched DHSs.

In cell many genes need to be transcribed continuosly to keep cell functional-ity and other are necessary in specific processes or responses and for a restrictedperiod of time. Thus, the first ones are called constitutive genes and the latterfacultative genes, transcribed only when needed.

TF binding sites New technologies allow researchers to identify with goodprecision sites where transcription factors bind. These sites may change from dif-ferent cell lines, mainly because of chromatin different states. ChIP-sequencing,also known as ChIP-seq, is a method used to analyze protein interactions withDNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with mas-sively parallel DNA sequencing to identify the binding sites of DNA-associatedproteins. It can be used to map global binding sites precisely for any protein ofinterest. A common way to describe the interaction sites of TFs with DNA isthrough the Positional Weight Matrices (PWM) where you have a binding sitesof lenght n and for every position you have the probability of finding a specificnucleotide. This matrices could be converted to the information content eachnucleotide keeps within the binding site, thanks to the Kullback-Leibler diver-gence, as done in [121]. A large dataset of PWM estimated by ENCODE projectare stored at FactorBook (http://www.factorbook.org/) [116]. An examplefor E2F1 PWM in Fig. 1.4.

Figure 1.4: Example of Positional Weight Matrix for E2F1 transcription factor.

1.2 Transcription

The DNA → RNA link copies information from DNA to RNA molecules andit’s called transcription because the language doesn’t change, you start from a

6

nucleotide sequence to get a new nucleotide sequence. This new sequence is asingle strand RNA chain called transcript. The sequence is exactly complemen-tary to the strand of DNA used as template but uracyles substitute thymines.A series of RNA are produced by the cell with different tasks, in table the majorfamilies.

Type of RNA FunctionmRNA messenger RNAsrRNA ribosomal RNAstRNA transfer RNAssnRNA small nuclear RNAssnoRNA small nucleolar RNAsscaRNA small cajal RNAsmiRNA microRNAssiRNA small interfering RNAsother

This process of transcription is carried on by RNA polymerases togetherwith specific transcription factors (e.g. σ-factor).

RNA molecules are less stable than DNA because of their chemical struc-tures. The 2’ OH group of RNA can react with the molecule’s backbone in flex-ible regions, causing the molecule to cleave. For instance, this effect is exploitedby self-splicing introns. Since long strands of RNA are therefore chemically lessstable, organisms which evolved to use DNA instead of RNA to protect their bi-ological code probably had a selective advantage. This may explain why almostall of life forms use DNA as its genetic code.

Once RNA polymerase (RNAp) has been bound to transcription start site,transcription process can start and dedicated transcription factors are necessaryfor the functioning of each RNAp. A set of transcription factors helps recognizingthe promoter region typical of each polymerase.

RNA polymerase type I transcribes rRNA genes [88].

TSS

coreUCE

18S 28S5.8SrDNA

Regions near TSS of a single ribosome gene have two binding sequence,Upstream Control Element (UCE) and CORE. Transcription of rDNAs re-turns ribosomal RNAs (rRNAs) that constitute about 80% of total RNAsin a cell. Indeed total amount of RNA in a cell is used in several works asproxy of ribosome concentration. RNA polymerase I lacks of C-terminaltail, explaing why its transcripts are neither capped nor polyadenilated.A growing mammalian cell must synthesize 10 million copies of rRNA.

RNA polymerase type II transcribes all protein coding genes, snoRNA genes,siRNA genes, microRNA genes [67], snRNA genes.

TSS

BRE TATA INR DPE

-35 -30 +30

7

TSS region is typical of all coding genes and it’s well-explained in Albertsbook [1]

RNA polymerase type III transcribes tRNA genes, 5S rRNA genes, 7SLRNA genes, microRNA genes [17], retrotransposons

TSS

A-box B-box

PSE TATA

type I

type II

For this polymerases, two different type of TSS binding are available. Forexample type I is used by 7SL RNA and by retrotransposons derived by7SL RNA, as Alu family. A review on RNA pol III can be find here [120].

We focus better on RNAp II transcripts, that include coding transcripts.Once the polymerase binds the TSS on DNA, it starts producing a singlestrand RNA (pre-mRNA) complemetary to DNA molecule. While elongationapproaches the end of the gene, the transcription stop signal, encoded in genome,is read by specific proteins that recognize, cleave and polyadenilate the tail ofpre-mRNA. In particular, CPSF complex binds to AAUAAA sequence and CstFcomplex, together with other proteins, cut DNA within 10-30 nucleotides fromCPSF. Then poly-A polymerase add a poly-A tail of variable length between 150and 250 A nucleotides. At the end of pre-mRNA processing the result is mRNAready to be exported from the nucleus and translated in proteins. Before the ex-portation the mRNA has to be 5’ capped, that helps the ribosomes in cytoplasmrecognizing what messengers have to be translated. Proteins, giving a maturemRNA, usually are bound to the C-terminal domain of RNA polymerase II.

mRNA

An5'-cap

5' UTR 3' UTRCDS

Figure 1.5: transcription factories

mRNA molecule presents two un-translated region (UTRs) and thecoding sequence (CDS). Thus, eu-karyotic pre-mRNAs undergo somemodification as capping at 5’ end,polyadenilation at 3’ end and splicingbefore a set of proteins helps its ex-portation off the nucleus for the trans-lation.

Using genome sequence from dif-ferent species (Ensembl [35]), it’s pos-sible to observe evolutionary changesin nucleotide density composition ofmulticellular organism. Following theevolutionary tree, the mean GC con-

tent of 5’ UTR modifies going towards mammalia, as in figure 1.6. In some

8

C. elegans Fruitfly Zebrafish

Genom

e

5' U

TR

3' U

TR

CD

S

Platypus Human

0.0

0.5

1.0

0.0

0.5

1.0

Genom

e

5' U

TR

3' U

TR

CD

S

Genom

e

5' U

TR

3' U

TR

CD

S

Genom

e

5' U

TR

3' U

TR

CD

S

Genom

e

5' U

TR

3' U

TR

CD

S

A p

robabili

ty

Figure 1.6: Evolution of CDS and UTRs in different genomes.

papers [24, 25] there are evidence of transcription factories, RNA polymerasesare bound to scaffold and transcribe a same DNA sequence at the meantime.Replication and transcription often display polymerases that track like locomo-tives along their DNA templates. However, recent works support an alternativemodel in which DNA and RNA polymerases are immobilized by attachment tolarger structures, where they reel in their templates and extrude newly made nu-cleic acids. These polymerases do not act independently; they are concentratedin discrete “factories”, where they work together on many different templates.

1.3 Translation

RNA → protein link, called translation, because of a change of language. Thegenetic code gives rules to map RNA triplets in one of the twenty amminoacids.So, each mRNA exported from nucleus, contains the information about theamminoacidic chain, it will produce. The ribosome starts decoding the RNAmolecule by a specific start codon (AUG) corresponding to methionine, so there’sonly one way to read the coding sequence of mRNA.

AUG AUG AUG

AUG

An

ribosome binding sites

eIF4G/E proteins

bacterial mRNA

eukaryotic mRNA

In eukaryotes, the initiator tRNA-methionine complex is first loaded into thesmall ribosomal subunit 40S along with Two additional proteins called eukary-otic initiation factors, eIF4E and eIF4G. Next, the small subunit binds to the5’ end of an mRNA molecule, recognized thanks to 5’ cap and its initiation fac-tors. Then, the subunit starts moving forward, looking for the AUG codon. Inmost of the cases, 90%, translation begins at the first AUG encountered, herethe initiation factors dissociate and the large subunit comes to complete theribosome. In bacterial mRNA, we have more coding sequence stacked togheterin sequence, in this way, proteins cell needs to a precise time can be producedtogheter by the same messenger. Besides this mechanism bacterium is able tocontrol protein stochiometry, placing more start codon AUG nearby the TSS.Bacterial mRNAs have no 5’ cap, so they use a different mechanism, they con-tain a specific sequence, called Shine-Dalgarno (SD) sequence. The SD sequences

9

are binding site for prokaryotic ribosome, generally located around 8 bases up-stream of the start codon AUG. Since, many bacterial mRNAs are polycistronic,typical mRNA has a SD sequence for each protein encoded in it.

anticodon

amminoacid

tRNA

ribosome

Figure 1.7: tRNA and ribosome struc-tures.

transfer RNAs (tRNA) have thetask to take amminoacids to the ri-bosome that synthesizes the chain.Each amino acid added to the grow-ing end of a polypeptide chain is se-lected by complementary base-pairingbetween the anticodon on its attachedtRNA molecule and the next codonon the mRNA chain. Because onlyone of the many types of tRNAmolecules in a cell can base-pair witheach codon, the codon determines thespecific amino acid to be added tothe growing polypeptide chain. Ribo-somes have three sites for tRNA, an-ticodon of tRNA binds to A-site, ifit’s the right tRNA, the amminoacidis bound to polypeptide chain andtRNA with all the chain is moved atP-site. Finally a new tRNA arrives inA-site, if it’s correct binds the chainand shifts the tRNA in posistion P onE-site. In E-site tRNA can dissociatefrom ribosome and mRNA template.

Figure 1.8: Polyribosome

The elongation process stops andtwo subunits of ribosome are releasedwhen it reaches one of the threestop codon (UAA, UAG, UGA). Stopcodons in the A-site of a ribosome arebound by release factors, catalyzingthe addition of a water molecule tothe peptide chain, this cut the bind-ing between tRNA in P-site and thepolypeptide. Once the protein chainis completed has to be fold, most ofthe proteins can do that by their own,but sometimes they need help be-cause of wrong folding. Some proteinscalled molecular chaperons carry onthe re-folding and verify the right fi-nal conformation of proteins. In caseof mis-folding or structural problem,proteins can also be directly degra-dated. The mRNA molecules beingtranslated are therefore usually found

in the form of polyribosomes. The polyribosome, also named polysomes, arelarge cytoplasmic assemblies made up of several ribosomes spaced as close as 80

10

nucleotides apart along a single mRNA molecule.Proteins are the hands of the cell, able to accomplish many functions also in-

teracting among each others, forming protein complexes or with DNA molecules,behaving as e.g. transcription factors. Many protein-protein interactions are re-constructed and it’s possible to get a network of interacting proteins. A gooddataset to start with is given by PrePPI [124]. It is a collection of annotationsfrom other protein-protein interaction databases (BioGrid, DIP, IntAct, HPRD,...) with the addition of new interactions, based on algorithm developed on an-notations. PrePPI contains 31402 high confidence predicted PPIs for yeast and317813 PPIs for human.

1.4 Post-transcriptional regulation

MicroRNAs were firstly observed in 1993 in C. elegans [66], but they were notrecognized as a distinct class of biological regulators with conserved functionsuntil the early 2000s, as PubMed trends shows. Even the GENOME project drewattention on non coding DNA [64] and other papers from 2001 [65, 6] discovereda great number of these small RNA regulators. Their action has been widelyexplored in this last ten years, their role in development (let-7), growth control(mir-17 cluster), tumorigenesis (mir-200 family) and as possible biomarkers, isbecoming even more clear.

PubMed

art

icle

s /

10

5

year0

5

10

1950 1970 1990 2010

trend

1990 2000 2010year

0

2

4

6

00000

microRNAarticles

MicroRNAs are endogenous small non coding RNA, 23 nucleotide long [10,11]. Having a peculiar biogenesis, they derive from transcripts that fold backon themselves and form an hairpin structure, called pre-miRNA (Fig. 1.9). Thehairpin is then cleaved by Dicer and the two strand of the duplex may formboth a mature miRNA, called -5p or -3p according to the distance from the 5’end of the RNA strand.

G U C AG A

A U A A U G UC A

A A G U G C U UAC A

GU G C A G G U A G U G

A U

AU

GUG

CAUCUACUGCAG

UGA

AGGCACUUGUA

GCAUUAUGG

UGAC

5'3'

hsa-miR-17 hsa-miR-17-5p

hsa-miR-17-3p

Figure 1.9: hsa-miR-17 hairpin with its two mature miRNA sequence.

11

Figure 1.10: Argonaute protein structure: in red RNA strand loaded into Arg-onaute, it interacts with MID and PIWI domains.

Now, miRNA can be loaded into the Argonaute protein as in figure 1.10. Be-tween two miRNAs originated by the same hairpin, usually only one is loadedwith high frequency into AGO, so, in a cell, you should find expressed only -5por -3p, but obviously there are exceptions. Once into its scaffold, a microRNAcan pair with 3’ UTR of mRNAs, regulating post-trancriptionally their level.Depending on the extent of the complementary region between mRNA andmicroRNA, the latter could behave in different ways. With large pairing re-gions, microRNA can cleave the mRNA, but that’s quite uncommon, usuallymicroRNAs direct translational repression [11, 118]. A major goal was identify-ing microRNA and target relationship. In early years of 2000s, a series of articlefound that requiring conserved Watson-Crick pairing of the miRNA centered on2-7, called “seed” improves prediction of miRNA target recognition. So a simpleprocedure to locate the binding sites of a microRNA is:

(1) identify 6-8 nt matches to seed region at the 5’ end of the microRNA se-quence from miRBase [63];

(2) look for conserved occurrences of each match within ortholous 3’ UTR, e.g.Ensembl [35].

Sometimes there’s a supplementary contribution to microRNA-mRNA pairingdue a 3’ pairing. This 3’ pairing centers on microRNA nucleotides 13-16 andits opposite UTR region, although this kind of pairing is atypical. For targetrecognition another important feature is the the role of target-site accessibility.Folded UTRs are bound more difficulty and infact, in [58], authors demonstratethat target accessibility is a critical factor in microRNA function. In [118] foundthat Argonaute binding alters the properties of an RNA guide and in particu-lar, mouse AGO2, mainly mediates miRNA-directed repression and dissociatesrapidly and with similar rates for fully paired and seed-matched targets, thusa target mRNA hardly will be cleaved. In last years more precise models ofmicroRNA-target interaction came out, as for example MIRZA from Zavolanand van Nimwegen’s groups, [59]. This tool considers the stabilizing effect causedby pairing nucleotides at each position on microRNA sequence.

12

Figure 1.11: RNA strand loadedinto Argonaute (red) against itsdouble strand structure (blue).

The crystal structure of full-length hu-man Argonaute 2 (Ago2) to a resolution of2.3 A(Fig. 1.10) is known and determined in[93]. Schirle et al. compared the single strandRNA molecule loaded into Ago2 with a dou-ble strand RNA, obtaining a fine overlap ofnucleotides between position 2 and 6/7 (Fig.1.11). Thus, Argonaute 2 pre-arranges mi-croRNA strand into A-form helix, preparing itto complete the helix together with its target.More in [94] Schirle et al. determined crys-tal structures of human Argonaute-2 (Ago2)bound to a defined guide RNA with and with-

out target RNAs. This experiment elucidates the mechanism of microRNA ac-tion. In a first step, Ago2 looks for pairing to nt 2 to 5 on target RNAs and,when there is, promotes conformational changes that expose nt 2 to 8 and 13to 16 for further target recognition. Since the pairing is central in mcroRNA-target interaction, in order to estimate the Gibbs free energy of RNA duplexes,Turner group developed an empirical model in 1998. In [122] they defined anearest neighbour model (INN-HB), taking into account for couple of nearestnucleotides. The chemical stability of whatever RNA duplexes can be calculatedthrough formula 1.1, where parameters are estimated by empirical fits.

∆G(duplex) = ∆Ginit +∑j

nj∆Gj(NN) +mterm−AU∆Gterm−AU + ∆Gsym (1.1)

Table 1.2: RNA thermodynamics parameters for INN-HB parameters, 1 M NaCl,pH 7

parameters ∆G37 ∆H ∆S(5′AA3′

3′UU5′

)−0.93(0.03) −6.82(0.79) −19.0(2.5)(

5′AU3′

3′UA5′

)−1.10(0.08) −9.38(1.68) −26.7(5.2)(

5′UA3′

3′AU5′

)−1.33(0.09) −7.69(2.02) −20.5(6.3)(

5′CU3′

3′GA5′

)−2.08(0.06) −10.48(1.24) −27.1(3.8)(

5′CA3′

3′GU5′

)−2.11(0.07) −10.44(1.28) −26.9(3.9)(

5′GU3′

3′CA5′

)−2.24(0.06) −11.40(1.23) −29.5(3.9)(

5′GA3′

3′CU5′

)−2.35(0.06) −12.44(1.20) −32.5(3.7)(

5′CG3′

3′GC5′

)−2.36(0.09) −10.64(1.65) −26.7(5.0)(

5′GG3′

3′CC5′

)−3.26(0.07) −13.39(1.24) −32.7(3.8)(

5′GC3′

3′CG5′

)−3.42(0.08) −14.88(1.58) −36.9(4.9)

initiation 4.09(0.22) 3.61(4.12) −1.5(12.7)per terminal AU 0.45(0.04) 3.72(0.83) 10.5(2.6)

self-complementary 0.43 0 −1.4non-self-complementary 0 0 0

Each ∆Gj(NN) term is the free energy contribution of the j-th nearest neigh-bor with nj occurrences in the sequence. The mterm−AU and ∆Gterm−AU arethe number of terminal AU pairs and the associated free-energy parameter, re-spectively. The ∆Ginit is the free energy of initiation that includes translational

13

and rotational entropy loss for converting two particles into one (theoreticallyit must depend on sequence length). The ∆Gsym term is due to the fact thereis a 2-fold rotational symmetry in self-complementary duplexes. If you want toconsider also bulges, loops and mismatches several tools are available, for in-stance RNA hybrid [85], based on recursive algorithm. RNAhydrid found thelower energy for a couple of RNA sequences.

Main database for the identification of microRNA targets.

TargetScan [67] TargetScan predicts biological targets of miRNAs by search-ing for the presence of conserved 8mer and 7mer sites that match theseed region of each miRNA. As an option, nonconserved sites are also pre-dicted. Also identified are sites with mismatches in the seed region that arecompensated by conserved 3’ pairing. As an option, predictions are alsoranked by their probability of conserved targeting (PCT). TargetScanHu-man considers matches to annotated human UTRs and their orthologs, asdefined by UCSC whole-genome alignments. Conserved targeting has alsobeen detected within open reading frames (ORFs).

microRNA.org [13] The microRNA.org website is a comprehensive resourceof microRNA target predictions and expression profiles. Target predic-tions are based on a development of the miRanda algorithm which in-corporates current biological knowledge on target rules and on the use ofan up-to-date compendium of mammalian microRNAs. The target sitespredicted by miRanda are scored for likelihood of mRNA downregula-tion using mirSVR, a regression model that is trained on sequence andcontextual features of the predicted miRNA::mRNA duplex. Expressionprofiles are derived from a comprehensive sequencing project of a large setof mammalian tissues and cell lines of normal and disease origin.

miRTarBase [53] miRTarBase has accumulated more than fifty thousand miRNA-target interactions, which are collected by manually surveying pertinentliterature after data mining of the text systematically to filter researcharticles related to functional studies of miRNAs. Generally, the collectedmiRNA-target interactions are validated experimentally by reporter assay,western blot, microarray and next-generation sequencing experiments.

microRNA BS identification CLIP and PAR-CLIP experiments allow toidentified sites where RBPs interact with RNA. Among RBPs we find also Arg-onaute family, so through with AGO CLIP experiment, it’s possible to locatemicroRNA binding sites. For example Hafner and Kishore dataset [45] and [60].

1.5 Regulatory network

Many of the complex networks that occur in nature have been shown to shareglobal statistical features. A series of works in late ’90s, threw light on peculiarproperties of biological networks. Defining a network with N nodes and E edges,different quantity can be useful for our understanding of the network structureand behavior. In [117] Strogatz and Watts (WS) focused their efforts on the

14

small-world phenomenon. Known as six degree of separation because of an ex-periment in ’60s published on Psicology Today [74], where a simple question wasaddressed: given any two people in the world, X and Y, how many intermediateacquaintance links are needed before X and Y are connected?

Figure 1.12: Six degree of separa-tion results from Kansas to Mas-sachusetts experiment publishedon [74] in 1967.

The answer is shown in Fig. 1.12, returnsa mean number of intermediates of 5/6. WSwanted to formalize this small-word effect.They found in a set of networks a mixed prop-erty of high clustering, measured by clusteringcoefficient and low mean distance between 2nodes. The clustering coefficient C is definedas the probability that, given two nodes con-nected to the same third node, these two areeven connected by an edge. The mean dis-tance L between two nodes is the shortestpath between two vertices, averaged over allcouple of vertices. Their analysis starts froma highly clustered network and through a ran-dom rewiring increases the network random-ness. With few rewiring steps the mean short-est path drops rapidly, because of the birth ofnew link between distant node creates shorterpaths. On the contrary, clustering coefficient remains high because the growingrandomness affects linearly C. This peculiar features high C and low L is foundin some empirical networks, taken as examples of small-world: film actors, USpower grid and C. elegans neural network.

Frequently, you need a random graph model and the most simple and usedis due to Erdos and Renyi (ER) [32], firstly introduced in 1959. To reproduce anoccurrence of ER network, one can start with N vertices and then connects eachpair of vertices with a probability p. The degree distribution must be binomial

P (k) =

(N − 1

k

)pk(1− p)N−1−k (1.2)

and, thus, for a great number of vertices, it becomes a Poisson distribution.A common property of many large networks is that vertex connectivities follow ascale-free power-law distribution and both ER and WS networks don’t reproducethis typical feature of real networks.

P (k) ∼ k−γ (1.3)

The power-law tail in P (k) indicates that highly connected vertices havelarge chance of occurring, dominating the connectivity. Barabasi focused hisefforts in defining a process to get this scale-free behavior. In [9] he consideredthe actors, World Wide Web, US power grid and paper citation networks, allof them show a power-law distribution and each one with a different γ. Thescale invariant nature of real networks can be reproduced by two simultaneousprocesses, growth and preferential attachment. The two ingredients are bothnecessary, as Barabasi tested in his paper. Easily if you want to produce arandom network with this property at each round a new node with m edges

15

is added at the network, and the probability of connection with the i node,increases linearly with its degree, ki.

11 102 104 106

102

104

Genome size [kb]

Num

ber

of

pro

tein

seukaryotesprokaryotesviruses

Figure 1.13: Genome size and protein numbers in different species from3 genera. Data source at ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_

REPORTS/, updated in 2014.

Figure 1.14: Analysis ofthe information contentin motif of TF bindingsites for bacteria andeucaryotes.

An easy but meaningful way to look at the genomeevolution from viruses to eukaryotes is the correla-tions between genome size and gene content for eu-karyotes and non-eukaryotes. When numbers of pro-teins are plotted against genome sizes (Fig. 1.13),two distinct relations appeared: eukaryotes in one andnon-eukaryotes in the other, with markedly differentslopes emerging from initial linear regression. For non-eukaryotes, the linear regression model is the best fitwhile for eukaryotes, a logarithmic one [51]. The gene-coding fraction of the genome declined from 81.6%, ofbacteria, to 1.2%, of eukaryotes [51]. Thus, in all eu-karyotes, non coding genome plays a central role, asalready suggested by the GENOME project [64], EN-CODE project and a lot of papers.

Another main difference between bacteria and eu-caryotic regulatory network is carried by informationcontent of binding sites and, consequently, by thecombinatorial behaviour of eucaryotic transcriptionfactors, as studied by Wunderlich and Mirny (WM)in [121]. Coordinated regulation of gene expressionrelies on transcription factors (TFs) binding to spe-cific DNA sites. The information analysis of > 950TF-binding motifs, made by WM, demonstrates thatprokaryotes and eukaryotes use strikingly different strategies to target TFs tospecific genome locations. While bacterial TFs can recognize a specific DNA sitein the genomic background, eukaryotic TFs exhibit widespread, nonfunctionalbinding and require clustering of sites to achieve specificity. Their systematic

16

characterization of binding motifs provides a quantitative assessment of thedifferences in transcription regulation in prokaryotes and eukaryotes. From In-formation theory, it’s known that finding a unique object among N alternativesneeds Imin = log2N bits of information. Since that, for bacteria N = 106−107 bpthis means Imin = 20 − 23 bits and for eukaryotes N = 108 − 109 bp andImin = 27− 33 bits (24 bits for yeast and 31 bits for human)

I =

L∑i=1

∑b∈A,C,G,T

pi(b)log2

(pi(b)

q(b)

)(1.4)

In figure 1.14 are shown the properties of binding motifs for bacteria, yeastand multicellular eukaryotes, found in [121]. The genome size defines the min-imum information content (Imin), red bands in figure, and the distribution ofinformation contents, in blue. The chart demonstrates that bacterial TF-bindingmotifs are informative enough to make spurious hits to the genomic backgroundunlikely, in contrast to yeast and multicellular eukaryotic motifs. Most bacterialmotifs have I > Imin, whereas almost all eukaryotic motifs do not. This resultsuggested that in eukaryotes the binding sites are working in a combinatorialway, so more than one transcription factor is needed to activate or inhibit thetranscription of a target gene.

To go beyond these global features would require an understanding of thebasic structural elements particular to each class of networks. So Alon groupdetected recurring and significant patterns in the node connections, called motif[76]. The presence of enriched n-node subgraphs (n = 3, 4) has been recordedfor a set of real networks. They took into account gene regulation networks ofE. coli and S. cerevisiae, neurons in C. elegans, food webs, electronic circuitsand protein-protein interactions, World Wide Web and Internet router.

X

Y

Z

X Y

Z WFeed-forward

loopBi-fan

gene regulation protein interaction

X

Y Z

X

Y Z W

Figure 1.15: Enriched motif from Milo et al. [76]. As shown by the presence ornot of the arrows, the gene regulation (transcription) graph is directed whilethe protein interaction one is undirected, obviously, due to the nature of theinteractions.

Our eyes focus on regulatory networks about cells and, in that case, thetypical enriched motifs, found by Milo et al. [76] are feed-forward loops andbi-fans, for protein interaction graph are the mutual interaction of three proteinand the interaction of one protein with three others without interaction betweenthem, as shown in Fig. 1.15.

17

To find enriched circuits, the procedure followed by Alon was to count thenumber of expected circuits in a ER random graph and count how many arein the considered real network. If real network has N nodes and E edges, youcan build a ER graph with the same number of nodes and edges and, then,count the motifs, you’re interested in. From the random expectation and thereal number, a Z-score may be obtained. The simplest circuit is undoubtedlythe auto-regulation. Considering a random network with N nodes we expect atotal number of possible nodes, N · N = N2 and a probability of self-edges ofpself = N

N2 = 1N . Since E edges are placed at random, the probability of k

self-edges is binomial

P (k) =

(E

k

)pkself (1− pself )

E−k(1.5)

For a binomial distribution < k >random= Epself = EN and σrandom =√

EN

(1− E

N2

)∼√

EN . Alon analysis figure out the number of self-loops in E.

coli outnumbers its number in ER graph with Z-score equals to 32 [5]. With thesame approach you can include n-subgraph as previously said, [76].

Alon investigation went further, tried to explore why these motifs have beenchosen by evolution. He found his answer in dynamical properties of the enrichedcircuits, such as for instance response time.

18

Figure 1.16: Example of the E. coli transcriptional network, source RegulonDB.HITS algorithm to identify Hubs and Authorities. Graph analysis was perfomedthrough Gephi 0.8.2

19

Chapter 2

Introduction to geneexpression model

2.1 Markov processes and the Chapman-Kolmogorovequation (CK)

Clonal populations of cells exhibit substantial phenotypic variation. The stochas-tic effects in gene expression may account for the large amounts of cell-cellvariation observed in isogenic populations. Here the theoretical foundation ofgene expression modeling will be introduced along with some applications. Mostimportant processes in Physics and Chemistry are described by a subclass ofstochastic processes, having the Markov property, see [114]. Markov processesare defined as stochastic processes that for any set of n successive times, havethe following property

P (yn, tn|y1, t1; . . . ; yn−1, tn−1) = PT (yn, tn|yn−1, tn−1) (2.1)

Thus, the state of a system yn at any time tn is uniquely determined and notaffected by any knowledge of the state of the system at earlier times and PT iscalled transition probability. So, a Markov process is fully determined by twofunctions P (y1, t1) and PT . In this way one can construct the process throughsuccessive iterations.

P (y1, t1; y2, t2; y3, t3) = PT (y3, t3|y2, t2)P (y2, t2; y1, t1)= PT (y3, t3|y2, t2)PT (y2, t2|y1, t1)P (y1, t1)

(2.2)

Following this procedure one find P (yn, tn; . . . ; y1, t1) ∀n. Clearly, process, yn,could have even a multidimensional nature with more components.

Integrating Eq. 2.2 over intermediate states y2 for ordered time steps t1 <t2 < t3

∫P (y1, t1; y2, t2; y3, t3)dy2 =

∫PT (y3, t3|y2, t2)PT (y2, t2|y1, t1)P (y1, t1)dy2

(2.3)P (y1, t1) comes out of the integrals and we can divide by it both sides. At

this stage, the equation becomes

20

PT (y3, t3|y1, t1) =

∫PT (y3, t3|y2, t2)PT (y2, t2|y1, t1)dy2 (2.4)

and is called the Chapman-Kolmogorov equation (CK). It was arrived atindependently by both the British mathematician Sydney Chapman [21] andthe Russian mathematician Andrey Kolmogorov [61]. To be a markovian processtwo non negative functions are required, P (y1, t1) and PT , and they must obeyto two identities:

(i) the Chapman-Kolmogorov equation (2.4);

(ii) the relation

P (y2, t2) =

∫PT (y2, t2|y1, t1)P (y1, t1)dy1 (2.5)

Having two functions satisfying these relations defines a unique Markov pro-cess. If PT doesn’t depend on two times but on the time interval, the process iscalled stationary and

PT (y2, t2|y1, t1) = Tτ (y2|y1) (2.6)

where τ = t2 − t1 and the CK eq. (2.4) becomes

Tτ ′+τ (y3|y1) =

∫Tτ ′(y3|y2)Tτ (y2|y1) dy2 (2.7)

The best known example of stationary, gaussian and Markov process is theOrnstein-Uhlenbeck. Defined by the following functions

P (y1) = 1√2πe−

12y

2t

Tτ (y2|y1) = 1√2π(1−e−2τ )

exp− (y2−y1e

−τ )2

2(1−e−2τ )

(2.8)

It was firstly set up to describe the velocity of brownian particles.

2.2 Chemical Master equation (CME)

The master equation is a reformulation of the CK equation (2.4), but easier tohandle. The first step, in the derivation of the Master Equation, is the seriesexpansion of Eq. (2.7) for vanishing time difference τ ′.

Tτ ′(y2|y1) = (1− a0τ′)δ(y2 − y1) + τ ′W (y2|y1) + o(τ ′) (2.9)

with the condition

a0(y1) =

∫W (y2|y1)dy2 (2.10)

that sets∫Tτ (y2|y1)dy2 = 1, because the integrals over all possible states

must have probability 1. W (y2|y1) = ∂Tτ′ (y2|y1)∂τ ′ is the transition probability per

unit of time from two states y1 → y2. From Eq. (2.7), substituting the seriesexpansion for small τ ′,

21

Tτ+τ ′(y3|y1) =∫Tτ ′(y3|y2)Tτ (y2|y1) dy2

=∫

[(1− a0(y2)τ ′)δ(y3 − y2) + τ ′W (y3|y2)]Tτ (y2|y1) dy2

=∫δ(y3 − y2)Tτ (y2|y1) dy2+

−τ ′∫a0(y2)δ(y3 − y2)Tτ (y2|y1)dy2+

+τ ′∫W (y3|y2)Tτ (y2|y1) dy2 =

= Tτ (y3|y1)++τ ′

∫W (y3|y2)Tτ (y2|y1)dy2 −

∫W (y|y3)Tτ (y3|y1)dy

(2.11)and, finally,

Tτ+τ ′(y3|y1)− Tτ (y3|y1)

τ ′=

∫(W (y3|y2)Tτ (y2|y1)dy2 −W (y2|y3)Tτ (y3|y1)) dy2

(2.12)in the limit of τ ′ → 0

∂

∂τTτ (y3|y1) =

∫(W (y3|y2)Tτ (y2|y1)dy2 −W (y2|y3)Tτ (y3|y1)) dy2 (2.13)

In this form the CK equation is named Master Equation (ME) and can bewritten in the subsequent simplified and more intuitive form

∂P (y, t)

∂t=

∫[W (y|y′)P (y′, t)−W (y′|y)P (y, t)] dy′ (2.14)

If states are discrete n, the ME becomes

dpn(t)

dt=∑n′

[Wnn′pn′(t)−Wn′npn(t)] (2.15)

where the integral is substituted by a sum. In a steady-state condition the

left side of ME is zero, dpn(t)dt = 0, and, thus∑n′

Wnn′peqn′ =

∑n′

Wn′npeqn (2.16)

must hold. An assumption, stronger than steady state ones, is the detailedbalance. It states that each pair n, n′, separately, satisfies

Wnn′peqn′ = Wn′np

eqn (2.17)

The detailed balance condition is strictly related to the entropy and, in partic-ular, defines the condition without entropy production (equilibrium) [83].

Step-operator To manage the ME it is quite useful to define the step-operatorEkni , it operates on a generic function f in this way:

f(. . . , ni + k, . . .) = Eki f(. . . , ni, . . .) (2.18)

Using Taylor expansion of the function on the left hand of equation, one canobtain

Ekni =

∞∑l=0

kl

l!∂lni (2.19)

22

and, by definition of step operator, it follows naturally the relation

Ekninif(. . . , ni, . . .) = (ni + k)f(. . . , ni + k, . . .). (2.20)

Thank to Ekni , complex master equations get compact and become easier to betreated.

System of chemical reactions Considering a set of chemical reactions

ρ :∑

sρjXjkρ→∑

rρjXj (2.21)

we define the Chemical Master Equation (CME), see van Kampen’s book [114].

∂P (ni, t)∂t

=∑ρ

kρΩ

(∏i

Esρi−r

ρi − 1

)∏j

nj !

(nj − sρj )!Ωsρj

P (ni, t)

(2.22)In the particular case in which detailed balance condition holds, the reactionsmust have this form

ρ :∑

sρjXj

kρ+kρ−

∑rρjXj (2.23)

and the CME is rewritten as

∂P (ni,t)∂t =

∑ρ

(kρ+Ω

(∏i Es

ρi−r

ρi − 1

)∏j

nj !

(nj−sρj )!Ωsρj

+

+kρ−Ω(∏

i Erρi−s

ρi − 1

)∏j

nj !

(nj−rρj )!Ωrρj

)· P (ni, t)

. (2.24)

Its solutions can be obtained through the Grand Canonical Ensemble (GCE)and they are

P g(ni) = C∏j

(Ωzj)nj

nj !e−Ωzj∆(ni, n0

i ) (2.25)

with constraints about the reactions rates,

kρ+kρ−

=∏j

zrρj−s

ρj

j . (2.26)

where Ω is the system volume, zj =njΩ concentration of species j, ∆(ni, n0

i )determines accessible states and C is a normalization constant. Eq. (2.26) isknown as law of mass action and must hold for all coupled reactions. In thefollowing, some analytical and numerical method to find out a solution for aME will be investigated.

2.3 Generating function

If feasible, a way to get the solution or at least first momenta of distributions,is represented by the generating function, Z-transform of the function P (n, t) orthe discrete Laplace transform,

F (zi, t) =

∞∑ni=0

∏i

znii P (ni, t). (2.27)

23

Applying this transform to the ME

∂F (zi, t)∂t

= A(1, ∂zi , ∂2zi , . . .)F (zi, t) (2.28)

and choosing the steady state, that is ∂F (zi,t)∂t = 0, you may be able to

calculate the moments of the generating function, exactly. The following sys-tem can be solved analytically if adding new orders at some point it becomesclose. Assuming gaussian distributions, we only need the first two momenta tocompletely define the concentration of reactants at the steady state.

1-st order limzl→1∂∂zj

(∂F (zi,t)

∂t = 0)

2-st order limzl→1∂2

∂zj∂zk

(∂F (zi,t)

∂t = 0)

......

(2.29)

The moments are given by these relations

< ni(t) >= ∂F (zk,t)∂zi

∣∣∣zi=1

< n2i (t) > − < ni(t) >= ∂2F (zk,t)

∂z2i

∣∣∣zi=1

< ni(t)nj(t) >|i 6=j = ∂2F (zk,t)∂zi∂zj

∣∣∣zi=1,zj=1

(2.30)

Variances and correlations can be obtained by equations (2.30) with the well-known identities

σ2i =< n2

i > −(< ni >)2

rij =<ninj>−<ni><nj>

σiσj(Pearson’s linear correlation)

(2.31)

2.4 Linear noise approximation

Since only for rare cases it’s possible to solve the master equation directly, VanKampen studied a systematic approximation. The Linear Noise Approximation(LNA), also called System size expansion is based on an expansion in powerof the volume of the system, Ω because we expect that for large Ω the size offluctuations shrinks. Given R reactions and N molecular species, the chemicalmaster equation has the form

dP (n, t)

dt= Ω

R∑i=1

N∏j=1

E−vijj − 1

fi(n Ω−1,Ω)P (n, t) (2.32)

where Ω is the volume of the system, vij is a matrix with dimensions (reactions)·(species)in which each element stays for the concentration of species j due to reaction iand E is the step-operator. Through the following substitution

n = Ω x + Ω12 ξ (2.33)

the master equation becomes

∂Π

∂t−Ω

12

N∑i=1

d xidt

∂Π

∂ξi= Ω

R∑i=1

N∏j=1

∞∑l=0

(−vij)l

l!Ω−

l2∂l

∂ξli− 1

fi(x+Ω−12 ξ)Π(ξ, t) .

(2.34)

24

x is the mean concentration per unit of volume and ξ the fluctuation term.The equation is a collection of orders of Ω and from the highest one (Ω

12 ) we

obtain the equations for the mean concentrations per unit of volume. From theorder Ω0, under the assumption that Ω is large enough (so that smaller ordersare negligible), we get the equations for the fluctuations. Then, expanding thefunctions fi in power of ξ we have to consider only the 0th-order

∂Π

∂t= −

N∑j=1

(R∑i=1

vij∂fi∂ξk

)∂

∂ξj(ξkΠ) +

N∑j,k=1

(R∑i=1

fi(x)vijvik

)∂

∂ξj∂ξkΠ.

(2.35)By defining the two matrices

Ajk =

R∑i=1

vij∂fi∂ξk

Bjk =

R∑i=1

fi(x)vijvik (2.36)

and using the multivariate Fokker-Planck equation’s solution for the first twomoments (next paragraph), the following relationships are obtained.

∂t < ξ >= A < ξ > with < ξ >= (< ξ1 >, . . . , < ξN >)t ; (2.37)

∂tΞ = AΞ + ΞAt + B with Ξij =< ξiξj > . (2.38)

Linear multivariate Fokker-Planck equation To get the moments of theeq. (2.35), you start from a differential equation of this type

∂P (x, t)

∂t= −

∑i,j

Aij∂

∂xi(xjP (x, t)) +

1

2

∑i,j

Bij∂2

∂xi∂xjP (x, t) (2.39)

where Aij and Bij are constant matrix and, in addition Bij is symmetric andsemi-definite non-negative. Within the hypothesis that the solution is gaussian,it has to be fully determined by the first two moments. Multiplying (2.39) byxk and integrating over all xi∫∞

0dx ∂

∂t (xkP (x, t)) =∫∞

0dx−∑i,j Aijxk

∂∂xi

(xjP (x, t))

+ 12

∑i,j Bijxk

∂2

∂xi∂xjP (x, t)

... (by parts and with P (0, t) = 0 and P (∞, t) = 0

∂t < xk >=∑j

Akj < xj > (2.40)

Multiplying (2.39) by xkxl and integrating∫∞0dx ∂

∂t (xkxlP (x, t)) =∫∞

0dx−∑i,j Aijxkxl

∂∂xi

(xjP (x, t))

+ 12

∑i,j Bijxkxl

∂2

∂xi∂xjP (x, t)

... (by parts and with P (0, t) = 0 and P (∞, t) = 0

∂t < xkxl >=∑j

Akj < xlxj > +∑j

Alj < xkxj > +Bkl (2.41)

25

At the steady state the time derivatives vanish and the mean values xi(t)→xsi are constant. Solving the linear system ∑

j Akj < xj >= 0∑j Akj < xlxj > +

∑j Alj < xkxj > +Bkl = 0

(2.42)

one obtains 1-st and 2-nd moments, at the stationary state.

2.5 Numerical solution of the Master Equation

2.5.1 Gillespie algorithm (SSA)

It was firstly introduced by Gillespie in 1976 [41, 42]. It provides a stochas-tic formulation of chemical kinetics and, in particular, the time evolution ofany spatially homogeneous mixture of molecular species interacting through adefined set of chemical reactions.

Let us consider a chemical compound with N molecular species and Mreactions. We define a vector n with N components, containing the amount ofmolecules.

n = (n1, · · · , nN ) (2.43)

and, for each reaction µ, a vector rµ taking into account for molecular changesdue to reaction µ. Thereby the vector of new concentrations after reation µhappens, is the result of the sum

n′ = n + rµ (2.44)

Now we define kµdt probability that reaction µ occurs in the next timeinterval dt

P (τ, µ) = P0(τ)kudt (2.45)

with P0(τ), the probability that no reaction occurs in [t, t+ τ ]. kµ is dividedinto two terms kµ = hµcµ, where cµdt is the average probability, that a partic-ular combination of reactant molecules will react accordingly in time intervaldt and hµ a combinatorial factor taking into account for all distinct reactantcombinations found to be present in a volume V at time t (Tab. 2.1).

Reaction hµ? → products 1X → products X

X + Y → products X · Y2X → products X(X−1)

2

Table 2.1: Example of some hµ for typical reactions.

By the assumption, the probability that no reactions occurs in the time stepε = τ

K is P0(ε) =∏µ(1 − kµε) = 1 −

∑µ kµε + o(ε2). Now considering K time

steps and a great number of steps, you get P (τ, µ).

P0(τ) = limK→∞

(1−

∑ν

kνε

)K= limK→∞

(1−

∑ν

kντ

K

)K= exp

(−∑ν

kντ

)(2.46)

26

P (τ, µ) = kµ exp

(−

M∑ν=1

kντ

)(2.47)

Thank to eq. (2.47), it is defined an algorithm able to simulate reactions ina mixture of chemical compounds. The algorithm implementation proceeds inthe following steps:

(1) Initialization of n to initial concentration of each species, set time t = 0,calculate the kν reaction rates;

(2) Montecarlo step for the generation of a couple (τ, µ), through one of thealgorithm chosen from the Montecarlo step (next numbered list);

(3) update the concentration of the species n(t+ τ) = n(t) + rµ the kν reactionrates and the time t = t+ τ , go to (2).

The Montecarlo step is the key to manage the computational time of thesimulation. It is essential to choose the right method for your purpose:

i. Direct method: P (τ, µ) can be decomposed into a product of conditionalprobability and the probability of time interval τ

P (τ, µ) = P (µ|τ)P (τ) (2.48)

P (τ) =∑µ

P (τ, µ) =

M∑µ=1

kµ exp

(−

M∑ν=1

kντ

)(2.49)

P (τ) gives back the mean time step of the set of reactions and P (µ|τ) thereaction that will occur

P (µ|τ) =P (τ, µ)

P (τ)=

kµ∑ν kν

(2.50)

27

Computational cost: ( 2· simulation events) random number generated.

ii. First reaction method: P (τ, µ) can be used, directly, to get a reaction timestep for each reaction and then, you choose the first happening reaction. Inthis case M random number are to be generated to get a time step for eachreaction.

Computational cost: (reactions · simulation events) random number gener-ated.

iii. Next reaction method: introduced by Gibson and Bruck in 2000 [40]. Itimproves the first reaction method, halving nearly the simulation time and isbased on the observation that, usually a small subset of kµ must be updatedat each step and, consequently, only the related time reactions should beregenerated. The algorithm defines a Dependency Graph, that counts forkµ to be recalculated and τµ for each reaction are stored in the, so-called,Indexed Priority Queue. At each step the reaction with the minimum τµhappens and only needed kµs are updated and new τµ are calculated for theupdated reactions. Authors gave even a demonstration of why it’s correctto re-use not updated τµ.

Computational cost at each step: (reactions + simulation events) randomnumber generated.

Thus, if you have time to spend into the program implementation the fastestalgorithm is the Next reaction method. However, for common purposes theDirect method is enough. Instead I advise against the First reaction methodbecause it is slower than the Direct one with a similar implementation time.

Numerical integration of the Master Equation

If we take the master equation definition, a direct way to solve it, is numericalintegration. Indeed,

pn =∑n′

(Wnn′pn′ −Wn′npn) (2.51)

pn(t+ ∆t)− pn(t)

∆t=∑n′


pn(t+ ∆t) = pn(t) + ∆t∑n′


The last equation may be used to develop a difference method for numericalintegration of Master equation, as we will see with an example.

2.6 Cases of study

2.6.1 Production-degradation process

As first example, we look at the birth-death process, where something is pro-duced with a rate α and then degraded with a rate β, as in figure 2.1. Therelated ME is

∂ P (m,t)∂t = α [P (m− 1, t)− P (m, t)] +

+β [(m+ 1)P (m+ 1, t)−mP (m, t)](2.54)

28

Figure 2.1: Birth-death process.

and applying the generating function approach (Z-transform), you get

∂tF (z, t) = α(z − 1)F (z, t) + β(1− z)∂zF (z, t) (2.55)

At the steady state ∂tF = 0 and you achieve the analytical form of F (z) forgreat t

F ss(z) = eαβ (z−1) = e〈m〉(z−1) (2.56)

Since F (z, t) =∑∞n=0 z

nP (n, t), we have the probability distribution P ss(n) byexpanding in series of z the function F ss(z)

F ss(z) = e−〈m〉∞∑n=0

(〈m〉 z)n

n!⇒ P ss(n) = e−〈m〉

〈m〉n

n!(2.57)

This process gives a Poisson distribution at the steady state, simulations andtheoretical distribution are in perfect agreement, Fig. 2.2. Unfortunately, from

Birth-Death process

PoissonGillespie

Figure 2.2: Birth-death process simulation and expected Poisson distribution.

experiment you find that proteins follow distributions with larger variances sim-ilar to those from a Γ distribution. This simple process is unable to capture theright behaviour. To describe this effect, a two step process is introduced, imag-ining the transcription and translation typical of gene expression.

2.6.2 Expression of a constitutive gene

Let us focusing on the dynamics of gene expression. As well-known protein pro-duction step is divided in two processes, transcription (DNA → mRNA) and

29

translation (mRNA → protein). Transcription can be imagined as process ofproduction of mRNA with a rate αm balanced by degradation and, for grow-ing cells, dilution, βm = βdeg + βdil. For the translation, in the same way, wehave a degradation/dilution rate, but the translation rate is for single mRNAmolecule, so we expect, the production term to be proportional to the mRNAconcentration, Fig. 2.3.

mRNA proteinDNA

Figure 2.3: m-p circuit.

Master equation Looking at the reactions in Fig. (2.3), the related masterequation is

∂ P (m,p,t)∂t = αm [P (m− 1, p, t)− P (m, p, t)] +

+βm [(m+ 1)P (m+ 1, p, t)−mP (m, p, t)] ++αpm [P (m, p− 1, t)− P (m, p, t)] ++βp [(p+ 1)P (m, p+ 1, t)− pP (m, p, t)]

(2.58)

and through the step-operator it’s possible to write it like this

∂ P (m,p,t)∂t =

αm(E−1

m − 1) + βm(E1m − 1)m+ αpm(E−1

p − 1)++βp(E1

p − 1)pP (m, p, t)

(2.59)

Through the Z-transform the master equation becomes

∂tF = αm(u− 1)F + βm(1− u)∂uF + αp(v − 1)u∂uF + βp(1− v)∂vF (2.60)

From the generating function approach, shown before, the estimated average,variance and correlation values are:

• mean valuesm =

αmβm

p =αmαpβmβp

(2.61)

• variancesσ2m = αm

βm

σ2p =

αmαp(αp+βm+βp)βmβp(βm+βp)

(2.62)

• correlations

rmp =

√αpβp

(βm + βp)(αp + βm + βp)(2.63)

30

Otherwise, following the linear noise approximation, you have the change ofvariables

m = Ωx+√

Ω ξmp = Ω y +

√Ω ξp

(2.64)

and the expanded equation

∂Π∂t − Ω

12dxdt

∂Π∂ξm− Ω

12dydt

∂Π∂ξp

= Ω12 −αm + βmx ∂Π

∂ξm+

+Ω12 −αpx+ βpy ∂Π

∂ξp+ Ω0

∂∂ξm

(βmξmΠ) + 12 (αm + βmx) ∂

2Π∂ξ2m

+

+ ∂∂ξp

[(βpξp − αpξm)Π] + 12 (αpx+ βpy)∂

2Π∂ξ2p

+ o(Ω0)

(2.65)

In the limit of great volume Ω→∞, the survived orders are:

Ω12 vanishes in fact we re-obtain the mean field equations

(d xdt − αm + βmx) ∂Π∂ξm

(d ydt − αpx+ βpy) ∂Π∂ξp

(2.66)

Ω0 is the order of fluctuations

∂Π∂t = ∂

∂ξm(βmξmΠ) + 1

2 (αm + βmx) ∂2Π∂ξ2m

+

+ ∂∂ξp

[(βpξp − αpξm)Π] + 12 (αpx+ βpy)∂

2Π∂ξ2p

(2.67)

From the order 12 , we get the steady state mean values and are the same ones

of generating function method.

x =αmβm

y =αmαpβmβp

(2.68)

The A and B matrixes are

A =

(−βm 0αp−βp

)B =

(αm + βmxs 0

0 αpxs + βpys

)(2.69)

From A < ξm >=< ξp >= 0, and from B

• variancesσ2m = αm

βm

σ2p =

αmαp(αp+βm+βp)βmβp(βm+βp)

(2.70)

• correlations

rmp =

√αpβp

(βm + βp)(αp + βm + βp)(2.71)

In this simple case, generating function approach and linear noise approximationgive identical results.

Gillespie simulation Numerical simulations of systems of chemical reactionscan be performed by Gillespie algorithm. The chemical reactions are

∗ → m αmm → m+ p αpm → βmp → βp

(2.72)

31

Numerical integration of ME The transition probability per unit of timeis

W(mp),(m′p′) = a(m′, p′)δm,m′+1+b(m′, p′)δm,m′−1+c(m′, p′)δp,p′+1+d(m′, p′)δp,p′−1

(2.73)with the coefficients defined in this way

a(m, p) = αmb(m, p) = βmmc(m, p) = αpmd(m, p) = βpp

(2.74)

pn becomes

d p(m,p)dt = a(m− 1, p)p(m− 1, p) + b(m+ 1, p)p(m+ 1, p)+

+c(m, p− 1)p(m, p− 1) + d(m, p+ 1)p(m, p+ 1)− p(m, p)(2.75)

and using the finite difference expansion of first derivative, you get the evolutionin time step of the probability pt(m, p)

pt+∆t(m, p) = ∆t a(m− 1, p)pt(m− 1, p) + b(m+ 1, p)pt(m+ 1, p)++c(m, p− 1)pt(m, p− 1) + d(m, p+ 1)pt(m, p+ 1)++(1−∆t)pt(m, p)

(2.76)For the boundary condition we have a reflecting lower bound in 0 because of thenumber of molecule must be positive. For the upper bound we set the productionreaction rate to 0, so the system remains confined and the total probability isconserved. The upper boundary creates a little distortion but negligible if itlocates at sufficiently large value. An example can be found on the density plotin figure 2.4.

With summations over m or p, we get the probability P (m) =∑p P (m, p)

and P (p) =∑m P (m, p). Then, we compare the SSA sample and the numerical

integrated distribution, see Fig. 2.5.

Comparison between methods With this set of parameters αm = 0.001, αp =0.04, βm = 0.00033, βp = 0.00017, t = 100000 s, we evaluate the steady statevalues of concentrations, noises and correlations obtained from the different ap-proaches in the below table. The agree is good for steady state values.

Steady state ex.Mean value Noise Corr.

m p m p rmpGillespie 3.01-3.06 711.3-714.1 0.569-0.580 0.331-0.339 0.568-0.592Gen. Func. & LNA 3.03 713.02 0.574456 0.33705 0.579485Num. integr. 3.02 711 0.57 0.333 0.576

Approximating the generating function approach for discrete valueIn [98]studied the case of much more stable proteins than messengers (βmβp 1) and

the expected distribution for the protein becomes a negative binomial

P (p) =Γ(a+ p)

p!Γ(a)

bp

(1 + b)a+p(2.77)

where a = αmβp

and b =αpβm

are, respectively, the burst frequency and the burst

size of protein production.

32

Figure 2.4: Numerical integration of master equation. Note the lattice discrete-ness is visible on m axes.

Figure 2.5: Comparison between the numerical integration of the master equa-tion and Gillepsie algorithm results. It is shown the average on 1000 trajectories

Gamma distribution

Another interesting approach, giving a further insight into the understandingof the gene expression, may be the one followed by Friedman et al. [36]. Giventhe reaction scheme in Fig. 2.3, the distribution of proteins in a population ofcells satisfies the continuous master equation (2.78). This equation integratesthe messenger step and defines a process in which proteins are assembled inburst by the mRNA, produced with rate αm and degraded with their rate βp.

33

∂p(x)

∂t=

∂

∂x(βpxp(x)) + αm

∫ x

0

w(x, x′)p(x′)dx′ (2.78)

where x is the concentration of a protein and with w(x, x′) = w(x|x′) −δ(x−x′). Now we suppose w(x|x′) depends only on distance from arrival statesw(x|x′) = v(x − x′) ⇒ w(x, x′) = w(x − x′) = v(x − x′) − δ(x − x′). At thesteady state the following equation holds

− ∂

∂x(xp(x)) =

a︷︸︸︷αmβp

∫ x

0

w(x− x′)p(x′)dx′ (2.79)

and through the Laplace transform, you get

s∂p

∂s= awp (2.80)

Now, setting the burst distribution v(x) = 1b exp

(−xb)

where b denotes theaverage burst size, you find

p(s) =

(s+

1

b

)−a(2.81)

Figure 2.6: Intrinsic and extrin-sic noises found in [36] and ex-plained by relation 2.84.

Through the reverse transform, the solu-tion for protein concentration is a Gamma dis-tribution

p(x) =1

baΓ(a)xa−1e−

xb (2.82)

where Γ is the gamma function. Gammadistribution has been verified in [108] in E.coli, with a series of interesting results aboutmRNA and protein dynamics. Assuming aand b slow varying and so that they exist asdistribution f(a) and g(b), then the proteindistribution will be

p(x) =

∫ ∞0

∫ ∞0

xa−1e−xb

Γ(a)baf(a)g(b) da db

(2.83)It is possible to find a relation for intrinsic and extrinsic noise. Extrinsic is relatedto fluctuations in cellular components such as ribosome and polymerase concen-trations while intrinsic to the stochastic nature of transcription and translationprocesses.

η2p =

intrinsic︷︸︸︷1 + 〈b〉+ 〈b〉 η2

b

µp+

extrinsic︷︸︸︷η2a + η2

aη2b + η2

b (2.84)

34

2.6.3 Transcriptional regulation

In this section, a transcription factor (TF) regulates a target protein. This circuitis an example of TF-promoter interaction and we have two possible kinds ofinteraction, activatory or repressive one. As studied within several papers, e.g.[5, 68, 49], the strenght of a transcription factor on the transcription rate of itstarget gene is described by an input function. This function should be monotonicand depends on the concentration of transcription factor and on the type ofregulation. A common function describing many real gene input functions is theHill function, [5].

activatory repressive

H(x) = k xn

xn+hn H(x) = k1+ xn

hn

The Hill function has three parameters, k the maximum transcription rate,h the activation or repression coefficient and n related to the stepness. For greatn Hill function becomes a step-like function and typical values are within 1-4[5]. In some cases some genes have a basal expression level and to implementthis behaviour the typical approach is defining a basal transcription rate plusan Hill function

Hbasal(x) = k0 +H(x) (2.85)

where k0 is the basal transcription rate and the maximum transcription ratebecomes k0 + k. For this motif we set k0 = 0.

The associated master equation is

∂P (ni,t)∂t = αmtf [P (n1 − 1)− P (n1)]+

+βmtf [(n1 + 1)P (n1 + 1)− n1P (n1)]+

+αtfn1[P (n2 − 1)− P (n2)] + αmt[n2]n

[n2]n+hn [P (n3 − 1)− P (n3)]+

+βmt[(n3 + 1)P (n3 + 1)− n3P (n3)]++αtn3[P (n4 − 1)− P (n4)] + βt[(n5 + 1)P (n5 + 1) + n5P (n5)]

(2.86)with ni = mtf,tf,mt,t and exploiting the step-operator takes this form

∂P∂t =

αmtf (E−1

1 − 1) + βmtf (E11 − 1)n1+

+αtfn1(E−12 − 1) + αmt

(∑l Cln

l2

)(E−1

3 − 1)++βmt(E1

3 − 1)n3 + αtn3(E−14 − 1) + βt(E1

4 − 1)n4

P

(2.87)

The mean field equations for this model are

35

d[mtf ]dt = αmtf − βmtf [mtf ]

d[tf ]dt = αtf [mtf ]− βtf [tf ]d[mt]dt = αmt

[tf ]n

[tf ]n+hn − βmt[mt]d[t]dt = αt[mt]− βt[t]

(2.88)

Hill function expansion To make calculations feasible, the Hill functionmust be expanded. Usually, we are interested in steady state values, so, wechoose to approximate the Hill function next to the expected steady state valueof the transcription factor concentration.

H(x) =xn

xn + hn(2.89)

So we expand hill function near the steady state value of x, in our case x ≡ tf

x(t) −−−→t→∞

xs

H(x) =∑∞n=0 an(x− xs)n,

(2.90)

collect every orders of expanded variable and define a new series expansion

H(x) =

∞∑n=0

an(x− xs)n =

∞∑n=0

Cnxn (2.91)

Generating function The transformed master equation, with Hill functionat the first order, takes this form

∂F (zi,t)∂t = [αmtf (z1 − 1) + αmtC0(z3 − 1)]F (zi, t)+

+[βmtf (1− z1) + αtfz1(z2 − 1)]∂F (zi,t)∂z1

+

+[βtf (1− z2) + αmtC1z2(z3 − 1)]∂F (zi,t)∂z2

+

+[βmt(1− z3) + αtz3(z4 − 1)]∂F (zi,t)∂z3

+ βt(1− z4)∂F (zi,t)∂z4

(2.92)

Linear noise approximation

∂P∂t =

Ωαmtf (E−1

1 − 1) + βmtf (E11 − 1)n1+

+αtfn1(E−12 − 1) + αmt

(∑l Ω

1−lClnl2

)(E−1

3 − 1)++βmt(E1

3 − 1)n3 + αtn3(E−14 − 1) + βt(E1

4 − 1)n4

P

(2.93)

Note that in the full expansion we’ll have the maximum order of expansion thatis Ω

12 and the second one that is Ω0. In the hypothesis that Ω is great enough,

the following orders (Ω−n2 with n > 0) can be neglected. Eki − 1 brings two

terms, one proportional to Ω−12 , for ∂ξi and the other to Ω−1, ∂2

ξi. In the master

equation the Hill function is multiplied by a term Eki − 1. Because the order Ω12

gives us the equation for xi(t) (see (2.88)), it disappears and we need the termsproportional to Ω0. This considerations are useful to keep the right orders ofthe Hill series expansion.

36

• substituting n = Ωx +√

Ωξ into the Hill function

H(ni) =∑∞l=0 Cln

li =

∑l ClΩ

1−l(Ωxi + Ω12 ξi)

l =

=∑∞l=0 ClΩ(xi + Ω−

12 ξi)

l =

=∑∞l=0 ClΩ

∑lk=0

(lk

)xl−ki (Ω−

12 ξi)

k =

=∑∞l=0 Cl

∑lk=0

(lk

)xl−ki ξki Ω1− k2

(2.94)

and with the previous consideration, we collect1− k

2 = 12 ⇒ k = 1 for terms ∝ Ω−

12

1− k2 = 1 ⇒ k = 0 for terms ∝ Ω−1 (2.95)

thus, we save two different series

H(x) ∼ ∑

n Cnnxn−1ξ for terms ∝ Ω−

12∑

n Cnxn for terms ∝ Ω−1 (2.96)

Now, we write down the order Ω0 of LNA for TF-T, with a change in notation:∂∂ξi

= ∂i:

∂Π∂t = βmtf∂1(ξ1Π) + 1

2 (αmtf + x1βmtf )∂21Π+

+ ∂2(βtfξ2 − αtfξ1)Π + 12 (βtfx2 + αtfx1)∂2

2Π+

+ ∂3(βmtξ3 − αmt∑∞l=0 l Clx

l−12 ξ2)Π+

+ 12 (αmt

∑∞l=0 Clx

l2 + βmtx3)∂2

3Π++ ∂4(βtξ4 − αtξ3)Π + 1

2 (βtx4 + αtx3)∂24Π

(2.97)

Example TF-T In our case, the matrices A and B are

A =

−βmtf 0 0 0αtf −βtf 0 00 α

(∑n nCnx

n−12

)−βmt 0

0 0 αt −βt

(2.98)

B =

αmtf + x1βmtf 0 0 0

0 βtfx2 + αtfx1 0 00 0 αmt (

∑n Cnx

n2 ) + βmtx3 0

0 0 0 βtx4 + αtx3

(2.99)

Gillespie simulation For our case of study, the set of chemical reactions arethe following:

∗ → mTF αmtfmTF → mTF + TF αtfTF → TF +mT αmt

TFn

TFn+hn

mT → mT + T αtmTF → βmtfTF → βtfmT → βmtT → βt

(2.100)

Note that the rate of third reaction depends on the concentration of transcrip-tion factor (TF) through the Hill function.

37

Comparison between methods We compare the method for this set ofparameters: αmtf = 0.01, αtf = 0.05, αmt = 0.01, h = 200, n = 1, αt =0.05, βmtf = 0.00033, βtf = 0.00017, βmt = 0.00033, βt = 0.00017, t = 100000 s.

Mean value mTF TF mT TGillespie 30.1-30.7 8894-8983 29.1-29.9 8674-8760Gen. Func. 30.303 8912.66 29.638 8717.05LNA 30.303 8912.66 29.638 8717.05NoiseGillespie 0.17-0.18 0.099-0.106 0.174-0.185 0.104-0.108Gen. Func. 0.181659 0.106453 0.183698 0.107656LNA 0.181659 0.106453 0.183698 0.107656

Correlation matrices are the following:

ρgillespie =

1 0.52/0.60 −0.08/0.04 −0.07/0.04

0.52/0.60 1 −0.06/0.06 −0.04/0.02−0.08/0.04 −0.06/0.06 1 0.53/0.63−0.07/0.04 −0.04/0.02 0.53/0.63 1

(2.101)

ρgenfunc =

1 0.580201 0.00368966 0.00214058

0.580201 1 0.010535 0.01081460.00368966 0.010535 1 0.5802480.00214058 0.0108146 0.580248 1

(2.102)

ρlna =

1 0.580201 0.00368966 0.00214058

0.580201 1 0.010535 0.01081460.00368966 0.010535 1 0.5802480.00214058 0.0108146 0.580248 1

(2.103)

We can conclude that there is a very good agreement among all these meth-ods, and we will apply them to a set of more complex motifs in the followingchapter.

38

Chapter 3

miRNA coherent feedforward loop

The interplay between transcriptional and post-transcriptional regulation at-tracted much interest in the past few years [72]. As in the purely transcrip-tional regulatory network [4], motifs belonging to such mixed layer of interac-tion have been identified [84, 99, 111, 123] and mathematically characterized[99, 111, 50, 81, 18]. MicroRNAs (miRNAs), small non-coding RNAs whichpost-transcriptionally regulate gene expression, play a pivotal role in these cir-cuitries. So far the attention was mainly devoted to circuits in which miRNAshave only an auxiliary role. This is the case for instance of the miRNA-mediatedFeed Forward Loop (FFL) [99, 111, 50, 81] or the miRNA mediated self-loop[18]. However, several important biological processes are actually controlled bymiRNAs which play themselves the role of master regulators. The correspond-ing network motifs show a remarkable degree of topological enrichment in themixed regulatory network [39, 106]. A major reason of interest in this type ofcircuits is the so called “sponge effect” [90, 105], i.e. the appearance of indirectinteractions among targets due to competition for miRNA binding.

In [39] analysis of data from the Encyclopedia of DNA Elements (ENCODE)project revealed that two distinct classes of miRNA-controlled circuits were par-ticularly enriched in the network. In the first class miRNAs target two interact-ing genes (which for example can dimerize). MiRNAs belonging to the secondclass target two transcription factors (TFs) which both regulate the same gene,one as proximal and one as distal regulator. This same topology was found tobe over-represented in human glioblastoma combining bioinformatical analysisand expression data [106]. Both these examples suggest a role of miRNAs in en-suring the stability and fine-tuning of the relative concentration of their targets.The topological enrichment is further magnified if one selects those motifs inwhich the two targets are linked by a transcriptional regulation (see Fig. 3.1).The resulting network motif is a FFL in which a miRNA regulates a TF andtogether with it one or more target (T) genes. In the following we shall denotethese circuitries as “miRNA-controlled FeedForward loops” (micFFL).

An interesting feature of the micFFL is that it is the simplest motif in whicha TF regulates its target simultaneously with direct (transcriptional) and in-direct (mediated by the sponge effect) regulatory interactions. Depending on

39

FFLC

2

X

Y

Z micFFL

miRNA

TF T

the sign of the transcriptional regulation this combination can be coherent orincoherent and may have very interesting functional roles. The transcriptionalversion of this circuit has been analyzed by several authors [71, 55] and it’s clas-sified as feed-forward loop coherent of type 2 (FFLC2), because of its topology.The circuit is able to perform a few important functions able to enhance the co-ordination of the targets. At the same time, targets coordination may representa too strong linkage, thus decreasing the overall flexibility of the network. Thisnon-trivial behavior could be the reason of the quite peculiar pattern of topolog-ical enrichment we observe. Our main goal will be to quantitatively study thesefunctions, to fix the range of parameters in which they occur and, possibly, tounderstand their role within the regulatory network as a whole.

3.1 Putative functions of micFFLs

It has been recently shown that microRNAs can generate thresholds in targetgene expression [78] which in turn may induce non-linear relations between pro-tein and transcript concentrations. In the same paper it was also pointed outthat gene expression shows large cell-to-cell fluctuations in a population of iden-tically prepared cells. We find that similar threshold effects are also present inthe TF and T of micFFLs whose relative concentrations can be fine-tuned toany desired value as function of miRNA concentration. In particular, the pecu-liar topology ensures a tight control of stochastic fluctuations of this ratio andthe noise reduction is maximal exactly in proximity to the threshold region. Weperform the analysis of the circuit in two main steps (deterministic and stochas-tic) concentrating on the behavior of the ratio p1/p2 for the concentration oftwo targets. The robustness of this ratio against stochastic fluctuations is oneof the main reasons of interest on this circuit and will be the main issue of thestochastic analysis.

In order to discuss the functional properties of the micFFL we compareit with five “null models” obtained eliminating miRNA-TF and/or miRNA-T interactions, see Figure 3.1. We can thus identify which properties are directconsequences of the miRNA interaction (as the threshold effect) or are a peculiarconsequence of the micFFL topology (as the noise reduction). The simplestnull model is represented by the direct regulation TF → T without miRNAs(NM1). Comparison with NM1 shows the effect of switching-on the miRNA inour circuit. Two other important null models are those circuits in which we onlykeep the miRNA-TF interaction (NM2) or the miRNA-T interaction (NM3).

Finally, we analyze the circuit with one miRNA regulating separately thetwo targets T1 and T2 (NM4) and the open circuit in which two independent

40

micFFL

miRNA

TF T

NM1TF T

NM2

miRNA

TF TNM3

miRNA

TF T

NM4

miRNA

T1 T2NM6

miRNA1

TF T

miRNA2

Figure 3.1: All circuits used as null models against micFFL.

miRNAs regulate TF and T respectively (NM5). These circuits are themselvesvery interesting. In particular NM4 was widely studied in the past few years tomodel bacterial small RNA (sRNA)/target interaction [69, 77]. More recentlyit was also discussed in the framework of a miRNA/target interaction network[19, 33, 80] as an example of the sponge effect. A byproduct of our analysis willbe the discussion of few interesting features of these null models.

3.2 Deterministic analysis

The micFFL is described by the following set of equations:

dm1

dt = k∗m1− γ∗m1

m1 − k∗,on1 m1Mfree + k∗,off1 c1d p1

dt = k∗p1m1 − γ∗p1

p1dm2

dt = k∗m2f(p1)− γ∗m2

m2 − k∗,on2 m2Mfree + k∗,off2 c2d p2

dt = k∗p2m2 − γ∗p2

p2dMfree

dt = k∗s − γ∗sMfree − k∗,on1 m1Mfree + (k∗,off1 + γ∗c1)c1 − k∗,on2 m2Mfree + (k∗,off2 + γ∗c2)c2d c1dt = k∗,on1 m1Mfree − (k∗,off1 + γ∗c1)c1d c2dt = k∗,on2 m2Mfree − (k∗,off2 + γ∗c2)c2

(3.1)where γ∗x denotes the degradation constant of the molecular species x and k∗xthe corresponding production rate, m1 and p1 the concentration of mRNA andprotein for the TF and m2, p2 those for the target. We then redefine the param-eters dividing them by the target protein degradation rate γ∗p2

in order to havedimensionless values. The system thus becomes:

dm1

dτ = km1 − γm1m1 − kon1 m1Mfree + koff1 c1d p1

dτ = kp1m1 − γp1p1dm2

dτ = km2f(p1)− γm2m2 − kon2 m2Mfree + koff2 c2d p2

dτ = kp2m2 − p2dMfree

dτ = ks − γsMfree − kon1 m1Mfree + (koff1 + γc1)c1 − kon2 m2Mfree + (koff2 + γc2)c2d c1dτ = kon1 m1Mfree − (koff1 + γc1)c1d c2dτ = kon2 m2Mfree − (koff2 + γc2)c2

(3.2)

41

where kx =k∗xγ∗p2

are the rescaled transcription or translation rates, γx =γ∗xγ∗p2

the rescaled degradation rates and tγ∗p2= τ the rescaled time. Following [78]

we assumed that miRNA can interact with target mRNA mi by forming acomplex ci with it. The ci stability is determined by the constants koni , koffi

and by the concentration of unbound miRNA Mfree. Mfree is related to thetotal concentration of miRNA Mtot by the relation:

Mtot = Mfree + c1 + c2 . (3.3)

In the following Mtot is an external input of the circuit. The transcriptionalregulation of m2 is described by the activatory Hill function

f(p1) =pn1

pn1 + hn, (3.4)

with Hill coefficient n and activation coefficient h. A further section is devotedto discuss the explicit introduction of the promoter state dynamics for the tar-get gene. The equations describing the null models introduced above can beeasily obtained from Eq.s 3.2 eliminating some of the molecular species and/orinteractions.

The steady state solution of Eq.s (3.2) can be written in a simple way as afunction of Mfree. Introducing

θfreei ≡ γciγmi

Mfree, λi ≡koffi + γci

koni, (i = 1, 2) η ≡ h

p01

(1 +

θfree1

λ1

)(3.5)

we can writep1 = p0

11

1+θfree1 /λ1,

p2 = p02

1

1+θfree2 /λ2

11+ηn ,

(3.6)

where p01 and p0

2 denote the asymptotic values of p1 and p2 in absence of miRNAs.The Hill function is at saturation, i.e. f(p1) = 1 (similarly for m0

1 and m02), so

that p01 = kp1

km1/γm1

γp1and p0

2 = kp2km2

/γm2. From these equations we obtain

the ratio R ≡ p2/p1 as a function of Mfree:

R ≡ p2

p1=p0

2

p01

1

1 + ηn=

1 +θfree1

λ1

1 +θfree2

λ2

. (3.7)

It would be interesting to obtain the same ratio as a function of Mtot insteadof Mfree. Mtot can be obtained from Mfree, m1 and m2

Mtot = Mfree

(1 + m1

λ1+ m2

λ2+ γs

α

)+ ks

α ,

α = koff1 + koff2 + γc1 + γc2 .(3.8)

The dependence on m1 and m2 makes it difficult to write the ratio explicitlyin terms of Mtot, but it can be easily obtained numerically. We plot R as afunction of Mtot in Figure 3.2 in the limit θfree1 = θfree2 ≡ θfree and λ1 =λ2 = λ for n = 1, 2 and 3. We plot for comparison the same ratio for thenull models NM2 and NM3. The shadowed portions of the plots denote theregions in which either p1/p

01 or p2/p

02 is less than 0.05, i.e where the miRNA

42

Figure 3.2: Reachable value of the ratio between TF and T, comparing NM2NM3 and micFFL topologies for different values of the Hill coefficient in theTF-T interaction.

concentration is so high that one of the proteins (or both) is almost absent. AsmiRNA concentration increases, R can be tuned from p0

2/p01 down to less than

20% of its orginal value. The shape of the Mtot dependence and the minimumvalue of R strongly depend on the Hill coefficient. It is interesting to observethat also NM2 and NM3 allow to fine tune R essentially to any desired value.These two models represent the limiting situations which one would obtain whenλ1 >> λ2 or λ1 << λ2.

3.3 Steady state analysis with the logic approx-imation

We discuss here the steady state analysis of the Eq.(1) of the main text in theframework of the logic approximation in which the Hill function is approximatedwith the Heaviside step function f(p1) = H(p1−h). Even if this approximationis very crude it may help to have an intuitive picture of the behaviour of thevarious players of the circuit as a function of the parameters of the circuit. Theequations for p1 and p2 (see Eq.(1) of the main text) can be solved immediatelyleading to the steady state values: p0

1 = kp1m1/γp1

and p02 = kp2

m2/γp2. We

can rescale the activation coefficient hs ≡ hγp1/kp1 and then write the stepfunction as a function of m1 and eliminate p1 from the equations. Following [16]we introduce the quantities (i = 1, 2)

λi ≡koffi +γcikoni

,

θi ≡γciγmi

Mtot ,(3.9)

which have an immediate physical interpretation: θi is the (suitably rescaled)amount of miRNA acting on mi and 1/λi measures the strenght of this interac-tion, i.e. the lifetime of the complex ci. These will be in the following the only

43

2 4

1

1.5

m1

m01

m01 m0

1+m02θ

Hs

1 4

2

m01+m

02θ

m2

m02

m01-Hs

BA

Figure 3.3: Steady state analysis with the logic approximation of the micFFL.Plots A and B show the mRNA concentrations, respectively, of transcriptionfactor (m1) and target (m2) as a function of the microRNA concentration (θ)in the limit λ → 0. Hs represents the activation threshold of the Heavisidefunction.

external parameters of micFFL. Finally we assume for simplicity λ1 = λ2 = λ,θ1 = θ2 = θ, and denote m0

i as the steady state value mi would reach if Mtot = 0(i.e. m0

1 ≡ km1/γm1 and m02 ≡ km2/γm2 if m0

1 > hs and m02 = 0 otherwise).

Then it is easy to obtain the steady state values of m1 e m2 as a function of θand λ.

m1 = m01

m01+m0

2−θ−λ+√

((m01+m0

2−θ−λ)2+4(m01+m0

2)λ)2(m0

1+m02)

m2 = m02

m01+m0

2−θ−λ+√

((m01+m0

2−θ−λ)2+4(m01+m0

2)λ)2(m0

1+m02)

(3.10)

The implications of this result can be better appreciated if we take the λ → 0limit.

m1 =

0 m0

1 +m02 ≤ θ

m01

(1− θ

m01+m0

2

)m0

1 +m02 > θ

m2 =

0 m0

1 +m02 ≤ θ

∨m1 < hs

m02

(1− θ

m01+m0

2

)m0

1 +m02 > θ

∧m1 > hs

with m02 =

0 m1 < hsm0

2 m1 > hs

(3.11)

We plot the value of m1 and m2 as a function of θ (i.e. of the miRNAconcentration) in Fig.s 3.3A-B. Looking at these figures we see a few interestingand non trivial features:

• In the λ→ 0 limit we find for the transcription factor m1 the same thresh-old behaviour discussed in as a function of the miRNA concentration. Thesame effect should be present also in the target concentration m2, but is

44

hidden by the fictious step behaviour due to the logic approximation. It iseasy to understand the origin of this threshold behaviour: if the number offree miRNA molecules greatly exceeds the number of transcripts mTF andmT, then these will be almost all bound in complexes and the correspond-ing proteins will not be expressed. On the opposite side, if the number ofmTF and mT molecules overcomes miRNA amount, then nearly all miR-NAs will be bound in complexes but there will be a sufficient amount offree mTFs and mTs to be translated.

• As the total miRNA concentration decreases the TF concentration in-creases. When the TF concentration reaches the threshold Hs for the m2

activation we observe a sudden enhancement in the TF concentration dueto the sponge interaction between m1 and m2. When also m2 is presentthen the two mRNAs start to compete for the same miRNAs and as a neteffect there is a smaller amount of miRNA available to downregulate m1.This non linear behavior of the TF concentration as the miRNA concen-tration increases is in our opinion one of the most effective ways to detectsponge-like interactions.

• The ratio m2/m1 (and thus p2/p1) can only take two possible values:m2/m1 = m0

2/m01 for m1 > Hs and m2/m1 = 0 for m1 < Hs. However

this is clearly an artifact of the logic approximation.

3.4 Models in detail

Gillespie reactions In the following we briefly introduce the reactions in-volved in the circuits in Fig. 3.1 that we used to simulate the models throughthe implementation of Gillespie’s direct algorithm. As discussed previously, weassume a titrative interaction of the miRNA with both its targets while theinteraction between transcription factor (TF) and target (T) is modelled as aHill function.

Set of reactions for the NM1 (TF-T):

Reaction Rate P icture∗ → m1 k∗m1

m1 → m1 + p1 k∗p1m1

p1 → p1 +m2 k∗m2

(p1)n

(p1)n+hn

m2 → m2 + p2 k∗p2m2

m1 → γ∗m1m1

p1 → γ∗p1p1

m2 → γ∗m2m2

p2 → γ∗p2p2

(3.12)

45

Set of reaction for the NM2:

Reaction Rate P icture∗ → s k∗s∗ → m1 k∗m1

m1 → m1 + p1 k∗p1m1

s+m1 → c1 konc1∗s ·m1

c1 → s αγ∗c1c1p1 → p1 +m2 k∗m2

(p1)n

(p1)n+hn

m2 → m2 + p2 k∗p2m2

s → γ∗ssm1 → γ∗m1

m1

c1 → (1− α)γ∗c1c1p1 → γ∗p1

p1

m2 → γ∗m2m2

p2 → γ∗p2p2

(3.13)



m1 → m1 + p1 k∗p1m1

p1 → p1 +m2 k∗m2

(p1)n

(p1)n+hn

m2 → m2 + p2 k∗p2m2


c2 → s αγ∗c2c2s → γ∗ss

m1 → γ∗m1m1

p1 → γ∗p1p1

m2 → γ∗m2m2

c2 → (1− α)γ∗c2c2p2 → γ∗p2

p2

(3.14)

46

Set of reaction for the maimed NM4:


m1 → m1 + p1 k∗p1m1


c1 → s αγ∗c1c1∗ → m2 km2

m2 → m2 + p2 k∗p2m2



m1 → γ∗m1m1

c1 → (1− α)γ∗c1c1p1 → γ∗p1

p1

m2 → γ∗m2m2

c2 → (1− α)γ∗c2c2p2 → γ∗p2

p2

(3.15)


Reaction Rate P icture∗ → s1 k∗s∗ → s2 k∗s∗ → m1 k∗m1

s1 +m1 → c1 konc1∗s1 ·m1

c1 → s1 αγ∗c1c1m1 → m1 + p1 k∗p1

m1

p1 → p1 +m2 k∗m2

(p1)n

(p1)n+hn

m2 → m2 + p2 k∗p2m2

s2 +m2 → c2 konc2∗s2 ·m2

c2 → s2 αγ∗c2c2s1 → γ∗ss1

s2 → γ∗ss2

m1 → γ∗m1m1

c1 → (1− α)γ∗c1c1p1 → γ∗p1

p1

m2 → γ∗m2m2

c2 → (1− α)γ∗c2c2p2 → γ∗p2

p2

(3.16)

47

TF-T micFFL

t t

protein

Figure 3.4: Through Gillespie simulations, we compared the feed-forward loophaving a microRNA as master regulator (micFFL in figure) with the direct reg-ulation (TF-T in figure). Protein levels for NM1 (TF-T) and micFFL are shownin function of time. It is easy to note how after certain time, or at the steadystate, the protein concentrations aren’t correlated in the direct regulation and,instead, they fluctuate together with the addition of the microRNA interaction.

Set of reaction for the micFFL:



c1 → s αγ∗c1c1m1 → m1 + p1 k∗p1

m1

p1 → p1 +m2 k∗m2

(p1)n

(p1)n+hn

m2 → m2 + p2 k∗p2m2



m1 → γ∗m1m1

c1 → (1− α)γ∗c1c1p1 → γ∗p1

p1

m2 → γ∗m2m2

c2 → (1− α)γ∗c2c2p2 → γ∗p2

p2

(3.17)

3.5 Master equations

In the following we report the master equations for the various circuits analyzed,with s → n1,m1 → n2, c1 → n3, p1 → n4,m2 → n5, c2 → n6, p2 → n7 and

the step operator Ekj =∑∞l=0

kl

l!∂l

∂nljas defined in van Kampen [114]. The rates

are rescaled with degradation rate of T protein, γp2 .

48

NM1

∂P (ni,τ)∂τ =

km1

(E−12 − 1) + γm1

(E12 − 1)n2kp1

n2(E−14 − 1)+

+γp1(E1

4 − 1)n4 + km2

∑n Cnn

n4 (E−1

5 − 1)++γm2(E1

5 − 1)n5 + kp2n5(E−17 − 1)+

+(E17 − 1)n7

P (ni, t) ;

(3.18)

NM3

∂P (ni,τ)∂τ =

ks(E−1

1 − 1) + γs(E11 − 1)n1 + km1(E−1

2 − 1)++γm1

(E12 − 1)n2 + kp1

n2(E−14 − 1) + γp1

(E14 − 1)n4+

+km2

∑n Cnn

n4 (E−1

5 − 1) + γm2(E1

5 − 1)n5++konc2 (E1

1E15E−1

6 − 1)n1n5 + αγc2(E−11 E1

6 − 1)n6++kp2

n5(E−17 − 1) + (E1

7 − 1)n7 + (1− α)γc2(E16 − 1)n6

P (ni, t) ;

(3.19)

NM4

∂P (ni,τ)∂τ =

ks(E−1

1 − 1) + γs(E11 − 1)n1 + km!

(E−12 − 1)+

+γm1(E1

2 − 1)n2 + konc1 (E11E1

2E−13 − 1)n1n2+

+αγc1(E−11 E1

3 − 1)n3 + kp1n2(E−14 − 1) + γp1(E1

4 − 1)n4++km2

(E−15 − 1) + γm2

(E15 − 1)n5+

+konc2 (E11E1

5E−16 − 1)n1n5 + γc2(E−1

1 E16 − 1)n6+

+kp2n5(E−17 − 1) + (E1

7 − 1)n7 + (1− α)γc1(E13 − 1)n3+

+(1− α)γc2(E16 − 1)n6

P (ni, t) ;

(3.20)

NM5

∂P (ni,τ)∂τ =

ks1(E−1

1 − 1) + γs1(E11 − 1)n1+

+ks2(E−18 − 1) + γs2(E1

8 − 1)n8 + km1(E−1

2 − 1)++γm1

(E12 − 1)n2 + konc1 (E1

1E12E−1

3 − 1)n1n2++αγc1(E−1

1 E13 − 1)n3 + kp1n2(E−1

4 − 1) + γp1(E14 − 1)n4+

+km2

∑n Cnn

n4 (E−1

5 − 1) + γm2(E1

5 − 1)n5++konc2 (E1

8E15E−1

6 − 1)n1n5 + αγc2(E−18 E1

6 − 1)n6++kp2n5(E−1

7 − 1) + (E17 − 1)n7 + (1− α)γc2(E1

3 − 1)n3++(1− α)γc2(E1

6 − 1)n6

P (ni, t) ;

(3.21)

micFFL

∂P (ni,τ)∂τ =

ks(E−1

1 − 1) + γs(E11 − 1)n1 + km1(E−1

2 − 1)++γm1

(E12 − 1)n2 + konc1 (E1

1E12E−1

3 − 1)n1n2++αγc1(E−1

1 E13 − 1)n3 + kp1

n2(E−14 − 1) + γp1

(E14 − 1)n4+

+km2

∑n Cnn

n4 (E−1

5 − 1) + γm2(E15 − 1)n5+

+konc2 (E11E1

5E−16 − 1)n1n5 + αγc2(E−1

1 E16 − 1)n6+

+kp2n5(E−1

7 − 1) + (E17 − 1)n7 + (1− α)γc1(E1

3 − 1)n3++(1− α)γc2(E1

6 − 1)n6

P (ni, t) .

(3.22)

49

From each master equation we obtained the first two moments of the var-ious distributions through the linear noise approximation (LNA). From theseequations it is straightforward to calculate the approximated correlations rx,yand variances σx for each concentration. In figure 3.5 we show the comparisonbetween simulations and approximated analytical results.

12 24 36 48 60 72 120 240 360 480 600

Figure 3.5: Examples of linear noise approximation and gillespie comparison,measuring correlations rm1,m2

, rs,m1and rs,m2

, among s, m1 and m2 in themicFFL as a function of the miRNA production rate ks. Points correspond tostochastic simulations and continuous lines to LNA solutions. (a) and (b) cor-respond to two different sets of parameters. (a) Parameters: km1

= 23.5, km2=

41, konc1 = konc2 = 1.5, kp1= kp2

= 117, γm1= γm2

= 2, γs = γp1= 1, αγc1 =

αγc2 = 1.5, (1 − α)γc1 = (1 − α)γc2 = 1, h = 200, n = 1 and ks varies. (b)Parameters: km1 = 235, km2 = 410, konc1 = konc2 = 1.5, kp1 = kp2 = 117, γm1 =γm2 = 2, γs = γp1 = 1, αγc1 = αγc2 = 1.5, (1 − α)γc1 = (1 − α)γc2 = 1, h =200, n = 1 and ks is variable.

3.6 Stochastic Analysis

As in the previous section, we assume a titrative miRNA-target interaction andan activatory Hill function for the TF-dependent target transcription rate. Themolecular species we considered are transcripts for miRNAs (s), transcriptionfactor (m1) and target (m2), proteins for transcription factor (p1) and target(p2), and the complexes the miRNA can form when bound to m1 or m2 (c1 andc2 respectively). The corresponding master equation, setting s → n1,m1 →n2, c1 → n3, p1 → n4,m2 → n5, c2 → n6, p2 → n7, is

∂P (ni,τ)∂τ =

ks(E−1

1 − 1) + γs(E11 − 1)n1 + km1(E−1

2 − 1)++γm1

(E12 − 1)n2 + kon1 (E1

1E12E−1

3 − 1)n1n2++αγc1(E−1

1 E13 − 1)n3 + kp1

n2(E−14 − 1) + γp1

(E14 − 1)n4+

+km2( TF ss

(TF ss)n+hn − n(TF

ss

h )n

(1+(TFssh )n)2 + n

TF ss(TF

ss

h )n

(1+(TFssh )n)2n4)(E−1

5 − 1)+

+γm2(E15 − 1)n5 + kon2 (E1

1E15E−1

6 − 1)n1n5 + αγc2(E−11 E1

6 − 1)n6++kp2

n5(E−17 − 1) + (E1

7 − 1)n7 + (1− α)γc1(E13 − 1)n3+

+(1− α)γc2(E16 − 1)n6

P (ni, t) ,

(3.23)where α denotes the probability of miRNA recycling and E is the step-operator

Ekj =∑∞l=0

kl

l!∂l

∂nlj. As in [81, 18] we linearized the Hill function around the

50

steady state value TF ss, as presented in the previous chapter. We are interestedin evaluating the linear correlation coefficients rxy = <xy>−<x><y>

σxσy, which

measures how much two variables are linearly dependent.

miRNA amount

inte

ractio

n stre

ng

th F

0

1A

B

HD

Correlation

mic

FFL

NM

4N

M3

0.85

1.7

2.55

Corre

latio

n b

ar

Legend

NM

5

50 150100

C

50 150100

0.85

1.7

2.55

0.85

1.7

2.55

0.85

1.7

2.55

0.85

1.7

2.55

0.85

1.7

2.55

0.85

1.7

2.55

0.85

1.7

2.55

Figure 3.6: Heat maps of correlation between protein levels. Only NM4 andmicFFL show a region with a high degree of coupled fluctuations.

This quantity can be evaluated in general for any pair of molecular species,but we are in particular interested in the correlation between T and TF . To esti-

51

mate it we need the first two moments of the probability distribution P (ni, τ).Due to the complexity of the master equation this cannot be done analyticallynot even by linearizing the target transcription rate, thus we decided to ap-proach the problem in the framework of the linear noise approximation [113].In this framework it is straightforward to obtain the covariance matrix of thesystem directly from its macroscopic description [31] and thus have approxi-mate expressions for the first two moments of P (ni, t). We performed a setof Gillespie simulations on the model in order to quantify the error due to thelinear noise approximations (see Fig. 3.5).

We made an effort to present all the results in terms of potentially measurableparameters, such as miRNA number of molecules and miRNA-target interaction

strenght F =koniγsγmi

[69] (where koni , γs, γpi and γmi are defined as above). The

other parameters take physiological values. We estimate the parameters’ order ofmagnitude via the transcription, translation and degradation rates found in [1]and Bionumbers database [75]. To test our choice, we checked whether the steadystate concentrations have realistic values. In order to understand the peculiarproperties of micFFL we compared it with the null models NM3,NM4 and NM5(Fig. 3.6). Given the large number of free parameters, such a comparison isnot straightforward. Our strategy was to maintain equal all the correspondingparameters in the four models.

3.7 Comparison between NM4 and micFFL

To better investigate the increase in correlation due to the transcriptional link,we compare NM4 and micFFL (i.e. the two circuits which show the best resultsin terms of correlation between TF and T). We fix mRNA, protein, microRNAand complexes amounts and then evaluate the ratio between p1 and p2 correla-tions in NM4 and micFFL. The constraint is thus the following:

kNM4m2

= kmicFFLm2

pn2pn2 + hn

. (3.24)

Thanks to this constraint it is possible to compare directly the gain of correlationdue to micFFL with respect to NM4. Our results are shown in Figure 3.7. Thecircuit with the transcriptional link between TF and T proteins always reacheshigher values of correlation.

3.8 A prototypical example: the micFFL involv-ing E2F1 and RB1 as targets and a set ofmiRNAs as master regulators

Within a list of candidates with experimentally validated interactions, miRTar-Base [52] for microRNA interactions and a manually-curated dataset [29, 86]for TF-T links, we selected, as an example, the micFFLs involving E2F1 andRB1 as targets and a set of miRNAs (miR-106a, miR-106b miR-17 miR-20a andmiR-23b) as master regulators. The network involving these genes is reported inFigure 3.8. The experimental support for these circuits is very strong (see [39] forthe transcriptional regulation and [110] for those involving the miRNAs). E2F1

52

0 500 1000 1500 20000.0

0.2

0.4

0.6

0.8

1.0

p1 amount

ρNM4

0 500 1000 1500 20000.0

0.2

0.4

0.6

0.8

1.0

p1 amount

ρmicFF

L

0 500 1000 1500 2000

1.0

1.5

2.0

2.5

3.0

p1 amount

ρmicFF

LρNM4NM4 micFFL

vs

Constraints: fixed mRNA and protein abundances.

F A

B C2.892.552.211.871.531.190.850.51

Figure 3.7: Correlation analysis of NM4 and micFFL fixing the amount of all themolecular species. Correlation values are calculated between p1 and p2. A Ratiobetween correlation of micFFL and NM4. Each curve corresponds to differentvalues of the interaction strength F. Varying the amount of TF, micFFL keepscorrelation always higher than NM4. B Correlation in micFFL varying p1. CCorrelation in NM4 varying p1.

and RB1 are known to physically interact [1, 43] and are in fact included in thePrePPi database. The E2F1-RB1 system is a well known important switch in thecell cycle. E2F1 belongs to the family of E2F genes, which control the transitionfrom G0/G1 to S phase in the cell (the quiescent phase and the first check-point phase respectively). In absence of mitogenic stimulation, E2F-dependentgene expression is inhibited by interaction between E2F and members of theretinoblastoma protein family RB (composed by RB1, RBL1 and RBL2) [43].When mitogens stimulate cells to divide, RB family members are phosphorilatedthen reducing their binding to E2F. The thus free-from-binding E2F proteinsin turn activate expression of their target genes and trigger cell cycle. In G0phase almost all cells have E2F1 and RB1 proteins bound in complexes [1, 43].In this state RB inhibits E2F functions and consequently the cell cycle. It isclear that the stability of the relative concentration of the two genes againststochastic fluctuations is of crucial importance for the correct functioning ofthis checkpoint. Our analysis suggests that this stability is guaranteed by thefive miRNAs listed above and by the peculiar topology of the micFFLs theyform with their targets. These micFFLs allow a rapid reaction of RB1 in caseof bursts of E2F1 production thus avoiding a dangerous erroneous activationof the E2F1 pathway. The fact that the E2F1-RB1 pair is targeted simulta-neously by five miRNAs is likely to reinforce the stabilization function. In ourdatabases there are several other instances of TF-T pairs targeted by more thanone miRNA. These are most probably the best candidates for further theoretical

53

RB-like proteins

mir-17~92 cluster other mirs

E2F1

RB1

RBL1RBL2

hsa-miR-17-5phsa-miR-20a-5p

hsa-miR-106a-5phsa-miR-106b-5p hsa-miR-21-5p

hsa-mir-519a

hsa-miR-23b-3p

hsa-mir-519d

hsa-miR-93-5p

hsa-mir-330-3p

hsa-miR-130b-3p

Figure 3.8: micFFLs (in green) involving E2F1 transcript factor and Retinoblas-toma protein. Two key proteins in the regulation of the cell cycle, the first oneactivates it through a transcriptional cascade and the latter inhibits the cell cy-cle advance by stopping E2F family cascade. miRNA regulation becomes in thiscase crucial to mantain coupled the fluctuation of RB-like proteins and E2F1.In red are shown NM4 circuits that could even accomplish this function.

and experimental studies.

54

Chapter 4

Analysis of microRNAbinding sites

4.1 Introduction

MicroRNAs are important players in genetic regulatory networks, they aresmall RNA molecules, 21-23 nt long and their action is mediated by Argonaute(AGO1234) protein family. At the moment only for human in miRBase [62] areannotated at least 1872 precursors and 2578 mature microRNAs. When loadedon Argonaute frame, microRNA gains ability to regulate specific sequences ontranscripts. When these target sites are perfectly complementary to the miRNAsequence, the target transcript can be cleaved [101]. In mammals, the vast ma-jority of target sites are not perfectly complementary (21 nt), only 7 nucleotidesof the microRNA are required to bind the targets. This regions are called seeds.Within this theoretical frame, regulation mediated by microRNA occurs mainlyinhibiting translation or promoting degradation of target transcripts. The MIDand PIWI domains of Argonaute2 complex organize the seed region of miRNAin A-form helix by extensive interactions with the phosphate backbone [93], thisshape helps forming the A-form dsRNA located on the 3’ UTR of target mRNA.In [79] claimed that microRNAs below ∼ 100 copies/cell have little regulatorycapacity and almost 60% of the miRNAs detected by deep-sequencing had nodiscernible suppressive activity. This results could be explained by the greatnumber of targets, that dilutes the post-transcriptional effect of microRNAs,as we will see later. Even in [119] found only a small fraction of microRNAsare expressed in high copy number, criticizing the so called ceRNA hypothe-sis [91] and its derived works, characterized by the stochastic correlation andthreshold like behaviour due to titrative interaction, as in [69, 78] and in theprevious chapter. Then, a global view of microRNA regulation layer should betaken into account, as suggested by their spread, the great number of targetsites and their role in several biological processes [12, 11, 79, 37]. MicroRNAshelp homeostasis, mantain cell differentiation and a single microRNA has upto thousands of target transcripts [14, 79, 70, 7]. Hence, it’s hard to imaginethat they are carrying on a specific function ruling out an heavy reorganizationof the regulatory network in the cell. However, a little investigated subject isthe number of binding sites (BSs) on the whole set of target transcripts for a

55

microRNA. In the following our purpose is to set up a model accounting for thetarget transcripts, the number of BS on each target transcript and the bindingenergy between microRNA and its targets.

4.2 Model

microRNAs act through a titrative interaction, so firstly we define the SimplestTitration Model (STM) in which two molecular species can bind and form acomplex (RISC complex). Usually such interaction happens when a molecule(s) binds to another one (m), activating or deactivating its function, e.g. theenzimes catalysis. Since microRNAs inhibit their targets, the analysis has beenfocused on deactivating case [12, 11]. STM consists of two reactions, for complexassociation and dissociation, with rate kon and koff , respectively, eq. (4.1).

m+ skonkoff

c (4.1)

Considering the deactivating titration, species s binds to m and inhibits it,so we would know free m because they are still capable of carrying on their task.Setting stot = s+ c, mtot = m+ c and Kd =

koffkon

, you have

m = 12

(mtot − stot −Kd+

+

√(mtot − stot −Kd)

2+ 4Kdmtot

)(4.2)

This well-known solution, eq. (4.2) shows the typical threshold effect whereKd defines the threshold sharpness and stot

mtot= 1 its position [69, 78], Fig. 4.1B.

In [15, 16, 89, 20], authors (Rob Phillips’ group) figured out the probability offinding RNA polymerase bound to TSS with a statistical model. They analizedin the two papers [15, 16] from 2005, the promoter region of a single gene with anumber of specific BS of order 1 and the number of transcription factors muchgreater and in 2014 with other two papers, [89, 20], the group has studied eventhe possibility of a titration effect on TF efficacy dues to the whole genome num-ber of binding sites and global concentration of transcription factors, in otherwords the state when number of specific BSs and number of transcription fac-tors are comparable. They investigated three different cases, in which a TF hasto repress its target genes, simple repression, looping and exclusive looping andcreated a framework for a set of thermodynamic models. Within this frameworkwe tried to apply the same approach to microRNA titrative interaction, focusingon situation where a transcript is carrying more BSs of a single microRNA, get-ting useful relations for experimental design and evaluation. Following the ideain [15], we introduce unspecific BS competing for the same titrative molecules.Unspecific BSs try to emulate the cell enviroment where a lot of low-interactingspecies live. So our titrant (s) reacts preferentially with m (low ∆GGibbs), butits amount is much lower than other molecular species, e.g. all possible sites onRNA molecules. To mimic this environment, we introduced a minimal model,made up of four reactions.

56

A

B

0 1 2 30

1

2

free

mR

NA

total mRNA

modelapproximationexact

total miRNAFigure 4.1: A. Specific and unspecific reaction scheme for our titration model. B.shows the validity of the two limits M NBS and NBS M . In each graph the redpoints identify the exact resummation of the partition function, the blue line the twoapproximations and the orange function, the titration ODE solution with parametersKd and stot estimated through the limits.

57

m+ skSonkSoff

cm

n+ skNSonkNSoff

cn

(4.3)

In eq. (4.3), the first reaction is accounting for binding to specific BSs, whilethe second one to unspecific BSs (n). Within this setting we suppose mtot =NBS , number of specific BSs, stot = M , number of titrative molecules andn + cn = NNS , number of unspecific BSs. In [15, 16, 89, 20] they made thehypothesis that all transcription factors must be always bound, specifically orunspecifically, to get some approximations. This assumption may be avoidedusing the Grand Canonical Ensemble to solve the associated Chemical MasterEquation, even if the final result doesn’t change, as we will see.

Going into details, we consider a single mature microRNA and its followingparameters

• m = target mRNA;

• ni = number of binding sites of a microRNA on the i-th mRNA;

•∑ml=1 nl = NS

BS , total number of specific binding sites;

• M, microRNA;

• NNS = unspecific binding sites.

For this case we have the same hypothesis of model by Phillips:

i. microRNAs and BSs are at the kinetical equilibrium in each instant;

ii. all microRNAs are bound, specifically or not.

Starting from these assumptions, each miRNA and binding site can be foundin two different states, bound or unbound. The first step is to write the partitionfunction for M microRNA bound unspecifically:

Z(M) =

(NNSM

)e−M εNS

kBT (4.4)

and the total function for M microRNA, bound to the NBS specific or the NNSunspecific BSs, is

Ztot(M) =

NBS∑i=0

Z(M − i)(NBSi

)e−i εS

kBT . (4.5)

Then, our goal is the determination of mean number of bound specific bind-

ing site, N boundBS . This can be obtained through the common relation from sta-

tistical physics.

N boundBS = − 1

β

∂

∂εSlogZtot(M) (4.6)

In the following we will get an approximated solution for two different limits,NBS M and NBS M , or, below and above threshold (NBS = M). A

58

useful limit involving the ratio of Γ functions will be exploited and marked byoverbraces in the calculations, eq. (4.7).

Γ(n+ k)

Γ(n)−−−−→great n

nk (4.7)

NBS M limit.

Ztot =

NBS∑i=0

approx︷︸︸︷(NNSM − i

)(NBSi

)e−iβε

S

e−(M−i)βεNS = . . . (4.8)

since the approximation we estimate the binomial coefficient(NNSM−i

)adjusting

the exponential e−(M−i)βεNS to e−(NBS−i)βεNS thanks to a sum-subtraction.

. . . =e−(M−NBS)βεNS

M !

NBS∑i=0

(NBSi

)(M

NNSe−βε

S

)i(e−βεNS

)NBS−i = . . . (4.9)

the well-know binomial formula (x+ y)n =∑nk=0

(nk

)xkyn−k gives us

Ztot =e−(M−NBS)βεNSNM

NS

M !

(M

NNSe−βε

S

+ e−βεNS

)NBS(4.10)

and, finally, through the relation (4.6)

NfreeBS = NBS −N bound

BS = NBSNNSe

−β∆ε

M +NNSe−β∆ε(4.11)

with ∆ε = εNS − εS > 0.

NBS M limit.

Ztot =

NBS∑i=0

(NNSM − i

)(NBSi

)e−iβε

S

e−(M−i)βεNS (4.12)

Decompose the binomial coefficient as follows

. . . =

NBS∑i=0

NNS !

(M − i)!(NNS −M + i)!

NBS !

i!(NBS − i)!e−iβε

S

e−(M−i)βεNS = . . .

(4.13)

. . . =1

M !

NBS∑i=0

approx︷︸︸︷NNS !NBS !

(NNS −M + i)!(NBS − i)!M !

i!(M − i)!e−iβε

S

e−(M−i)βεNS = . . .

(4.14)and thanks to the binomial coefficient M !

i!(M−i)! the upper limit of the summation

is set to M, because all greater terms (i > M) will be 0.

. . . =1

M !

M∑i=0

(M

i

)(NBSe

−βεS )i(NNSe−βεNS )M−i = . . . (4.15)

59

As for the previous case, you get

Ztot =1

M !(NBSe

−βεS +NNSe−βεNS )M (4.16)

and, finally

NfreeBS = NBS −N bound

BS = NBS

(1− M

NBS +NNSe−β∆ε

)(4.17)

Merging together the two limits

NfreeBS =

NBS

NNSe−β∆ε

M+NNSe−β∆ε NBS M

NBS

(1− M

NBS+NNSe−β∆ε

)NBS M

(4.18)

Kinetic model comparison By comparing the STM in the two limits, wemapped the parameter from our model to the STM. The number of free specificBSs should be equal to

m =1

2

(mtot − stot −Kd +

√(mtot − stot −Kd)

2+ 4Kdmtot

)(4.19)

and, consequently

stot = M Kd =koffkon

= NNSe−β∆ε

mtot stotmtot stot

(4.20)

Grand canonical approach

Another way to obtain the same results avoiding the all bound hypothesis, isthe Grand canonical approach. As previously done we define a set of chemicalreactions, which in turn defines a Chemical Master Equation (CME), see vanKampen’s book [114].

∂P (ni,t)∂t =∑

ρ

(kρ+Ω

(∏i Es

ρi−r

ρi − 1

)∏j

((nj))

sρj

Ωsρj

+

+kρ−Ω(∏

i Erρi−s

ρi − 1

)∏j

((nj))

rρj

Ωrρj

)·P (ni, t)

(4.21)

Since reactions (4.3) are consistent with the Detailed balance hypothesis,CME can be written as (4.21). In this particular case, its solutions are figuredout through the Grand Canonical Ensemble (GCE) and take the form

P g(ni) = C∏j

(Ωzj)nj

nj !e−Ωzj∆(ni, n0

i ), (4.22)

with constraints about rate ratios

kρ+kρ−

=∏j

zrρj−s

ρj

j , (4.23)

60

and where Ω is the system volume, zj =njΩ concentration of species j,

∆(ni, n0i ) determines accessible states and C is a normalization constant.

Eq. (4.23) is known as law of mass action and must hold for all coupled reactions,it is equivalent to detailed balance.

Then, it is left only one hypothesis:

i. microRNAs and binding sites are at the kinetic equilibrium in each instant;

So, for our four reactions model must hold

P (m,n, s, cm, cn) = C (Ωzm)m

m! e−Ωzm (Ωzn)n

n! e−Ωzn (Ωzs)s

s! e−Ωzs (Ωzcm )cm

cm! e−Ωzcm ·· (Ωzcn )cn

cn! e−Ωzcn∆(cm + cn + s,M)∆(m+ cm, NBS)∆(n+ cn, NNS)

(4.24)To get the normalization C, you sum over all the possible configurations andthis must be equal to 1. ∆s kill 3 of 5 summations and the relations kSd =

kSonkSoff

=zcmzmzs

kNSd =kNSonkNSoff

=zcnznzs

(4.25)

from law of mass action, lead to

P (cm, cn) =

(kSdΩ

)cm( kNSdΩ

)cn(NBS−cm)!(NNS−cn)!(M−cm−cn)!cm!cn!∑NBS

cm=0

∑Mcn=0

(kSdΩ

)cm( kNSdΩ

)cn(NBS−cm)!(NNS−cn)!(M−cm−cn)!cm!cn!

(4.26)

Since the hypothesis NNS of all other involved variables (NBS ,M), you have

P (cm, cn) =

(kSdΩ

)cm(NNS

kNSdΩ

)cn(NBS−cm)!(M−cm−cn)!cm!cn!∑NBS

cm=0

∑Mcn=0

(kSdΩ

)cm(NNS

kNSdΩ

)cn(NBS−cm)!(M−cm−cn)!cm!cn!

(4.27)

and, so the partition function Z(kSd , kNSd ) comes out, naturally

Z(kSd , kNSd ) =

NBS∑cm=0

M∑cn=0

(kSdΩ

)cm (NNS

kNSdΩ

)cn(NBS − cm)!(M − cm − cn)!cm!cn!

(4.28)

Through the usual derivative you get

cm = kSd∂∂kSd

logZ(kSd , kNSd )

cn = kNSd∂

∂kNSdlogZ(kSd , k

NSd )

(4.29)

the mean number of bound BSs. Some change of variable leads to the Ztot

Z(kSd , kNSd ) =

NBS∑i=0

(kSdΩ

)i (1 +NNS

kNSdΩ

)M−i(NBS − i)!(M − i)!i!

(4.30)

and as in the previous case, two limits have to be considered

61

i. NBS M

Zabove =1

NBS !M !

(1 +NNS

kNSdΩ

+NBSkSdΩ

)M(4.31)

ii. NBS M

Zbelow =1

NBS !M !

(1 +NNS

kNSdΩ

)M−NBS (1 +NNS

kNSdΩ

+MkSdΩ

)NBS(4.32)

Thanks to relations (4.29)

NfreeBS =

NBS

1− MΩ+NNSk

NSd

kSd

+NBS

NBS M

NBS

1− MΩ+NNSk

NSd

kSd

+M

NBS M

(4.33)

Through the comparison with STM

stot = M Kd =Ω +NNSk

NSd

kSd(4.34)

At this point it is easy to see, the first model is the limit of GCE approach forvolume Ω tending to 0. Small Ω means low number of free states able to containfree microRNAs. Thus, this finding is coherent with all bound microRNA ifΩ→ 0.

To exploit the power of the approach, we would address some interestingissues, as presence of alternative specific targets (Pool) and transcripts withmore BSs. Hence, for both cases, accounting for other possible specific bondswith other molecules (Pool) we need two new reactions, see also Fig. 4.11A.

m+ skS/BSon

kS/BSoff

cm

p+ skS/Poolon

kS/Pooloff

cp

n+ skNSonkNSoff

cn

(4.35)

When more BSs are on the same transcript we would know how many mRNAare completely unbound. To do that, we consider a group of transcripts, for eachone there are nbs BSs. m species identifies each single BS and so mtot = NBS =mRNA · nbs. The distribution of M microRNAs on NBS BSs is described by asubset of the integer partition of M , where the largest value is nbs. The integerpartition, summing to M and with maximum integer nbs, is defined as follows

π(i,mRNA, nbs) = set of partitions of lengthmRNA that sums to i and withmaximum integer per partition nbs

(4.36)

62

To reproduce these partitions, we look at Ruskey’s RecDesc algorithm, forthe technical specification see [87, 57].

From the model with the Pool, see reactions in (4.35), we are able to calculatethe probability of each single NBS to be bound when i microRNAs are arrangedon specific BSs, pbound(i). So i represents the number of bound BSs withinthe NBS subset, m the number of messengers and nbs the number of BSs permessenger. All messengers have the same number of BS and each partition isassociated to a list of occupied BSs. Since that list you evaluate the probabilityof having m mRNA unbound by a microRNA.

p(nij) =

∏mi=1

(nbsni

)∑j

∏mi=1

(nbsni

) (4.37)

To get the integer partition, RecDesc algorithm by Ruskey has been imple-mented. The probability of having k free mRNA is

pfree(k|π(i, m, nbs)) =∑

free(nij ,k)

p(nij) (4.38)

where the function free(nij , k) gives back the partition nij with k freemRNA. For each element of the partition, we calculated its probability and thenumber of free mRNAs. So, summing over i, the probability of having k freemRNAs becomes

pfree(k) =

M∑i=0

pbound(i) · pfree(k|π(i,mRNA, nbs)) (4.39)

Data

In presented models we considered each BS of a specific microRNA. Since PAR-CLIP data from Hafner [45] and Kishore [60] articles we reconstruct microRNAregulation layer for HEK293 cell line. Then, using the predicted microRNA BSsin microRNA.org [14], we cross-check the two datasets, to get a prediction offunctional BSs in HEK293 cells and list for each microRNA the number of BSsper transcript. In [59], miRanda is the second best algorithm recognizing alsonon-canonical BSs immediately after MIRZA, e.g. [44]. Thus this intersectionshould give us a good estimation of physiological BSs in HEK293 cells.

To evaluate how NBS scales for each microRNA, we extracted the expressionvalues from microarray experiment in HEK293. We took out only controls fromchosen dataset and used them as a measure of the transcript copy number. Alldataset are based on GeneChip Human Genome U133 Plus 2.0 Array and arelinked to these works [30, 95, 46].

4.3 Results

A journey in miRNA world

microRNAs stand in cytoplasm and live in this overcrowded world, full of waterand interacting molecules, looking for target mRNAs. RNA strands are theirmain interactors even if only a small fraction has to be bound by a specific

63

-5 0-10

Den

sity

specificenergy

unspecificenergy

Den

sity

0-5-10-15

Interaction energy

hsa-miR-1

target sequence

seed sequence RNAhybrid Gibbs freeenergy

for all microRNAs

-8-8.5-9

z-score=-10.4C

ount

s

random7-word

seed

B C

A

5

Figure 4.2: To estimate the free enrgy that should be used in our model, eachseed was hybridized with RNAhybrid a sample of 20000 random words againstseeds, with nucleotide density typical of 3’ UTR. A. An example of distributionobtained through these hybridizations for hsa-miR-1. B. Distribution of ∆ε es-timated for all microRNAs. ∆ε was figured out thanks to the difference betweenspecific and unspecific energy. Specific energy was calculated through direct pair-ing of seed and complementary word and unspecific one through distribution inA. C. Using Nearest neighbour model due to [122] we found seeds forming morestable duplexes are selected by evolution in respect to other words of length 7.MiRNA seeds are GC-rich sequences in respect to background UTRs.

64

microRNA. These two species interact mainly base-pairing microRNA seed andits target binding sites on 3’ UTR sequences [12, 11]. To explore this interactionissue, we started from a list of microRNAs and the seed region of each one cor-responds to mature sequence from nucleotide 2 to 8 (7-mer), other possible BSsare due to different couplings such as 6-mer and 8-mer, see [12, 11]. A first sub-ject are the biases against all possible 7-word. So, we figured out the Gibbs freeenergy of duplexes formed by each seed and its complementary RNA. Drawingrandom samples from the set of all RNA sequences with length seven, and cal-culating the Gibbs free energy of words in many random samples, we found thata significant choice of real seed forms more stable perfect duplexes, Fig. 4.2C.This energies due to complementary pairing were calculated through nearest-neighbour model by Turner, INN-HB [122]. This bias towards more stable wordcould be explained in different ways, e.g. this helps forming tight bound betweenmicroRNA and target or this is due to microRNA biogenesis, when microRNAstands in the hairpin state [12], or perhaps both together. As shown later itcould be also generated by a global bias in nuccleotide density in 3’ UTR regionof transcripts. To be sure we didn’t find a typical pattern due to molecules in-teracting with RNAs, we get the binding motif of RNA binding proteins fromRBPDB database [23]. It assigns to each domain a Position Frequency Matrix(PFM) and we tested the nucleotide composition for each matrix finding averageAU equal to 0.51 and CG to 0.49. However our asimmetric density could be aspecial feature of RBP interacting with 3’ UTR, but, unfortunately, we aren’table to pinpoint them.

The world seen by a seed is composed by other small RNA words. In orderto evaluate the effect of this “world of words”, we hybridized each seed witha set of 20000 random words of length seven. These words are produced usingnucleotide probability in human 3’ UTR and each set of interactions gave usa stability spectrum, an example for hsa-miR-1 is shown in Fig. 4.2A. In thiscase the energies have been obtained through the RNAhybrid tool [85] to alloweven non perfect match between seed and target sequences. Thanks to thesedistributions we defined ∆GNS , non specific free energy, to evaluate how muchstable the specific interaction miRNA-mRNA is in respect to unspecific ones.The mean free energy of each distribution defines ∆GNS for each microRNA.The difference between specific (perfect pairing) and unspecific (other possibleof words) free energy is a measure of duplex stability and the distribution ofthese values is plotted in Fig. 4.2B. Mean stability of miRNA-mRNA against thechemical pool of mRNA is∼ 9 kcal mol−1, with values from -15 to -2 kcal mol−1.Similar values are found for example in [48] through the Vienna package.

As in [38, 15] we would estimate the fraction of bound microRNAs. To dothat, we calculated the equilibrium constant for specific and unspecific interac-tion, both values are smaller than 0, suggesting the equilibrium shifts towardsbound microRNAs. All microRNAs end to be bound specifically or unspecifi-cally. This agrees with our assumption, the number of free microRNAs tends to0.

It’s well-known single mutation in BS/seed motif can disrupt respressiveaction of a miRNA [48, 119], we would quantifying the effect of mutation ormismatch in the target sequence. To do that we hybridized all seeds with allword having one, two or three mismatch in respect to perfect pairing sequence.The results are shown in Fig. (4.3) and, it is clear a single mismatch generates amean ∆Gloss of 3 kcal

mol , capable to completely abolish the microRNA regulation.

65

0-4-8-120

0.2

0.4

0.6

1 2 3

0

-5

-10

Number of mismatches

Den

sity

123

Mismatches

Figure 4.3: Distribution of ∆Gloss (A) and mean ∆Gloss with its standard error(B) for 1, 2 and 3 mismatches in the seed-target pairing.

0 1 2 3

1

2

0free

mR

NA

108

109

1010

NNS

total mRNAFigure 4.4: Number of unspecific binding sites (NNS) changes the titrationcurves and in particular the increasing of NNS weakens the repression strengthand, consequently thresholds tend to disappear.

Unspecific sites on RNAs dilute the repression strenght ofa microRNA.

Each RNA is made of targetable words that in average have ∆GNS free energy.A microRNA can target unspecific sites, that include all positions on a mRNAmultiplied by the number of mRNAs. In Bionumbers [75], we looked for an esti-mated mRNA amount per cell, and got ∼ 300000 units in human. Consideringa mean lenght for each transcript of 103−104, the number of unspecific bindingsites (NNS) gets values of ∼ 108− 109, in eukaryotes. To investigate the changein the bulk of unspecific words, three different values of NNS are used in Fig. 4.4.Clearly, NNS modifies Kd but leaves threshold at the same position (Fig. 4.4A).Here an experiment to see if the model is based on physiological hypothesis maybe done, someone tries to rescale the RNA transcript in a cell and looking fora change in titration curves for some microRNA targets, having in mind that

66

Figure 4.5: Data of repression increase due to microRNA transfection in functionof the concentration of its targets from [7].

Kd depends also on the volume of the system considered Ω. Obviously, the cellvolume and unspecific BSs have a similar dilution effect on titration curves.

Expression of microRNA targets influences its repressionstrenght.

From our model, it’s easy to infer how much the repression strength of a mi-croRNA drops while its specific targets grow up. This effect was already studiedin [7] and its authors found out that downregulation of transfected microR-NAs is a function of target abundance and showed a significant rank correlationbetween target abundance and mean log expression ratio (ρ = 0.62). They mea-sured the ratio of expression of chosen microRNA targets before and after thetransfection and stated that

log2(expression ratio) = log2

(1− a

RPN + b

)(4.40)

is the best fitting function. In equation (4.40) RPN is reads per nucleotideand measures the target abundance, a and b are two parameters. Inferring the

same quantity from our model, we get two behaviours for log2

(NfreeBS (M ′)

NfreeBS (M)

).

Below threshold the expression ratio doesn’t depend on the target abundancebut only on transfection size. Thus, below threshold, the ratio is equal to

log2

(Nfree

BS (M ′)

NfreeBS (M)

)= log2

(1− ∆M

M ′ +Kd

)(4.41)

and, above threshold,

log2

(NfreeBS (M ′)

NfreeBS (M)

)=

log2

(1− ∆M

NBS+Kd−M

) (4.42)

67

Our suggestion is that, in [7], equation (4.40) doesn’t fit well the data (Fig.(4.5) because the two parameter a and b depend on unknown varying param-eters. Moreover, the functional dependence changes according to the regime(above/below threshold).

a = ∆M = M ′ −Mb = Kd −M

(4.43)

In order to fully understand the curve and explain the variation of repressionstrength we need to know at least the microRNA concentration, before andafter the transfection, the regime, above and below threshold, and the totalconcentration of targets.

The efficacy of binding sites on single transcript dependson target pool size.

A transcript may carry more BSs (Fig. 4.6A). How does the number of BS on atranscript change the probability for a transcript to be bound specifically? Theincrease of probability follows the equation (4.54) where n are the number of BSon the transcript. We obtain that relation through the subsequent calculation.

Zn is the partition function of having at least one bound BS among n.

Zn =

NBS∑i=1

Z(M − i)e−iεS

kBT

i∑j=1

(n

j

)(NBS − ni− j

)(4.44)

We introduce the partition function related to a messenger with n bindingsites, chosen from the NBS , and at least 1 bound:

Zn =

NBS∑i=1

Z(M − i)e−iεS

kBT

i∑j=1

(n

j

)(NBS − ni− j

)(4.45)

Resummating Zn over n is too hard, so we proceeded evaluating the BS effectthrough the ratio Fn. This quantity evaluates the increase of binding probabilityof a microRNA by adding new binding sites on the target transcripts and leavingconstant the total number of specific BSs.

Fn =ZnZ1

= . . . (4.46)

using the Chu-Vandermonde identity∑kj=0

(mj

)(n−mk−j

)=(nk

). . . =

∑NBSi=0 Z(M − i)e−i β εS

[(NBSi

)−(n0

)(NBS−n

i

)]∑NBSi=0 Z(M − i)e−i β εS

[(NBSi

)−(

10

)(NBS−1

i

)] = . . . (4.47)

. . . =


(NBSi

) (1− (NBS−n)!(NBS−i)!

(NBS−n−i)!NBS !

)∑NBSi=0 Z(M − i)e−i β εS

(NBSi

) (1− NBS−i

NBS

) = . . . (4.48)

the Stirling formula and with i, n NBS

Fn =


(NBSi

) (1− e−

n·iNBS

)∑NBSi=0 Z(M − i)e−i β εS

(NBSi

)i

NBS

(4.49)

68

A

80

Threshold

1 4 128

1 74

7

1

4

0

2

4

6

8

Effe

ctiv

e B

S

target BS

B

CBS on transcriptF

old

repr

essi

on g

ain

Figure 4.6: Transcripts can carry on more than a single binding sites for aspecific microRNA. A. From the pool of target binding sites, n are located on asingle transcript. B. We are looking for the probability ratio of finding at leasta bound binding site among the ns in respect to the case of a transcript with asingle BS, n = 1. The Fold repression gain, due to n, measures how much thenumber of binding sites n, affects the probability of being bound by at least onemicroRNA. C. It’s possible to define an effective number of BSs that depends onthe size of the target pool, NBS and on microRNA expression, M . neff ∝ NBS

M .In the plot the number of effective BSs remains constant below threshold, butgrows linearly with the pool size of target transcripts, for the above threholdcase.

69

0 2 4 6 8 10 12

02

46

8NBS = 12

n

F

Fmax Rateex. 1.34 1.382th. 1.12 0.892

0 2 4 6 8 10 12

02

46

8

NBS = 50

n

F

Fmax Rateex. 1.38 1.286th. 1.12 0.892

0 2 4 6 8 10 12

02

46

8

NBS = 100

n

F

Fmax Rateex. 1.43 1.198th. 1.12 0.892

0 2 4 6 8 10 12

02

46

8

NBS = 150

n

F

Fmax Rateex. 1.50 1.106th. 1.12 0.892

0 2 4 6 8 10 12

02

46

8NBS = 200

n

F

Fmax Rateex. 1.58 1.005th. 1.12 0.892

0 2 4 6 8 10 12

02

46

8

NBS = 300

n

F

Fmax Rateex. 1.83 0.793th. 1.62 0.617

0 2 4 6 8 10 12

02

46

8

NBS = 500

n

F

Fmax Rateex. 2.62 0.481th. 2.62 0.381

0 2 4 6 8 10 12

02

46

8

NBS = 1000

n

F

Fmax Rateex. 5.05 0.221th. 5.12 0.195

0 2 4 6 8 10 12

02

46

8NBS = 1500

n

F

Fmax Rateex. 7.54 0.142th. 7.62 0.131

Figure 4.7: Comparison between the theoretical solution (green dashed) in eq.4.54 and the exact resummation (red points) with M = 200 and NNS = 109.

Fn =

(

1 + NNSe−β∆ε

M

)(1−

∑∞k=0(−1)k <i

k>nk

NkBS k!

)NBS M(

NBS+NNSe−β∆ε

M

)(1−

∑∞k=0(−1)k <i

k>nk

NkBS k!

)NBS M

(4.50)

When < ik >=< i >k

Fn = Fmax(1− e−r·n

)(4.51)

holds. This equation can explain why saturating exponentials fit well the nu-merical resummation, see Fig. 4.7. Besides, it works for great values of NBS ,indeed in the Figure 4.7 theoretical lines are green dashed and the exact black.The plots in Fig.s 4.6B and 4.7 are saying that having more binding sites iseffective only above threshold, and incresing the pool NBS the effect of the BSswill be greater. Above threshold, it’s possible to define a number of effective BS,neff :

FnFmax

=

(1− 1

e

)= 0.63⇒ neff =

1

r= Fmax =

NBS +Kd

M(4.52)

70

neff =NBS +Kd

M∼ NBS

M(4.53)

For below threshold case, neff ∼ 1 holds.

Fn =ZnZ1

= Fmax(1− e−r·n

)(4.54)

In this way we figured out the relation between probability to be boundhaving n or 1 BSs. This saturating exponential always fit well exact numericalresummation (Fig. 4.6B), only for above threshold limit. Always within this limiteq. (4.55) holds and a number of effective BS (neff ) can be defined and linkedto physiological parameters. This calculation was done in above threshold limit,and for below threshold case you can easily imagine the number of effective BSsdoesn’t affect the probability to be bound, Fn ∼ 1 .

neff ∼NBS +Kd

M(4.55)

For below threshold case, having more BS doesn’t affect in a significant waythe repression strength. In Fig. 4.6C we plot how the effective BS depends onthe pool of microRNA targets, NBS .

Number of binding sites of a microRNA on a single tran-script and specific nucleotide bias in 3’ UTR.

To further investigate BSs on single transcript, we evaluated the mean num-ber of BSs per microRNA among its targets from PAR-CLIP data by Kishoreand Hafner cross-checked with microRNA.org database. We suppose, NBS isproportional to the number of targets and indeed the correlation between totalexpression of microRNA targets and number of targets is close to 1, see Fig.4.9. Expression values were extracted by microarray data in HEK293 cell line(data from [30, 95, 46]). Since this data, we found a correlation between numberof targets and the mean number of BS on targets of each microRNA (see Fig.4.8A). It seems a microRNA has more BSs per transcript when has more tar-gets, this may be due to the number of effective BSs that grows linearly with thenumber of target BSs. In 4.8B microRNAs with less stable duplexes have moreBSs, in someway this may balance lower binding strength with the number ofBSs. To be honest these effects may also originate by nucleotide bias in 3’ UTR.We have found that the microRNA with more stable duplexes, have less chanceto meet a word complementary to its seed. The figure 4.8C is showing the AUbias in 3’ UTR, in particular, that microRNA duplex stability correlates withthe probability of finding the complementary word, and this correlation is closeto 1. Asymmetric density of AU/CG nucleotides has been estimated throughEnsembl [34] genome sequences. This asymmetry generates a world where mi-croRNAs that form less stable duplexes (AU rich) meet more complemetarywords in respect to more stable microRNAs.

Pool of competitors changes titration parameters.

Now, we focus on a subset of target transcripts, called NBS . The remainingspecific binding sites are called Pool, as in Fig. 4.11A. The Pool presence affects

71

0 1000 2000 3000

mea

n B

S

Target transcripts

1.0

1.2

1.4

0-5-10-150

0.5

1.0

1.5

2.0

Seed energy

wor

d pr

obab

ility

0

0.5

1.0

0.25

0.75

3' U

TR

pro

babi

lity A U C G

A

B

Duplex free energy

mea

n B

S

1.0

1.2

1.4

-9 -6 -3-12

C

But ...

Figure 4.8: Cross-checking data from microrna.org and PAR-CLIP experimenta list of validated binding sites for HEK293 cell line was obtained. Besideswe used expression microarray data to evaluate the global expression of targettranscripts. In A is shown the correlation found between number of targetsand the mean number of BS on targets of each microRNA, perhaps due to thenumber of effective BSs, that grows linearly with the number of target BSs(NBS). In B, the microRNAs with less stable duplexes, have a greater numberof mean BSs on target transcripts, this may be to balance the weaker interactionenergy. C suggests this correlation could even be caused by asymmetric densityof AU nucleotides in 3’ UTR, the estimation has been done through Ensembl[34]. This asymmetry generates a world where microRNAs that form less stableduplexes (AU rich) meet more complemetary words in respect to more stablemicroRNAs. In figure, ρP , pearson and ρS , spearman correlation coefficient.

72

1200

0

600

6 1053 105

-1 0 1

Total expression

Num

ber

ofta

rget

tran

scrip

ts

Correlation for 6 samples in HEK293

Figure 4.9: Correlation between number of targets and total expression of targetsfor each microRNA.

titration curves for the considered subset of NBS binding sites. Suppose ∆εPand ∆ε are respective equilibrium energies for the binding with Pool and NBS .The approximation made for above and below threshold case, works only for theabove case. So we are able to achieve an estimation of stot and Kd parameters,eq. (4.56) for that state.

stot = M (4.56)

Kd = NNSe−β∆ε + Pool eβ(∆εP−∆ε) (4.57)

The simplest titration model with this parameters, partially approximatesthe exact numerical resummation of the partition function, see Fig. 4.10A. Asrule of thumb should catch the right behaviour, so we used this expression toexplain Hwa’s data in [48]. In [48] they claimed to thermodynamic constant kBmust be changed to explain the observed repressions. But they did hybridizationexperiment between sodB and RyhB modifying the complementary sequencesto study how the RNA duplex stability is affecting the repression strenght.Doing so the Pool is changing for every seed mutations, and the number oftotal targets varies among the samples. So we suggest that the presence of othertarget transcripts could explain the unexpected behaviour, see Fig. 4.10B-C,without trotting out any change in β.

Experimental observation: more binding sites on the sametranscript affect typical titration curves.

In some works, e.g. [78], transcripts with more BSs are studied. We try toreconstruct the experimental setting where a transcript with different number ofBSs is induced, and the cell has an endogenous microRNA with its own targets.To simulate this setting we use the model with Pool and integer partition,presented in Model section. It’s hard to find an approximation of the partitionfunction useful for our case and so all the results are numerical resummation ofprobabilities calculated from partition function. Using these settings, sketchedin Fig 4.11A, we make some considerations.

73

10.5

510

-13 -11 -9-14 -4

0

800

15

1

5

-10 -8

400

400 800total BS

Afo

ld r

epre

ssio

n

free BS

B C

Figure 4.10: Within our model is possible to get informations about the poolof NBS competitors and in particular approximate new relation for titrationparameters, M and Kd A. Titration approximation (lines) with Pool and exactresummation (points). B and C. shows data from [48] fitted through our modelwith pool and these were obtained with the usual value for the thermodynamicparameter β. So a possible explanation of strange dependencies by the Gibbsfree energy could be generated by the presence of other specific targets, Pool.

74

0 50 100 150 2000

2

4

6

8

10

12

mRNA

Fold

repression

A BS worldSPECIFIC

UNSPECIFIC

B

STMBS model

n123457

12

6

0 100 200mRNAF

old

rep

ress

ion 6

0 1 2 3 4eYFPF

old

repr

essi

on n

147

11 peak

Figure 4.11: A. Cartoon about the model with a pool of competitors for themicroRNA binding. microRNAs can bind specifically both to NBS and to Pool.NBS are located on m mRNAs, each one with n BSs. B. The simplest titrationfunction doesn’t apply well to our case for low mRNA amount dashed (STM)and continuous lines (BS model) don’t overlap. Van Oudenaarden’s data [78]shows a peak, that becomes, as expected, clearer increasing the BSs on mRNA(n). 75

Usually to fit data in these experiments, authors use model that doesn’t takeinto account other targets and the relative size of the Pool in respect to thenumber of transcripts (NBS). If you use the known kinetic titration model [69],you try to fit your data with common function (eq. 4.2) that doesn’t work forlow mRNA amount as you can see in dashed lines of Fig. 4.11B. Indeed, the foldrepression takes unexpected behaviour. For low value of target expression, firstlyit increases, in contrast with the usual model. It is caused by a combinatorialeffect between the two set of specfic BSs, Pool and NBS , and creates peaks inFold repression plots. The presence of this peaks becomes more clear increasingBSs on transcript. How could peaks be explained? By the definition of foldrepression, (Fold = total mRNA

free mRNA ), one can imagine that with more BS while totalmRNA is increasing the free mRNA keeps low, i.e. BSs keep high the transcriptprobability to be bound for a range larger than with less BSs.

76

Chapter 5

Dynamics of mobileelements

5.1 Introduction

Since the completion of the first human genome sequence it is known that nearlyhalf of our genome originated from transposable elements [64, 56, 26]. The ma-jority are non-long terminal repeat (LTR) retrotransposons, such as LINE-1(L1), Alu and SVA elements, see Fig. 5.1.

Figure 5.1: Genome composition of transposable elements from [26].

L1 elements are the most widespread within the human genome and belongto LINE group (Long INterspersed Element). Their full lenght is ∼ 5 kb, even ifmost of them are broken into smaller sequences. The unharmed L1 consists oftwo open reading frames and each ORF possesses an internal RNA polymeraseII promoter. In particular, ORF1 encodes an RNA-binding protein and ORF2, aprotein with endonuclease and reverse-transcriptase activities. The second classfor spread is the Alu family and belongs to SINE group (Short INterspersedElement). Alu progenitor was derived by the fusion of two SINE, FLAM andFRAM, both monomers originated by the loss of a central sequence of 7SL RNAcopies in genome. The two segments are separated by an A-rich linker regionand the typical lenght is of ∼ 300 b. Alu elements are transcribed by RNApolymerase III, indeed the left monomer contains the A and B boxes typical of

77

RNA polymerase III promoter. The element could end with an A-rich tail thatcan be up to 100 bp long, perhaps the leftovers of the poly-A tail. A peculiarfeature of SINE is that they rely on other mobile elements for transposition,because do not encode a reverse transcriptase protein.

As in previous works [97, 82], our analysis focuses on distances between in-sertion of single Alu subfamily. Althought L1 family is the biggest family, singleelements are often broken in multiple pieces, and so the single insertion is muchharder to be identified. Since that we chose to analyze only Alu family. The dataare drawn from Repeat Masker [100]. Owing to the mechanism of retrotrans-position [64, 56, 26] and the number of possible insertion sequence for the Alufamily [54], it is straightforward to think at a random insertion model for eachsubfamily at the time of their spread. Starting from this consideration, the dis-tance distribution should be generated by a set of random points on a segment.In polymer theory, this process is known as stick-breaking and, has been solvedanalytically by Ziff and McGrady in 1985 (see [125]). It has been conceivedto get the distribution of lengths due to the random fragmentation process ofa polymer chain based on the assumption that all bonds in the chain breakwith equal probability. For retrotransposon insertion process this assumptionmeans that the sugar-phosphate backbone should break with equal probabil-ity at every positions. So retrotransposition with a random insertion sites isequivalent to the fragmentation process. Hence, the stick-breaking distributioneq. (5.1) should be an evidence of retrotransposition process onto the genome.The stick-breaking process gives a distribution of fragment lenghts without freeparameters, eq. (5.1), because it depends on B, the number of breaks or in ourcase the Alus belonging to a selected subfamily, and L, the polymer length,or, in our case, the genome length. This idea has been already tested throughnumerical simulatioins by Sellis et al. in [97]. But they found this simple modelcan’t explain the data. In Fig. 5.2 we compared each distance distribution ofeach Alu subfamily and found there isn’t agree between the stick-breaking andthe empirical distributions, as expected [97].

p(l, B, L) =1

1+B

(e−Bδ(l − L) +

(2BL +

(BL

)2(L− l)

)e−

BL l)

= SB(l, L,B)

(5.1)

As figure 5.2 shows, distributions cannot be explained using the stick-breakingas null model. Sellis et al. [97] proposed to take into account the evolution ofhuman genome by successive insertion of new sequences, with a stick-breakingdistribution as initial condition. The model, they proposed, contemplates atfirst, the possible expansion of segments due to new insertions and latter, thedisappeareance of some “breaks” or recognizable Alu because of the sequencemutations during time. Their analysis considered only simulations, but here wesolve their model, exactly, and test it showing it fails for a set of real genomicparameters. The discrete equation for an expansion model is

∂pl(t)

∂t= γ

l − λL(t)

pl−λ(t)− γ l

L(t)pl(t) (5.2)

and with few manipulations, we write the relative continuos equation

78

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluJb

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluJo

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluJr

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluJr4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSc

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluSc5

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSc8

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSg

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

10 - 4

AluSg4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSg7

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluSp

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

10 - 4

AluSq

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

AluSq10

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluSq2

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluSq4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluSx

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluSx1

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSx3

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

10 - 4

AluSx4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluSz

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluSz6

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

10 - 5

AluY

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

AluYa5

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluYb8

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

10 - 4

AluYc

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluYe5

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluYf1

10 3 10 410 5 10 6

10 - 1

10 - 2

10 - 3

AluYj4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluYk2

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

10 - 4

AluYk3

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

AluYk4

10 410 5 10 6 10 7

10 - 1

10 - 2

10 - 3

AluYm1

stick-breaking test

distance

pro

babili

ty

Figure 5.2: Testing the stick-breaking hypothesis for each Alu sub family.

104 105 106

1

10-2

10-4

B=50k

distance [base]

pro

babili

ty

Li=1 GbLf=2Gb

Bi=100kBf=50k

expansion expansion+BA

104 105

degradation

Figure 5.3: A. Expansion simulation B. Expansion-degradation simulation

∂p(l, t)

∂t+

γλ

L(t)

∂

∂l(l p(l, t)) = 0 (5.3)

Traditional approach to solve the first order partial differential equation isthe method of characteristics.

dtdk = L(t)dldk = γldpdk = −γp

⇒

∫ t

0dsL(s) = k

l(k) = l(0)eγk

p(k) = p(0)e−γk(5.4)

replacing into the stick-breaking distribution, you get

p(l, B, L(0), k) = 11+B

(e−Bδ(le−γk − L(0)) +

+

(2 BL(0) +

(BL(0)

)2 (L(0)− le−γk

))e−

BL(0)

le−γk)e−γk

(5.5)

79

and, since, the process of expansion doesn’t create any new segment, we have atrivial constraint ∫ L(k)

0

p(l, B, L(0), k) dl = const. (5.6)

in our case∫

= 1 because we chose normalized p. Consequently, the followingrelation must hold

γk = lnL(t)

L(0)⇒ γ

∫ t

0

ds

L(s)= lnL(t)− lnL(0) ⇒ L(t) = γt+ L(0) (5.7)

through the substitution and few manipulations

p(l, B, L(t)) =1

1 +B

(e−Bδ(l − L(t)) +

(2B

L(t)+

(B

L(t)

)2

(L(t)− l)

)e−

BL(t)

l

)(5.8)

Therefore thank to the method of characteristic we found an analytical so-lution for the expansion process. After the expansion the distribution remainsa stick-breaking but with a different total lenght L(t) = γλt+ L(0).

p(l, t) = SB(l, L(t), B) (5.9)

This proves an expanded stick-breaking remains a stick-breaking and theanalytical solution has been tested also with some simulations and the agree isperfect, see Figure 5.3A. Unfortunately it is quite hard to solve the model forretrotransposon degradation process, then our conclusions are obtained fromsimulations and a simple discussion of what we expect from the process. Assimulations show, the degradation process is described by a stick-breaking pro-cess again but with less breaks B → B′, where B′ is the new number of breaks,see Figure 5.3B. This happens because random degradation of some breaks of astick-breaking is the reverse process of that defined by Ziff and McGrady [125].So the fusion of consecutive segments of a stick-breaking is a stick-breaking too.Since that, it is described by the number of breaks B and the total length Lagain. Our conclusion is that Expansion+Degradation model gives back alwaysa stick-breaking distribution.

5.2 Model

A potential explanation for the deviations in Fig. 5.2, is the presence of a localsource of Alu duplication beyond the expansion process. To verify this hypoth-esis, we added to the equation (5.3) a source term, given by a distribution ofduplication distances, q(l). A local source of duplication leaves unchanged thegreatest distances but increases the Alu density at small distances.

∂p

∂t+

γ

L(t)

∂

∂l(l p) = q(l) (5.10)

L(t)∂p∂t + γ ∂∂l (l p) = q(l)L(t)p(l, 0) = SB(l, B, L)

(5.11)

Let us consider L(t) ∼ L(0) = L, this assumption simplifies the calculationswithout affecting the results, for a realistic choice of parameters, Fig. 5.4.

80

∂tp(l, t) + γ∂l(l p(l, t)) = µ q(l) (5.12)

source with distribution q(l) and rate µ. Expanding the l-derivatives

∂tp+ γl∂lp = −γp+ µq(l) (5.13)

and, as for the expansion process PDE, the method of characteristics pro-vides a good way to get the solution.

dtdk = 1dldk = γldpdk = −γp(k) + µq(k)

⇒

t(k) = kl(k) = l(0) exp(γk)dpdk = −γp(k) + µq(k)

(5.14)

The third equation can be solved with this simple argument

dy

dx+ p(x)y = q(x) (5.15)

dµy

dx= µ

dy

dx+ y

dµ

dx= µq(x) (5.16)

dy

dx+

1

µ

dµ

dxy = q(x)⇒ µ = exp

(∫ x

p(x′) dx′ + c

)(5.17)

y =

∫ xµq(x′)dx′ + c

µ(5.18)

Thus, with an exponential source, q(l) = µe−lδ , that will be estimated from

data in the next section, the complete solution becomes

p(l, t) =SB(le−

γL t, B, L)e−

γL t + µLδ

γl

(e−

lδ e− γLt

− e− lδ

)1 + µδt

(5.19)

In figure 5.4 are shown the simulation of expansion-duplication model over-lapped to the analytical solution.

As the model with source suggests, the drift, from the original stick-breaking,should depend on the time since Alu subfamily spreaded. Two different ap-proaches to evaluate these deviations, have been followed. In the first one, thedistance from the original stick breaking is calculated through the differencebetween the empiricic cumulative distributions and the related expected stick-breaking. Paying attention to the fact that comparing several families we haveto normalize distributions by the average distance of the respective subfam-ily because of the different scale of integration. Then, for each subfamily it’sstraighforward to figure out the absolute value of integral difference betweenthe two curves. Normalized areas can assocciate with the age of subfamilies andthese ages can be determined by the average divergence of each subfamily. Soyou can use directly the average sequence divergences, see Figure 5.6 or if youwish a quantity proportional to the time, the Jukes-Cantor model is enough forthis purpose.

81

1000 10 000 100 000

10- 5

10- 4

0.001

0.01

0.1

1

1000 10 000 100 000

Stick-breaking expansion duplication

Figure 5.4: Stick-breaking evolving under the effect of an exponential source,points are simulations and lines the analytical solutions for different time.

103 104 105

distance [base]

pro

babili

ty

1

10-2

10-4

A B

10-2

10-4

104 105

1simulation data+analytical

Figure 5.5: Source effect A. simulation. B. overlap between AluY subfamily andanalytical solution.

5.3 Source estimation

We would know how the sources appear and which fraction of duplicated weshould expect. Working under the hypothesis that sources are local, and hencedon’t affect the original stick-breaking distribution beyond a certain distance,we have revised a method developed in [22] to extract the underlying old stick-breaking. In [22] Clauset et al. want to evaluate the presence of power law be-haviour in the tail of whatever empirical distribution. In the proposed extension,we don’t seek for power law tails, but for stick-breaking ones.

The hypothesis, new segments are produced by asource of short distances,justifies the tail fitting of the empirical distribution of distances. Hence, thecurrent distributions are the outcome of an initial stick-breaking plus a source.

pcurrent(l) = SBfrac · SB(l, B, L) + (1− SBfrac) · psource(l) (5.20)

The same hypothesis allows us to consider the occurrences beyond a certain

82

SxSxSzSzJbJbJrJr

Sx1Sx1

YY

SpSp

Sq2Sq2

JoJo

SgSgSz6Sz6

Sx3Sx3

ScScSc8Sc8SqSq Jr4Jr4Sg4Sg4

Sx4Sx4

Yk2Yk2

YcYc

Yk3Yk3

Sg7Sg7

Ym1Ym1Sc5Sc5

Ya5Ya5Yj4Yj4

Yb8Yb8

Yf1Yf1Sq4Sq4

Sq10Sq10

Ye5Ye5Yk4Yk4

5 10 15

0.2

0.4

0.6

mean div.

(%)

stick-b

reakin

g

r=0.74

Y S J

0.038

Alu familydis

tance

L1M5L1M5

L1ME1L1ME1

L1ME4aL1ME4aL1ME4bL1ME4b

L1MC4L1MC4

L1MB7L1MB7HAL1HAL1

L1M4L1M4

L1ME3GL1ME3G

L1MEcL1MEc

L1MC5aL1MC5aL1MB3L1MB3

L1MEdL1MEd

L1MA9L1MA9L1MB8L1MB8

L1ME3AL1ME3AL1MC5L1MC5 L1ME3CzL1ME3Cz

L1PA16L1PA16

L1MEgL1MEg

L1MC3L1MC3

L1PA7L1PA7

L1PB1L1PB1L1MC1L1MC1

L1ME2L1ME2

L1PA4L1PA4

L1MEfL1MEf

L1PA5L1PA5

L1MA8L1MA8

L1MD2L1MD2

L1PA3L1PA3

L1MA4L1MA4 L1MC4aL1MC4a

L1M1L1M1

L1MB5L1MB5

L1ME4cL1ME4c

L1MB4L1MB4 L1ME3L1ME3

L1M2L1M2

L1MA3L1MA3L1MB2L1MB2

L1PA13L1PA13 L1MA7L1MA7

L1MDL1MD

L1PA15L1PA15

L1ME3BL1ME3B

L1PA8L1PA8

L1PREC2L1PREC2

L1MCL1MC

L1PB4L1PB4

L1MA2L1MA2

HAL1MEHAL1ME

L1M3L1M3

L1ME2zL1ME2z

L1MCaL1MCa

L1PA10L1PA10

L1MDaL1MDa

L1MD1L1MD1

L1MC2L1MC2

L1MB1L1MB1L1MA4AL1MA4A

L1PA6L1PA6

L1MA6L1MA6

L1MD3L1MD3

L1M6L1M6L1MEiL1MEi

L1MA10L1MA10L1PA2L1PA2

L1PA17L1PA17

L1ME3EL1ME3E

L1ME3FL1ME3F

L1MA5L1MA5

HAL1bHAL1b

L1ME3DL1ME3D

L1M4bL1M4b

L1MA1L1MA1

L1PA11L1PA11

L1P4L1P4

L1PB3L1PB3

L1M4cL1M4c

L1P3L1P3

L1MA5AL1MA5A

L1ME5L1ME5

L1ME3CL1ME3C

L1P1L1P1

L1MCcL1MCc

L1M7L1M7

L1MEhL1MEh

L1PA14L1PA14

L1PB2L1PB2

L1M4a1L1M4a1

L1PA8AL1PA8A

L1MEjL1MEjHAL1M8HAL1M8

L1PBaL1PBa

L1MCbL1MCb

L1M4a2L1M4a2

L1PBL1PB

L1PA12L1PA12

L1MEbL1MEb

L1M8L1M8

L1P2L1P2

L1HSL1HS

L1PA15 - 16L1PA15 - 16

L1M3cL1M3c

L1MEg1L1MEg1

L1MDbL1MDb

0.011

r=0.59

5 15mean div.

(%)

25

0.2

0.4

0.6

0.8

L1 familyA B

Figure 5.6: Distances from stick-breaking distribution grow with subfamily age,data for Alu (A.) and L1 (B.) family. Two different slopes in fitted line areprobably due to the different affinity for unequal crossing over. As suggested in[8], it seems Alu subfamilies were affected by a strong duplication effect.

threshold lmin and fit their distribution with a stick-breaking normalized on thedomain l ∈ [lmin,∞), as eq. 5.21.

p(l, lmin, B, L) = 1

(1+B)(e−BLlmin−e−B)+B(e−B− lminL e−

BLlmin )

·

·(

2BL +(BL

)2(L− l)

)e−

BL l

(5.21)

Therefore, only distances, belonging to this domain, are used in the fits, asdone in [22]. Always in that work they were able to use the maximum likelihoodapproach analytically, because of the simple form of the power law function. But,unfortunately, it’s impossible to do the same here, thus, we set out a numericalprocedure to maximize the likelihood function.

Maximum likelihood For reviews of the history and the technical detailsabout the method see for example [92, 2]. If we have a sample of independentand identically distributed variables li, the likelihood function L is defined as

L(θ|li) =

n∏i=1

p(θ|li) (5.22)

where θ is the vector of parameters, necessary to define the distribution,θ = (B,L). This makes the likelihood a function of parameters θ. In practice itis often more convenient to work with the logarithm of the likelihood function,called the log-likelihood

83

p(l)

l

lmin

Figure 5.7: Tail fitting scheme.

lnL =

n∑i=1

ln p(θ|li) (5.23)

or, the average log-likelihood

l =1

nlnL. (5.24)

We chose the latter one, l, then we have to maximize this function

l =1

n

n∑i=1

ln p(li, lmin, B, L) (5.25)

and in its extended version

l(li) = − log(

(1 +B)(e−

BL lmin − e−B

)+B

(e−B − lmin

L e−BL lmin

))+

+ 1n

∑i log

(2BL +

(BL

)2(L− li)

)− 1

n

∑iBL li

(5.26)After a first step of maximization, it appears clearly, that the significant

parameter for the fitting is the density of breaks, BL . Hence, a new variable

b = BL is defined and the total length L becomes irrelevant for the distribution

shape and plays a role in pinpoint the maximum achievable distance and set acut-off. We maximized both b and L, but only b values has been crucial. Thenumerical maximization is done in three steps, sketched in Fig. 5.8:

(1) grid maximization: we define a grid with great side to cover all possiblevalues of the parameters, calculating the average log-likelihood in each pointof the grid, thus we select the position with maximum value;

(2) grid maximization: near the maximum identified in step (1), it is created anew grid with smaller step and we select the new maximum on this grid;

(3) steepest ascent: at last we use the derivatives, ∂b l(b, L) and ∂L l(b, L), to geta little closer to the real maximum.

This method, associated to the moving threshold lmin, permits to identify thelmin associated with the couple (b, L) that minimizes the Kolmogorov-Smirnov

84

1-grid

2-grid

steepestascentL

b

A b

lmin lmin

L

DK

S

lmin

SB estimationfor Alu Y subfamily

b=2.04 x 10-5

L=2.35 x 107

DKS=0.0106SB frac=0.4585

B

Figure 5.8: A Maximization procedure through 2 step of grid maximization andat last, a steepest ascent. B Example of the maximization procedure associatedwith moving threshold for the Alu Y subfamily.

distance, DKS , as in [22]. An example of the method applied to Alu Y subfamilyis shown in Figure 5.8.

Now, we’re ready to infer the parameters for all the Alu subfamilies, and theobtained parameters are in Table 5.1.

Subfamily Size b DKS SB fraction Divergence Cut-off

AluJb 120796 1.22E-5 0.0113992 0.150025 16.8 1000000AluJo 47084 4.87E-6 0.0122988 0.172475 16.14 2000000AluJr4 12538 1.92E-6 0.0159476 0.290061 16.97 4000000AluJr 117142 1.17E-5 0.011796 0.136319 16.8 1000000AluSc5 4186 1.10E-6 0.0203795 0.697921 8.94 6000000AluSc8 20910 3.79E-6 0.00761504 0.358724 9.6 3000000AluSc 38495 8.57E-6 0.00846116 0.537885 9.9 1200000AluSg4 12406 2.47E-6 0.0174922 0.49801 11.06 4000000AluSg7 5937 1.30E-6 0.0214966 0.523416 8.9 5000000AluSg 44126 7.25E-6 0.00941408 0.342975 10.52 1200000AluSp 66950 9.85E-6 0.0124053 0.316627 10.08 1100000AluSq2 53148 7.29E-6 0.0138612 0.203711 10.61 2000000AluSq 12578 2.02E-6 0.0229612 0.339543 10.29 5000000AluSx1 110641 1.22E-5 0.00897877 0.155105 11.24 1200000AluSx3 39745 6.68E-6 0.011441 0.33726 11.8 1200000AluSx4 11996 2.64E-6 0.00781842 0.542291 10.5 3000000AluSx 129223 1.44E-5 0.0104415 0.169736 11.69 900000AluSz6 42011 5.84E-6 0.0160614 0.255997 13.47 2000000AluSz 121797 1.32E-5 0.0136124 0.147549 12.42 1000000AluYa5 3792 1.27E-6 0.0225233 0.927471 1.33 6000000AluYc 6781 1.89E-6 0.0108716 0.751359 7.54 4000000AluY 101265 2.04E-5 0.010568 0.458571 6.97 500000AluYk2 6802 1.64E-6 0.0171428 0.603215 8.07 5000000AluYk3 6108 1.63E-6 0.0105437 0.69835 7.72 4000000AluYm1 4714 1.39E-6 0.0105621 0.775896 7 4000000

Table 5.1: Estimated parameters by the tail fitting process.

Regarding at their shape, the exponential distribution fits every sources al-most perfectly, Figure 5.9, and might explain the observed data 5.10A. Even if,most of the sources aren’t completely explained by a single exponential source,because they have small divergences at short distances, like we were at the pres-ence of extra source of small distances. Thus, to test this effect we tried forJr family, the oldest and with the greatest duplicated fraction, to use a doubleexponential source and the agreement is quite good, Fig. 5.10B.

p(l, t) =

SB(le−γL t, B, L)e−

γL t + µ1Lδ1

γl

(e−

lδ1e−

γLt

− e−lδ1

)+ µ2Lδ2

γl

(e−

lδ2e−

γLt

− e−lδ2

)1 + µ1δ1t+ µ2δ2t

(5.27)Pay attention to the fact, we are not assuming exponential sources, their form

85

distances

probability

Sources

Figure 5.9: Estimated sources for each subfamilies with a sufficient size.

pro

babili

ty 1

10-2

10-4

104105

distance [bp]

A

10-2

Alu Y1

10-4

104105

Blength

pro

b. source

length

Alu Jrsource

pro

b.

double exp.exponential0.5

1

0.5

1

00 25000 2500050000

Figure 5.10: A Alu Y example of superimposition of the model (red) with thedata (orange). B Alu Jr example of superimposition of the model (red) with thedata (orange).

86

is unknown. The youngest subfamilies have a little divergence with a small peakfor short distances, so an exponential distribution is enough, but that peak growswith the oldest ones, making us to prefer a double exponential distribution tocatch the right behaviour of the latter. Obviously, the youngest would be fittedbetter by a double exponential, too. This growing peak may be the consequenceof more duplicating processes with different rates.

5.4 Conclusion

Retrotransposons fill about 42% of human genome and played a central role inevolution of mammals since their origin ∼ 100 Mya and, in particular after thelast great mass extinction. The transcription and retrotranscription of specificsequences, generates many insertions of the same fragment of DNA insomuch astoday we observe within the genome distributions of a great number of segmentswith common sequences. The stick-breaking process, taken from polymer theory,is able to model the retrotransposition mechanism and the comprehensive studyof the most numerous families of retrotransposons gives a deep insight into theprocesses affecting genome at the scale of thousands of bases.

age [MYA]

Y S J

dupl

ica

ted

frac

tion

0.016r=0.81

Figure 5.11: Fraction of duplicated Alu for each family in funciton of the age,estimated through the Jukes-Cantor model.

For each subfamily it is possible to estimate their age, obtained with Jukes-Cantor model and the mean sequence divergence of that subfamily. Using Kimuramodel to approximate the ages, we did not observe any significant change inthe estimation. Thus, even if Jukes-Cantor is the simplest model of mutation-accumulation, it is enough to verify the correlation for both Alu and L1 families(see Fig. 5.11). Thank to the tail fitting, we are able to approximate anotheressential parameter: the fraction of duplicated Alu for each family. This ratiois figured out dividing the fraction of distances greater than the best lmin bythe expected ones from the estimated stick-breaking. In this way the maximum

87

likelihood returns also the fraction of retrotransposed insertions, that come fromthe duplication process. Finally, the fraction of duplicated should increase witholder families and, indeed we observe this pattern in Figure 5.11.

This wide study of the most numerous families of retrotransposons gives adeep insight into the processes affecting genome at the scale of thousands ofbases and demonstrates that a simple duplication model describes the distri-bution of distances between retrotransposon insertions. We can state that localduplications is the main process driving the evolution of human genome. Al-though many processes may be blamed to be responsible of this effect, someevidences make us stabbing at the unequal crossing over. It is known since 1923thanks to Sturtevant works on Drosophila [104, 103] and works exactly at thescale, we see, generating duplications of long segments, too. Besides, in [8], au-thors suggested Alu insertions are strictly involved in segmental duplications,endorsing our hypothesis.

88

Bibliography

[1] Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, KeithRoberts, and Peter Walter. Molecular Biology of the Cell. Garland Sci-ence, 5 edition, 2008.

[2] John Aldrich et al. RA Fisher and the making of maximum likelihood1912-1922. Statistical Science, 12(3):162–176, 1997.

[3] Rosalind J Allen, Patrick B Warren, and Pieter Rein Ten Wolde. Samplingrare switching events in biochemical networks. Physical Review Letters,94(1):018104, 2005.

[4] U. Alon. Network motifs: theory and experimental approaches. NatureRev Genet, 8:450–461, 2007.

[5] Uri Alon. An introduction to systems biology: design principles of biologicalcircuits. CRC press, 2006.

[6] Victor Ambros. microRNAs: tiny regulators with great potential. Cell,107(7):823–826, 2001.

[7] Aaron Arvey, Erik Larsson, Chris Sander, Christina S Leslie, and Debora SMarks. Target mRNA abundance dilutes microRNA and siRNA activity.Molecular Systems Biology, 6(1), 2010.

[8] Jeffrey A Bailey, Ge Liu, and Evan E Eichler. An Alu TranspositionModel for the Origin and Expansion of Human Segmental Duplications.The American Journal of Human Genetics, 73(4):823–834, 2003.

[9] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in randomnetworks. Science, 286(5439):509–512, 1999.

[10] David P Bartel. MicroRNAs: genomics, biogenesis, mechanism, and func-tion. Cell, 116(2):281–297, 2004.

[11] David P Bartel. MicroRNAs: target recognition and regulatory functions.Cell, 136(2):215–233, 2009.

[12] D.P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function.Cell, 116(2):281–297, 2004.

[13] Doron Betel, Anjali Koppal, Phaedra Agius, Chris Sander, and ChristinaLeslie. Comprehensive modeling of microRNA targets predicts functionalnon-conserved and non-canonical sites. Genome biology, 11(8):R90, 2010.

89

[14] Doron Betel, Manda Wilson, Aaron Gabow, Debora S. Marks, and ChrisSander. The microRNA.org resource: targets and expression. NucleicAcids Research, 36(SI):D149–D153, JAN 2008.

[15] L. Bintu, N.E. Buchler, H.G. Garcia, U. Gerland, T. Hwa, J. Kondev, andR. Phillips. Transcriptional regulation by the numbers: models. Currentopinion in genetics & development, 15(2):116–124, 2005.

[16] Lacramioara Bintu, Nicolas E Buchler, Hernan G Garcia, Ulrich Gerland,Terence Hwa, Jane Kondev, Thomas Kuhlman, and Rob Phillips. Tran-scriptional regulation by the numbers: applications. Current opinion ingenetics & development, 15(2):125–135, 2005.

[17] Glen M Borchert, William Lanier, and Beverly L Davidson. RNA poly-merase III transcribes human microRNAs. Nature structural & molecularbiology, 13(12):1097–1101, 2006.

[18] C. Bosia, M. Osella, M. El Baroudi, D. Cora, and M. Caselle. Geneautoregulation via intronic microRNAs and its functions. BMC SystemsBiology, 6:131, 2012.

[19] C. Bosia, A. Pagnani, and R. Zecchina. Modelling competing endogenousRNA networks. PLoS ONE, 8(6):e66609, 2013.

[20] Robert C Brewster, Franz M Weinert, Hernan G Garcia, Dan Song, Mat-tias Rydenfelt, and Rob Phillips. The Transcription Factor Titration Ef-fect Dictates Level of Gene Expression. Cell, 156(6):1312–1323, 2014.

[21] Sydney Chapman. On the brownian displacements and thermal diffu-sion of grains suspended in a non-uniform fluid. Proceedings of the RoyalSociety of London. Series A, 119(781):34–54, 1928.

[22] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-lawdistributions in empirical data. SIAM review, 51(4):661–703, 2009.

[23] Kate B Cook, Hilal Kazan, Khalid Zuberi, Quaid Morris, and Timothy RHughes. RBPDB: a database of RNA-binding specificities. Nucleic acidsresearch, 39(suppl 1):D301–D308, 2011.

[24] Peter R Cook. The organization of replication and transcription. Science,284(5421):1790–1795, 1999.

[25] Peter R Cook. A model for all genomes: the role of transcription factories.Journal of molecular biology, 395(1):1–10, 2010.

[26] Richard Cordaux and Mark A Batzer. The impact of retrotransposonson human genome evolution. Nature Reviews Genetics, 10(10):691–703,2009.

[27] Alexandru Dan Corlan. Medline trend: automated yearly statis-tics of PubMed results for any query. http://dan.corlan.net/

medline-trend.html. Accessed: 2012-02-14.

[28] Francis Crick et al. Central dogma of molecular biology. Nature,227(5258):561–563, 1970.

90

[29] Mariama El Baroudi, Davide Cora, Carla Bosia, Matteo Osella, andMichele Caselle. A curated database of miRNA mediated feed-forwardloops involving MYC as master regulator. PLoS ONE, 6(3):e14742, 2011.

[30] Carlos El Hader, Sandra Tremblay, Nicolas Solban, Denis Gingras,Richard Beliveau, Sergei N. Orlov, Pavel Hamet, and Johanne Tremblay.HCaRG increases renal cell migration by a TGF-α autocrine loop mech-anism. American Journal of Physiology - Renal Physiology, pages 1273–1280, 2005.

[31] J. Elf, J. Paulsson, O.G. Berg, and M. Ehrenberg. Near-critical phenom-ena in intracellular metabolite pools. Biophysical Journal, 84(1):154–170,2003.

[32] P Erdos and A Renyi. On random graphs. I. Publ. Math. Debrecen,6:290–297, 1959.

[33] M. Figliuzzi, A. De Martino, and E. Marinari. MicroRNAs as a selective,post-transcriptional channel of communication between ceRNAs: a steady-state theory. Biophysical Journal, 104(5):1203–13, 2013.

[34] Paul Flicek, Ikhlak Ahmed, M Ridwan Amode, Daniel Barrell, KathrynBeal, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates,Susan Fairley, et al. Ensembl 2013. Nucleic acids research, 41(D1):D48–D55, 2013.

[35] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstanti-nos Billis, Simon Brent, Denise Carvalho-Silva, Peter Clapham, GuyCoates, Stephen Fitzgerald, et al. Ensembl 2014. Nucleic acids research,page gkt1196, 2013.

[36] Nir Friedman, Long Cai, and X Sunney Xie. Linking stochastic dynamicsto population distribution: an analytical framework of gene expression.Physical Review Letters, 97(16):168302, 2006.

[37] Robin C Friedman, Kyle Kai-How Farh, Christopher B Burge, andDavid P Bartel. Most mammalian mRNAs are conserved targets of mi-croRNAs. Genome research, 19(1):92–105, 2009.

[38] Ulrich Gerland, J David Moroz, and Terence Hwa. Physical constraintsand functional characteristics of transcription factor–DNA interaction.Proceedings of the National Academy of Sciences, 99(19):12015–12020,2002.

[39] Mark B Gerstein, Anshul Kundaje, Manoj Hariharan, Stephen G Landt,Koon-Kiu Yan, Chao Cheng, Xinmeng Jasmine Mu, Ekta Khurana, JoelRozowsky, Roger Alexander, et al. Architecture of the human regulatorynetwork derived from ENCODE data. Nature, 489(7414):91–100, 2012.

[40] Michael A Gibson and Jehoshua Bruck. Efficient exact stochastic simu-lation of chemical systems with many species and many channels. Thejournal of physical chemistry A, 104(9):1876–1889, 2000.

91

[41] Daniel T Gillespie. A general method for numerically simulating thestochastic time evolution of coupled chemical reactions. Journal of com-putational physics, 22(4):403–434, 1976.

[42] Daniel T Gillespie. Exact stochastic simulation of coupled chemical reac-tions. The journal of physical chemistry, 81(25):2340–2361, 1977.

[43] D.W. Goodrich. The retinoblastoma tumor-suppressor gene, the exceptionthat proves the rule. Oncogene, 25(38):5233–5243, 2006.

[44] Andrew Grimson, Kyle Kai-How Farh, Wendy K Johnston, Philip Garrett-Engele, Lee P Lim, and David P Bartel. MicroRNA targeting specificityin mammals: determinants beyond seed pairing. Molecular cell, 27(1):91–105, 2007.

[45] Markus Hafner, Markus Landthaler, Lukas Burger, Mohsen Khorshid,Jean Hausser, Philipp Berninger, Andrea Rothballer, Manuel Ascano Jr,Anna-Carina Jungkamp, Mathias Munschauer, et al. Transcriptome-wideidentification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell, 141(1):129–141, 2010.

[46] Markus Hafner, Klaas EA Max, Pradeep Bandaru, Pavel Morozov, Ste-fanie Gerstberger, Miguel Brown, Henrik Molina, and Thomas Tuschl.Identification of mRNAs bound and regulated by human LIN28 proteinsand molecular requirements for RNA recognition. RNA, 19(5):613–626,2013.

[47] Jeffrey C Hansen. Human mitotic chromosome structure: what happenedto the 30-nm fibre? The EMBO journal, 31(7):1621–1623, 2012.

[48] Yue Hao, Zhongge J Zhang, David W Erickson, Min Huang, YingwuHuang, Junbai Li, Terence Hwa, and Hualin Shi. Quantifying thesequence–function relation in gene silencing by bacterial small RNAs. Pro-ceedings of the National Academy of Sciences, 108(30):12473–12478, 2011.

[49] Sara Hooshangi, Stephan Thiberge, and Ron Weiss. Ultrasensitivity andnoise propagation in a synthetic transcriptional cascade. Proceedingsof the National Academy of Sciences of the United States of America,102(10):3581–3586, 2005.

[50] E. Hornstein and N. Shomron. Canalization of development by microR-NAs. Nature Genetics, 38:S20–S24, 2006.

[51] Yubo Hou and Senjie Lin. Distinct gene number-genome size relationshipsfor eukaryotes and non-eukaryotes: gene content estimation for dinoflag-ellate genomes. PLoS One, 4(9):e6978, 2009.

[52] S.D. Hsu, F.M. Lin, W.Y. Wu, C. Liang, W.C. Huang, W.L. Chan, W.T.Tsai, G.Z. Chen, C.J. Lee, C.M. Chiu, et al. miRTarBase: a databasecurates experimentally validated microRNA–target interactions. NucleicAcids Research, 39(suppl 1):D163–D169, 2011.

92

[53] Sheng-Da Hsu, Yu-Ting Tseng, Sirjana Shrestha, Yu-Ling Lin, AnasKhaleel, Chih-Hung Chou, Chao-Fang Chu, Hsi-Yuan Huang, Ching-MinLin, Shu-Yi Ho, et al. miRTarBase update 2014: an information resourcefor experimentally validated miRNA-target interactions. Nucleic acidsresearch, 42(D1):D78–D85, 2014.

[54] Jerzy Jurka. Sequence patterns indicate an enzymatic involvement in inte-gration of mammalian retroposons. Proceedings of the National Academyof Sciences, 94(5):1872–1877, 1997.

[55] S. Kalir, S. Mangan, and U. Alon. A coherent feed-forward loop with a suminput function prolongs flagella expression in escherichia coli. MolecularSystems Biology, 1:2005.0006, 2005.

[56] Haig H Kazazian. Mobile elements: drivers of genome evolution. Science,303(5664):1626–1632, 2004.

[57] Jerome Kelleher and Barry O’Sullivan. Generating all partitions: a com-parison of two encodings. arXiv preprint arXiv:0909.2331, 2009.

[58] Michael Kertesz, Nicola Iovino, Ulrich Unnerstall, Ulrike Gaul, and EranSegal. The role of site accessibility in microRNA target recognition. Naturegenetics, 39(10):1278–1284, 2007.

[59] Mohsen Khorshid, Jean Hausser, Mihaela Zavolan, and Erik van Nimwe-gen. A biophysical miRNA-mRNA interaction model infers canonical andnoncanonical targets. Nature methods, 10(3):253–255, 2013.

[60] Shivendra Kishore, Lukasz Jaskiewicz, Lukas Burger, Jean Hausser,Mohsen Khorshid, and Mihaela Zavolan. A quantitative analysis of CLIPmethods for identifying binding sites of RNA-binding proteins. Naturemethods, 8(7):559–564, 2011.

[61] A. Kolmogoroff. Uber die analytischen Methoden in der Wahrschein-lichkeitsrechnung. Mathematische Annalen, 104:415–458, 1931.

[62] Ana Kozomara and Sam Griffiths-Jones. miRBase: integrating microRNAannotation and deep-sequencing data. Nucleic acids research, 39(suppl1):D152–D157, 2011.

[63] Ana Kozomara and Sam Griffiths-Jones. miRBase: annotating high con-fidence microRNAs using deep sequencing data. Nucleic acids research,page gkt1181, 2013.

[64] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael CZody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, WilliamFitzHugh, et al. Initial sequencing and analysis of the human genome.Nature, 409(6822):860–921, 2001.

[65] Rosalind C Lee and Victor Ambros. An extensive class of small RNAs inCaenorhabditis elegans. Science, 294(5543):862–864, 2001.

[66] Rosalind C Lee, Rhonda L Feinbaum, and Victor Ambros. The C. elegansheterochronic gene lin-4 encodes small RNAs with antisense complemen-tarity to lin-14. Cell, 75(5):843–854, 1993.

93

[67] Yoontae Lee, Minju Kim, Jinju Han, Kyu-Hyun Yeom, Sanghyuk Lee,Sung Hee Baek, and V Narry Kim. MicroRNA genes are transcribed byRNA polymerase II. The EMBO journal, 23(20):4051–4060, 2004.

[68] Stefan Legewie, Nils Bluthgen, and Hanspeter Herzel. Quantitative anal-ysis of ultrasensitive responses. FEBS Journal, 272(16):4071–4079, 2005.

[69] E. Levine, Z. Zhang, T. Kuhlman, and T. Hwa. Quantitative characteris-tics of gene regulation by small RNA. PLoS Biology, 5(9)(9):e229, 2007.

[70] Lee P Lim, Nelson C Lau, Philip Garrett-Engele, Andrew Grimson,Janell M Schelter, John Castle, David P Bartel, Peter S Linsley, and Ja-son M Johnson. Microarray analysis shows that some micrornas downreg-ulate large numbers of target mrnas. Nature, 433(7027):769–773, 2005.

[71] S. Mangan, A. Zaslaver, and U. Alon. The coherent feedforward loopserves as a sign-sensitive delay element in transcription networks. J MolBiol., 334(2):197–204, 2003.

[72] N. J. Martinez and A. J. M. Walhout. The interplay between transcriptonfactors and microRNAs in genome-scale regulatory networks. Bio Essays,31:435–445, 2009.

[73] BJ McCarthy and JJ Holland. Denatured dna as a direct template for invitro protein synthesis. Proceedings of the National Academy of Sciencesof the United States of America, 54(3):880, 1965.

[74] Stanley Milgram. The small world problem. Psychology Today, 2(1):60–67,1967.

[75] Ron Milo, Paul Jorgensen, Uri Moran, Griffin Weber, and MichaelSpringer. Bionumbers - the database of key numbers in molecular andcell biology. Nucleic acids research, 38(suppl 1):D750–D753, 2010.

[76] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, DmitriChklovskii, and Uri Alon. Network motifs: simple building blocks of com-plex networks. Science, 298(5594):824–827, 2002.

[77] Namiko Mitarai, Anna M C Andersson, Sandeep Krishna, Szabolcs Sem-sey, and Kim Sneppen. Efficient degradation and expression prioritizationwith small RNAs. Physical Biology, 4:164, 2007.

[78] S. Mukherji, M.S. Ebert, G.X.Y. Zheng, J.S. Tsang, P.A. Sharp, andA. van Oudenaarden. MicroRNAs can generate thresholds in target geneexpression. Nature Genetics, 43(9):854–859, 2011.

[79] Gavriel Mullokandov, Alessia Baccarini, Albert Ruzo, Anitha DJayaprakash, Navpreet Tung, Benjamin Israelow, Matthew J Evans, RaviSachidanandam, and Brian D Brown. High-throughput assessment of mi-crorna activity and function using microrna sensor and decoy libraries.Nature methods, 9(8):840–846, 2012.

[80] J. Noorbakhsh, A.H. Lang, and P. Mehta. Intrinsic Noise of microRNA-Regulated Genes and the ceRNA Hypothesis. PLoS ONE, 8(8):e72676,2013.

94

[81] M. Osella, C. Bosia, D. Cora, and M. Caselle. The role of incoherentmicroRNA-mediated feedforward loops in noise buffering. PLoS Compu-tational Biology, 7(3):e1001101, 2011.

[82] Dimitris Polychronopoulos, Diamantis Sellis, and Yannis Almirantis. Con-served Noncoding Elements Follow Power-Law-Like Distributions in Sev-eral Genomes as a Result of Genome Dynamics. PloS one, 9(5):e95437,2014.

[83] Ilya Prigogine. Thermodynamics of irreversible processes. Thomas, 1955.

[84] A. Re, D. Cora, D. Taverna, and M. Caselle. Genome-wide survey ofmicroRNA-transcription factor feed-forward regulatory circuits in human.Molecular BioSystems, 5:854–867, 2009.

[85] Marc Rehmsmeier, Peter Steffen, Matthias Hochsmann, and RobertGiegerich. Fast and effective prediction of microRNA/target duplexes.Rna, 10(10):1507–1517, 2004.

[86] A. Riba, C. Bosia, L. Ollino, M. El Baroudi, and M. Caselle. Data from:A combination of transcriptional and microrna regulation improves thestability of the relative concentrations of target genes, 2014.

[87] Frank Ruskey. Combinatorial generation. Working Version (1j-CSC425/520), pages 96–97, 2003.

[88] Jackie Russell and Joost CBM Zomerdijk. RNA-polymerase-I-directedrDNA transcription, life and works. Trends in biochemical sciences,30(2):87–96, 2005.

[89] Mattias Rydenfelt, Robert Sidney Cox III, Hernan Garcia, and RobPhillips. Statistical mechanical model of coupled transcription from mul-tiple promoters due to transcription factor titration. Physical Review E,89(1):012702, 2014.

[90] Leonardo Salmena, Laura Poliseno, Yvonne Tay, Lev Kats, and Pier PaoloPandolfi. A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNALanguage? Cell, 146(3):353–358, AUG 5 2011.

[91] Leonardo Salmena, Laura Poliseno, Yvonne Tay, Lev Kats, and Pier PaoloPandolfi. A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNALanguage? Cell, 146(3):353–358, 2011.

[92] Leonard J Savage. On rereading RA Fisher. The Annals of Statistics,pages 441–500, 1976.

[93] Nicole T Schirle and Ian J MacRae. The crystal structure of humanArgonaute2. Science, 336(6084):1037–1040, 2012.

[94] Nicole T Schirle, Jessica Sheu-Gruttadauria, and Ian J MacRae. Struc-tural basis for microRNA targeting. Science, 346(6209):608–613, 2014.

[95] Daniela Schmitter, Jody Filkowski, Alain Sewer, Ramesh S Pillai, Ed-ward J Oakeley, Mihaela Zavolan, Petr Svoboda, and Witold Filipowicz.Effects of Dicer and Argonaute down-regulation on mRNA levels in humanHEK293 cells. Nucleic Acids Research, 34(17):4801–4815, 2006.

95

[96] E Schrock, S Du Manoir, T Veldman, B Schoell, J Wienberg,MA Ferguson-Smith, Y Ning, DH Ledbetter, I Bar-Am, D Soenksen,et al. Multicolor spectral karyotyping of human chromosomes. Science,273(5274):494–497, 1996.

[97] Diamantis Sellis, Astero Provata, and Yannis Almirantis. Alu and LINE1distributions in the human chromosomes: evidence of global genomic or-ganization expressed in the form of power laws. Molecular biology andevolution, 24(11):2385–2399, 2007.

[98] Vahid Shahrezaei and Peter S Swain. Analytical distributions for stochas-tic gene expression. Proceedings of the National Academy of Sciences,105(45):17256–17261, 2008.

[99] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local archi-tecture of the mammalian microRNA-transcription factor regulatory net-work. PLoS Computational Biology, 3(7):e131, 2007.

[100] A.F.A. Smit, R. Hubley, and P. Green. RepeatMasker. http://www.

repeatmasker.org/, 2014.

[101] Ji-Joon Song, Stephanie K Smith, Gregory J Hannon, and Leemor Joshua-Tor. Crystal structure of Argonaute and its implications for RISC sliceractivity. Science, 305(5689):1434–1437, 2004.

[102] Michael R Speicher and Nigel P Carter. The new cytogenetics: blurring theboundaries with molecular biology. Nature Reviews Genetics, 6(10):782–792, 2005.

[103] Alfred H Sturtevant. The effects of unequal crossing over at the Bar locusin Drosophila. Genetics, 10(2):117, 1925.

[104] Alfred H Sturtevant and Thomas Hunt Morgan. Reverse mutation of thebar gene correlated with crossing over. Science, 57:746–747, 1923.

[105] Pavel Sumazin, Xuerui Yang, Hua-Sheng Chiu, Wei-Jen Chung, ArchanaIyer, David Llobet-Navas, Presha Rajbhandari, Mukesh Bansal, PaoloGuarnieri, Jose Silva, and Andrea Califano. An Extensive MicroRNA-Mediated Network of RNA-RNA Interactions Regulates EstablishedOncogenic Pathways in Glioblastoma. Cell, 147(2):370–381, OCT 14 2011.

[106] J. Sun, X. Gong, B. Purow, and Z. Zhao. Uncovering MicroRNA and Tran-scription Factor Mediated Regulatory Networks in Glioblastoma. PLoSComput Biol, 8(7):e1002488, 2012.

[107] Jack W Szostak and Ray Wu. Unequal crossing over in the ribosomalDNA of Saccharomyces cerevisiae. Nature, 284(5755):426–430, 1980.

[108] Yuichi Taniguchi, Paul J Choi, Gene-Wei Li, Huiyi Chen, Mohan Babu,Jeremy Hearn, Andrew Emili, and X Sunney Xie. Quantifying E. coliproteome and transcriptome with single-molecule sensitivity in single cells.Science, 329(5991):533–538, 2010.

96

[109] Robert E Thurman, Eric Rynes, Richard Humbert, Jeff Vierstra,Matthew T Maurano, Eric Haugen, Nathan C Sheffield, Andrew B Ster-gachis, Hao Wang, Benjamin Vernot, et al. The accessible chromatinlandscape of the human genome. Nature, 489(7414):75–82, 2012.

[110] Hans-Ingo Trompeter, Hassane Abbad, Katharina M Iwaniuk, MarkusHafner, Neil Renwick, Thomas Tuschl, Jessica Schira, Hans WernerMuller, and Peter Wernet. MicroRNAs MiR-17, MiR-20a, and MiR-106bAct in Concert to Modulate E2F Activity on Cell Cycle Arrest duringNeuronal Lineage Differentiation of USSC. PLoS ONE, 6(1):e16138, 2011.

[111] J. Tsang, J. Zhu, and A. van Oudenaarden. MicroRNA-mediated feedbackand feedforward loops are recurrent network motifs in mammals. Mol Cell,26:753–767, 2007.

[112] Taketoshi Uzawa, Akihiko Yamagishi, and Tairo Oshima. Polypeptidesynthesis directed by dna as a messenger in cell-free polypeptide synthe-sis by extreme thermophiles, thermus thermophilus hb27 and sulfolobustokodaii strain 7. Journal of biochemistry, 131(6):849–853, 2002.

[113] N. G. Van Kampen. Stochastic processes in physics and chemistry. Else-vier Science & Technology Books, 3 edition, 2007.

[114] Nicolaas Godfried Van Kampen. Stochastic processes in physics and chem-istry, volume 1. Elsevier, 2007.

[115] Bas van Steensel. Chromatin: constructing the big picture. The EMBOjournal, 30(10):1885–1895, 2011.

[116] Jie Wang, Jiali Zhuang, Sowmya Iyer, XinYing Lin, Troy W Whitfield,Melissa C Greven, Brian G Pierce, Xianjun Dong, Anshul Kundaje, YongCheng, et al. Sequence features and chromatin structure around the ge-nomic regions bound by 119 human transcription factors. Genome re-search, 22(9):1798–1812, 2012.

[117] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, 1998.

[118] Liang Meng Wee, C Fabian Flores-Jasso, William E Salomon, andPhillip D Zamore. Argonaute divides its RNA guide into domains withdistinct functions and RNA-binding properties. Cell, 151(5):1055–1067,2012.

[119] Liang Meng Wee, C Fabian Flores-Jasso, William E Salomon, andPhillip D Zamore. Argonaute divides its RNA guide into domains withdistinct functions and RNA-binding properties. Cell, 151(5):1055–1067,2012.

[120] Robert J White. Transcription by RNA polymerase III: more complexthan we thought. Nature Reviews Genetics, 12(7):459–463, 2011.

[121] Zeba Wunderlich and Leonid A Mirny. Different gene regulation strategiesrevealed by analysis of binding motifs. Trends in genetics, 25(10):434–440,2009.

97

[122] Tianbing Xia, John SantaLucia, Mark E Burkard, Ryszard Kierzek, Su-san J Schroeder, Xiaoqi Jiao, Christopher Cox, and Douglas H Turner.Thermodynamic parameters for an expanded nearest-neighbor model forformation of RNA duplexes with Watson-Crick base pairs. Biochemistry,37(42):14719–14735, 1998.

[123] X. Yu, J. lin, D. J. Zack, J. T. Mendell, and J. Qian. Analysis of regula-tory network topology reveals functionally distinct classes of microRNAs.Nucleic Acids Research, 36:6494 – 6503, 2008.

[124] Qiangfeng Cliff Zhang, Donald Petrey, Jose Ignacio Garzon, Lei Deng, andBarry Honig. PrePPI: a structure-informed database of protein–proteininteractions. Nucleic acids research, page gks1231, 2012.

[125] Robert M Ziff and ED McGrady. The kinetics of cluster fragmentationand depolymerisation. Journal of Physics A: Mathematical and General,18(15):3027, 1985.

98

Documents

University of Turin - Istituto Nazionale di Fisica …...DNA molecules consists of two long polynucleotide chains. Each chain is composed by a sequence of four nucleotides, two purines,