Genome-Wide Analysis of the Arabidopsis thaliana ... Genome-Wide Analysis of the Arabidopsis thaliana ... Farkash-Amar and ... 86 Arabidopsis thaliana is an important plant model system

Short Title: Repli-Seq of the Arabidopsis DNA replication program 1

Corresponding Author: Linda Hanley-Bowdoin, [email protected] 2

Genome-Wide Analysis of the Arabidopsis thaliana Replication Timing 3

Program1 4 Lorenzo Conciaa,2, Ashley M. Brooksa, Emily Wheelera, Gregory J. Zyndab, Emily E. Weara, 5 Chantal LeBlancc,3, Jawon Songb, Tae-Jin Leea,4, Pete E. Pascuzzid, Robert A. Martienssenc, 6 Matthew W. Vaughnb, William F. Thompsona and Linda Hanley-Bowdoina,5 7 8 aDepartment of Plant and Microbial Biology, North Carolina State University, Raleigh, NC 9 27695 10

bTexas Advanced Computing Center, University of Texas at Austin, Austin, TX 78758 11

cHoward Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 12 11724 13

dPurdue University Libraries, Purdue University, West Lafayette, IN 47907 14

Summary Sentence: The Arabidopsis thaliana genome replicates in two non-interacting 15

compartments during early/mid and late S phase. 16 17 Authors Contributions: Experiments were conceived by LC, RAM, MWV, WFT, and LH-B. 18

Experiments were performed by LC, AMB, EW, EEW, CL and T-JL. Repli-Seq data were 19

analyzed by LC, PP, GJZ, JS, MWV, WFT and LH-B. LC, WFT and LH-B wrote the manuscript 20

with contributions from all authors. All authors read and approved the final manuscript. 21 22 1Funding Information: This work was supported by a grant (IOS-1025830) from the Plant 23

Genome Research Program of the National Science Foundation to LH-B, WFT, RAM and 24

MWV. 25 26 Current addresses: 2Institute of Plant Sciences Paris-Saclay, Btiment 630, Rue Noetzlin, 91190 27

Gif-sur-Yvette, France; 3Department of Molecular, Cellular & Developmental Biology, Yale 28

University, New Haven, CT 06511; 4Syngenta Crop Protection, LLC, Research Triangle Park, 29

NC, 27709 30

5Address correspondence to [email protected] 31 32

Plant Physiology Preview. Published on January 4, 2018, as DOI:10.1104/pp.17.01537

Copyright 2018 by the American Society of Plant Biologists

www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.

http://www.plantphysiol.org

ABSTRACT 33

Eukaryotes use a temporally regulated process, known as the replication timing program, to 34

ensure that their genomes are fully and accurately duplicated during S phase. Replication timing 35

programs are predictive of genomic features and activity, and considered to be functional 36

readouts of chromatin organization. Although replication timing programs have been described 37

for yeast and animal systems, much less is known about the temporal regulation of plant DNA 38

replication or its relationship to genome sequence and chromatin structure. We used the 39

thymidine analog, 5-ethynyl-2-deoxyuridine, in combination with flow sorting and Repli-Seq to 40

describe, at high-resolution, the genome-wide replication timing program for Arabidopsis 41

thaliana Col-0 suspension cells. We identified genomic regions that replicate predominantly 42

during early, mid and late S phase, and correlated these regions with genomic features and with 43

data for chromatin state, accessibility and long-distance interaction. Arabidopsis chromosome 44

arms tend to replicate early while pericentromeric regions replicate late. Early and mid-45

replicating regions are gene-rich and predominantly euchromatic, while late regions are rich in 46

transposable elements and primarily heterochromatic. However, the distribution of chromatin 47

states across the different times is complex, with each replication time corresponding to a 48

mixture of states. Early and mid-replicating sequences interact with each other and not with late 49

sequences, but early regions are more accessible than mid regions. The replication timing 50

program in Arabidopsis reflects a bipartite genomic organization with early/mid replicating 51

regions and late regions forming separate, non-interacting compartments. The temporal order of 52

DNA replication within the early/mid compartment may be modulated largely by chromatin 53

accessibility. 54

55



Concia et al.

3

3

INTRODUCTION 56

In each cell cycle, a cell must produce two identical copies of its genome during S phase. 57

Most of our knowledge about genome replication in higher eukaryotes comes from studies in 58

animals. These studies have indicated that replication is a temporally ordered process (Gilbert, 59

2010) that occurs in large domains of coordinate replication (replication domains) with 60

multiple origins firing in concert during S phase (MacAlpine et al., 2004; Desprat et al., 2009; 61

Schwaiger et al., 2009; Farkash-Amar and Simon, 2010). The replication timing programs of 62

several metazoan genomes have been characterized (Schbeler et al., 2002; Woodfine et al., 63

2004; Hiratani et al., 2008; Schwaiger et al., 2009; Hansen et al., 2010). These studies revealed 64

that early replicating chromatin is rich in genes, transcriptionally active, and contains 65

euchromatic histone modifications (Schbeler et al., 2002; Woodfine et al., 2004; Hiratani and 66

Gilbert, 2009; Hansen et al., 2010; Eaton et al., 2011; Lubelsky et al., 2014). Conversely, late 67

replicating chromatin is enriched for heterochromatin and repetitive elements (Gilbert, 2002; 68

Woodfine et al., 2004). Early and late replication domains correlate strongly with the open and 69

closed compartments identified by chromatin conformation capture experiments (Ryba et al., 70

2010; Yaffe et al., 2010; Pope et al., 2014). These compartments, which are megabases in size, 71

differ widely with respect to nuclease accessibility, gene density, transcriptional activity and 72

epigenetic marks (Lieberman-Aiden et al., 2009; Sexton et al., 2012). Hence, metazoan 73

replication timing programs are predictive of important genomic features and can be considered 74

functional readouts of chromatin organization (Rivera-Mulia et al., 2015). 75

Much less is known about how DNA replication occurs temporally and spatially across plant 76

genomes. Although the DNA replication machinery and many aspects of chromatin biology are 77

conserved between plants and animals, there are significant differences like the absence in plants 78



Concia et al.

4

4

of lamins and geminin (Shultz et al., 2007; Thorpe and Charpentier, 2017), which play key roles 79

in chromatin organization and origin function in metazoans. In addition, fundamental processes 80

such as transcriptional regulation have been shown to differ between plants and animals 81

(Meyerowitz, 2002; Hetzel et al., 2016). There is also evidence that the spatiotemporal 82

distribution of replicating DNA is different in plant nuclei than in metazoan cells (Bass et al., 83

2015). Hence, we cannot assume that DNA replication programs in plants mirror those in 84

animals (Savadel and Bass, 2017). 85

Arabidopsis thaliana is an important plant model system because of its small genome, which 86

has been fully sequenced and is well annotated, and the broad range of genomic resources 87

(Arabidopsis Genome Initiative, 2000; Provart et al., 2016). There are genome-wide data 88

available for Arabidopsis chromatin accessibility, histone modifications and chromatin 89

interactions. Because of these resources, Arabidopsis is an ideal system for examining DNA 90

replication programs in plants. 91

Our group previously published a description of the replication timing program for 92

Arabidopsis chromosome 4 (Lee et al., 2010). In that study, Arabidopsis suspension cells were 93

pulse-labeled with 5-bromo-2-deoxyuridine (BrdU) for 1 h followed by nuclei separation based 94

on DNA content using flow cytometry. Replication was examined in three nuclei populations 95

corresponding to early, mid and late S phase, using a 1-kb tiling microarray platform. While both 96

the spatial resolution and labeling pulse length were comparable to similar studies with 97

metazoans (Schbeler et al., 2002; Hiratani et al., 2008), no major differences were observed 98

between the early and mid S-phase replication profiles for Arabidopsis. This finding led us to 99

conclude that, different from animals, the order of origin activation in Arabidopsis in early and 100

mid S phase is stochastic and replication of euchromatin does not follow a strict temporal 101



Concia et al.

5

5

pattern. Unlike the Arabidopsis chromosome 4 replication timing profiles, we recently observed 102

differences between the early and mid S-phase profiles during replication of the maize genome 103

(Wear et al., 2017). To address these conflicting results, we reexamined the Arabidopsis 104

replication program, focusing more closely on sequences replicating in early and mid S phase. In 105

the process, we adapted our flow cytometry strategy and the Repli-seq methodology to better 106

distinguish between early and mid S replication. We generated a high-resolution replication 107

timing map for the entire Arabidopsis genome, and correlated the replication program with 108

chromatin state, accessibility and interaction data. 109

110



Concia et al.

6

6

RESULTS 111

Improving Resolution of the Replication Timing Protocol 112

We examined several factors that might improve our ability to resolve differences in 113

replication timing. These included the analysis platform used to detect newly synthesized DNA, 114

the thymidine analog used to pulse label nascent DNA, the length of the labeling period, and the 115

flow cytometry strategy for separating nuclei in different stages of S phase. 116

Initially, we sought to improve our ability to distinguish sequences replicating in early versus 117

mid S phase by using a more advanced NimbleGen microarray platform with shorter, more 118

closely spaced probes to better resolve replicating DNA sequences. In this experiment, we used 119

the same protocol as our previous study, with the exception of the array platform. The replication 120

timing profiles generated using the NimbleGen arrays show more fine structure than those 121

obtained from the tiling arrays (Supplemental Fig. S1). However, the overall replication profiles 122

are very similar for the two array platforms with early and mid S-phase signals showing very 123

high correlations on both platforms (Supplemental Fig. S2). Thus, we concluded that probe 124

resolution was not a major factor in our ability to distinguish early and mid S-phase replication. 125

We then focused on reducing the labeling time and obtaining better separation of early and 126

mid-replicating nuclei (Fig. 1, A and B) (Bass et al., 2014; Wear et al., 2016). Arabidopsis 127

cultured cells were pulse labeled with the thymidine analog, 5-ethynyl-2-deoxyuridine (EdU), 128

for 10 minutes. After formaldehyde fixation and nuclei isolation, the incorporated EdU was 129

conjugated with Alexa Fluor 488 (AF488) azide using Click chemistry (Salic and Mitchison, 130

2008). Nuclei were then stained with DAPI and fractionated by flow cytometry using a two-color 131

sort strategy based on EdU incorporation (AF488) and DNA content (DAPI). EdU-labeled nuclei 132

were fractionated into early, mid and late S-phase populations (Fig. 1B, upper panel), while non-133



Concia et al.

7

7

replicating G1 and G2 nuclei were excluded based on the absence of EdU. The S-phase gates 134

were assigned by dividing the EdU arc into 5 equal sections based on DNA content, with the 135

first, third and fifth sections defined as early, mid and late S phase. This resulted in narrower, 136

better separated sorting gates than in our previous experiments reducing the range of total DNA 137

content within each S-phase fraction, and minimizing cross contamination between fractions. 138



Concia et al.

8

8

Reanalysis of a sample from each fraction by flow cytometry showed minimal overlap (

Concia et al.

9

9distribution. Importantly, the read distributions of the mid S-phase samples were clearly different 162 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from

Copyright 2018 American Society of Plant Biologists. All rights reserved.


Concia et al.

10

10

from the early samples (r2 = -0.02 - 0.11). 163

Replication profiles were created using the Repliscan pipeline (Zynda et al., 2017). Read 164

counts were averaged over non-overlapping 1-kb bins, and the total number of reads per sample 165

(sequencing depth) was normalized to 1X genome coverage using the RPGC method (Ramrez et 166

al., 2016). Given their high correlations, the biological replicates were combined and sequencing 167

depth was normalized again prior to further analysis. To account for local variation in 168

sequenceability, the normalized read densities were divided by the corresponding densities in the 169

non-replicating G1 reference DNA (Supplemental Fig. S6). Additional low-amplitude variations 170

were removed using Haar transform wavelets level-3 (Percival and Walden, 2000) to produce 171

smoothed, normalized read density profiles for early, mid and late S phase. 172

We chose not to represent the data as a "log ratio," as is often done in replication timing 173

studies (Lee et al., 2010; Ryba et al., 2011; Pope et al., 2012), because low intensity replication 174

activity transformed to a log ratio would have resulted in a negative number. This creates 175

problems for downstream analyses both computationally and conceptually. Moreover, the ability 176

of log ratio plots to compress extreme values is not necessary here because Repli-Seq profiles 177

cover a limited range of values. 178

Distribution of Replication Activity within Chromosomes 179

As illustrated for Arabidopsis chromosome 1 (Fig. 2B, upper panel; Supplemental Fig. S7), 180

visualization of the replication activity at the whole chromosome scale shows a temporal pattern 181

along the chromosome. Early replication intensity is stronger in the distal arms and decreases 182

progressively toward the centromeres. Conversely, late replication is concentrated near the 183

centromere. In contrast, replication during mid S phase is more evenly distributed. The trend is 184



Concia et al.

11

11

consistent across all five chromosomes, although the early replication signal is less intense in the 185

short arms of the acrocentric chromosomes 2 and 4 (Supplemental Fig. S7). 186

Visualization on a smaller scale confirmed that the distal arms replicate mainly in early S and 187

centromeric regions replicate in late S. It also revealed that the proximal arms tend to replicate 188

predominantly in mid S phase, further supporting the trend described above (Fig. 2B, lower 189

panels). We quantified the fraction of replication at each time as a function of the distance from 190

the centromere for all ten Arabidopsis chromosome arms. Because the chromosome arms vary in 191

length, each arm was partitioned into 10 equal size bins and the fraction of total replication in 192

each bin was determined at each time (Fig. 2D). When the results were plotted as a function of 193

relative distance from the centromere, it was clear that early replication increases as the distance 194

from the centromere increases (Fig. 2D, left panel). In contrast, nearly half of late replication 195

occurs in the three bins closest to the centromere (Fig. 2D, right panel). Mid S replication is more 196

uniformly distributed and clearly different from early replication (Fig. 2D, middle panel). 197

Early and mid S phase also have distinct features when examined on a fine scale. The 198

differences were especially evident in regions where replication intensities were similar for both 199

time points. Overlaying early and mid-replication profiles in those regions often produced a 200

pattern of alternate early and mid local maxima (Fig. 2C, alternating blue and green line in the 201

top panel), suggestive of replication activity spreading over time from early replicating regions to 202

surrounding mid replicating sequences. 203

Segmentation Analysis 204

To facilitate more detailed analysis, we partitioned the genome into segments with similar 205

replication times using the Repliscan pipeline (Zynda et al., 2017). This method allows for the 206

possibility that replication of a given locus occurs in more than one time window. Our data 207



Concia et al.

12

12showed that no sequence replicated exclusively in a single time window (Fig. 2B; Supplemental 208 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from



Concia et al.

13

13

Fig. S4). Hence, for a given sequence, we will refer to the "prevalent" time of replication, in 209

which the replication signal is stronger than at the other times. The Repliscan pipeline uses a 210

two-step process to assign a prevalent replication time (RT) to a 1-kb bin based on its replication 211

intensity in early, mid and late S phase. First, 1-kb bins were classified either as replicating or 212

non-replicating based on a threshold established for each chromosome, and only bins with 213

replication intensity above the threshold were used for segmentation analysis. Second, 214

replication signals for each 1-kb bin were divided by the maximum value for that bin, scaling the 215

largest value to 1 and all others between 0 and 1. The bin was then labeled as replicating 216

predominantly at the time with a normalized signal above 0.5. If the bin contained one or more 217

signals within 50% of the highest signal, they were included in the classification. Adjacent bins 218

with the same RT were merged into larger segments. With this approach, we identified segments 219

replicating predominantly in early S phase (E), in both early and mid S phase (EM), only in mid 220

S phase (M), in both mid and late S phase (ML), only in late S phase (L), in early and late S 221

phase (EL), and at all the three times (EML) (Fig. 3, A and B). Regions with replication signal 222

below the threshold in all time points were not classified or included in our statistical analyses. 223

The cumulative genomic coverage of each RT class is shown in Fig. 3C. A single prevalent 224

time of replication was identified for more than half of the genome (31% E + 20% M + 7% L = 225

58%), while most of the rest of the genome was evenly split between EM (21 %) and ML (20%). 226

The EL and EML segment classes together constituted about 1% of the genome, and 2.5% of 227

genome could not be classified. Given the clear separation of the sorting gates used to generate 228

the early, mid and late populations (Fig. 1B, upper panel), it is noteworthy that 41% of the 229

Arabidopsis genome replicates in the intermediate EM and ML classes. The timing heterogeneity 230

may reflect the presence of subpopulations of cells with related but distinct replication programs 231



Concia et al.

14

14

and/or allelic heterogeneity that may have arisen during prolonged cell culture (Wang and Wang, 232

2012). The low coverage of L segments relative to E and M is also noteworthy because the width 233

(range of DNA content) of the three sorting gates was equivalent (Fig. 1B, upper panel). 234

The distribution of replication timing segments is similar for the five Arabidopsis 235

chromosomes with the exception of the short arms of chromosomes 2 and 4, which have very 236

few early segments (Fig. 3B). The distal portions of longer chromosome arms are covered with 237

large E segments (>50-100 kb) interspersed with small EM and M segments (

Concia et al.

15

15detected several EL and EML segments, but due to their small size and low frequency, we did 255 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from



Concia et al.

16

16

not include them in subsequent analyses (Fig. 3D). 256

Replication Time and Genomic Features 257

To explore the relationship between replication timing and major genomic features, we 258

queried the Repli-Seq data using Araport11 genome annotations (Cheng et al., 2017).. A visual 259

comparison of the RT segmentation data with genes and transposable elements (Fig. 4A; 260

Supplemental Fig. S9) showed that the gene-rich chromosome arms replicate in E, EM and M 261

while the TE-rich pericentromeric region replicates in ML and L, as described above and 262

reported previously (Lee et al., 2010). To obtain a more detailed picture, we computed the 263

cumulative overlaps of genes, pseudogenes, TEs and unannotated sequences with RT classes for 264

the entire Arabidopsis genome. The overlaps were expressed as a percent of total genomic 265

coverage for a given feature to adjust for abundance differences (Fig. 4B). This analysis gave 266

similar results as the visual inspection of chromosome 1, with segment coverage of genes highest 267

in E and EM and TEs highest in ML and L. 268

To assess if the distributions of the genomic features across the RT classes are statistically 269

different from the distribution across the whole genome, we built a contingency table with the 270

absolute overlaps expressed as the number of 1-kb bins (Table 2) and applied a chi-square test 271

for homology. Differences in the overlaps showed high statistical significance (p-value < 2.2E-272

16, 2= 25,561, df=12). However, when analyzing a large population (N=116,063), small 273

differences between observed and expected values almost always generate a statistically 274

significant p-value (Sullivan and Feinn, 2012). For this reason, we estimated the "effect size" of 275

the test, defined as the "magnitude of association between categorical variables" (Kotrlik et al., 276

2011), by calculating Cramer's V statistic. The Cramer V value for our data was 0.27, within the 277



Concia et al.

17

17

0.2 - 0.4 range for a "moderate association" (Rea and Parker, 2005), indicating a nonrandom 278

distribution of genomic features in the RT classes. 279

Next, we identified which genomic features and replication timing segments overlapped 280

either more or less than expected by examining the sign and value of the chi-square adjusted 281

residuals (Agresti, 2007). We split the adjusted residuals into tertiles and classified the relevant 282

combinations as overrepresented (highest tertile), underrepresented (lowest tertile), or similar to 283

expected (central tertile). The arrows and dots in Fig. 4B indicate the assigned category. 284

The statistical analysis confirmed that genes are over-represented in E, EM and M and under-285

represented in ML and L segments (Fig. 4B). Pseudogenes are enriched in ML segments. This 286

may be due in part to the association of "processed pseudogenes," the products of 287

retrotransposition events (Zheng et al., 2007), with TE-rich pericentromeric regions that replicate 288

in ML (Fig. 4A). Unannotated regions overlap more with E and less with M and ML segments 289

relative the total genome. The enrichment of unannotated regions in E segments may reflect the 290

fact that the distances between genes in the distal arms are generally much longer than the spaces 291

between TEs or between genes and TE in the pericentromeres (Fig. 4A). It is worth noting that 292

depletion of unannotated regions in L segments is not statistically significant and, instead, is 293

most likely due to poor annotation of the centromeric regions. 294

We then determined the number of protein coding genes, pseudogenes and TE genes in each 295

RT class (Fig. 4C). To control for differences between RT segment coverage, the counts were 296

normalized over the genomic coverage for each RT class and expressed as the number of 297

elements per Mb. The densities of protein coding genes in E (287/Mb), EM (308/Mb) and M 298

(291/MB) are very similar, then drop in ML (173/Mb) and L (58/Mb) segments. Conversely, TE 299

genes are very sparse in E (4/Mb), EM (10/Mb) and M (18/Mb) but densely packed in ML 300



Concia et al.

18

18

(79/Mb) and L (155/Mb) segments. The density of pseudogenes across the RT classes is low due 301

to their low number in the Arabidopsis genome. 302

We also computed the fraction of each segment covered by genes, TEs and unannotated 303

sequences and generated boxplots showing the range of coverage within each RT class (Fig. 4D). 304

Consistent with the other analyses, E, EM and M segments are gene-rich (left panel) and 305

depleted in TEs (center panel), while ML and L segments are TE-rich and have lower gene 306

content. The unannotated region content of different RT classes is more uniform (right panel), 307

with slightly higher content in E and EM. 308

Together, our results indicated that the genomic features associated with the M segments are 309

more similar to those in E and EM segments than in ML and L segments. This was true even 310

though the M segments replicate at a distinct stage of S phase and are more likely to be located 311

in the proximal regions of the chromosome arms, while the E and EM segments are 312

predominantly located in the distal regions. 313

The above analyses only used sequence tags that mapped uniquely to the Arabidopsis 314

genome and, as such, did not address replication timing of repetitive sequences. To analyze 315

replication timing of repeats, we queried all the reads after initial processing with TEL, CEN, 316

45S and 5S repeat sequences from the Plant Repeat Databases (Ouyang and Buell, 2004). 317

Arabidopsis telomeric sequences consist of 2-5 kb stretches of 5-CCCTAAA-3 repeat units 318

(TEL) (Richards and Ausubel, 1988), while centromeres and pericentromeres contain about 319

20,000 copies of a 180-bp satellite repeat (CEN) in long arrays extending for several 320

megabases (Lermontova et al., 2015). The 570-750 copies per haploid genome of 45S rRNA 321

genes (45S rDNA) form two 4-Mbp arrays in nucleolar organizing regions located at the 322

ends of the short arms of chromosome 2 and 4 (Copenhaver and Pikaard, 1996; Havlov et 323



Concia et al.

19

19

al., 2016). The pericentromeres of chromosome 3, 4 and 5 also contain heterogeneous arrays 324

including about 1000 copies of the 5S rRNA genes (5S rDNA) (Vaillant et al., 2007). 325

For each S-phase dataset, we computed the fraction of reads aligning to each repeat 326

consensus and normalized it to the fraction of reads in the G1 control that aligned to the same 327

consensus (Fig. 4E). The resulting ratio is a measure of enrichment or depletion of a given repeat 328

in reads from early, mid or late S phase. CEN sequences are strongly enriched in late S phase and 329

depleted in early and mid, in agreement with the late replication timing of the centromeres (Fig. 330

2A; Supplemental Fig. S7). TEL sequences replicate preferentially in early and mid S phase but 331

replication activity is also detectable in late S phase. The lack of a single predominant replication 332

time is likely due to asynchrony between telomeres. In human cells, the telomere replication 333

program is chromosome-specific and influenced by sequences in sub-telomeric regions (Arnoult 334

et al., 2010). Replication of both 5S and 45S rDNA occurs primarily in late S phase, consistent 335

with sequestration and silencing of most 5S and 45S rDNA gene copies by repressive 336

heterochromatin (Layat et al., 2012). However, some 5S and 45S rDNA genes are 337

transcriptionally active and packaged into permissive euchromatin (Douet and Tourmente, 2007; 338

Hamperl et al., 2013; Dvorackova et al., 2017). These active fractions may be the source of the 339

5S and 45S rDNA reads in the early and mid S-phase datasets. 340

Replication Time and Chromatin States 341

Chromatin structure influences the replication program (Hiratani et al., 2008; Schwaiger et 342

al., 2009; Picard et al., 2014), with early replication associated with euchromatin and late 343

replication associated with heterochromatin (Ding and MacAlpine, 2011). Some combinations of 344

epigenetic marks occur together more frequently than others (Kharchenko et al., 2011; Roudier 345

et al., 2011; Sequeira-Mendes and Gutierrez, 2016). These combinations define chromatin states 346



Concia et al.

20

20

that describe the local chromatin environment more accurately than the traditional binary 347

classification and may correlate better with replication timing programs. 348

Arabidopsis chromatin has been classified into 6 different states (CS) using 16 epigenetic 349

marks by Wang et al. (2015). We chose this classification because it is biologically compatible 350

with the large size of replication timing segments compared to other functional regions like 351

transcription units. The classification described two euchromatic states (CS1and CS5), two 352

heterochromatic states (CS6 and CS3), and two intermediate states (CS2 and CS4). Chromatin 353

void of any of the 16 histone marks was defined as "unclear" or CS0. 354

We used these chromatin states to examine the relationship between chromatin structure and 355

replication timing. First, we calculated the overlap between each CS and RT class (Fig. 5A). 356

Applying the same procedure as for genomic features, we built a contingency table 357

(Supplemental Table S3) and performed a chi-square test (p-value < 2.2E-16, 2= 44,932, 358

df=24). The associated Cramer's V statistic is equal to 0.31, indicating a non-random distribution 359

of chromatin states in RT classes. The adjusted residuals for each combination of RT class and 360

CS were classified in three tertiles indicated by the black arrows and dots in Fig. 5A. 361

Inspection of the overlap between chromatin states and RT classes revealed that the 362

heterochromatic CS6 and CS3 are more abundantly represented in late replicating regions. 363

However, there is no simple relationship between chromatin states and the replication timing 364

segments (Fig. 5A). All of the chromatin states except for CS6 and CS3 include readily 365

discernible amounts of DNA replicating in each portion of S phase except for late. There is no 366

clear difference in the distribution of RT classes for the euchromatic states, CS1 and CS5, and 367

the intermediate states, CS2 and CS4. While there are small differences in the amount of early 368

replication associated with CS1, CS5, CS2 and CS4, none of these non-heterochromatic states 369



Concia et al.

21

21

display a strong preference for any particular replication time (c.f. the % RT class coverage in 370

Fig. 5B and Table 3). 371



Concia et al.

22

22

Each RT class also contains multiple chromatin states (Fig. 5C). The most striking 372

differences in CS content are found between the three early to mid RT classes and the ML and 373

L classes. E, EM and M have substantial amounts of CS1, CS5, CS2 and CS4, while the L is 374

primarily the heterochromatic states, CS6 and CS3. The ML class includes a similar amount of 375

CS6 but is greatly reduced for CS3, which is characterized by the canonical heterochromatin 376

marks H3K27me1 and H3K9me2 (Luo et al., 2013). Instead, the ML class has a large fraction of 377

CS4 and smaller amounts of CS1, CS5 and CS2, and appears transitional between the early to 378

mid RT classes and the L class. This idea is supported by the pairwise Spearman correlation 379

coefficients in the similarity matrix (Fig. 5D) showing that the chromatin composition of E, EM 380

and M are similar, while L has a distinctive heterochromatic signature and ML is in between. 381



Concia et al.

23

23

Replication Timing and Chromatin Accessibility 382

Replication timing also correlates with chromatin accessibility (Farkash-Amar and Simon, 383

2010; Hansen et al., 2010; Yaffe et al., 2010; Takebayashi et al., 2012). In plants, open 384

chromatin has been associated with higher gene density and higher levels of transcription (Zhang 385

et al., 2012; Vera et al., 2014), but these studies did not examine the relationship between 386

chromatin accessibility and replication timing. Hence, we compared our replication timing data 387

with the genome-wide mapping of 34,254 DNase I hypersensitive sites (DHS) by Sullivan et al. 388

(2014). We calculated the number of DHS per kb for each replication timing segment and plotted 389

the distribution of the DHS densities for each RT class (Fig. 6A). The number of DHS/kb 390

progressively decreases from E to L segments. Interestingly, only E and EM show a median 391

DHS density above the genome average (0.28 DHS/kb). Only about 25% of M segments contain 392

more DHS than the average, while 25% of ML and 50% of L segments do not contain any DHS. 393

To gain further insight into the relationship between DHS density and replication timing, 394

regions of high DHS density were compared with regions showing high local replication activity 395

in early, mid or late S (Fig. 6B). There is an association between DHS site density and local 396

maxima for replication in early S. In contrast, mid replication activity tends to decline around the 397

regions of highest DHS density. There are many fewer DHS sites in centromeric and 398

pericentromeric regions (Fig. 6C), and the DHS sites that are present in these regions do not 399

overlap with local maxima of late replication. Instead, the peaks of DHS density in these regions 400

are often associated with small peaks of early replication interspersed among the much stronger 401

regions of late replication (Fig. 6C). 402

The DHS analysis indicated that an open chromatin structure is associated with early 403

replication activity, whereas chromatin replicating in mid S phase, although still classified as 404



Concia et al.

24

24

euchromatic, is less accessible. This behavior suggests a sequential model for euchromatin 405

replication, starting in regions that can be accessed readily by the replication machinery and then 406

spreading to less accessible regions. In contrast, late replication activity appears unaffected by 407

short-range variations in DHS density, raising the possibility that a different mechanism 408

regulates replication timing within heterochromatin, possibly involving long-range, subnuclear 409

topology similar to what has been suggested for larger genomes (Pope et al., 2014). 410

Replication Timing and Long-Range Chromosome Interactions 411



Concia et al.

25

25

Chromosome conformation capture (Hi-C) techniques, which characterize long distance 412

interactions and reveal large scale spatial patterns of chromatin, have uncovered two distinct sub-413

nuclear compartments in animals (Lieberman-Aiden et al., 2009; Hou et al., 2012; Zhang et al., 414

2012). These compartments, which differ widely in nuclease accessibility, gene density, 415

transcriptional activity and epigenetic marks, correlate with early and late replicating domains 416

that span 0.1-2 Mbp (Ryba et al., 2010; Ryba et al., 2011). 417

Hi-C analysis of the Arabidopsis genome has indicated that its spatial organization is much 418

simpler. Arabidopsis telomeres interact more frequently with other telomeres and with the distal 419

regions of their adjacent chromosome arms, while pericentromeres interact with the adjacent 420

proximal regions of their chromosome arms as well as with other pericentromeres (Feng et al., 421

2014; Grob et al., 2014). This bipartite configuration recalls the overall distribution of replication 422

activity in early, mid and late S phase (Fig. 4A). To examine the relationship between three-423

dimensional proximity and replication timing patterns, we compared the RT classes to the 424

chromosome conformation capture datasets described by Liu et al. (2016). We chose this dataset 425

because of its reproducibility was established by an earlier study (Wang et al., 2015). 426

We aligned the Hi-C reads to the TAIR10 reference genome and identified significant 427

interactions (p-value < 0.001) at 100-kb resolution. To focus attention on long range interactions, 428

we imposed a minimum 1-Mbp separation between interacting loci because of the strong bias 429

toward local interactions (Dekker et al., 2002; Lieberman-Aiden et al., 2009). We also did not 430

consider inter-chromosomal interactions because the in-solution ligation method used to generate 431

this dataset is known to inflate the number of trans interactions (Nagano et al., 2015). Finally, we 432

excluded sequences within 1 Mbp of telomeres because telomeres tend to interact with very high 433

frequency compared to the rest of the genome (Supplemental Fig. S10) (Feng et al., 2014). 434



Concia et al.

26

26

Significant Hi-C interactions and associated RT classes are shown for Arabidopsis 435

chromosome 1 in Fig. 7A. Three main groups of interactions are apparent, e.g. interactions 436

within the pericentromere (Mbp 13.5-16.5), within each chromosome arm, and between the distal 437

parts of the two arms. This pattern agrees well with the large-scale pattern of early-replicating 438

arms and late-replicating pericentromeres (Fig. 3B; Fig. 4A). Interestingly, while pericentromeric 439

sequences mainly interact between themselves, the distal arms contact other early replicating 440

regions on both chromosome arms. All chromosomes show a similar organization (Supplemental 441

Fig. S11), except the short arms of the acrocentric chromosomes 2 and 4. These results suggested 442

that sequences in spatial proximity within the nucleus tend to replicate at the same time during S 443

phase, irrespective of their map positions along the chromosome. 444

We then analyzed the pairs of interacting bins identified by Hi-C to determine the interaction 445

profile for each RT class. The resolution of our replication data is much higher than the Hi-C 446

data, so each Hi-C bin can contain multiple RT classes. To address this, we analyzed separately 447

all the interacting pairs of Hi-C bins in which the first bin included a given RT class. Next, we 448

summarized the RT segment classes in the second bin in the pair (Fig. 7B). Some pairs were 449

assigned multiple times corresponding to each RT segment class included in the first bin. We 450

performed the analysis in both directions with similar results, confirming that the choice of the 451

first and second bins in each interacting pair did not influence the outcome (Supplemental Fig. 452

S12). The E, EM and M segment classes have nearly identical interaction profiles, with a slight 453

increase of ML and L segments in bins interacting with EM and M segments relative to E. The L 454

segments interact preferentially with ML and L bins, while the ML segments interact with all RT 455

classes. The ML and L groups are smaller than the E, EM and M groups due to the reduced 456

genomic coverage of these classes (Fig. 3C). To account for this disparity, we expressed the 457



Concia et al.

27

27interaction profiles as percent of total coverage for each interaction group (Fig. 7C; Table 4). The 458 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from



Concia et al.

28

28

cumulative interaction profile of all the groups taken together is also shown for reference. 459

We calculated a Pearson correlation matrix for the overlap of the RT classes with interacting 460

partners of each group (Supplemental Table S5) and plotted the results as a heat map (Fig. 7D). 461

The interaction profiles of the E, EM and M groups are strongly correlated, while the L group 462

has a distinct and opposite interaction profile. The interactions of the ML group are intermediate, 463

reinforcing the transitional nature of this RT class. The two interaction clusters related to 464

replication timing the E/EM/M cluster and the L cluster correlate with the large-scale 465

organization of chromosomes into early replicating arms and late replicating pericentromeres. 466

467



Concia et al.

29

29

468



Concia et al.

30

30

DISCUSSION 469

The Genome-Wide Arabidopsis Replication Program at High Resolution 470

We used a new high-resolution strategy to characterize the replication timing program of 471

Arabidopsis suspension cells at the whole genome level. Nearly 60% of the Arabidopsis genome 472

was classified as replicating principally in either early, mid or late S phase. Unlike our earlier 473

study (Lee et al., 2010), clear differences were observed between the sequence populations 474

replicating in early and mid S phase. However, 41% of the genome showed strong replication 475

activity in more than one portion of S phase, indicative of heterogeneity in replication timing. 476

Several factors contributed to the increased resolution of our new strategy. Potentially most 477

important, we shortened the labeling time from 1 hour to 10 min after determining that the 478

duration of S phase is only 1.5-1.9 hours for our Arabidopsis cultured cells (Mickelson-Young et 479

al., 2016). We also reduced the widths of the sorting gates and increased the distance between 480

them to minimize cross contamination between nuclei in early, mid and late S phase (Fig. 1). 481

Finally, EdU conjugation to AF488 allowed us to use a two-way sorting strategy to resolve 482

replicating from non-replicating nuclei and reduce contamination of EdU-labeled DNA by 483

unlabeled DNA in the immunoprecipitates. 484

The increased resolution is apparent in maps of the raw sequencing reads, which show 485

distinct replication profiles for early, mid and late S phase across 3 highly reproducible 486

biological replicates (Fig. 1C; Supplemental Fig. S5). Although the narrow sorting gates only 487

captured about 50% of the S-phase nuclei, the entire Arabidopsis genome was represented in the 488

read profiles. This may reflect heterogeneity in replication time among genome sequences and/or 489

technical limitations associated with the sensitivity of the flow cytometer, as demonstrated by a 490

study in human cells that used six sorting gates (Hansen et al., 2010). 491



Concia et al.

31

31

The increased resolution is also evident in visual comparisons between the replication 492

profiles for Arabidopsis chromosome 4 generated using a 1-h BrdU pulse versus the 10-min EdU 493

pulse (Supplemental Fig. S13). The profiles for early S phase are very similar, but there are 494

major differences in the mid and late S profiles obtained using the two protocols. These 495

differences correspond to regions that overlap between early and mid or mid and late in the BrdU 496

profiles. The overlap between adjacent time points most likely reflects the inclusion of regions 497

that incorporated BrdU as cells moved from earlier to later S phase during the 1-h pulse, which 498

represents ca. 50% of the length of S phase in the Arabidopsis cultured cells (Mickelson-Young 499

et al., 2016). Notably, there is less overlap between the profiles generated using a 10-min EdU 500

pulse, indicating that the Arabidopsis replication timing program is less stochastic than proposed 501

previously (Lee et al., 2010). 502

We presented the EdU replication profiles separately for each time point, rather than assign a 503

unique replication time to each locus based on the ratio between early and late, as is often 504

described in the literature (Hiratani et al., 2008; Schwaiger et al., 2009; Gilbert, 2010). By doing 505

so, we highlighted the fact that some sequences replicate with high intensity in more than one 506

portion of S phase (Fig. 2B; Supplemental Fig. S4). This almost always happens in consecutive 507

time points, like early-mid or mid-late S phase. However, because of the short pulse length, wide 508

separation between the gates, and sharp separation between populations of sorted nuclei, the 509

heterogeneity is unlikely to be a technical artifact. Given that a sequence can replicate only once 510

in a single cell, this heterogeneity is most likely due to variation between cells in the suspension 511

culture. However, differences between alleles at a locus, often generated in cell cultures by 512

somaclonal variation (Wang and Wang, 2012) may also contribute to the observed heterogeneity. 513

Segmentation Analysis 514



Concia et al.

32

32

To reduce the complexity of the data and assign replication times to regions across the 515

Arabidopsis genome, we used the Repliscan pipeline (Zynda et al., 2017) to assign a 516

predominant replication time based on the relative intensity of normalized signal in all three time 517

points. This analysis allowed us to score replication that occurs in more than one time window at 518

a given locus, better representing heterogeneous replication. 519

The segmentation analysis assigned a single prevalent replication time (E, M or L) to more 520

than half of the Arabidopsis genome, with the rest divided between EM and ML. Only 1% of the 521

genome was not assigned to a single time or two adjacent times, underscoring the robustness of 522

the segmentation analysis. The shorter labeling time and placement of gates to minimize overlap 523

and emphasize mid-replicating sequences (Fig. 1) led to significant differences in segmentation 524

from our earlier analysis of Arabidopsis chromosome 4 (Lee et al., 2010) (Supplemental Fig. 525

S8). In the current study, 17% of chromosome 4 was classified as EM compared to 37% in the 526

previous study. Concomitantly, sequences classified as E increased to 26% from less than 1% 527

and as M to 22% from 4%. Coverage of L was reduced to 9% from 44%, with most of the late 528

replicating segments located in a few megabases near the centromere. This reduction may reflect 529

the narrower late gate and a shift in its placement to improve resolution. However, sequences 530

classified as ML increased to 23% from 6% and included regions previously regarded as late 531

replicating (Fig. 1B), consistent with the shorter labeling time increasing resolution. EL and 532

EML declined from 8% to 1%. The sizes of segment identified in this study (Fig. 3D) are 533

comparable to the putative replicons described for Arabidopsis chromosome 4 (Lee et al., 2010) 534

and some animal systems (MacAlpine et al., 2004; Lebofsky et al., 2006; Schwaiger et al., 2009). 535

However, our analysis did not uncover evidence of the larger replication domains that have been 536

described in mammals (Hiratani et al., 2008; Ryba et al., 2010). 537



Concia et al.

33

33

The Replication Program and Genome Organization 538

All five Arabidopsis chromosomes showed the same general pattern of replication timing 539

(Fig. 2B; Supplemental Fig. S7). At a macroscopic level, the distal portions of the chromosome 540

arms replicate earlier than proximal regions, while pericentromeric and centromeric regions 541

replicate last. The short arms of chromosomes 2 and 4 are exceptions because they replicate 542

mainly in M and ML, perhaps because of their proximity to pericentromeric regions. This 543

organization agrees generally with the biphasic model of replication that we proposed previously 544

for Arabidopsis (Lee et al., 2010). Analysis of RT classes in relation to genomic features (Fig. 545

4D) suggested that E, EM and M segments are predominantly euchromatic, and ML and L 546

segments are primarily heterochromatic. However, the distribution of chromatin states across the 547

RT classes is more complex, with each RT class including multiple chromatin states and each 548

chromatin state including several RT classes. This diversity suggests that, particularly in the 549

portion of the genome classically regarded as euchromatic, replication timing may be determined 550

to a large extent by factors that are independent of local chromatin states or by epigenetic 551

features not included in the chromatin state analysis. 552

Replication timing data is thought to integrate transcriptional, epigenetic and spatial 553

information across the genome (Hiratani and Gilbert, 2009), and its inclusion in modeling can 554

inform chromatin state assignments. Wang et al. (2015) classified CS2 and CS4 as intermediate 555

between euchromatin or heterochromatin. These assignments were based in part on the lack of 556

transcription of CS2 and CS4 and no enrichment for histone marks associated with active 557

transcription. However, the large amount of CS2 and CS4 in the E, EM and M RT classes 558

indicates that major fractions of these states are in an open, accessible conformation 559

characteristic of euchromatin. Thus, CS2 and CS4 may include nontranscribed euchromatin that 560



Concia et al.

34

34

replicates with transcribed euchromatin (CS1 and CS5) during early to mid S phase. This idea is 561

supported by the near absence of CS2 and only a small fraction of CS4 replicating with 562

heterochromatin in late S phase. ML segments, which include both euchromatic (CS1, CS5, CS2, 563

and CS4) and heterochromatic (CS6 and CS3) chromatin states, represent a transition from 564

replicating euchromatin to replicating heterochromatin. 565

Comparison of replication timing and chromosome conformation data showed that E, EM 566

and M segments interact with each other with equal frequency within and between the arms of a 567

chromosome, L segments interact predominantly with Hi-C bins located in the pericentromeres 568

that encompass ML and L RT classes, while ML segments interact with all RT classes (Fig. 7). 569

This pattern of interaction is consistent with the Arabidopsis genome consisting of two main 570

genomic compartments one that replicates during early to mid S phase and another that 571

replicates in late S phase. This bipartite chromosomal architecture is reminiscent of the "open" 572

and "closed" compartments identified in the human genome (Lieberman-Aiden et al., 2009). The 573

two compartments have distinctive epigenomic and expression features and correlate with 574

replication time (Hansen et al., 1996; Ryba et al., 2010). It has been proposed that because of the 575

compact nature of the Arabidopsis genome and differences in chromatin organization between 576

plants and metazoans, the pericentromeric regions and chromosome arms may correspond 577

functionally to the closed and open compartments in mammalian genomes (Grob, et al., 2014; 578

Feng et al., 2014). 579

The datasets used for chromatin state and the long-range interaction studies (and the DHS 580

data discussed below) were generated from Arabidopsis seedlings. During plant development, 581

actively proliferating cells are localized primarily to meristematic regions and primordia and 582

include all cell cycle stages. As a consequence, only a small fraction of the cells used to create 583



Concia et al.

35

35

the seedling datasets were in S phase. For this reason, future studies that use chromatin data from 584

mitotic cells may uncover relationships between replication timing and chromatin that were not 585

apparent in the comparisons here. 586

Nature of Mid S-Phase Replication 587

Replication in mid S phase may reflect spreading from regions that initiate during early S 588

phase and/or initiation and elongation events specific to mid S phase. In our data, the 589

distributions of read densities are sharply different in early and mid S phase, with early reads 590

displaying high local maxima separated by deep troughs, while mid reads are more evenly 591

distributed with smaller peaks and dips. These profiles are consistent with models postulating 592

firing of low efficiency origins during mid S phase (Guilbaud et al., 2011), as well as with other 593

models involving replication of regions lacking origins by unidirectional fork progression 594

(Desprat et al., 2009; Ryba et al., 2010). Both of these mechanisms can be incorporated into a 595

model in which origins are not distributed uniformly across a genome (Rhind, 2014; Kaykov et 596

al., 2016) and compete for replication factors (Mantiero et al., 2011), with the likelihood of 597

replication initiating in a given region depending primarily on its origin. 598

According to the above models, early replicating regions of the Arabidopsis genome would 599

have more origins and origin clusters, and mid replicating regions would have fewer, more 600

dispersed origins but would not differ dramatically with respect to sequence composition or 601

global chromatin features. The only genome-wide study describing putative origin sequences in 602

Arabidopsis is biased for early replication due to the use of sucrose starvation to arrest cells in 603

G1 before release into BrdU in the presence of hydroxyurea to deplete nucleotide pools (Costas 604

et al., 2011), and thus cannot provide insight into whether origins are enriched in early versus 605

mid-replicating regions. However, our model is supported by the observation that Arabidopsis 606



Concia et al.

36

36

sequences replicating in early or mid S phase overlap similar genomic features (Fig. 4D) and 607

display similar chromatin state (Fig. 5, C and D) and chromatin interaction profiles (Fig. 7, B, C, 608

and D). However, these regions have different sensitivities to DNAse I digestion, with early 609

regions, but not mid regions, enriched for DHS sites (Fig. 6 B and C). Local maxima in early 610

regions are DHS-rich, while local maxima in mid regions are DHS-depleted, suggesting that 611

early replication is associated with a higher degree of chromatin accessibility than mid 612

replication. In this context, it is interesting that the replication program of the human genome can 613

be accurately simulated by a model in which an initiation probability landscape is determined by 614

the locations of DHS sites (Gindin et al., 2014). 615

Comparison to the Maize Replication Timing Program 616

We recently characterized replication timing in maize root tips labeled with EdU (Wear et al. 617

2017). The global distribution of the replication timing signals in maize and Arabidopsis are 618

similar, with chromosome arms replicating earlier and pericentromeric and centromeric regions 619

replicating later. Like Arabidopsis, maize replication is distributed across the RT classes. 620

However, there are more early replicating regions and fewer late regions in Arabidopsis than 621

maize. This difference likely reflects the very different genic and nongenic (TEs and noncoding 622

sequences) content of the two genomes (Arabidopsis - 51% genic and 49% nongenic; maize - 8% 623

genic and 92% nongenic), with genic sequences tending to replicate earlier. In addition, there are 624

more dispersed blocks of ML and L replicating DNA in maize chromosome arms, which are 625

typically organized into genic regions separated by TE clusters. Maize TEs (81% of the genome) 626

are very abundant in all RT classes, with those closer to genes replicating earlier. In contrast, 627

Arabidopsis TEs (20% of the genome) are located primarily in pericentomeric regions and 628

enriched in ML and L classes. 629



Concia et al.

37

37

There are other important similarities between the Arabidopsis and maize replication timing 630

programs. Strikingly, the sizes of the RT segments are similar even though the maize genome is 631

ca. 20-fold larger than Arabidopsis. Some loci show heterogeneity with respect to replication 632

timing in both plant species. Moreover, early replicating regions are more accessible than mid 633

replicating regions. This comparison underscores the role of genome structure in replication 634

timing and highlights common features that are independent of genome organization. 635

CONCLUSION 636

We developed a high-resolution approach to study the replication program of eukaryotic 637

genomes and applied it to the model plant Arabidopsis thaliana, extending our previous analysis 638

of chromosome 4 (Lee et al., 2010) to the entire genome. Our results confirmed the basic 639

observation that euchromatin replicates during early and mid S phase and heterochromatin 640

replicates in late S phase, similar to most other eukaryotes (Hiratani et al., 2008; Schwaiger et 641

al., 2009; Ryba et al., 2010). However, in this study, we resolved better early and mid-642

replication patterns within euchromatin. Although very similar in their association with most 643

genomic features and chromatin marks, early and mid-replicating sequences differ strikingly in 644

chromatin accessibility as measured by DHS density. This finding is of particular interest in 645

connection with a recent model proposing that origin accessibility to replication factors is one of 646

the primary determinants of replication programs (Rhind, 2014). The model, which integrates 647

sequential activation of origins with stochastic firing, efficiently predicted the human replication 648

program (Gindin et al., 2014). 649

650



Concia et al.

38

38

MATERIALS AND METHODS 651

Arabidopsis Cell Culture and Nuclei Isolation 652

The Arabidopsis thaliana cell line (Col-0, ecotype Columbia) was maintained as described 653

by Lee et al. (2010). Labeling followed the 7-d split protocol, in which 25 mL of fresh medium 654

and 25 mL of a 7-day culture are mixed and grown for 16 h. At 16 h, the cells were labeled with 655

10 M 5-Ethynyl-2-deoxyuridine (EdU, Life Technologies) for 10 min. Labeling was 656

terminated by fixing the cells in 1% paraformaldehyde with gentle agitation for 10 min, followed 657

by quenching the formaldehyde with 0.125 M glycine. Fixed cells were filtered through two 658

layers of Miracloth mesh and transferred to 1X phosphate buffered saline (PBS). They were 659

washed in PBS three times and snap frozen in liquid nitrogen. Cells from eight cultures were 660

combined for each of three biological replicates. 661

Nuclei were isolated as described previously (Lee et al., 2010; Wear et al., 2016) with the 662

addition of a Percoll gradient step. The frozen cell pellet was ground at 4C in 40 mL of cell lysis 663 buffer (15 mM Tris-HCl pH 7.5, 2 mM EDTA, 80 mM KCl, 20 mM NaCl, 15 mM -664

mercaptoethanol, and 0.1% Triton X-100) using a commercial blender. The ground cell 665

suspension was incubated for 5 min at 4C, filtered through two layers of Miracloth, and 666 centrifuged at 400xg for 5 min at 4 C. Nuclei were enriched using a Percoll step gradient as 667 described by Folta and Kaufman (2006) with minor modifications. The nuclei pellet was 668

resuspended in 25 mL of extraction buffer (2 M hexylene glycol, 20 mM PIPES-KOH pH 7.0, 10 669

mM MgCl2, 5 mM -mercaptoethanol) and centrifuged at 1500xg over a discontinuous density 670

gradient (30% and 80% v/v Percoll in gradient buffer: 0.5 M hexylene glycol, 10 mM MgCl2, 5 671

mM PIPES-KOH pH 7.0, 5 mM -mercaptoethanol and 1 % w/v Triton X-100) for 30 min at 672

4C. The nuclei recovered from the 30:80% Percoll interface were resuspended in 15 mL of 673



Concia et al.

39

39

gradient buffer and centrifuged at 1500xg over a cushion of 30% Percoll (v/v) in Gradient Buffer 674

for 10 min at 4C. 675 After washing the nuclei pellet in modified cell lysis buffer (15 mM Tris-HCl pH 7.5, 2 mM, 676

EDTA, 80 mM KCl, 20 mM NaCl, and 0.1% Triton X-100), the incorporated EdU was 677

conjugated with Alexa Fluor 488 (AF488) using a Click-iT EdU Alexa fluor 488 Imaging kit 678

(Life Technologies) as described previously (Wear et al., 2016). Finally, the nuclei were 679

resuspended in the original cell lysis buffer containing 2 g/mL DAPI and filtered through a 680

CellTrics 20-m nylon mesh filter (Partec) just before flow cytometry and sorting. 681

Flow Cytometry and Sorting 682

An InFlux flow cytometer (BD Biosciences) equipped with UV (535 nm) and blue (488 683

nm) lasers was used to sort nuclei by DNA content (DAPI fluorescence) and EdU incorporation 684

(fluorescence of the conjugated AF488). Events were triggered on forward-angle light scatter 685

(FSC), and data were collected using 90 side scatter (SSC) and 460/50 nm and 530/40 nm 686

bandpass filters (Bass et al., 2014; Wear et al., 2016). Plots of SSC vs. 460/50 nm (DAPI) were 687

used to set analysis and sorting gates that excluded cellular debris. 688

Sub-stage gates were used to sort labeled nuclei into pools representing early, mid and late S-689

phase as well as unlabeled nuclei in G1 phase as a source of non-replicating reference DNA. The 690

sorting gates were separated from each other to minimize overlap between the sorted populations 691

(Fig. 1B). For each biological replicate, between 90,000-160,000 nuclei for each S phase fraction 692

and 1 million unlabeled G1 nuclei were collected in tubes containing STE buffer (100 mM NaCl, 693

10 mM Tris-HCl pH 7.5, 1 mM EDTA). A small sample of nuclei (~12,000-16,000) were also 694

sorted from each gate into cell lysis buffer augmented with 2 g/mL DAPI and reanalyzed to 695



Concia et al.

40

40

determine the sort purity (Supplemental Fig. S3). Flow cytometry data were analyzed using 696

FlowJo software (Tree Star Inc.). 697

Genomic DNA Extraction and Immunoprecipitation of EdU/AF488-Labeled DNA 698

Genomic DNA was extracted as described previously (Lee et al., 2010) with minor 699

modifications. After overnight incubation with proteinase K, the samples were incubated with 700

RNAse A (50 g/mL) for 1 h at 37C prior to addition of PMSF (0.7mg/ml). The DNA was 701

extracted once with phenol/chloroform/isoamyl alcohol (25:24:1) and twice with chloroform, 702

and precipitated with 0.6 volumes of ice-cold isopropanol overnight at 20C. The DNA was 703

pelleted by centrifugation, washed twice with 1 mL of 70 % ethanol and resuspended in 130 L 704

of IP dilution buffer (167 mM NaCl, 16.7 mM Tris-HCl pH 8, 1.2 mM EDTA and 1.1 % (v/v) 705

Triton X-100). A Covaris S220 ultrasonicator was used to shear the DNA to an average size of 706

300 bp (parameters: intensity 5, duty cycle 10%, cycles per burst 200, treatment time 180 s). 707

After shearing, 370 L of IP dilution buffer (Gendrel et al., 2005) was added, and the sheared 708

DNA solution was precleared by gentle agitation in 20 L of magnetic protein G beads 709

(Dynabeads Life Technologies) pre-equilibrated with IP dilution buffer at 4C for 1 h. The 710 beads were removed with a magnet and newly synthesized DNA was immunoprecipitated by 711

incubating with a 1:200 dilution of anti-Alexa Fluor 488 antibody (Molecular Probes, #A-712

11094) at 4C overnight. The DNA-antibody complex was captured with 25 L of pre-713 equilibrated protein G beads at 4C for 2 h, followed by washing the beads as described by 714 Gendrel et al. (2005). Bound DNA was eluted from the beads in 250 L of elution buffer (1% 715

(w/v) SDS, 100 mM sodium bicarbonate) at 65C for 15 min, transferring the supernatant to a 716 new tube and repeating the elution for a final volume of 500 L. Eluted DNA was purified with 717



Concia et al.

41

41

QIAquick PCR Purification Kit (Qiagen) according to the manufacturers directions. To 718

maximize DNA recovery, pre-warmed (50C) TE was used for the elution step. 719 Library Construction, Sequencing and Analysis of Repli-Seq Data 720

Immunoprecipitated DNA was used to construct sequencing libraries with the NEXTflex 721

Illumina ChIP-Seq Library Prep Kit (Bioo Scientific) using the ultra-low input protocol. After 722

adapter ligation, the libraries were amplified with 18 cycles of PCR with the Expand High 723

FidelityPLUS PCR System (Roche). For each experiment, individual samples were barcoded and 724

pooled. The libraries were sequenced with an Illumina Hi-Seq 2000 platform. 725

Raw sequencing data was processed using Trim Galore! (v0.3.7) to remove 3 universal 726

adapters from the paired reads, trim 5 ends with fastq quality scores below 20, and remove 727

trimmed reads shorter than 40 bp. The quality controlled reads were then aligned to the 728

Arabidopsis TAIR10 genome with BWA mem (v0.7.4) using default parameters (Li, 2013). 729

After alignment, reads with multiple alignments were discarded using samtools 1.3 (Li et al., 730

2009). For mapping statistics and total sequence coverage, see Supplemental Table S1. 731 Data were then analyzed as described by Zynda et al. (2017). The scripts can be found at 732 https://github.com/zyndagj/repliscan. Read densities were scored in 1-kb bins across the genome, 733

and normalized using sequence depth scaling (Ramrez et al., 2016). The correlation between 734

biological replicates was assessed using multiBigwigSummary and plotted as a heatmap using 735

plotCorrelation in Deeptools 2.0 suite (Ramrez et al., 2016). Replicates were highly correlated 736

(Supplemental Fig. S5). 737

Biological replicates were aggregated by taking the median value in each 1-kb bin. Bins with 738

coverage in the upper and lower 0.1% tails of a calculated normal distribution were removed. 739

Values for each of the S-phase samples were divided by the value for the non-replicating G1 740



Concia et al.

42

42

reference in the corresponding bin to normalize for sequencing bias. To reduce noise, Haar 741

wavelet smoothing was performed using the software package wavelets from Percival and 742

Walden (2000). The Haar wavelet method was chosen because, unlike kernel smoothing 743

methods, it reduces differential noise without spreading peak boundaries. 744 Classifying Predominant Replication Time 745 The method used to assign a predominant time of replication to each 1-kb bin across the 746 genome is described by Zynda et al. (2017). Each bin was classified as replicating at a given time 747

point if its normalized replication intensity was above a chromosome-specific threshold value, as 748

calculated by the following procedure. Total coverage, defined as the fraction of the 749

chromosome with a signal greater than the threshold in at least one replication time window, was 750

computed as a continuous function of the threshold value using a cubic spline interpolation 751

across the replication values. The first derivative of the coverage function was then calculated 752

using the central difference formula to show the rate of coverage change. 753

Starting from the point with the highest rate of coverage change (maximum first derivative), 754

the threshold was lowered until the first derivative of the coverage vs. threshold curve effectively 755

flattened out. Below this point any additional signals were uninformative because those regions 756

had already been classified as replicating in other time points. The predominant replication time 757

for a given 1-kb bin was then assigned by considering the relative amounts of total replication 758

signal in early, mid and late S phase. For each 1-kb bin, the three signals were divided by the 759

maximum value, scaling the largest value to 1 and others between 0 and 1. The bin was labeled 760

as the combination of times with a normalized signal above 0.5. This strategy allowed single 761

prevalent time and combinatorial time classifications to be assigned to a given 1-kb bin. Bins 762



Concia et al.

43

43

were classified as undetermined if none the signals in any of the three time samples reached the 763

threshold value. 764

Replication Intensity and Relative Distance from the Centromere 765 Centromere positions in each chromosome were identified with the bedtools 2.25.0 766 genomecov utility (Quinlan and Hall, 2010) as 1-kb bins with the maximum coverage of 180-bp 767

repeats (Nagaki et al., 2003). Using normalized replication intensity in early, middle and late S 768

phase, the percent of total replication occurring in bins representing successive 10% portions of a 769

given chromosome arm was calculated with a custom R script (R Development Core Team, 770

2016). Replication within each interval, expressed as percentage of total replication activity for 771

that chromosome arm in that portion of S phase, was plotted as a function of the relative distance 772

from the centromere (Fig. 2D) using the R package ggplot2 (Wickham, 2009). 773

Association of Replication Timing with Genomic Features and Repeat Sequences 774

Genomic annotation of genes, pseudogenes and transposable elements (TEs) were obtained 775

from the Araport11 database (TAIR10_GFF3_genes_transposons.AIP.gff.gz at 776

https://www.araport.org/downloads/TAIR10_genome_release/annotation). Unannotated regions 777

were defined as the difference between the genome and all the annotated features. For viewing in 778

IGV 2.3.60 and comparison with Repli-Seq data, the coverage of genes and TEs was defined as 779

the percentage of bases in a specified portion of the genome that overlap with that feature. Gene 780

and TE coverage was scored in 1-kb bins with bedtools v2.25.0 genomecov and map utilities. For 781

visualization in IGV, the data were smoothed using a 50-kb moving average with the R package 782

zoo (Zeileis and Grothendieck, 2005). A custom script is available upon request. 783

Associations of genomic features with RT segmentation classes were computed with 784

bedtools v2.25.0 intersect, and their statistical significance assessed with a chi-squared test. The 785



Concia et al.

44

44

adjusted residuals (Agresti, 2007) were used to measure the relative contribution of each 786

combination of genomic feature and RT class to assess the statistical significance of the 787

associations. 788

Telomere-related (TEL), centromere-related (CEN), 45S and 5S ribosomal DNA sequences 789

were obtained from Plant Repeat Databases (Ouyang and Buell, 2004). The replication timing of 790

each group of repeats was assessed as described by Gent et al. (2014). Reads from individual 791

biological replicates of G1, early, mid and late S phase samples were aligned to consensus 792

sequences for each group using Blast software (parameter -e 1e-8) (Camacho et al., 2009). For 793

each sample and biological replicate, the number of reads that aligned to each repeat family was 794

normalized to the total number of reads present in the sample. Finally, the relative abundance of 795

each family in the early, mid or late reads was normalized to the relative abundance of the same 796

family in the G1 reference. 797

Association of Replication Timing with Chromatin States, DNAse I Hypersensitivity Sites 798

and Chromosome Conformation 799

Repli-Seq data were compared with the chromatin state dataset produced by Wang et al. 800

(2015). The overlaps in bp between each chromatin state (CS) and the five major RT segment 801

classes were calculated using bedtools v2.25.0 intersect, and plotted as absolute and relative 802

coverage. Statistical significance was assessed with a chi-squared test. We used the chi-square 803

adjusted residuals (Agresti, 2007) to identify which RT classes were most different from the 804

expected value in each chromatin state group of features, compared to the genome. The absolute 805

coverage of each chromatin state in each RT class was used to compute the Spearman correlation 806

coefficient between RT classes using the function cor in R, and subsequently plotted as a heat 807

map with the package corrplot (Wei and Simko, 2016). 808



Concia et al.

45

45

To compare replication timing profiles with DNase I hypersensitivity sites (DHSs), we used 809

the dataset (GEO accession PRJNA231710) described by Sullivan et al. (2014). The density of 810

DHSs in each RT class (Fig. 6A; Supplemental Fig. S4) was determined using data from control 811

experiments. The number of DNase cleavages from signal files (Accessions GSM1289359 and 812

GSM1289363) was averaged at 1-kb steps across the genome and smoothed using a 5-kb moving 813

average. The resulting DHS density distribution was plotted as a heat map and overlaid with the 814

early, mid and late replication intensity signals. The DNaseI read density files (Col-815

0.7d_Seedling.NA.NA.DS19992.signal.bw and Col-0.7d_Seedling.NA.NA.DS21094.signal.bw) 816

are at http://plantregulome.org/public/dnase/other/all-reads/signal/. The DNaseI hypersensitive 817

peak files (Col-0.7d_Seedling.NA.NA.DS19992.peaks.bed.gz and Col-818

0.7d_Seedling.NA.NA.DS21094.peaks.bed.gz) are at 819

http://plantregulome.org/public/dnase/other/all-reads/peaks/. 820

We used the dataset (Accession number SRR2626429) described in (Liu et al., 2016) for 821

chromosome conformation analysis. Sequencing reads were aligned to the TAIR10 reference 822

genome and experimental artifacts, like circularized fragments, PCR duplicates, re-ligated 823

adjacent sequences and wrong size fragments, were removed using HICUP with the default 824

parameters (Wingett et al., 2015). Significant interactions, defined as pairs of loci that have a 825

greater number of Hi-C reads than expected by chance (p-value < 0.001), were identified at 100-826

kb resolution using HOMER (Heinz et al., 2010) and visualized using the CIRCOS tool 827

(Krzywinski et al., 2009) together with the genome segmentation in RT classes. Within each 828

interacting pair of 100-kb bins, we randomized the first and second bins and split interaction in 829

groups based on the content of RT classes in the first bin. The absolute and relative overlaps of 830

the second bins with RT classes were computed with bedtools v2.25.0 intersect. A Pearson 831



Concia et al.

46

46

correlation matrix was computed using the function cor in R, and subsequently plotted as heat 832

map with the package corrplot (Wei and Simko, 2016). 833

Accession Numbers 834

Repli-Seq data from this study is in the NCBI Sequence Read Archive (SRA) under the 835

umbrella accession number PRJNA330547. The SRA numbers are: G1 SAMN05417671, Early 836

SAMN05417674, Mid SAMN05417672, and Late SAMN05417673. Processed data files 837

(E_ratio_3.smooth.bedgraph; M_ratio_3.smooth.bedgraph; L_ratio_3.smooth.bedgraph; 838

ratio_segmentation.gff3) are available from the CyVerse (previously iPlant Collaborative, 839

(Merchant et al., 2016)) Data Store. The Nimblegene microarray data for Arabidopsis 840

chromosome 4 replication timing is at Gene Expression Omnibus under accession number 841

GSE103321. The tiling microarray data for Arabidopsis chromosome 4 replication timing can be 842

found at Array Express under accession number E-GEOD-30433. 843

Supplemental Data 844

The following supplemental materials are available. 845

Supplemental Table S1. Statistics for sequenced libraries 846

Supplemental Table S2. Adjusted residuals for chi-square test on contingency Table 2 847

describing the overlaps between genomic features and RT classes 848

Supplemental Table S3. Overlap between chromatin states (CS) and RT classes 849

Supplemental Table S4. Adjusted residuals relative to the chi-square test on the contingency 850

Table S3 describing the overlaps between between Chromatin states (CS) and RT classes 851

Supplemental Table S5. Coverage of RT classes of genomic bins establishing significant long-852

range interactions 853



Concia et al.

47

47

Supplemental Figure S1. Comparison of replication timing profiles generated using tiling and 854

Nimblegen arrays 855

Supplemental Figure S2. Spearman correlation matrix for tiling (TL) and Nimblegen (NG) 856

array platforms 857

Supplemental Figure S3. Sorting gates and reanalysis of sorted fractions 858

Supplemental Figure S4. Distribution of read density for each sequencing library in 859

representative 1 Mb regions of Arabidopsis chromosomes 1, 3 and 5 860

Supplemental Figure S5. Spearman correlation matrix of read densities of sequenced samples 861

Supplemental Figure S6. Comparison of linear ratio versus Log2 ratio 862

Supplemental Figure S7. Large-scale distribution of read density on the five Arabidopsis 863

chromosomes 864

Supplemental Figure S8. Comparison of the distribution of RT classes on Arabidopsis 865

chromosome 4 866

Supplemental Figure S9. Replication timing and genomic features 867

Supplemental Figure S10. Hi-C background models generated with HOMER for Arabidopsis 868

chromosome 1 86

Documents

Genome-Wide Analysis of the Arabidopsis thaliana ... Genome-Wide Analysis of the Arabidopsis thaliana ... Farkash-Amar and ... 86 Arabidopsis thaliana is an important plant model system