Upload
tranmien
View
232
Download
1
Embed Size (px)
Citation preview
Short Title: Repli-Seq of the Arabidopsis DNA replication program 1
Corresponding Author: Linda Hanley-Bowdoin, [email protected] 2
Genome-Wide Analysis of the Arabidopsis thaliana Replication Timing 3
Program1 4 Lorenzo Conciaa,2, Ashley M. Brooksa, Emily Wheelera, Gregory J. Zyndab, Emily E. Weara, 5 Chantal LeBlancc,3, Jawon Songb, Tae-Jin Leea,4, Pete E. Pascuzzid, Robert A. Martienssenc, 6 Matthew W. Vaughnb, William F. Thompsona and Linda Hanley-Bowdoina,5 7 8 aDepartment of Plant and Microbial Biology, North Carolina State University, Raleigh, NC 9 27695 10
bTexas Advanced Computing Center, University of Texas at Austin, Austin, TX 78758 11
cHoward Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 12 11724 13
dPurdue University Libraries, Purdue University, West Lafayette, IN 47907 14
Summary Sentence: The Arabidopsis thaliana genome replicates in two non-interacting 15
compartments during early/mid and late S phase. 16 17 Authors Contributions: Experiments were conceived by LC, RAM, MWV, WFT, and LH-B. 18
Experiments were performed by LC, AMB, EW, EEW, CL and T-JL. Repli-Seq data were 19
analyzed by LC, PP, GJZ, JS, MWV, WFT and LH-B. LC, WFT and LH-B wrote the manuscript 20
with contributions from all authors. All authors read and approved the final manuscript. 21 22 1Funding Information: This work was supported by a grant (IOS-1025830) from the Plant 23
Genome Research Program of the National Science Foundation to LH-B, WFT, RAM and 24
MWV. 25 26 Current addresses: 2Institute of Plant Sciences Paris-Saclay, Btiment 630, Rue Noetzlin, 91190 27
Gif-sur-Yvette, France; 3Department of Molecular, Cellular & Developmental Biology, Yale 28
University, New Haven, CT 06511; 4Syngenta Crop Protection, LLC, Research Triangle Park, 29
NC, 27709 30
5Address correspondence to [email protected] 31 32
Plant Physiology Preview. Published on January 4, 2018, as DOI:10.1104/pp.17.01537
Copyright 2018 by the American Society of Plant Biologists
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
ABSTRACT 33
Eukaryotes use a temporally regulated process, known as the replication timing program, to 34
ensure that their genomes are fully and accurately duplicated during S phase. Replication timing 35
programs are predictive of genomic features and activity, and considered to be functional 36
readouts of chromatin organization. Although replication timing programs have been described 37
for yeast and animal systems, much less is known about the temporal regulation of plant DNA 38
replication or its relationship to genome sequence and chromatin structure. We used the 39
thymidine analog, 5-ethynyl-2-deoxyuridine, in combination with flow sorting and Repli-Seq to 40
describe, at high-resolution, the genome-wide replication timing program for Arabidopsis 41
thaliana Col-0 suspension cells. We identified genomic regions that replicate predominantly 42
during early, mid and late S phase, and correlated these regions with genomic features and with 43
data for chromatin state, accessibility and long-distance interaction. Arabidopsis chromosome 44
arms tend to replicate early while pericentromeric regions replicate late. Early and mid-45
replicating regions are gene-rich and predominantly euchromatic, while late regions are rich in 46
transposable elements and primarily heterochromatic. However, the distribution of chromatin 47
states across the different times is complex, with each replication time corresponding to a 48
mixture of states. Early and mid-replicating sequences interact with each other and not with late 49
sequences, but early regions are more accessible than mid regions. The replication timing 50
program in Arabidopsis reflects a bipartite genomic organization with early/mid replicating 51
regions and late regions forming separate, non-interacting compartments. The temporal order of 52
DNA replication within the early/mid compartment may be modulated largely by chromatin 53
accessibility. 54
55
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
3
3
INTRODUCTION 56
In each cell cycle, a cell must produce two identical copies of its genome during S phase. 57
Most of our knowledge about genome replication in higher eukaryotes comes from studies in 58
animals. These studies have indicated that replication is a temporally ordered process (Gilbert, 59
2010) that occurs in large domains of coordinate replication (replication domains) with 60
multiple origins firing in concert during S phase (MacAlpine et al., 2004; Desprat et al., 2009; 61
Schwaiger et al., 2009; Farkash-Amar and Simon, 2010). The replication timing programs of 62
several metazoan genomes have been characterized (Schbeler et al., 2002; Woodfine et al., 63
2004; Hiratani et al., 2008; Schwaiger et al., 2009; Hansen et al., 2010). These studies revealed 64
that early replicating chromatin is rich in genes, transcriptionally active, and contains 65
euchromatic histone modifications (Schbeler et al., 2002; Woodfine et al., 2004; Hiratani and 66
Gilbert, 2009; Hansen et al., 2010; Eaton et al., 2011; Lubelsky et al., 2014). Conversely, late 67
replicating chromatin is enriched for heterochromatin and repetitive elements (Gilbert, 2002; 68
Woodfine et al., 2004). Early and late replication domains correlate strongly with the open and 69
closed compartments identified by chromatin conformation capture experiments (Ryba et al., 70
2010; Yaffe et al., 2010; Pope et al., 2014). These compartments, which are megabases in size, 71
differ widely with respect to nuclease accessibility, gene density, transcriptional activity and 72
epigenetic marks (Lieberman-Aiden et al., 2009; Sexton et al., 2012). Hence, metazoan 73
replication timing programs are predictive of important genomic features and can be considered 74
functional readouts of chromatin organization (Rivera-Mulia et al., 2015). 75
Much less is known about how DNA replication occurs temporally and spatially across plant 76
genomes. Although the DNA replication machinery and many aspects of chromatin biology are 77
conserved between plants and animals, there are significant differences like the absence in plants 78
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
4
4
of lamins and geminin (Shultz et al., 2007; Thorpe and Charpentier, 2017), which play key roles 79
in chromatin organization and origin function in metazoans. In addition, fundamental processes 80
such as transcriptional regulation have been shown to differ between plants and animals 81
(Meyerowitz, 2002; Hetzel et al., 2016). There is also evidence that the spatiotemporal 82
distribution of replicating DNA is different in plant nuclei than in metazoan cells (Bass et al., 83
2015). Hence, we cannot assume that DNA replication programs in plants mirror those in 84
animals (Savadel and Bass, 2017). 85
Arabidopsis thaliana is an important plant model system because of its small genome, which 86
has been fully sequenced and is well annotated, and the broad range of genomic resources 87
(Arabidopsis Genome Initiative, 2000; Provart et al., 2016). There are genome-wide data 88
available for Arabidopsis chromatin accessibility, histone modifications and chromatin 89
interactions. Because of these resources, Arabidopsis is an ideal system for examining DNA 90
replication programs in plants. 91
Our group previously published a description of the replication timing program for 92
Arabidopsis chromosome 4 (Lee et al., 2010). In that study, Arabidopsis suspension cells were 93
pulse-labeled with 5-bromo-2-deoxyuridine (BrdU) for 1 h followed by nuclei separation based 94
on DNA content using flow cytometry. Replication was examined in three nuclei populations 95
corresponding to early, mid and late S phase, using a 1-kb tiling microarray platform. While both 96
the spatial resolution and labeling pulse length were comparable to similar studies with 97
metazoans (Schbeler et al., 2002; Hiratani et al., 2008), no major differences were observed 98
between the early and mid S-phase replication profiles for Arabidopsis. This finding led us to 99
conclude that, different from animals, the order of origin activation in Arabidopsis in early and 100
mid S phase is stochastic and replication of euchromatin does not follow a strict temporal 101
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
5
5
pattern. Unlike the Arabidopsis chromosome 4 replication timing profiles, we recently observed 102
differences between the early and mid S-phase profiles during replication of the maize genome 103
(Wear et al., 2017). To address these conflicting results, we reexamined the Arabidopsis 104
replication program, focusing more closely on sequences replicating in early and mid S phase. In 105
the process, we adapted our flow cytometry strategy and the Repli-seq methodology to better 106
distinguish between early and mid S replication. We generated a high-resolution replication 107
timing map for the entire Arabidopsis genome, and correlated the replication program with 108
chromatin state, accessibility and interaction data. 109
110
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
6
6
RESULTS 111
Improving Resolution of the Replication Timing Protocol 112
We examined several factors that might improve our ability to resolve differences in 113
replication timing. These included the analysis platform used to detect newly synthesized DNA, 114
the thymidine analog used to pulse label nascent DNA, the length of the labeling period, and the 115
flow cytometry strategy for separating nuclei in different stages of S phase. 116
Initially, we sought to improve our ability to distinguish sequences replicating in early versus 117
mid S phase by using a more advanced NimbleGen microarray platform with shorter, more 118
closely spaced probes to better resolve replicating DNA sequences. In this experiment, we used 119
the same protocol as our previous study, with the exception of the array platform. The replication 120
timing profiles generated using the NimbleGen arrays show more fine structure than those 121
obtained from the tiling arrays (Supplemental Fig. S1). However, the overall replication profiles 122
are very similar for the two array platforms with early and mid S-phase signals showing very 123
high correlations on both platforms (Supplemental Fig. S2). Thus, we concluded that probe 124
resolution was not a major factor in our ability to distinguish early and mid S-phase replication. 125
We then focused on reducing the labeling time and obtaining better separation of early and 126
mid-replicating nuclei (Fig. 1, A and B) (Bass et al., 2014; Wear et al., 2016). Arabidopsis 127
cultured cells were pulse labeled with the thymidine analog, 5-ethynyl-2-deoxyuridine (EdU), 128
for 10 minutes. After formaldehyde fixation and nuclei isolation, the incorporated EdU was 129
conjugated with Alexa Fluor 488 (AF488) azide using Click chemistry (Salic and Mitchison, 130
2008). Nuclei were then stained with DAPI and fractionated by flow cytometry using a two-color 131
sort strategy based on EdU incorporation (AF488) and DNA content (DAPI). EdU-labeled nuclei 132
were fractionated into early, mid and late S-phase populations (Fig. 1B, upper panel), while non-133
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
7
7
replicating G1 and G2 nuclei were excluded based on the absence of EdU. The S-phase gates 134
were assigned by dividing the EdU arc into 5 equal sections based on DNA content, with the 135
first, third and fifth sections defined as early, mid and late S phase. This resulted in narrower, 136
better separated sorting gates than in our previous experiments reducing the range of total DNA 137
content within each S-phase fraction, and minimizing cross contamination between fractions. 138
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
8
8
Reanalysis of a sample from each fraction by flow cytometry showed minimal overlap (
Concia et al.
9
9distribution. Importantly, the read distributions of the mid S-phase samples were clearly different 162 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from
Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
10
10
from the early samples (r2 = -0.02 - 0.11). 163
Replication profiles were created using the Repliscan pipeline (Zynda et al., 2017). Read 164
counts were averaged over non-overlapping 1-kb bins, and the total number of reads per sample 165
(sequencing depth) was normalized to 1X genome coverage using the RPGC method (Ramrez et 166
al., 2016). Given their high correlations, the biological replicates were combined and sequencing 167
depth was normalized again prior to further analysis. To account for local variation in 168
sequenceability, the normalized read densities were divided by the corresponding densities in the 169
non-replicating G1 reference DNA (Supplemental Fig. S6). Additional low-amplitude variations 170
were removed using Haar transform wavelets level-3 (Percival and Walden, 2000) to produce 171
smoothed, normalized read density profiles for early, mid and late S phase. 172
We chose not to represent the data as a "log ratio," as is often done in replication timing 173
studies (Lee et al., 2010; Ryba et al., 2011; Pope et al., 2012), because low intensity replication 174
activity transformed to a log ratio would have resulted in a negative number. This creates 175
problems for downstream analyses both computationally and conceptually. Moreover, the ability 176
of log ratio plots to compress extreme values is not necessary here because Repli-Seq profiles 177
cover a limited range of values. 178
Distribution of Replication Activity within Chromosomes 179
As illustrated for Arabidopsis chromosome 1 (Fig. 2B, upper panel; Supplemental Fig. S7), 180
visualization of the replication activity at the whole chromosome scale shows a temporal pattern 181
along the chromosome. Early replication intensity is stronger in the distal arms and decreases 182
progressively toward the centromeres. Conversely, late replication is concentrated near the 183
centromere. In contrast, replication during mid S phase is more evenly distributed. The trend is 184
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
11
11
consistent across all five chromosomes, although the early replication signal is less intense in the 185
short arms of the acrocentric chromosomes 2 and 4 (Supplemental Fig. S7). 186
Visualization on a smaller scale confirmed that the distal arms replicate mainly in early S and 187
centromeric regions replicate in late S. It also revealed that the proximal arms tend to replicate 188
predominantly in mid S phase, further supporting the trend described above (Fig. 2B, lower 189
panels). We quantified the fraction of replication at each time as a function of the distance from 190
the centromere for all ten Arabidopsis chromosome arms. Because the chromosome arms vary in 191
length, each arm was partitioned into 10 equal size bins and the fraction of total replication in 192
each bin was determined at each time (Fig. 2D). When the results were plotted as a function of 193
relative distance from the centromere, it was clear that early replication increases as the distance 194
from the centromere increases (Fig. 2D, left panel). In contrast, nearly half of late replication 195
occurs in the three bins closest to the centromere (Fig. 2D, right panel). Mid S replication is more 196
uniformly distributed and clearly different from early replication (Fig. 2D, middle panel). 197
Early and mid S phase also have distinct features when examined on a fine scale. The 198
differences were especially evident in regions where replication intensities were similar for both 199
time points. Overlaying early and mid-replication profiles in those regions often produced a 200
pattern of alternate early and mid local maxima (Fig. 2C, alternating blue and green line in the 201
top panel), suggestive of replication activity spreading over time from early replicating regions to 202
surrounding mid replicating sequences. 203
Segmentation Analysis 204
To facilitate more detailed analysis, we partitioned the genome into segments with similar 205
replication times using the Repliscan pipeline (Zynda et al., 2017). This method allows for the 206
possibility that replication of a given locus occurs in more than one time window. Our data 207
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
12
12showed that no sequence replicated exclusively in a single time window (Fig. 2B; Supplemental 208 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from
Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
13
13
Fig. S4). Hence, for a given sequence, we will refer to the "prevalent" time of replication, in 209
which the replication signal is stronger than at the other times. The Repliscan pipeline uses a 210
two-step process to assign a prevalent replication time (RT) to a 1-kb bin based on its replication 211
intensity in early, mid and late S phase. First, 1-kb bins were classified either as replicating or 212
non-replicating based on a threshold established for each chromosome, and only bins with 213
replication intensity above the threshold were used for segmentation analysis. Second, 214
replication signals for each 1-kb bin were divided by the maximum value for that bin, scaling the 215
largest value to 1 and all others between 0 and 1. The bin was then labeled as replicating 216
predominantly at the time with a normalized signal above 0.5. If the bin contained one or more 217
signals within 50% of the highest signal, they were included in the classification. Adjacent bins 218
with the same RT were merged into larger segments. With this approach, we identified segments 219
replicating predominantly in early S phase (E), in both early and mid S phase (EM), only in mid 220
S phase (M), in both mid and late S phase (ML), only in late S phase (L), in early and late S 221
phase (EL), and at all the three times (EML) (Fig. 3, A and B). Regions with replication signal 222
below the threshold in all time points were not classified or included in our statistical analyses. 223
The cumulative genomic coverage of each RT class is shown in Fig. 3C. A single prevalent 224
time of replication was identified for more than half of the genome (31% E + 20% M + 7% L = 225
58%), while most of the rest of the genome was evenly split between EM (21 %) and ML (20%). 226
The EL and EML segment classes together constituted about 1% of the genome, and 2.5% of 227
genome could not be classified. Given the clear separation of the sorting gates used to generate 228
the early, mid and late populations (Fig. 1B, upper panel), it is noteworthy that 41% of the 229
Arabidopsis genome replicates in the intermediate EM and ML classes. The timing heterogeneity 230
may reflect the presence of subpopulations of cells with related but distinct replication programs 231
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
14
14
and/or allelic heterogeneity that may have arisen during prolonged cell culture (Wang and Wang, 232
2012). The low coverage of L segments relative to E and M is also noteworthy because the width 233
(range of DNA content) of the three sorting gates was equivalent (Fig. 1B, upper panel). 234
The distribution of replication timing segments is similar for the five Arabidopsis 235
chromosomes with the exception of the short arms of chromosomes 2 and 4, which have very 236
few early segments (Fig. 3B). The distal portions of longer chromosome arms are covered with 237
large E segments (>50-100 kb) interspersed with small EM and M segments (
Concia et al.
15
15detected several EL and EML segments, but due to their small size and low frequency, we did 255 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from
Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
16
16
not include them in subsequent analyses (Fig. 3D). 256
Replication Time and Genomic Features 257
To explore the relationship between replication timing and major genomic features, we 258
queried the Repli-Seq data using Araport11 genome annotations (Cheng et al., 2017).. A visual 259
comparison of the RT segmentation data with genes and transposable elements (Fig. 4A; 260
Supplemental Fig. S9) showed that the gene-rich chromosome arms replicate in E, EM and M 261
while the TE-rich pericentromeric region replicates in ML and L, as described above and 262
reported previously (Lee et al., 2010). To obtain a more detailed picture, we computed the 263
cumulative overlaps of genes, pseudogenes, TEs and unannotated sequences with RT classes for 264
the entire Arabidopsis genome. The overlaps were expressed as a percent of total genomic 265
coverage for a given feature to adjust for abundance differences (Fig. 4B). This analysis gave 266
similar results as the visual inspection of chromosome 1, with segment coverage of genes highest 267
in E and EM and TEs highest in ML and L. 268
To assess if the distributions of the genomic features across the RT classes are statistically 269
different from the distribution across the whole genome, we built a contingency table with the 270
absolute overlaps expressed as the number of 1-kb bins (Table 2) and applied a chi-square test 271
for homology. Differences in the overlaps showed high statistical significance (p-value < 2.2E-272
16, 2= 25,561, df=12). However, when analyzing a large population (N=116,063), small 273
differences between observed and expected values almost always generate a statistically 274
significant p-value (Sullivan and Feinn, 2012). For this reason, we estimated the "effect size" of 275
the test, defined as the "magnitude of association between categorical variables" (Kotrlik et al., 276
2011), by calculating Cramer's V statistic. The Cramer V value for our data was 0.27, within the 277
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
17
17
0.2 - 0.4 range for a "moderate association" (Rea and Parker, 2005), indicating a nonrandom 278
distribution of genomic features in the RT classes. 279
Next, we identified which genomic features and replication timing segments overlapped 280
either more or less than expected by examining the sign and value of the chi-square adjusted 281
residuals (Agresti, 2007). We split the adjusted residuals into tertiles and classified the relevant 282
combinations as overrepresented (highest tertile), underrepresented (lowest tertile), or similar to 283
expected (central tertile). The arrows and dots in Fig. 4B indicate the assigned category. 284
The statistical analysis confirmed that genes are over-represented in E, EM and M and under-285
represented in ML and L segments (Fig. 4B). Pseudogenes are enriched in ML segments. This 286
may be due in part to the association of "processed pseudogenes," the products of 287
retrotransposition events (Zheng et al., 2007), with TE-rich pericentromeric regions that replicate 288
in ML (Fig. 4A). Unannotated regions overlap more with E and less with M and ML segments 289
relative the total genome. The enrichment of unannotated regions in E segments may reflect the 290
fact that the distances between genes in the distal arms are generally much longer than the spaces 291
between TEs or between genes and TE in the pericentromeres (Fig. 4A). It is worth noting that 292
depletion of unannotated regions in L segments is not statistically significant and, instead, is 293
most likely due to poor annotation of the centromeric regions. 294
We then determined the number of protein coding genes, pseudogenes and TE genes in each 295
RT class (Fig. 4C). To control for differences between RT segment coverage, the counts were 296
normalized over the genomic coverage for each RT class and expressed as the number of 297
elements per Mb. The densities of protein coding genes in E (287/Mb), EM (308/Mb) and M 298
(291/MB) are very similar, then drop in ML (173/Mb) and L (58/Mb) segments. Conversely, TE 299
genes are very sparse in E (4/Mb), EM (10/Mb) and M (18/Mb) but densely packed in ML 300
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
18
18
(79/Mb) and L (155/Mb) segments. The density of pseudogenes across the RT classes is low due 301
to their low number in the Arabidopsis genome. 302
We also computed the fraction of each segment covered by genes, TEs and unannotated 303
sequences and generated boxplots showing the range of coverage within each RT class (Fig. 4D). 304
Consistent with the other analyses, E, EM and M segments are gene-rich (left panel) and 305
depleted in TEs (center panel), while ML and L segments are TE-rich and have lower gene 306
content. The unannotated region content of different RT classes is more uniform (right panel), 307
with slightly higher content in E and EM. 308
Together, our results indicated that the genomic features associated with the M segments are 309
more similar to those in E and EM segments than in ML and L segments. This was true even 310
though the M segments replicate at a distinct stage of S phase and are more likely to be located 311
in the proximal regions of the chromosome arms, while the E and EM segments are 312
predominantly located in the distal regions. 313
The above analyses only used sequence tags that mapped uniquely to the Arabidopsis 314
genome and, as such, did not address replication timing of repetitive sequences. To analyze 315
replication timing of repeats, we queried all the reads after initial processing with TEL, CEN, 316
45S and 5S repeat sequences from the Plant Repeat Databases (Ouyang and Buell, 2004). 317
Arabidopsis telomeric sequences consist of 2-5 kb stretches of 5-CCCTAAA-3 repeat units 318
(TEL) (Richards and Ausubel, 1988), while centromeres and pericentromeres contain about 319
20,000 copies of a 180-bp satellite repeat (CEN) in long arrays extending for several 320
megabases (Lermontova et al., 2015). The 570-750 copies per haploid genome of 45S rRNA 321
genes (45S rDNA) form two 4-Mbp arrays in nucleolar organizing regions located at the 322
ends of the short arms of chromosome 2 and 4 (Copenhaver and Pikaard, 1996; Havlov et 323
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
19
19
al., 2016). The pericentromeres of chromosome 3, 4 and 5 also contain heterogeneous arrays 324
including about 1000 copies of the 5S rRNA genes (5S rDNA) (Vaillant et al., 2007). 325
For each S-phase dataset, we computed the fraction of reads aligning to each repeat 326
consensus and normalized it to the fraction of reads in the G1 control that aligned to the same 327
consensus (Fig. 4E). The resulting ratio is a measure of enrichment or depletion of a given repeat 328
in reads from early, mid or late S phase. CEN sequences are strongly enriched in late S phase and 329
depleted in early and mid, in agreement with the late replication timing of the centromeres (Fig. 330
2A; Supplemental Fig. S7). TEL sequences replicate preferentially in early and mid S phase but 331
replication activity is also detectable in late S phase. The lack of a single predominant replication 332
time is likely due to asynchrony between telomeres. In human cells, the telomere replication 333
program is chromosome-specific and influenced by sequences in sub-telomeric regions (Arnoult 334
et al., 2010). Replication of both 5S and 45S rDNA occurs primarily in late S phase, consistent 335
with sequestration and silencing of most 5S and 45S rDNA gene copies by repressive 336
heterochromatin (Layat et al., 2012). However, some 5S and 45S rDNA genes are 337
transcriptionally active and packaged into permissive euchromatin (Douet and Tourmente, 2007; 338
Hamperl et al., 2013; Dvorackova et al., 2017). These active fractions may be the source of the 339
5S and 45S rDNA reads in the early and mid S-phase datasets. 340
Replication Time and Chromatin States 341
Chromatin structure influences the replication program (Hiratani et al., 2008; Schwaiger et 342
al., 2009; Picard et al., 2014), with early replication associated with euchromatin and late 343
replication associated with heterochromatin (Ding and MacAlpine, 2011). Some combinations of 344
epigenetic marks occur together more frequently than others (Kharchenko et al., 2011; Roudier 345
et al., 2011; Sequeira-Mendes and Gutierrez, 2016). These combinations define chromatin states 346
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
20
20
that describe the local chromatin environment more accurately than the traditional binary 347
classification and may correlate better with replication timing programs. 348
Arabidopsis chromatin has been classified into 6 different states (CS) using 16 epigenetic 349
marks by Wang et al. (2015). We chose this classification because it is biologically compatible 350
with the large size of replication timing segments compared to other functional regions like 351
transcription units. The classification described two euchromatic states (CS1and CS5), two 352
heterochromatic states (CS6 and CS3), and two intermediate states (CS2 and CS4). Chromatin 353
void of any of the 16 histone marks was defined as "unclear" or CS0. 354
We used these chromatin states to examine the relationship between chromatin structure and 355
replication timing. First, we calculated the overlap between each CS and RT class (Fig. 5A). 356
Applying the same procedure as for genomic features, we built a contingency table 357
(Supplemental Table S3) and performed a chi-square test (p-value < 2.2E-16, 2= 44,932, 358
df=24). The associated Cramer's V statistic is equal to 0.31, indicating a non-random distribution 359
of chromatin states in RT classes. The adjusted residuals for each combination of RT class and 360
CS were classified in three tertiles indicated by the black arrows and dots in Fig. 5A. 361
Inspection of the overlap between chromatin states and RT classes revealed that the 362
heterochromatic CS6 and CS3 are more abundantly represented in late replicating regions. 363
However, there is no simple relationship between chromatin states and the replication timing 364
segments (Fig. 5A). All of the chromatin states except for CS6 and CS3 include readily 365
discernible amounts of DNA replicating in each portion of S phase except for late. There is no 366
clear difference in the distribution of RT classes for the euchromatic states, CS1 and CS5, and 367
the intermediate states, CS2 and CS4. While there are small differences in the amount of early 368
replication associated with CS1, CS5, CS2 and CS4, none of these non-heterochromatic states 369
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
21
21
display a strong preference for any particular replication time (c.f. the % RT class coverage in 370
Fig. 5B and Table 3). 371
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
22
22
Each RT class also contains multiple chromatin states (Fig. 5C). The most striking 372
differences in CS content are found between the three early to mid RT classes and the ML and 373
L classes. E, EM and M have substantial amounts of CS1, CS5, CS2 and CS4, while the L is 374
primarily the heterochromatic states, CS6 and CS3. The ML class includes a similar amount of 375
CS6 but is greatly reduced for CS3, which is characterized by the canonical heterochromatin 376
marks H3K27me1 and H3K9me2 (Luo et al., 2013). Instead, the ML class has a large fraction of 377
CS4 and smaller amounts of CS1, CS5 and CS2, and appears transitional between the early to 378
mid RT classes and the L class. This idea is supported by the pairwise Spearman correlation 379
coefficients in the similarity matrix (Fig. 5D) showing that the chromatin composition of E, EM 380
and M are similar, while L has a distinctive heterochromatic signature and ML is in between. 381
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
23
23
Replication Timing and Chromatin Accessibility 382
Replication timing also correlates with chromatin accessibility (Farkash-Amar and Simon, 383
2010; Hansen et al., 2010; Yaffe et al., 2010; Takebayashi et al., 2012). In plants, open 384
chromatin has been associated with higher gene density and higher levels of transcription (Zhang 385
et al., 2012; Vera et al., 2014), but these studies did not examine the relationship between 386
chromatin accessibility and replication timing. Hence, we compared our replication timing data 387
with the genome-wide mapping of 34,254 DNase I hypersensitive sites (DHS) by Sullivan et al. 388
(2014). We calculated the number of DHS per kb for each replication timing segment and plotted 389
the distribution of the DHS densities for each RT class (Fig. 6A). The number of DHS/kb 390
progressively decreases from E to L segments. Interestingly, only E and EM show a median 391
DHS density above the genome average (0.28 DHS/kb). Only about 25% of M segments contain 392
more DHS than the average, while 25% of ML and 50% of L segments do not contain any DHS. 393
To gain further insight into the relationship between DHS density and replication timing, 394
regions of high DHS density were compared with regions showing high local replication activity 395
in early, mid or late S (Fig. 6B). There is an association between DHS site density and local 396
maxima for replication in early S. In contrast, mid replication activity tends to decline around the 397
regions of highest DHS density. There are many fewer DHS sites in centromeric and 398
pericentromeric regions (Fig. 6C), and the DHS sites that are present in these regions do not 399
overlap with local maxima of late replication. Instead, the peaks of DHS density in these regions 400
are often associated with small peaks of early replication interspersed among the much stronger 401
regions of late replication (Fig. 6C). 402
The DHS analysis indicated that an open chromatin structure is associated with early 403
replication activity, whereas chromatin replicating in mid S phase, although still classified as 404
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
24
24
euchromatic, is less accessible. This behavior suggests a sequential model for euchromatin 405
replication, starting in regions that can be accessed readily by the replication machinery and then 406
spreading to less accessible regions. In contrast, late replication activity appears unaffected by 407
short-range variations in DHS density, raising the possibility that a different mechanism 408
regulates replication timing within heterochromatin, possibly involving long-range, subnuclear 409
topology similar to what has been suggested for larger genomes (Pope et al., 2014). 410
Replication Timing and Long-Range Chromosome Interactions 411
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
25
25
Chromosome conformation capture (Hi-C) techniques, which characterize long distance 412
interactions and reveal large scale spatial patterns of chromatin, have uncovered two distinct sub-413
nuclear compartments in animals (Lieberman-Aiden et al., 2009; Hou et al., 2012; Zhang et al., 414
2012). These compartments, which differ widely in nuclease accessibility, gene density, 415
transcriptional activity and epigenetic marks, correlate with early and late replicating domains 416
that span 0.1-2 Mbp (Ryba et al., 2010; Ryba et al., 2011). 417
Hi-C analysis of the Arabidopsis genome has indicated that its spatial organization is much 418
simpler. Arabidopsis telomeres interact more frequently with other telomeres and with the distal 419
regions of their adjacent chromosome arms, while pericentromeres interact with the adjacent 420
proximal regions of their chromosome arms as well as with other pericentromeres (Feng et al., 421
2014; Grob et al., 2014). This bipartite configuration recalls the overall distribution of replication 422
activity in early, mid and late S phase (Fig. 4A). To examine the relationship between three-423
dimensional proximity and replication timing patterns, we compared the RT classes to the 424
chromosome conformation capture datasets described by Liu et al. (2016). We chose this dataset 425
because of its reproducibility was established by an earlier study (Wang et al., 2015). 426
We aligned the Hi-C reads to the TAIR10 reference genome and identified significant 427
interactions (p-value < 0.001) at 100-kb resolution. To focus attention on long range interactions, 428
we imposed a minimum 1-Mbp separation between interacting loci because of the strong bias 429
toward local interactions (Dekker et al., 2002; Lieberman-Aiden et al., 2009). We also did not 430
consider inter-chromosomal interactions because the in-solution ligation method used to generate 431
this dataset is known to inflate the number of trans interactions (Nagano et al., 2015). Finally, we 432
excluded sequences within 1 Mbp of telomeres because telomeres tend to interact with very high 433
frequency compared to the rest of the genome (Supplemental Fig. S10) (Feng et al., 2014). 434
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
26
26
Significant Hi-C interactions and associated RT classes are shown for Arabidopsis 435
chromosome 1 in Fig. 7A. Three main groups of interactions are apparent, e.g. interactions 436
within the pericentromere (Mbp 13.5-16.5), within each chromosome arm, and between the distal 437
parts of the two arms. This pattern agrees well with the large-scale pattern of early-replicating 438
arms and late-replicating pericentromeres (Fig. 3B; Fig. 4A). Interestingly, while pericentromeric 439
sequences mainly interact between themselves, the distal arms contact other early replicating 440
regions on both chromosome arms. All chromosomes show a similar organization (Supplemental 441
Fig. S11), except the short arms of the acrocentric chromosomes 2 and 4. These results suggested 442
that sequences in spatial proximity within the nucleus tend to replicate at the same time during S 443
phase, irrespective of their map positions along the chromosome. 444
We then analyzed the pairs of interacting bins identified by Hi-C to determine the interaction 445
profile for each RT class. The resolution of our replication data is much higher than the Hi-C 446
data, so each Hi-C bin can contain multiple RT classes. To address this, we analyzed separately 447
all the interacting pairs of Hi-C bins in which the first bin included a given RT class. Next, we 448
summarized the RT segment classes in the second bin in the pair (Fig. 7B). Some pairs were 449
assigned multiple times corresponding to each RT segment class included in the first bin. We 450
performed the analysis in both directions with similar results, confirming that the choice of the 451
first and second bins in each interacting pair did not influence the outcome (Supplemental Fig. 452
S12). The E, EM and M segment classes have nearly identical interaction profiles, with a slight 453
increase of ML and L segments in bins interacting with EM and M segments relative to E. The L 454
segments interact preferentially with ML and L bins, while the ML segments interact with all RT 455
classes. The ML and L groups are smaller than the E, EM and M groups due to the reduced 456
genomic coverage of these classes (Fig. 3C). To account for this disparity, we expressed the 457
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
27
27interaction profiles as percent of total coverage for each interaction group (Fig. 7C; Table 4). The 458 www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from
Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
28
28
cumulative interaction profile of all the groups taken together is also shown for reference. 459
We calculated a Pearson correlation matrix for the overlap of the RT classes with interacting 460
partners of each group (Supplemental Table S5) and plotted the results as a heat map (Fig. 7D). 461
The interaction profiles of the E, EM and M groups are strongly correlated, while the L group 462
has a distinct and opposite interaction profile. The interactions of the ML group are intermediate, 463
reinforcing the transitional nature of this RT class. The two interaction clusters related to 464
replication timing the E/EM/M cluster and the L cluster correlate with the large-scale 465
organization of chromosomes into early replicating arms and late replicating pericentromeres. 466
467
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
29
29
468
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
30
30
DISCUSSION 469
The Genome-Wide Arabidopsis Replication Program at High Resolution 470
We used a new high-resolution strategy to characterize the replication timing program of 471
Arabidopsis suspension cells at the whole genome level. Nearly 60% of the Arabidopsis genome 472
was classified as replicating principally in either early, mid or late S phase. Unlike our earlier 473
study (Lee et al., 2010), clear differences were observed between the sequence populations 474
replicating in early and mid S phase. However, 41% of the genome showed strong replication 475
activity in more than one portion of S phase, indicative of heterogeneity in replication timing. 476
Several factors contributed to the increased resolution of our new strategy. Potentially most 477
important, we shortened the labeling time from 1 hour to 10 min after determining that the 478
duration of S phase is only 1.5-1.9 hours for our Arabidopsis cultured cells (Mickelson-Young et 479
al., 2016). We also reduced the widths of the sorting gates and increased the distance between 480
them to minimize cross contamination between nuclei in early, mid and late S phase (Fig. 1). 481
Finally, EdU conjugation to AF488 allowed us to use a two-way sorting strategy to resolve 482
replicating from non-replicating nuclei and reduce contamination of EdU-labeled DNA by 483
unlabeled DNA in the immunoprecipitates. 484
The increased resolution is apparent in maps of the raw sequencing reads, which show 485
distinct replication profiles for early, mid and late S phase across 3 highly reproducible 486
biological replicates (Fig. 1C; Supplemental Fig. S5). Although the narrow sorting gates only 487
captured about 50% of the S-phase nuclei, the entire Arabidopsis genome was represented in the 488
read profiles. This may reflect heterogeneity in replication time among genome sequences and/or 489
technical limitations associated with the sensitivity of the flow cytometer, as demonstrated by a 490
study in human cells that used six sorting gates (Hansen et al., 2010). 491
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
31
31
The increased resolution is also evident in visual comparisons between the replication 492
profiles for Arabidopsis chromosome 4 generated using a 1-h BrdU pulse versus the 10-min EdU 493
pulse (Supplemental Fig. S13). The profiles for early S phase are very similar, but there are 494
major differences in the mid and late S profiles obtained using the two protocols. These 495
differences correspond to regions that overlap between early and mid or mid and late in the BrdU 496
profiles. The overlap between adjacent time points most likely reflects the inclusion of regions 497
that incorporated BrdU as cells moved from earlier to later S phase during the 1-h pulse, which 498
represents ca. 50% of the length of S phase in the Arabidopsis cultured cells (Mickelson-Young 499
et al., 2016). Notably, there is less overlap between the profiles generated using a 10-min EdU 500
pulse, indicating that the Arabidopsis replication timing program is less stochastic than proposed 501
previously (Lee et al., 2010). 502
We presented the EdU replication profiles separately for each time point, rather than assign a 503
unique replication time to each locus based on the ratio between early and late, as is often 504
described in the literature (Hiratani et al., 2008; Schwaiger et al., 2009; Gilbert, 2010). By doing 505
so, we highlighted the fact that some sequences replicate with high intensity in more than one 506
portion of S phase (Fig. 2B; Supplemental Fig. S4). This almost always happens in consecutive 507
time points, like early-mid or mid-late S phase. However, because of the short pulse length, wide 508
separation between the gates, and sharp separation between populations of sorted nuclei, the 509
heterogeneity is unlikely to be a technical artifact. Given that a sequence can replicate only once 510
in a single cell, this heterogeneity is most likely due to variation between cells in the suspension 511
culture. However, differences between alleles at a locus, often generated in cell cultures by 512
somaclonal variation (Wang and Wang, 2012) may also contribute to the observed heterogeneity. 513
Segmentation Analysis 514
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
32
32
To reduce the complexity of the data and assign replication times to regions across the 515
Arabidopsis genome, we used the Repliscan pipeline (Zynda et al., 2017) to assign a 516
predominant replication time based on the relative intensity of normalized signal in all three time 517
points. This analysis allowed us to score replication that occurs in more than one time window at 518
a given locus, better representing heterogeneous replication. 519
The segmentation analysis assigned a single prevalent replication time (E, M or L) to more 520
than half of the Arabidopsis genome, with the rest divided between EM and ML. Only 1% of the 521
genome was not assigned to a single time or two adjacent times, underscoring the robustness of 522
the segmentation analysis. The shorter labeling time and placement of gates to minimize overlap 523
and emphasize mid-replicating sequences (Fig. 1) led to significant differences in segmentation 524
from our earlier analysis of Arabidopsis chromosome 4 (Lee et al., 2010) (Supplemental Fig. 525
S8). In the current study, 17% of chromosome 4 was classified as EM compared to 37% in the 526
previous study. Concomitantly, sequences classified as E increased to 26% from less than 1% 527
and as M to 22% from 4%. Coverage of L was reduced to 9% from 44%, with most of the late 528
replicating segments located in a few megabases near the centromere. This reduction may reflect 529
the narrower late gate and a shift in its placement to improve resolution. However, sequences 530
classified as ML increased to 23% from 6% and included regions previously regarded as late 531
replicating (Fig. 1B), consistent with the shorter labeling time increasing resolution. EL and 532
EML declined from 8% to 1%. The sizes of segment identified in this study (Fig. 3D) are 533
comparable to the putative replicons described for Arabidopsis chromosome 4 (Lee et al., 2010) 534
and some animal systems (MacAlpine et al., 2004; Lebofsky et al., 2006; Schwaiger et al., 2009). 535
However, our analysis did not uncover evidence of the larger replication domains that have been 536
described in mammals (Hiratani et al., 2008; Ryba et al., 2010). 537
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
33
33
The Replication Program and Genome Organization 538
All five Arabidopsis chromosomes showed the same general pattern of replication timing 539
(Fig. 2B; Supplemental Fig. S7). At a macroscopic level, the distal portions of the chromosome 540
arms replicate earlier than proximal regions, while pericentromeric and centromeric regions 541
replicate last. The short arms of chromosomes 2 and 4 are exceptions because they replicate 542
mainly in M and ML, perhaps because of their proximity to pericentromeric regions. This 543
organization agrees generally with the biphasic model of replication that we proposed previously 544
for Arabidopsis (Lee et al., 2010). Analysis of RT classes in relation to genomic features (Fig. 545
4D) suggested that E, EM and M segments are predominantly euchromatic, and ML and L 546
segments are primarily heterochromatic. However, the distribution of chromatin states across the 547
RT classes is more complex, with each RT class including multiple chromatin states and each 548
chromatin state including several RT classes. This diversity suggests that, particularly in the 549
portion of the genome classically regarded as euchromatic, replication timing may be determined 550
to a large extent by factors that are independent of local chromatin states or by epigenetic 551
features not included in the chromatin state analysis. 552
Replication timing data is thought to integrate transcriptional, epigenetic and spatial 553
information across the genome (Hiratani and Gilbert, 2009), and its inclusion in modeling can 554
inform chromatin state assignments. Wang et al. (2015) classified CS2 and CS4 as intermediate 555
between euchromatin or heterochromatin. These assignments were based in part on the lack of 556
transcription of CS2 and CS4 and no enrichment for histone marks associated with active 557
transcription. However, the large amount of CS2 and CS4 in the E, EM and M RT classes 558
indicates that major fractions of these states are in an open, accessible conformation 559
characteristic of euchromatin. Thus, CS2 and CS4 may include nontranscribed euchromatin that 560
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
34
34
replicates with transcribed euchromatin (CS1 and CS5) during early to mid S phase. This idea is 561
supported by the near absence of CS2 and only a small fraction of CS4 replicating with 562
heterochromatin in late S phase. ML segments, which include both euchromatic (CS1, CS5, CS2, 563
and CS4) and heterochromatic (CS6 and CS3) chromatin states, represent a transition from 564
replicating euchromatin to replicating heterochromatin. 565
Comparison of replication timing and chromosome conformation data showed that E, EM 566
and M segments interact with each other with equal frequency within and between the arms of a 567
chromosome, L segments interact predominantly with Hi-C bins located in the pericentromeres 568
that encompass ML and L RT classes, while ML segments interact with all RT classes (Fig. 7). 569
This pattern of interaction is consistent with the Arabidopsis genome consisting of two main 570
genomic compartments one that replicates during early to mid S phase and another that 571
replicates in late S phase. This bipartite chromosomal architecture is reminiscent of the "open" 572
and "closed" compartments identified in the human genome (Lieberman-Aiden et al., 2009). The 573
two compartments have distinctive epigenomic and expression features and correlate with 574
replication time (Hansen et al., 1996; Ryba et al., 2010). It has been proposed that because of the 575
compact nature of the Arabidopsis genome and differences in chromatin organization between 576
plants and metazoans, the pericentromeric regions and chromosome arms may correspond 577
functionally to the closed and open compartments in mammalian genomes (Grob, et al., 2014; 578
Feng et al., 2014). 579
The datasets used for chromatin state and the long-range interaction studies (and the DHS 580
data discussed below) were generated from Arabidopsis seedlings. During plant development, 581
actively proliferating cells are localized primarily to meristematic regions and primordia and 582
include all cell cycle stages. As a consequence, only a small fraction of the cells used to create 583
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
35
35
the seedling datasets were in S phase. For this reason, future studies that use chromatin data from 584
mitotic cells may uncover relationships between replication timing and chromatin that were not 585
apparent in the comparisons here. 586
Nature of Mid S-Phase Replication 587
Replication in mid S phase may reflect spreading from regions that initiate during early S 588
phase and/or initiation and elongation events specific to mid S phase. In our data, the 589
distributions of read densities are sharply different in early and mid S phase, with early reads 590
displaying high local maxima separated by deep troughs, while mid reads are more evenly 591
distributed with smaller peaks and dips. These profiles are consistent with models postulating 592
firing of low efficiency origins during mid S phase (Guilbaud et al., 2011), as well as with other 593
models involving replication of regions lacking origins by unidirectional fork progression 594
(Desprat et al., 2009; Ryba et al., 2010). Both of these mechanisms can be incorporated into a 595
model in which origins are not distributed uniformly across a genome (Rhind, 2014; Kaykov et 596
al., 2016) and compete for replication factors (Mantiero et al., 2011), with the likelihood of 597
replication initiating in a given region depending primarily on its origin. 598
According to the above models, early replicating regions of the Arabidopsis genome would 599
have more origins and origin clusters, and mid replicating regions would have fewer, more 600
dispersed origins but would not differ dramatically with respect to sequence composition or 601
global chromatin features. The only genome-wide study describing putative origin sequences in 602
Arabidopsis is biased for early replication due to the use of sucrose starvation to arrest cells in 603
G1 before release into BrdU in the presence of hydroxyurea to deplete nucleotide pools (Costas 604
et al., 2011), and thus cannot provide insight into whether origins are enriched in early versus 605
mid-replicating regions. However, our model is supported by the observation that Arabidopsis 606
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
36
36
sequences replicating in early or mid S phase overlap similar genomic features (Fig. 4D) and 607
display similar chromatin state (Fig. 5, C and D) and chromatin interaction profiles (Fig. 7, B, C, 608
and D). However, these regions have different sensitivities to DNAse I digestion, with early 609
regions, but not mid regions, enriched for DHS sites (Fig. 6 B and C). Local maxima in early 610
regions are DHS-rich, while local maxima in mid regions are DHS-depleted, suggesting that 611
early replication is associated with a higher degree of chromatin accessibility than mid 612
replication. In this context, it is interesting that the replication program of the human genome can 613
be accurately simulated by a model in which an initiation probability landscape is determined by 614
the locations of DHS sites (Gindin et al., 2014). 615
Comparison to the Maize Replication Timing Program 616
We recently characterized replication timing in maize root tips labeled with EdU (Wear et al. 617
2017). The global distribution of the replication timing signals in maize and Arabidopsis are 618
similar, with chromosome arms replicating earlier and pericentromeric and centromeric regions 619
replicating later. Like Arabidopsis, maize replication is distributed across the RT classes. 620
However, there are more early replicating regions and fewer late regions in Arabidopsis than 621
maize. This difference likely reflects the very different genic and nongenic (TEs and noncoding 622
sequences) content of the two genomes (Arabidopsis - 51% genic and 49% nongenic; maize - 8% 623
genic and 92% nongenic), with genic sequences tending to replicate earlier. In addition, there are 624
more dispersed blocks of ML and L replicating DNA in maize chromosome arms, which are 625
typically organized into genic regions separated by TE clusters. Maize TEs (81% of the genome) 626
are very abundant in all RT classes, with those closer to genes replicating earlier. In contrast, 627
Arabidopsis TEs (20% of the genome) are located primarily in pericentomeric regions and 628
enriched in ML and L classes. 629
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
37
37
There are other important similarities between the Arabidopsis and maize replication timing 630
programs. Strikingly, the sizes of the RT segments are similar even though the maize genome is 631
ca. 20-fold larger than Arabidopsis. Some loci show heterogeneity with respect to replication 632
timing in both plant species. Moreover, early replicating regions are more accessible than mid 633
replicating regions. This comparison underscores the role of genome structure in replication 634
timing and highlights common features that are independent of genome organization. 635
CONCLUSION 636
We developed a high-resolution approach to study the replication program of eukaryotic 637
genomes and applied it to the model plant Arabidopsis thaliana, extending our previous analysis 638
of chromosome 4 (Lee et al., 2010) to the entire genome. Our results confirmed the basic 639
observation that euchromatin replicates during early and mid S phase and heterochromatin 640
replicates in late S phase, similar to most other eukaryotes (Hiratani et al., 2008; Schwaiger et 641
al., 2009; Ryba et al., 2010). However, in this study, we resolved better early and mid-642
replication patterns within euchromatin. Although very similar in their association with most 643
genomic features and chromatin marks, early and mid-replicating sequences differ strikingly in 644
chromatin accessibility as measured by DHS density. This finding is of particular interest in 645
connection with a recent model proposing that origin accessibility to replication factors is one of 646
the primary determinants of replication programs (Rhind, 2014). The model, which integrates 647
sequential activation of origins with stochastic firing, efficiently predicted the human replication 648
program (Gindin et al., 2014). 649
650
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
38
38
MATERIALS AND METHODS 651
Arabidopsis Cell Culture and Nuclei Isolation 652
The Arabidopsis thaliana cell line (Col-0, ecotype Columbia) was maintained as described 653
by Lee et al. (2010). Labeling followed the 7-d split protocol, in which 25 mL of fresh medium 654
and 25 mL of a 7-day culture are mixed and grown for 16 h. At 16 h, the cells were labeled with 655
10 M 5-Ethynyl-2-deoxyuridine (EdU, Life Technologies) for 10 min. Labeling was 656
terminated by fixing the cells in 1% paraformaldehyde with gentle agitation for 10 min, followed 657
by quenching the formaldehyde with 0.125 M glycine. Fixed cells were filtered through two 658
layers of Miracloth mesh and transferred to 1X phosphate buffered saline (PBS). They were 659
washed in PBS three times and snap frozen in liquid nitrogen. Cells from eight cultures were 660
combined for each of three biological replicates. 661
Nuclei were isolated as described previously (Lee et al., 2010; Wear et al., 2016) with the 662
addition of a Percoll gradient step. The frozen cell pellet was ground at 4C in 40 mL of cell lysis 663 buffer (15 mM Tris-HCl pH 7.5, 2 mM EDTA, 80 mM KCl, 20 mM NaCl, 15 mM -664
mercaptoethanol, and 0.1% Triton X-100) using a commercial blender. The ground cell 665
suspension was incubated for 5 min at 4C, filtered through two layers of Miracloth, and 666 centrifuged at 400xg for 5 min at 4 C. Nuclei were enriched using a Percoll step gradient as 667 described by Folta and Kaufman (2006) with minor modifications. The nuclei pellet was 668
resuspended in 25 mL of extraction buffer (2 M hexylene glycol, 20 mM PIPES-KOH pH 7.0, 10 669
mM MgCl2, 5 mM -mercaptoethanol) and centrifuged at 1500xg over a discontinuous density 670
gradient (30% and 80% v/v Percoll in gradient buffer: 0.5 M hexylene glycol, 10 mM MgCl2, 5 671
mM PIPES-KOH pH 7.0, 5 mM -mercaptoethanol and 1 % w/v Triton X-100) for 30 min at 672
4C. The nuclei recovered from the 30:80% Percoll interface were resuspended in 15 mL of 673
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
39
39
gradient buffer and centrifuged at 1500xg over a cushion of 30% Percoll (v/v) in Gradient Buffer 674
for 10 min at 4C. 675 After washing the nuclei pellet in modified cell lysis buffer (15 mM Tris-HCl pH 7.5, 2 mM, 676
EDTA, 80 mM KCl, 20 mM NaCl, and 0.1% Triton X-100), the incorporated EdU was 677
conjugated with Alexa Fluor 488 (AF488) using a Click-iT EdU Alexa fluor 488 Imaging kit 678
(Life Technologies) as described previously (Wear et al., 2016). Finally, the nuclei were 679
resuspended in the original cell lysis buffer containing 2 g/mL DAPI and filtered through a 680
CellTrics 20-m nylon mesh filter (Partec) just before flow cytometry and sorting. 681
Flow Cytometry and Sorting 682
An InFlux flow cytometer (BD Biosciences) equipped with UV (535 nm) and blue (488 683
nm) lasers was used to sort nuclei by DNA content (DAPI fluorescence) and EdU incorporation 684
(fluorescence of the conjugated AF488). Events were triggered on forward-angle light scatter 685
(FSC), and data were collected using 90 side scatter (SSC) and 460/50 nm and 530/40 nm 686
bandpass filters (Bass et al., 2014; Wear et al., 2016). Plots of SSC vs. 460/50 nm (DAPI) were 687
used to set analysis and sorting gates that excluded cellular debris. 688
Sub-stage gates were used to sort labeled nuclei into pools representing early, mid and late S-689
phase as well as unlabeled nuclei in G1 phase as a source of non-replicating reference DNA. The 690
sorting gates were separated from each other to minimize overlap between the sorted populations 691
(Fig. 1B). For each biological replicate, between 90,000-160,000 nuclei for each S phase fraction 692
and 1 million unlabeled G1 nuclei were collected in tubes containing STE buffer (100 mM NaCl, 693
10 mM Tris-HCl pH 7.5, 1 mM EDTA). A small sample of nuclei (~12,000-16,000) were also 694
sorted from each gate into cell lysis buffer augmented with 2 g/mL DAPI and reanalyzed to 695
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
40
40
determine the sort purity (Supplemental Fig. S3). Flow cytometry data were analyzed using 696
FlowJo software (Tree Star Inc.). 697
Genomic DNA Extraction and Immunoprecipitation of EdU/AF488-Labeled DNA 698
Genomic DNA was extracted as described previously (Lee et al., 2010) with minor 699
modifications. After overnight incubation with proteinase K, the samples were incubated with 700
RNAse A (50 g/mL) for 1 h at 37C prior to addition of PMSF (0.7mg/ml). The DNA was 701
extracted once with phenol/chloroform/isoamyl alcohol (25:24:1) and twice with chloroform, 702
and precipitated with 0.6 volumes of ice-cold isopropanol overnight at 20C. The DNA was 703
pelleted by centrifugation, washed twice with 1 mL of 70 % ethanol and resuspended in 130 L 704
of IP dilution buffer (167 mM NaCl, 16.7 mM Tris-HCl pH 8, 1.2 mM EDTA and 1.1 % (v/v) 705
Triton X-100). A Covaris S220 ultrasonicator was used to shear the DNA to an average size of 706
300 bp (parameters: intensity 5, duty cycle 10%, cycles per burst 200, treatment time 180 s). 707
After shearing, 370 L of IP dilution buffer (Gendrel et al., 2005) was added, and the sheared 708
DNA solution was precleared by gentle agitation in 20 L of magnetic protein G beads 709
(Dynabeads Life Technologies) pre-equilibrated with IP dilution buffer at 4C for 1 h. The 710 beads were removed with a magnet and newly synthesized DNA was immunoprecipitated by 711
incubating with a 1:200 dilution of anti-Alexa Fluor 488 antibody (Molecular Probes, #A-712
11094) at 4C overnight. The DNA-antibody complex was captured with 25 L of pre-713 equilibrated protein G beads at 4C for 2 h, followed by washing the beads as described by 714 Gendrel et al. (2005). Bound DNA was eluted from the beads in 250 L of elution buffer (1% 715
(w/v) SDS, 100 mM sodium bicarbonate) at 65C for 15 min, transferring the supernatant to a 716 new tube and repeating the elution for a final volume of 500 L. Eluted DNA was purified with 717
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
41
41
QIAquick PCR Purification Kit (Qiagen) according to the manufacturers directions. To 718
maximize DNA recovery, pre-warmed (50C) TE was used for the elution step. 719 Library Construction, Sequencing and Analysis of Repli-Seq Data 720
Immunoprecipitated DNA was used to construct sequencing libraries with the NEXTflex 721
Illumina ChIP-Seq Library Prep Kit (Bioo Scientific) using the ultra-low input protocol. After 722
adapter ligation, the libraries were amplified with 18 cycles of PCR with the Expand High 723
FidelityPLUS PCR System (Roche). For each experiment, individual samples were barcoded and 724
pooled. The libraries were sequenced with an Illumina Hi-Seq 2000 platform. 725
Raw sequencing data was processed using Trim Galore! (v0.3.7) to remove 3 universal 726
adapters from the paired reads, trim 5 ends with fastq quality scores below 20, and remove 727
trimmed reads shorter than 40 bp. The quality controlled reads were then aligned to the 728
Arabidopsis TAIR10 genome with BWA mem (v0.7.4) using default parameters (Li, 2013). 729
After alignment, reads with multiple alignments were discarded using samtools 1.3 (Li et al., 730
2009). For mapping statistics and total sequence coverage, see Supplemental Table S1. 731 Data were then analyzed as described by Zynda et al. (2017). The scripts can be found at 732 https://github.com/zyndagj/repliscan. Read densities were scored in 1-kb bins across the genome, 733
and normalized using sequence depth scaling (Ramrez et al., 2016). The correlation between 734
biological replicates was assessed using multiBigwigSummary and plotted as a heatmap using 735
plotCorrelation in Deeptools 2.0 suite (Ramrez et al., 2016). Replicates were highly correlated 736
(Supplemental Fig. S5). 737
Biological replicates were aggregated by taking the median value in each 1-kb bin. Bins with 738
coverage in the upper and lower 0.1% tails of a calculated normal distribution were removed. 739
Values for each of the S-phase samples were divided by the value for the non-replicating G1 740
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
42
42
reference in the corresponding bin to normalize for sequencing bias. To reduce noise, Haar 741
wavelet smoothing was performed using the software package wavelets from Percival and 742
Walden (2000). The Haar wavelet method was chosen because, unlike kernel smoothing 743
methods, it reduces differential noise without spreading peak boundaries. 744 Classifying Predominant Replication Time 745 The method used to assign a predominant time of replication to each 1-kb bin across the 746 genome is described by Zynda et al. (2017). Each bin was classified as replicating at a given time 747
point if its normalized replication intensity was above a chromosome-specific threshold value, as 748
calculated by the following procedure. Total coverage, defined as the fraction of the 749
chromosome with a signal greater than the threshold in at least one replication time window, was 750
computed as a continuous function of the threshold value using a cubic spline interpolation 751
across the replication values. The first derivative of the coverage function was then calculated 752
using the central difference formula to show the rate of coverage change. 753
Starting from the point with the highest rate of coverage change (maximum first derivative), 754
the threshold was lowered until the first derivative of the coverage vs. threshold curve effectively 755
flattened out. Below this point any additional signals were uninformative because those regions 756
had already been classified as replicating in other time points. The predominant replication time 757
for a given 1-kb bin was then assigned by considering the relative amounts of total replication 758
signal in early, mid and late S phase. For each 1-kb bin, the three signals were divided by the 759
maximum value, scaling the largest value to 1 and others between 0 and 1. The bin was labeled 760
as the combination of times with a normalized signal above 0.5. This strategy allowed single 761
prevalent time and combinatorial time classifications to be assigned to a given 1-kb bin. Bins 762
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
43
43
were classified as undetermined if none the signals in any of the three time samples reached the 763
threshold value. 764
Replication Intensity and Relative Distance from the Centromere 765 Centromere positions in each chromosome were identified with the bedtools 2.25.0 766 genomecov utility (Quinlan and Hall, 2010) as 1-kb bins with the maximum coverage of 180-bp 767
repeats (Nagaki et al., 2003). Using normalized replication intensity in early, middle and late S 768
phase, the percent of total replication occurring in bins representing successive 10% portions of a 769
given chromosome arm was calculated with a custom R script (R Development Core Team, 770
2016). Replication within each interval, expressed as percentage of total replication activity for 771
that chromosome arm in that portion of S phase, was plotted as a function of the relative distance 772
from the centromere (Fig. 2D) using the R package ggplot2 (Wickham, 2009). 773
Association of Replication Timing with Genomic Features and Repeat Sequences 774
Genomic annotation of genes, pseudogenes and transposable elements (TEs) were obtained 775
from the Araport11 database (TAIR10_GFF3_genes_transposons.AIP.gff.gz at 776
https://www.araport.org/downloads/TAIR10_genome_release/annotation). Unannotated regions 777
were defined as the difference between the genome and all the annotated features. For viewing in 778
IGV 2.3.60 and comparison with Repli-Seq data, the coverage of genes and TEs was defined as 779
the percentage of bases in a specified portion of the genome that overlap with that feature. Gene 780
and TE coverage was scored in 1-kb bins with bedtools v2.25.0 genomecov and map utilities. For 781
visualization in IGV, the data were smoothed using a 50-kb moving average with the R package 782
zoo (Zeileis and Grothendieck, 2005). A custom script is available upon request. 783
Associations of genomic features with RT segmentation classes were computed with 784
bedtools v2.25.0 intersect, and their statistical significance assessed with a chi-squared test. The 785
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
44
44
adjusted residuals (Agresti, 2007) were used to measure the relative contribution of each 786
combination of genomic feature and RT class to assess the statistical significance of the 787
associations. 788
Telomere-related (TEL), centromere-related (CEN), 45S and 5S ribosomal DNA sequences 789
were obtained from Plant Repeat Databases (Ouyang and Buell, 2004). The replication timing of 790
each group of repeats was assessed as described by Gent et al. (2014). Reads from individual 791
biological replicates of G1, early, mid and late S phase samples were aligned to consensus 792
sequences for each group using Blast software (parameter -e 1e-8) (Camacho et al., 2009). For 793
each sample and biological replicate, the number of reads that aligned to each repeat family was 794
normalized to the total number of reads present in the sample. Finally, the relative abundance of 795
each family in the early, mid or late reads was normalized to the relative abundance of the same 796
family in the G1 reference. 797
Association of Replication Timing with Chromatin States, DNAse I Hypersensitivity Sites 798
and Chromosome Conformation 799
Repli-Seq data were compared with the chromatin state dataset produced by Wang et al. 800
(2015). The overlaps in bp between each chromatin state (CS) and the five major RT segment 801
classes were calculated using bedtools v2.25.0 intersect, and plotted as absolute and relative 802
coverage. Statistical significance was assessed with a chi-squared test. We used the chi-square 803
adjusted residuals (Agresti, 2007) to identify which RT classes were most different from the 804
expected value in each chromatin state group of features, compared to the genome. The absolute 805
coverage of each chromatin state in each RT class was used to compute the Spearman correlation 806
coefficient between RT classes using the function cor in R, and subsequently plotted as a heat 807
map with the package corrplot (Wei and Simko, 2016). 808
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
45
45
To compare replication timing profiles with DNase I hypersensitivity sites (DHSs), we used 809
the dataset (GEO accession PRJNA231710) described by Sullivan et al. (2014). The density of 810
DHSs in each RT class (Fig. 6A; Supplemental Fig. S4) was determined using data from control 811
experiments. The number of DNase cleavages from signal files (Accessions GSM1289359 and 812
GSM1289363) was averaged at 1-kb steps across the genome and smoothed using a 5-kb moving 813
average. The resulting DHS density distribution was plotted as a heat map and overlaid with the 814
early, mid and late replication intensity signals. The DNaseI read density files (Col-815
0.7d_Seedling.NA.NA.DS19992.signal.bw and Col-0.7d_Seedling.NA.NA.DS21094.signal.bw) 816
are at http://plantregulome.org/public/dnase/other/all-reads/signal/. The DNaseI hypersensitive 817
peak files (Col-0.7d_Seedling.NA.NA.DS19992.peaks.bed.gz and Col-818
0.7d_Seedling.NA.NA.DS21094.peaks.bed.gz) are at 819
http://plantregulome.org/public/dnase/other/all-reads/peaks/. 820
We used the dataset (Accession number SRR2626429) described in (Liu et al., 2016) for 821
chromosome conformation analysis. Sequencing reads were aligned to the TAIR10 reference 822
genome and experimental artifacts, like circularized fragments, PCR duplicates, re-ligated 823
adjacent sequences and wrong size fragments, were removed using HICUP with the default 824
parameters (Wingett et al., 2015). Significant interactions, defined as pairs of loci that have a 825
greater number of Hi-C reads than expected by chance (p-value < 0.001), were identified at 100-826
kb resolution using HOMER (Heinz et al., 2010) and visualized using the CIRCOS tool 827
(Krzywinski et al., 2009) together with the genome segmentation in RT classes. Within each 828
interacting pair of 100-kb bins, we randomized the first and second bins and split interaction in 829
groups based on the content of RT classes in the first bin. The absolute and relative overlaps of 830
the second bins with RT classes were computed with bedtools v2.25.0 intersect. A Pearson 831
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
46
46
correlation matrix was computed using the function cor in R, and subsequently plotted as heat 832
map with the package corrplot (Wei and Simko, 2016). 833
Accession Numbers 834
Repli-Seq data from this study is in the NCBI Sequence Read Archive (SRA) under the 835
umbrella accession number PRJNA330547. The SRA numbers are: G1 SAMN05417671, Early 836
SAMN05417674, Mid SAMN05417672, and Late SAMN05417673. Processed data files 837
(E_ratio_3.smooth.bedgraph; M_ratio_3.smooth.bedgraph; L_ratio_3.smooth.bedgraph; 838
ratio_segmentation.gff3) are available from the CyVerse (previously iPlant Collaborative, 839
(Merchant et al., 2016)) Data Store. The Nimblegene microarray data for Arabidopsis 840
chromosome 4 replication timing is at Gene Expression Omnibus under accession number 841
GSE103321. The tiling microarray data for Arabidopsis chromosome 4 replication timing can be 842
found at Array Express under accession number E-GEOD-30433. 843
Supplemental Data 844
The following supplemental materials are available. 845
Supplemental Table S1. Statistics for sequenced libraries 846
Supplemental Table S2. Adjusted residuals for chi-square test on contingency Table 2 847
describing the overlaps between genomic features and RT classes 848
Supplemental Table S3. Overlap between chromatin states (CS) and RT classes 849
Supplemental Table S4. Adjusted residuals relative to the chi-square test on the contingency 850
Table S3 describing the overlaps between between Chromatin states (CS) and RT classes 851
Supplemental Table S5. Coverage of RT classes of genomic bins establishing significant long-852
range interactions 853
www.plantphysiol.orgon June 29, 2018 - Published by Downloaded from Copyright 2018 American Society of Plant Biologists. All rights reserved.
http://www.plantphysiol.org
Concia et al.
47
47
Supplemental Figure S1. Comparison of replication timing profiles generated using tiling and 854
Nimblegen arrays 855
Supplemental Figure S2. Spearman correlation matrix for tiling (TL) and Nimblegen (NG) 856
array platforms 857
Supplemental Figure S3. Sorting gates and reanalysis of sorted fractions 858
Supplemental Figure S4. Distribution of read density for each sequencing library in 859
representative 1 Mb regions of Arabidopsis chromosomes 1, 3 and 5 860
Supplemental Figure S5. Spearman correlation matrix of read densities of sequenced samples 861
Supplemental Figure S6. Comparison of linear ratio versus Log2 ratio 862
Supplemental Figure S7. Large-scale distribution of read density on the five Arabidopsis 863
chromosomes 864
Supplemental Figure S8. Comparison of the distribution of RT classes on Arabidopsis 865
chromosome 4 866
Supplemental Figure S9. Replication timing and genomic features 867
Supplemental Figure S10. Hi-C background models generated with HOMER for Arabidopsis 868
chromosome 1 86