23
1 1 HiCORE: Hi-C analysis for identification of core chromatin loops with 2 higher resolution and reliability 3 4 Hongwoo Lee 1 and Pil Joon Seo 1,2,* 5 6 1 Department of Chemistry, Seoul National University, Seoul 08826, Korea 7 2 Plant Genomics and Breeding Institute, Seoul National University, Seoul 08826, Korea 8 9 * To whom correspondence should be addressed: [email protected] 10 11 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024 doi: bioRxiv preprint

HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

1

1 HiCORE: Hi-C analysis for identification of core chromatin loops with

2 higher resolution and reliability

3

4 Hongwoo Lee1 and Pil Joon Seo1,2,*

5

6 1Department of Chemistry, Seoul National University, Seoul 08826, Korea

7 2Plant Genomics and Breeding Institute, Seoul National University, Seoul 08826, Korea

8

9 *To whom correspondence should be addressed: [email protected]

10

11

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 2: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

2

12 Abstract

13 Genome-wide chromosome conformation capture (3C)-based high-throughput sequencing

14 (Hi-C) has enabled identification of genome-wide chromatin loops. Because the Hi-C map

15 with restriction fragment resolution is intrinsically associated with sparsity and stochastic

16 noise, Hi-C data are usually binned at particular intervals; however, the binning method has

17 limited reliability, especially at high resolution. Here, we describe a new method called

18 HiCORE, which provides simple pipelines and algorithms to overcome the limitations of

19 single-layered binning and predict core chromatin regions with 3D physical interactions. In

20 this approach, multiple layers of binning with slightly shifted genome coverage are generated,

21 and interacting bins at each layer are integrated to infer narrower regions of chromatin

22 interactions. HiCORE predicts chromatin looping regions with higher resolution and

23 contributes to the identification of the precise positions of potential genomic elements.

24

25 Author Summary

26 The Hi-C analysis has enabled to obtain information on 3D interaction of genomes. While

27 various approaches have been developed for the identification of reliable chromatin loops,

28 binning methods have been limitedly improved. We here developed HiCORE algorithm that

29 generates multiple layers of bin-array and specifies core chromatin regions with 3D

30 interactions. We validated our algorithm and provided advantages over conventional binning

31 method. Overall, HiCORE facilitates to predict chromatin loops with higher resolution and

32 reliability, which is particularly relevant in analysis of small genomes.

33

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 3: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

3

34 Introduction

35 Hi-C analysis maps all possible genomic interactions in a manner dependent on three-

36 dimensional (3D) distance. In this method, cross-linked chromatin is digested into fragments

37 by a restriction enzyme [1]. Then, restriction fragments are proximity-ligated to generate a

38 library of chimeric circular DNA. Paired-end sequencing and mapping of reads to a reference

39 genome identifies interacting restriction fragments and measures their interaction frequencies

40 [2]. Unfortunately, despite the high sequencing depth of typical Hi-C experiments, single

41 fragment–resolution contact matrices are too sparse to analyze. Hence, to ensure efficient and

42 reliable data processing, Hi-C data are usually binned with fixed intervals at low resolution

43 (>5 kb) [3-5].

44 Because the resolution of the Hi-C map is a critical issue for 3D genome

45 conformation studies, multiple studies have focused on improving the resolution of raw Hi-C

46 matrices [6-10]. Although many cutting edge technologies have been implemented, including

47 machine learning-based computational methods, the binning strategy is still an important

48 determinant of Hi-C resolution. Two major types of binning methods have been used in

49 practice: fixed interval binning and fragment unit binning. Fixed interval binning allows

50 intervals to have a regular size (e.g., 5 kb) without considering restriction fragment size,

51 whereas fragment unit binning has an interval defined by the size of restriction fragment(s)

52 [5]. Although fixed interval binning is a practical approach that facilitates efficient analysis of

53 Hi-C data, fixed intervals mismatch restriction fragments of variable sizes, especially at a

54 high resolution, creating bias in chromatin loop detection [5]. Thus, fragment unit binning

55 methods have also been improved in parallel for high-resolution Hi-C analysis. Because

56 single fragment resolution analysis is limited due to contact matrix sparsity, as an alternative

57 method, several restriction fragments are assembled to generate a single bin (here called

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 4: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

4

58 multi-fragment binning) [6, 11].

59 Following the binning step, genomic looping regions are identified by various

60 methods, which can be roughly divided into two categories depending on the methodology

61 used for chromatin loop detection. The first category, including HiCCUPS [12] and cLoop

62 [13], primarily identifies chromatin loops located at topologically associating domain (TAD)

63 border regions that are associated with enhancer–promoter interactions [14, 15]. The other

64 category, including Fit-HiC2 [16], HOMER (http://homer.ucsd.edu/homer/), and GOTHiC

65 [17], is well suited for the identification of smaller, local chromatin loops independently of

66 TAD structures [16, 18]. Because the methods of the latter category do not account for a high-

67 order 3D folding structure, they require a more robust post-analysis to determine the

68 reliability of chromatin loop prediction.

69 In this study, we developed the HiCORE algorithm, which includes an advanced

70 binning method and a reliable loop prediction strategy. HiCORE predicts fine-scale

71 interactions between genomic elements. Independently of the binning method, single-layered

72 binning (conventional method) is insufficient for defining chromatin looping regions,

73 especially at high resolution. To further improve resolution and reliability of detecting 3D

74 chromatin interactions, HiCORE generates multiple layers of bin arrays with different

75 genome coverages. Significantly interacting chromatin regions at each layer are identified

76 [16], and then this information from multiple layers is integrated to draw partially overlapping

77 and narrower chromatin regions with 3D interactions, which represent chromatin loops with

78 higher resolution. Prediction of high-resolution chromatin loops allows the estimation of

79 precise positions of genomic elements. We also validated the reliability of the specified

80 chromatin-interacting regions based on their relative statistical significances.

81

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 5: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

5

82 Results and Discussion

83 Framework of HiCORE

84 HiCORE generated multiple layers of bin arrays, in which a single bin was defined through

85 serial assembly of fragments (multi-fragment binning) to have a size above a certain

86 threshold. The first layer was generated by forward multi-fragment binning initiated from the

87 starting point (5’-end) of the genome (forward binning), and the second layer was created by

88 reverse multi-fragment binning initiated from the end point (3’-end) of the genome (reverse

89 binning). Additional layers were generated by bidirectional binning from randomly selected

90 positions of the genome (random binning). Bin arrays at each layer could be shifted a bit

91 relative to one another, and thus have different genome coverages (Fig 1). An interaction-

92 frequency matrix of fragment–resolution Hi-C data was assigned to bin arrays at each layer,

93 and bin pairs with statistically significant interactions were identified using the Fit-HiC2

94 package [16]. Consequently, overlapping regions of interacting bins from multiple layers

95 could be drawn, enabling us to specify reliably interacting chromatin regions at higher

96 resolution (Fig 1).

97

98

99 Fig 1. HiCORE features.

100 Schematic diagram showing the working process of HiCORE. Multiple layers of multi-

101 fragment binning are produced, and chromatin looping regions at each layer are predicted

102 based on raw 3D contact matrices. HiCORE identifies overlapped regions of interacting bins

103 from multiple layers, specifying chromatin regions with 3D interaction at higher resolution.

104 Dashed lines indicate restriction sites. Solid lines indicate multi-fragment bin units larger than

105 a given cutoff bin size. Colored regions in each layer indicate chromatin looping regions.

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 6: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

6

106 HiCORE suggests not only overlapped regions (hatched regions), but also expanded regions

107 (blue-shaded regions).

108

109

110 Dataset description

111 As a proof of concept, we used Hi-C data from the model plant Arabidopsis, which has a

112 relatively small genome [11]. Arabidopsis Hi-C analysis suggested that no clear 3D chromatin

113 conformation exists in the absence of TAD structures [19, 20]. Moreover, genes encoding

114 CCCTC-BINDING FACTOR (CTCF) insulator proteins have not been identified in the

115 Arabidopsis genome [21]. However, several studies raised the possibility that chromatin

116 looping is crucial for gene regulation in Arabidopsis [11, 22, 23]. Hence, we wanted to

117 implement a stringent method for 3D genome analysis that identifies reliable chromatin

118 looping regions in the model plant genome. We developed a pipeline consisting of HiCORE

119 (for binning) followed by the Fit-HiC2 packages, which can identify the chromatin loops

120 independently of TAD structures [16]. The output of Fit-HiC2 was further post-processed by

121 HiCORE (for loop detection with higher resolution and reliability).

122

123 Increased resolution and reliability in prediction of chromatin loops by HiCORE

124 To validate the relevance of HiCORE in specifying chromatin looping regions, we generated

125 multiple layers of bin arrays, in which a single bin was defined through multi-fragment

126 binning to have a size above 700bp. The threshold bin size was determined by the raw matrix

127 resolution of the original Hi-C data [14]. We integrated information about 3D chromatin

128 interactions from multiple layers and analyzed the average sizes of specified chromatin

129 regions with 3D contacts in units of fragments. Analysis with a single layer of binning

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 7: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

7

130 (conventional method) revealed that the average size of all specified chromatin regions with

131 3D contacts was 4.6 fragments (Fig 2A). However, the HiCORE-generated multiple layers

132 enabled further specification of chromatin looping regions. Compared with single-layered

133 binning, the average size of all specified chromatin regions with 3D contacts was reduced by

134 ~50% (to 2.2 fragments) after integration of 20 layers (Fig 2A), demonstrating that chromatin

135 looping regions are better specified after HiCORE analysis. Surprisingly, when 20 layers of

136 bin arrays were overlapped, 44% of all predicted chromatin regions with 3D interactions were

137 specified at single-fragment resolution (S1 and S2 Fig). We further tested our HiCORE

138 algorithm with various cutoff values in the threshold bin size from 400 to 2000 bp, and

139 confirmed that chromatin looping regions could be further specified by HiCORE in all

140 threshold size examined (S3 Fig). In addition, the HiCORE-generated loops had higher

141 reliability, which was supported by lower q-values in Fit-HiC2 analysis (Fig 2B).

142

143

144 Fig 2. HiCORE test results.

145 (A) Average number of fragments in each specified chromatin-interacting region. (B)

146 Average q-values (minimum false discovery rate, FDR) of specified interacting chromatin

147 regions. The q-values were estimated by Fit-HiC2 packages [16]. In (A) and (B), for binning

148 at each layer, restriction fragments smaller than 700 bp are merged with neighboring

149 fragments until the bin is larger than 700 bp. One bin layer was added at a time, repeatedly,

150 and chromatin looping regions were specified. The x-axis indicates the number of layers

151 integrated for HiCORE analysis.

152

153

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 8: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

8

154 In support, chromatin looping regions were better specified throughout the genome, compared

155 with conventional single-layered Fit-HiC2 analysis. For example, while three chromatin loops

156 were predicted by the conventional Fit-HiC2 analysis within Chr1:590 kb – 599 kb regions

157 (Fig 3A), HiCORE selected a reliable chromatin loop (shown in blue) and defined the

158 chromatin looping region with higher resolution (Fig 3B): one side specified at a single-

159 fragment unit and the other at a 2-fragment unit.

160

161

162 Fig 3. Comparison between conventional and HiCORE analysis in specification of

163 chromatin looping regions.

164 (A) Chromatin loops predicted by conventional binning method. Chromatin regions with 3D

165 interactions are indicated by black boxes. (B) Chromatin loops predicted by HiCORE

166 analysis. Chromatin regions with 3D interactions are indicated by blue boxes. The interaction

167 frequency was shown at single fragment scale.

168

169

170 Notably, HiCORE results could not be obtained by single-layered single fragment binning.

171 HiCORE analysis predicted a total of 1987 chromatin loops, 27% of which (541 chromatin

172 loops) were specified at single fragment resolution (S4 Fig). However, the single-layered Fit-

173 HiC2 method identified 185 chromatin loops at single fragment resolution (Fig 4A).

174 Furthermore, the average size of HiCORE-specified chromatin regions with 3D contacts (for

175 1987 chromatin loops) was much smaller than that by the conventional single-layered Fit-

176 HiC2 analysis (for 185 chromatin loops) (Fig 4B). The chromatin loops predicted by HiCORE

177 had also lower q-value (Fig 4A), indicating that HiCORE enables the successful identification

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 9: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

9

178 of chromatin looping regions at higher resolution.

179

180

181 Fig 4. Comparison between conventional single-fragment binning method and HiCORE

182 analysis.

183 (A) The box plot of q-value distribution. (B) Average size (bp) of specified interacting

184 chromatin region. For HiCORE analysis, restriction fragments smaller than 700bp are merged

185 with neighboring fragments until the bin is larger than 700bp. Twenty bin layers were

186 integrated, and chromatin looping regions were specified.

187

188

189 We also showed the results in units of base-pair length (S5 Fig). However, at extremely high

190 resolution (cutoff value < matrix resolution), the average size of all specified chromatin

191 regions with 3D contacts in units of base-pair length did not gradually decrease as binning

192 layers were integrated (S5 Fig), although the average fragment number was reduced (Fig 2A).

193 The inconsistency was possibly due to the sparsity of 3D contact matrices in smaller bins,

194 which caused low reproducibility at each layer and thereby excluded during HiCORE

195 analysis. Thus, HiCORE can specify the core chromatin looping regions, but optimal analysis

196 can usually be performed above the matrix resolution.

197

198 Additional applications of HiCORE

199 The HiCORE method can also be used to profile all possible chromatin loop information

200 predicted by multiple binning layers. The single-layered binning can introduce significant bias

201 into the identification of chromatin loops. For instance, a chromatin loop in the Arabidopsis

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 10: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

10

202 FLC locus was empirically validated [11, 23], but this loop was predicted by only a few

203 binning layers (Fig 5). This observation indicates that there is no best single binning layer that

204 sufficiently covers genomic interactions, and that comprehensive integration of chromatin

205 looping information predicted by multi-layered binning with different genome coverage is

206 required. HiCORE can provide a platform for data integration and optionally suggest a wide

207 range for possible chromatin looping regions (Fig 1). Furthermore, using HiCORE, chromatin

208 looping regions predicted by only a few layers could be selectively integrated to specify

209 narrower overlapping chromatin regions. Indeed, the empirically-validated FLC loop was

210 precisely specified by HiCORE analysis with integration of selected binning layers that had

211 statistically significant 3D-interaction values within the FLC locus (Fig 5).

212

213

214 Fig 5. Chromatin loops within the FLC locus.

215 For binning at each layer, restriction fragments smaller than 700bp were merged with

216 neighboring fragments until the bin is larger than 700bp. Twenty layers of binning were

217 produced. For each binning layer, chromatin loops in the genomic coordinates between

218 (<Chr5:3179000-3181500> : <Chr5:3172000-3174000>) are predicted (Green). Then, bin

219 layers with the FLC loop prediction were selected, and the overlapped regions were specified

220 by HiCORE (Yellow). The empirically-proven interaction around the FLC locus

221 (<Chr5:3172571-3173171> : <Chr5:3178619-3181368> is shown in red.

222

223

224 Hi-C is a commonly used technique for mapping 3D chromatin organization. Despite

225 significant advances in the accuracy and resolution of genome contacts, identification of fine-

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 11: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

11

226 scale chromatin loops remains challenging, mostly because of low sequencing depth and

227 single-layered binning. The HiCORE algorithm takes advantage of an advanced binning

228 method, and also predicts higher-resolution chromatin looping by generating multiple layers

229 of bin arrays, which are slightly shifted relative to each other. Extraction of partially

230 overlapping, more narrowly interacting regions allows higher-resolution analysis of chromatin

231 looping from Hi-C data. HiCORE can be incorporated into current Hi-C data analysis

232 pipelines, improving the resolution in predicting chromatin looping regions. HiCORE analysis

233 will facilitate better separation of genomic contacts and the identification of potential genomic

234 elements.

235

236 Method

237 Hi-C data and their pre-processing

238 The high resolution Hi-C data of Arabidopsis were obtained from NCBI Sequence Read

239 Archive (SRA; http://www.ncbi.nlm.gov/sra/) under accession number SRP064711. The 4

240 replicates were mapped to Arabidopsis genome (TAIR10) and further processed by Juicer

241 packages [12]. We extracted the raw interaction matrix from the ‘.hic’ file that was derived

242 from reads with mapping quality > 30.

243

244 Binning strategy

245 Fragments with a size below a given cutoff length are merged with neighboring fragments.

246 The merged fragments are assembled into a single bin (multi-fragment bin). This binning

247 strategy is applied to all fragments except for the end fragment. For ‘forward binning’, from

248 the 5’ end of the chromosome, a restriction fragment shorter than the cutoff length is merged

249 with the following fragment (in the 3’-direction) until the merged fragment is longer than

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 12: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

12

250 cutoff length. For ‘reverse binning’, from the 3’ end of the chromosome, a restriction

251 fragment shorter than the cutoff length is merged with the preceding fragment (in the 5’-

252 direction) until the merged fragment is longer than the cutoff length. For ‘random binning’,

253 HiCORE randomly selects restriction fragments for construction of a binning layer.

254 Randomly selected fragments shorter than the cutoff length are first merged with the

255 following fragment in the 3’-direction. In the middle of fragment assembly, if the next

256 fragment in the 3’-position was already merged with other fragments, the bin being generated

257 is merged with the fragment in the 5’-direction until the bin being generated is longer than the

258 cutoff length. If the neighboring fragments in both directions are already taken by other

259 completed bins, the bin being generated is divided into two units. The first unit (5’-

260 neighboring fragments of the random fragment) and the second unit (the random fragment

261 plus its 3’-neighboring fragments) of the bin being generated are separately merged to

262 already-completed neighboring bins in the 5’ and 3’ positions, respectively.

263

264 Single-layered Fit-HiC2 analysis

265 Using the extracted raw fragment matrix file, we generated ‘.interaction.gz’ and

266 ‘fragments.gz’ files that are necessary for subsequent Fit-HiC2 analysis. We normalized the

267 raw matrix using ‘HiCKRy.py’ in Fit-HiC2 packages [16]. Fit-HiC2 analysis was performed

268 with the following parameters : -r 0 -L 2000 -U 25000 -x intraOnly. The output loops were

269 filtered by a statistical cutoff value (q < 0.01)

270

271 Profiling of core chromatin looping regions using HiCORE

272 The raw fragment matrix file was assigned to each bin layer, and then single-layered Fit-HiC2

273 analysis was performed in each bin layer. After defining the interacting chromatin regions

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 13: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

13

274 using Fit-HiC2 packages, a statistical cutoff value (q < 0.01) is used to obtain statistically

275 significant chromatin regions with 3D interactions. Fit-HiC2 output files, which contain 3D

276 interaction data in units of multi-fragment bins, are converted to data in units of single-

277 fragment bins for each binning layer. Then, the output files are further processed by HiCORE

278 to provide either overlapped core interacting chromatin regions or expanded interacting

279 chromatin regions using single-fragment data from multiple binning layers. The HiCORE-

280 processed data of single-fragment interactions are shown in units of the original multi-

281 fragment bins for comparison of specification of chromatin looping regions.

282

283 Software availability

284 HiCORE is implemented as a Python3 package. The most current stable version is available

285 at: https://github.com/CDL-HongwooLee/HiCORE

286

287 Acknowledgements

288 We thank Dr. Sangrea Shim (Seoul National University, Korea) for fruitful discussion.

289

290 Funding

291 This work was supported by the Basic Science Research (NRF-2019R1A2C2006915) and

292 Basic Research Laboratory (2020R1A4A2002901) programs provided by the National

293 Research Foundation of Korea and by the Next-Generation BioGreen 21 Program

294 (PJ01314501) provided by the Rural Development Administration.

295

296 Conflict of Interest: none declared.

297

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 14: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

14

298 References

299 1. Dekker J, Rippe K, Dekker M, Kleckner N. Capturing Chromosome Conformation.

300 Science. 2002;295(5558):1306. doi: 10.1126/science.1067799.

301 2. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling

302 A, et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of

303 the Human Genome. Science. 2009;326(5950):289. doi: 10.1126/science.1181369.

304 3. Fernandez-Albert J, Lipinski M, Lopez-Cascales MT, Rowley MJ, Martin-Gonzalez

305 AM, del Blanco B, et al. Immediate and deferred epigenomic signatures of in vivo neuronal

306 activation in mouse hippocampus. Nature Neuroscience. 2019;22(10):1718-30. doi:

307 10.1038/s41593-019-0476-2.

308 4. Sima J, Chakraborty A, Dileep V, Michalski M, Klein KN, Holcomb NP, et al.

309 Identifying cis Elements for Spatiotemporal Control of Mammalian DNA Replication. Cell.

310 2019;176(4):816-30.e18. doi: https://doi.org/10.1016/j.cell.2018.11.036.

311 5. Lajoie BR, Dekker J, Kaplan N. The Hitchhiker’s guide to Hi-C analysis: Practical

312 guidelines. Methods. 2015;72:65-75. doi: https://doi.org/10.1016/j.ymeth.2014.10.031.

313 6. Cameron CJF, Dostie J, Blanchette M. HIFI: estimating DNA-DNA interaction

314 frequency from Hi-C data at restriction-fragment resolution. Genome Biology. 2020;21(1):11.

315 doi: 10.1186/s13059-019-1913-y.

316 7. Liu Q, Lv H, Jiang R. hicGAN infers super resolution Hi-C data with generative

317 adversarial networks. Bioinformatics. 2019;35(14):i99-i107. doi:

318 10.1093/bioinformatics/btz317.

319 8. Liu T, Wang Z. HiCNN: a very deep convolutional neural network to better enhance

320 the resolution of Hi-C data. Bioinformatics. 2019;35(21):4222-8. doi:

321 10.1093/bioinformatics/btz251.

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 15: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

15

322 9. Zhang Y, An L, Xu J, Zhang B, Zheng WJ, Hu M, et al. Enhancing Hi-C data

323 resolution with deep convolutional neural network HiCPlus. Nature Communications.

324 2018;9(1):750. doi: 10.1038/s41467-018-03113-2.

325 10. Zhu B, Zhang W, Zhang T, Liu B, Jiang J. Genome-Wide Prediction and Validation

326 of Intergenic Enhancers in Arabidopsis Using Open Chromatin Signatures. The Plant Cell.

327 2015;27(9):2415. doi: 10.1105/tpc.15.00537.

328 11. Liu C, Wang C, Wang G, Becker C, Zaidem M, Weigel D. Genome-wide analysis of

329 chromatin packing in Arabidopsis thaliana at single-gene resolution. Genome research.

330 2016;26(8):1057-68. doi: 10.1101/gr.204032.116. PubMed PMID: 27225844.

331 12. Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, et al. Juicer

332 Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell

333 Systems. 2016;3(1):95-8. doi: 10.1016/j.cels.2016.07.002.

334 13. Cao Y, Chen Z, Chen X, Ai D, Chen G, McDermott J, et al. Accurate loop calling for

335 3D genomic data with cLoops. Bioinformatics. 2019. doi: 10.1093/bioinformatics/btz651.

336 14. Rao Suhas SP, Huntley Miriam H, Durand Neva C, Stamenova Elena K, Bochkov

337 Ivan D, Robinson James T, et al. A 3D Map of the Human Genome at Kilobase Resolution

338 Reveals Principles of Chromatin Looping. Cell. 2014;159(7):1665-80. doi:

339 10.1016/j.cell.2014.11.021.

340 15. Luo Z, Wang X, Jiang H, Wang R, Chen J, Chen Y, et al. Reorganized 3D Genome

341 Structures Support Transcriptional Regulation in Mouse Spermatogenesis. iScience.

342 2020;23(4):101034. doi: https://doi.org/10.1016/j.isci.2020.101034.

343 16. Kaul A, Bhattacharyya S, Ay F. Identifying statistically significant chromatin

344 contacts from Hi-C data with FitHiC2. Nature Protocols. 2020;15(3):991-1012. doi:

345 10.1038/s41596-019-0273-0.

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 16: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

16

346 17. Mifsud B, Martincorena I, Darbo E, Sugar R, Schoenfelder S, Fraser P, et al.

347 GOTHiC, a probabilistic model to resolve complex biases and to identify real interactions in

348 Hi-C data. PLOS ONE. 2017;12(4):e0174744. doi: 10.1371/journal.pone.0174744.

349 18. Forcato M, Nicoletti C, Pal K, Livi CM, Ferrari F, Bicciato S. Comparison of

350 computational methods for Hi-C data analysis. Nature Methods. 2017;14(7):679-85. doi:

351 10.1038/nmeth.4325.

352 19. Wang C, Liu C, Roqueiro D, Grimm D, Schwab R, Becker C, et al. Genome-wide

353 analysis of local chromatin packing in Arabidopsis thaliana. Genome research.

354 2015;25(2):246-56. doi: 10.1101/gr.170332.113. PubMed PMID: 25367294.

355 20. Feng S, Cokus Shawn J, Schubert V, Zhai J, Pellegrini M, Jacobsen Steven E.

356 Genome-wide Hi-C Analyses in Wild-Type and Mutants Reveal High-Resolution Chromatin

357 Interactions in Arabidopsis. Molecular Cell. 2014;55(5):694-707. doi:

358 https://doi.org/10.1016/j.molcel.2014.07.008.

359 21. Ong C-T, Corces VG. Insulators as mediators of intra- and inter-chromosomal

360 interactions: a common evolutionary theme. Journal of Biology. 2009;8(8):73. doi:

361 10.1186/jbiol165.

362 22. Sun L, Jing Y, Liu X, Li Q, Xue Z, Cheng Z, et al. Heat stress-induced transposon

363 activation correlates with 3D chromatin organization rearrangement in Arabidopsis. Nature

364 Communications. 2020;11(1):1886. doi: 10.1038/s41467-020-15809-5.

365 23. Crevillén P, Sonmez C, Wu Z, Dean C. A gene loop containing the floral repressor

366 <em>FLC</em> is disrupted in the early phase of vernalization. The EMBO Journal.

367 2013;32(1):140. doi: 10.1038/emboj.2012.324.

368

369 Supporting information

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 17: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

17

370 S1 Text. The information for HiCORE implementation.

371

372 Supplemental Figure 1. Box plot for average number of fragments in each specified

373 chromatin-interacting region.

374 Restriction fragments smaller than the cutoff values (700bp) were merged with neighboring

375 fragments until the bin is larger than the cutoff size. One bin layer was added at a time,

376 repeatedly, and chromatin looping regions were specified. The x-axis indicates the number of

377 layers integrated for HiCORE analysis. See also Figure 2A.

378

379 Supplemental Figure 2. Size distribution of HiCORE-specified looping regions.

380 Restriction fragments smaller than the cutoff values (700bp) were merged with neighboring

381 fragments until the bin is larger than the cutoff size. Twenty bin layers were integrated by

382 HiCORE and the overlapped looping regions were identified. Specified regions in each side

383 of a chromatin loop was counted independently. The number of interacting regions was

384 shown based on the number of consisting fragment(s) in a HiCORE-specified region.

385

386 Supplemental Figure 3. The average number of fragments in each specified interacting

387 chromatin region, with various cutoff values.

388 Restriction fragments smaller than the cutoff values were merged with neighboring fragments

389 until the bin is larger than the cutoff size. One bin layer was added at a time, repeatedly, and

390 chromatin looping regions were specified. The x-axis indicates the number of layers

391 integrated for HiCORE analysis.

392

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 18: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

18

393 Supplemental Figure 4. The fragment number distribution of HiCORE-specified

394 chromatin loops.

395 Restriction fragments smaller than the cutoff values (700bp) were merged with neighboring

396 fragments until the bin is larger than the cutoff size. Twenty bin layers were integrated by

397 HiCORE, and the overlapped looping regions were identified. Reliable chromatin loops were

398 collected, and they were categorized based on the number of consisting fragment(s) in a pair

399 of HiCORE-specified chromatin looping regions. 1F, 1-fragment unit; 2F, 1~2-fragment unit;

400 3F, 1~3-fragment unit; 4F, 1~4-fragment unit; >4F, >4-fragment unit.

401

402 Supplemental Figure 5. The average size of each specified interacting chromatin region.

403 Restriction fragments smaller than the cutoff values were merged with neighboring fragments

404 until the bin is larger than the cutoff size. One bin layer was added at a time, repeatedly, and

405 chromatin looping regions were specified. The x-axis indicates the number of layers

406 integrated for HiCORE analysis.

407

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 19: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 20: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 21: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 22: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint

Page 23: HiCORE: Hi-C analysis for identification of core chromatin ......2020/10/01  · 3 34 Introduction 35 Hi-C analysis maps all possible genomic interactions in a manner dependent on

.CC-BY 4.0 International licensemade available under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted October 1, 2020. ; https://doi.org/10.1101/2020.10.01.322024doi: bioRxiv preprint