32
SnpReady for Rice (SR4R) Database Jun Yan 1,a# , Dong Zou 2,b# , Chen Li 3,c , Zhang Zhang 2,d , Shuhui Song 2,e *, Xiangfeng Wang 1,f* 1 Department of Crop Genomics and Bioinformatics, College of Agronomy and Biotechnology, China Agricultural University, Beijing 100094, China 2 National Genomics Data Center & BIG Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Bejing 100101, China 3 Rice Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, China # Equal contribution * Correspondence should be addressed to [email protected] (Wang XF) and [email protected] (Song SH); Emails of authors: [email protected] (Yan J); [email protected] (Zou D); [email protected] (Li C); [email protected] (Zhang Z) a ORCID: 0000-0002-3806-6457 b ORCID: 0000-0002-7169-4965 c ORCID: 0000-0001-6702-6860 d ORCID: 0000-0001-6603-5060 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint this version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999 doi: bioRxiv preprint

SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

SnpReady for Rice (SR4R) Database 1

2

Jun Yan1,a#, Dong Zou2,b#, Chen Li3,c, Zhang Zhang2,d, Shuhui Song2,e*, Xiangfeng Wang1,f* 3

4

1 Department of Crop Genomics and Bioinformatics, College of Agronomy and 5

Biotechnology, China Agricultural University, Beijing 100094, China 6

2 National Genomics Data Center & BIG Data Center & CAS Key Laboratory of Genome 7

Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 8

Bejing 100101, China 9

3 Rice Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, 10

China 11

12

# Equal contribution 13

* Correspondence should be addressed to [email protected] (Wang XF) and 14

[email protected] (Song SH); 15

Emails of authors: 16

[email protected] (Yan J); 17

[email protected] (Zou D); 18

[email protected] (Li C); 19

[email protected] (Zhang Z) 20

21

a ORCID: 0000-0002-3806-6457 22

b ORCID: 0000-0002-7169-4965 23

c ORCID: 0000-0001-6702-6860 24

d ORCID: 0000-0001-6603-5060 25

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 2: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

e ORCID: 0000-0003-2409-8770 26

f ORCID: 0000-0002-6406-5597 27

28

Number of words: 5421 29

Number of figures: 7 30

Number of tables: 0 31

Number of supplementary figures: 4 32

Number of supplementary tables: 3 33

34

Running title: Jun Yan et al. / SNP Ready for Rice database 35

36

37

38

39

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 3: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Abstract 40

The information commons for rice (IC4R) database is a collection of ~18 million SNPs 41

(single nucleotide polymorphisms) identified by the resequencing of 5,152 rice accessions. 42

Although IC4R offers ultra-high density rice variation map, these raw SNPs are not readily 43

usable for the public. To satisfy different research utilizations of SNPs for population 44

genetics, evolutionary analysis, association studies and genomic breeding in rice, the raw 45

genotypic data of the 18 million SNPs were processed by unified bioinformatics pipelines. 46

The outcomes were used to develop a daughter database of IC4R – SnpReady for Rice 47

(SR4R). The SR4R presents four reference SNP panels, including 2,097,405 hapmapSNPs 48

after data filtration and genotype imputation, 156,502 tagSNPs selected from linkage 49

disequilibrium (LD)-based redundancy removal, 1,180 fixedSNPs selected from genes 50

exhibiting selective sweep signatures, and 38 barcodeSNPs selected from DNA fingerprinting 51

simulation. SR4R thus offers a highly efficient rice variation map that combines reduced SNP 52

redundancy with extensive data describing the genetic diversity of rice populations. In 53

addition, SR4R provides rice researchers with a web-interface that enables them to browse all 54

four SNP panels, use online toolkits, and retrieve the original data and scripts for a variety of 55

population genetics analyses on local computers. The SR4R is freely available to academic 56

users at http://sr4r.ic4r.org/. 57

58

Keywords: Rice; SNP; Database; Hapmap 59

60

Introduction 61

Oryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, 62

thousands of rice accessions in the germplasm banks worldwide have been genotyped [1] and 63

numerous rice variation databases have been constructed. One of these databases is the rice 64

variation database (RVD), a daughter database of the Information Commons for Rice 65

consortium (IC4R) [2]. RVD is a collection of over eighteen million SNPs (single nucleotide 66

polymorphisms) identified from 5,152 rice accessions based on whole-genome resequencing 67

data, and offers an ultra-high-density rice variation map – about one SNP per twenty bases on 68

average. The information contained in this high volume of raw SNPs is not ready for use until 69

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 4: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

it has been processed to remove low-quality SNPs, such as those with missing/low frequency 70

genotypes, or redundant SNPs identified due to linkage disequilibrium (LD). In addition, 71

different types of research require different magnitudes of SNPs to ensure efficient 72

computing and accurate results; for example, the requirements are different for evolutionary 73

studies using comparative genomics and pan-genome analysis, gene mapping by quantitative 74

trait loci (QTL), genome-wide association study analysis (GWAS), molecular breeding by 75

marker-assisted selection (MAS) and genomic selection (GS), and variety protection by DNA 76

fingerprint barcoding. 77

Construction of a reference haplotype map (HapMap) to represent the maximal population 78

diversity for a species is the first step. The ~18 million raw SNPs in RVD provide an initial 79

variation dataset to generate a reference HapMap for rice. According to international human 80

HapMap database, which contains over 3.1 million high-quality SNPs, a density of one SNP 81

per 100 bases is sufficient for performing genotype imputation, GWAS analysis and mapping 82

of causal variations [3]. Because the genome size of rice is ~ 400 Mbp, about two million 83

high-quality SNPs may offer an ideal density of one SNP per 200 bases. Such density of a 84

reference rice HapMap is especially useful for molecular breeders to perform genotype 85

imputation to supplement missing genotypes or increase SNP density, as low-density 86

genotyping platforms are mostly used in rice to lower genotyping expense. 87

For population genetics studies in which thousands of individual samples are assessed, the 88

millions of SNPs in an entire HapMap are excessive. The redundant SNPs in a HapMap 89

extensively increase computing costs, and may also reduce the accuracy of results. To 90

circumvent these challenges, a subgroup of SNPs whose genotypes significantly correlate 91

with other SNPs in the same linkage disequilibrium (LD) region are selected; these are 92

known as tagging SNPs. The number of tagging SNPs may vary between species and 93

populations, depending upon the lengths of LD regions in each group [4]. Based on the data 94

in RVD, LD length in rice ranges from 100 to 500 Kb; thus 100,000 SNPs, which yields a 95

density of one tagging SNP per 3 to 5 Kbp, is sufficient for various genetic diversity analysis. 96

The expense of genotyping is an important factor to consider in crop molecular breeding, 97

as molecular breeding typically requires the rapid genotyping of thousands of samples, often 98

within days or even hours. Therefore, low SNP density genotyping technologies, such as SNP 99

chip or KASP-based platforms are usually preferred by industrial seed companies; these 100

methods offer great flexibility by combining the rapid identification of low numbers of SNPs 101

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 5: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

(several to a few dozen) with the ability to multiplex hundreds to thousands of DNA samples. 102

However, these methods lack precision. 103

Modern breeding methods demand the efficiency and stability of a highly concise marker 104

panel containing ~1K SNPs. SNPs used to select plants for breeding typically occur in genes 105

or genomic regions that are associated with agronomic traits believed to be subject to 106

selective pressures [5]. Genes with variations exhibiting selectively fixed signatures can be 107

identified based on the θπ and Fst values computed by selective sweep analysis [6]. This 108

magnitude of SNPs is suitable for synthesis on low-density SNP chips, which are then used 109

for conducting certain types of molecular analysis, such as marker-assisted selection, seed 110

purity or heterozygosity testing, genetic component analysis, and subpopulation classification. 111

For intellectual protection of commercial rice varieties, DNA fingerprinting typically uses 112

only 12 to 36 SNPs, to generate a combination of barcodes with maximal resolution to 113

distinguish commercial varieties in the seed industry or germplasm accessions in gene banks. 114

Simulation of all possible combinations of a set of candidate SNPs have to be tested in a large 115

germplasm population to ensure the maximal resolution with fewest markers, such as the 116

MinimalMarker algorithm [7]. 117

To enhance the ability of researchers to effectively use the RVD in IC4R, we developed a 118

daughter database we have called SnpReady for Rice, or SR4R. SR4R enables researcher to 119

readily retrieve SNPs that are relevant to their own research, thus saving time and 120

computational resources. In SR4R, the ~18 million SNPs have been divided into four 121

categories: hapmapSNPs, tagSNPs, fixedSNPs, and barcodeSNPs (Figure 1). SR4R allows 122

users to browse the related information associated with each SNP panel, and also to 123

download each set of genotype files for local use. SR4R also offers 18 bioinformatics tools 124

and pipeline scripts, enabling users to locally run the tools to perform genotype imputation, 125

basic statistical analysis, genotype file format conversion, SNPs filtration and extraction, 126

population structure analysis, genetic diversity analysis, rice subpopulation classification, 127

DNA fingerprinting analysis, and other additional functions. 128

129

Database contents and analytical modules 130

The hapmapSNP panel 131

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 6: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

The IC4R rice variation database (RVD; http://variation.ic4r.org/) is a collection of over 18 132

million SNPs with related annotation information, identified from previously published 133

whole genome resequencing of 5,152 rice accessions [2]. Such a high-density rice variation 134

map, which identifies an average of one SNP per twenty bases, offers the possibility of 135

generating a high-density HapMap for the rice research community; creating such a HapMap 136

was the first step in creating the SnpReady for Rice (SR4R) Database described here. 137

To ensure the quality of HapMap, we performed an initial filtration of samples and SNPs 138

on the raw dataset of 5,152 accessions (Materials and Methods). First, a total of 2,556 139

accessions with genotype missing rate less than 20% were selected; each selected accession 140

has been documented with explicit subpopulation classification and origins (Table S1). Then, 141

SNPs with genotype missing rate ≥ 0.1 and minor allele frequency (MAF) ≤ 0.05 were 142

removed. Genotype imputation on the resulting 2,883,623 SNPs in the selected 2,556 143

accessions yielded a high-quality HapMap containing 2,097,405 SNPs without any missing 144

genotypes using the software Beagle [8]. These 2,097,405 SNPs were regarded as the 145

hapmapSNP panel, and were used as the initial dataset for generating the other three SNP 146

panels (Figure 2A and 2D). 147

The generated reference HapMap of rice has an average density of five SNPs per Kb with 148

a heterozygosity rate of 3.75% (Figure 2B and 2D). Genome-wide distribution statistics 149

showed that 58.4% of the hapmapSNPs present in the intergenic regions, 12.5% in the 150

intronic regions, 11.8% in the exonic regions, 0.02% on the splicing sites, and 10.6% and 151

6.8% hapmapSNPs located in the upstream and downstream regions (1Kb away from 152

transcription start site or transcription end site) of a gene territory (Figure 2F). The 2,097,405 153

hapmapSNPs with genotypes of 2,556 accessions are available to download, enabling users to 154

perform genotype imputation on local genotype data to increase the density of SNPs 155

generated from low-density genotyping platform. 156

157

The tagSNP panel 158

High SNP density is usually beneficial to precise mapping of trait-related genes with GWAS 159

analysis, but is not suitable for population genetic analysis because SNP redundancy may add 160

unnecessary computation costs and introduce bias to the results [9]. Since SNPs within the 161

same LD region possess correlated genotypes forming one haplotype block, a representative 162

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 7: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

SNP is usually selected as a tag to solve the redundancy issue. We adopted an LD-based SNP 163

pruning procedure to infer haplotype tagging SNPs (tagSNPs) from the hapmapSNPs 164

(Materials and Methods). As a result, 156,502 tagSNPs were identified (Figure 1). To 165

verify whether the tagSNP panel properly represents the genetic diversity of the population, 166

phylogenetic analysis using the 156,502 tagSNPs was performed on the 2,556 rice accessions 167

which were explicitly documented with subpopulation classification and origins. As shown in 168

Figure 3A, the resulting phylogenetic tree clearly exhibited six major clades representing the 169

five cultivated rice subpopulations and one wild rice subpopulation. The five cultivated rice 170

subpopulations include indica rice (Ind for short) containing 1,655 accessions, Aus rice (Aus) 171

containing 182 accessions, Aromatic (Aro) rice containing 56 accessions, tropical japonica 172

rice (TrJ) containing 318 accessions, and temperate japonica rice (Tej) containing 327 173

accessions, whilst the wild rice subpopulation contains 18 O. rufipogon (Oru) accessions. In 174

addition, PCA-based (Figure 3B) and admixture-based (Figure 3C) analyses showed the 175

same pattern, with the subpopulation classification as the phylogenetic tree indicated. For 176

population admixture structure analysis, a predefined parameter of “K value” was used to 177

mandatorily estimate the number of ancestral subpopulation and uses different colours for 178

each K value to represent the number of subpopulations. Because the optimal number of 179

ancestral subpopulation is usually unknown, a common way is to use a series of K value to 180

estimate the optimal K parameter. It is worth noting that the japonica, indica and Aus 181

subpopulations were explicitly separated when K was set to 3, while the six subpopulations 182

were clearly separated until the K value was set to 8. In addition, between K=4 to 7, the 183

indica subpopulation showed clear structure divided into six groups (indica g1 to g6) as 184

indicated by both PCA and admixture analysis (Figure 3D and Figure S1). The genetic 185

structures of the six rice subpopulations and the six indica subgroups are consistent with 186

multiple previous reports [10]. 187

188

Genetic diversity analysis with the tagSNP panel 189

The tagSNP panel represents a subset of the hapmapSNPs after approximately 92.5% of the 190

genetic redundancy was removed (Figure 1). To test the effectiveness of the 156,502 191

tagSNPs, we performed another series of standard genetic diversity analyses and examined 192

whether the results agreed with previously reported conclusions. First, we found that the 193

count of homozygous SNPs and the heterozygosity rate of the accessions in the six 194

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 8: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

subpopulations showed opposite trends: while the accessions in the TeJ subpopulation had 195

the highest count of homozygous SNPs and lowest heterozygous rate, the accessions in the 196

indica subpopulation had the lowest count of homozygosity SNPs and highest homozygosity 197

rate (Figure 4A and 4B). The IBS (identity by state) analysis is a commonly used method to 198

measure the similarity of alleles in a designated population, which may reflect the genetic 199

diversity of the whole population and subpopulations. Comparison of the IBS values among 200

different subpopulations may help understand the degree of genetic differentiation in 201

different subpopulations. In order to validate whether the IBS results generated from the 202

tagSNPs are consistent with the previous reports regarding the genetic diversity in different 203

subpopulations, pairwise computation of the IBS values between each pair of accessions 204

within the same subpopulation was performed, and the results showed that temperate 205

japonica rice has the highest IBS values, while the indica rice has the lowest (Figure 4C). In 206

addition, runs of homozygosity (ROH) analysis indicated that the temperate japonica rice has 207

the most and longest ROH regions, while the indica rice has the least and shortest ROH 208

regions (Figure 4D). This pattern agreed with the result from LD decay analysis showing that 209

temperate japonica rice has the slowest LD decay rate while the indica rice has fastest rate 210

(Figure 4E). Computations of θπ and Fst are commonly used methods to measure genetic 211

diversity within population and between population, respectively (Materials and Methods). 212

The within-subpopulation diversities of the six rice subpopulations are Oru (θπ=0.218), Ind 213

(θπ=0.216), Aus (θπ=0.182), Aro (θπ=0.145), TrJ (θπ=0.116) and TeJ (θπ=0.068). Using the 214

wild rice subpopulation as reference, the genetic distances of the five types of cultivated rice 215

between wild rice are TeJ (Fst=0.476), TrJ (Fst=0.419), Aus (Fst=0.299), Ind (Fst=0.266) 216

and Aro (Fst=0.241), suggesting the highest domestication level of japonica rice compared to 217

other rice (Figure 4F and 4G). The collective results from multiple angles of standard 218

genetic diversity analyses were consistent with previous reports that indica rice has a more 219

complicated genetic composition and origin compared to the other five subpopulations [11]. 220

221

Genomic selection analysis with the tagSNP panel 222

Genomic selection (GS) has been widely used in industrial animal and crop breeding 223

programs [12]. GS is essentially a best linear unbiased prediction (BLUP) model that is first 224

trained with known genotypes and phenotypes of reference population individuals, usually 225

accounting for 20% to 50% of a breeding population, and then used to predict the unknown 226

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 9: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

phenotypes of the remaining genotyped individuals (the candidate population). The predicted 227

phenotypes, known as the genomic estimated breeding values (GEBV), are ranked from high 228

to low, and can be used to assist in deciding upon a hybridization plan. Although GS may 229

significantly shorten the breeding cycle, the cost for genotyping has been a vital factor 230

because the GS model has to take genome-wide SNP markers as input, especially from crop 231

breeding in which thousands to hundreds of thousands of individuals need to be genotyped. 232

In order to lower genotyping cost, compilation of a set of thousands of SNPs that may best 233

represent the overall genetic backgrounds of a breeding population is of great importance. 234

Because the 156,502 tagSNPs category is a high-quality marker set with most redundancy 235

removed while preserving maximal genetic diversity, it may be considered as a marker pool 236

for selecting high-efficiency SNPs for genomic selection. To test the effectiveness, we 237

analysed a previously published dataset containing 414 rice parental lines with non-missing 238

genotypes of 29,434 SNPs profiled by the 44K rice SNP chip, and nine phenotype traits 239

(flowering time, panicle fertility, seed width, seed volume, seed surface area, plant height, 240

flag leaf length, flag leaf width, and florets per panicle). The GS model was obtained from 241

the ridge regression best linear unbiased prediction (rrBLUP) algorithm [13], and prediction 242

accuracy was evaluated with Pearson correlation between observed and predicted traits by 243

five-fold cross validation. The evaluation was performed using five different SNP 244

combinations: Set-1, the original 29,434 SNPs on the 44K chip; Set-2, the 1,090 SNPs 245

overlapped between the 156,502 tagSNPs and 29,434 SNPs; Set-3, the 1,090 SNPs randomly 246

selected from the 29,434 SNPs; Set-4, the 1,090 SNPs evenly distributed in the genome (350 247

Kb per SNP) selected from the 29,434 SNPs; and Set-5, the 1,090 consecutive SNPs 248

localized within a randomly selected genomic region from the 29,434 SNPs. Then the 249

rrBLUP prediction was performed on the nine phenotypes using the five sets of SNPs to 250

compare prediction accuracies (Figure 5). Although prediction accuracies greatly varied 251

ranging from 0.23 to 0.90 among the nine traits due to different heritability of each trait, the 252

trend of the five SNP sets within the same trait was generally consistent. Except for the trait 253

of panicle fertility in which the Set-2 SNPs exhibited the highest prediction accuracy, the full 254

29,434 SNPs showed the highest prediction accuracy for the other eight traits followed by the 255

1,090 tagSNPs in the second position. We further performed pairwise student’s t-test for 256

Pearson correlations of the selected 1,090 tagSNPs set (Set-2) and other four sets, the result 257

shows that the selected 1,090 tagSNPs set significantly outperform other randomly selected 258

SNP set for most traits (Figure S2). These results indicate that selection of about one 259

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 10: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

thousand tagSNPs from the tagSNP pool might be a feasible option to lower genotyping 260

budget; for example, these SNPs could inform the synthesis of a new low-density SNP chip 261

rather than using high-density SNP chip. 262

263

The fixedSNP panel 264

In the crop breeding industry, genotyping cost-per-sample is a top-priority factor, since 265

hundreds to thousands of samples are often genotyped in single day. The data then assists a 266

variety of molecular breeding practices, including genomic selection-assisted phenotype 267

prediction, marker-assisted backcrossing, seed purity or genotype heterozygosity analysis, 268

and subpopulation identification. Cost reduction is usually fulfilled by compiling a highly 269

effective marker panel containing only dozens to hundreds of SNPs that are available for 270

high-throughput genotyping platforms, such as Douglas ArrayTape and LGC Omega-F 271

equipment, using a PCR-based KASP™ genotyping assay. These systems allow users to 272

flexibly combine different numbers of SNPs and DNA samples using multiple plates with 96 273

and 384 wells per run. To meet the industrial demand, further compression of the tagSNP 274

panel must consider not only the genetic relationship between subpopulations and accessions, 275

but also the evolutionary and/or functional significance of SNPs with high diagnostic 276

effectiveness and stability. 277

The Fst and θπ values are commonly used indicators of genomic regions demonstrating 278

signatures of selective sweeps, caused by domestications, artificial selections and 279

environmental adaption. SNPs in selective-sweep regions are usually evolutionarily fixed 280

with strong positive selection signals. To generate the fixedSNP panel, we first identified the 281

selective sweep regions that are specific to each subpopulation and are common to the six 282

subpopulations by combining the ratio of Fst versus θπ based on the comparison of the 283

cultivated subpopulation against the wild rice population (Materials and Methods). Using 284

100 Kb and 10 Kb windows, large and small genomic regions showing selective sweep 285

signals were identified, respectively. In total, 227 (cultivated vs. wild), 381 (Ind vs. wild), 333 286

(Aus vs. wild), 296 (Aro vs. wild), 256 (TrJ vs. wild) and 269 (TeJ vs. wild) identified regions 287

showed significantly smaller Tajima' D values compared to other regions (Figure 6A). 288

Subsequently, genes located in the selective sweep regions and their corresponding GSEA 289

(Gene Set Enrichment Analysis) terms were further identified for each subpopulation, and 290

~50% of them were specific to each subpopulation whilst only 27 GSEA terms co-exist in the 291

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 11: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

five cultivated rice subpopulations (Figure 6B). Finally, a total of 1,180 SNPs occurred 292

within the genes in the selective sweep regions were selected to generate the fixedSNP panel. 293

294

Subpopulation classification analysis with the fixedSNP panel 295

To evaluate the fixedSNP panel, subpopulation classification with phylogenetic tree analysis 296

was performed using the 1,180 fixedSNPs, and the results were compared to the results 297

generated from the 156,502 tagSNPs performed on the same population of 2,556 accessions. 298

All of the accessions were assigned to the correct subpopulations with tagSNPs and the 299

phylogenetic tree showed consistent structure with the tree constructed with fixedSNPs 300

(Figure 6C). To further evaluate the universality of the fixedSNP panel, we performed 301

subpopulation classification on two external populations genotyped by SNP chips [11] [14]. 302

One chip dataset contained 880 cultivated rice accessions genotyped by the Affymetrix 700K 303

SNP chip, while the other contained 351 cultivated accessions genotyped by the Illumina 44K 304

SNP chip. Both external chip datasets have been documented with clear subpopulation 305

classification and origins, and possess relatively high genetic diversity. Only 314 and 63 306

SNPs from the 700K and 44K chips, respectively, were found in the 1,180 fixedSNP panel. 307

For the chip dataset containing 880 accessions, 877 accessions were correctly assigned to 308

their documented subpopulations; three TeJ accessions (IRGC121549, IRGC121520 and 309

IRGC121535) were incorrectly assigned to the TrJ subpopulation (Figure 6D). As for chip 310

dataset containing 351 accessions, 348 were assigned to the correct subpopulation; three TeJ 311

accessions (NSFTV134, NSFTV204 and NSFTV283) were mistakenly assigned to Trj rice 312

(Figure 6E). Overall, 99.8% of the rice accessions examined were assigned to previously 313

documented subpopulation records using markers extracted from the fixedSNP panel, 314

indicating that the fixedSNP panel is an efficient, accurate new tool for subpopulation 315

classification. 316

317

The barcodeSNP panel 318

DNA fingerprinting technology using a small set of SNPs to generate a series of genotype 319

combinations, referred to as “barcodes,” has become an economical means to protect 320

commercialized varieties. Thus, the barcodeSNP panel must be able to uniquely identify 321

these barcodes to distinguish between each of the rice varieties on the market. To ensure 322

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 12: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

highest uniqueness but lowest count of barcodeSNPs, we applied the MinimalMarker 323

algorithm on the fixedSNP panel to exhaustively traverse all possible genotype combinations 324

that would distinguish the 2,556 accessions (Materials and Methods). The MinimalMarker 325

algorithm generate three sets of minimum marker combinations, in which each set contains 326

28 SNPs. After merging the three sets, 38 barcodeSNPs were finally selected to generate the 327

panel (Figure S2A). In addition, up- and down-stream flanking sequences were also provided 328

for users to design primers for PCR-based KASP™ genotyping assays. The SR4R also offers 329

a web interface that allows users to identify corresponding accessions or varieties when rice 330

varieties are submitted for genotyping with any number of barcodeSNPs between 8 to 38. 331

The SR4R returns a list of the top 10 best-matched accessions/variety in the database, and 332

displays associated information including the accession/variety IDs, number of mismatched 333

bases, genomic position of the barcode, genotype heterozygosity, and documented 334

subpopulation and origin. Among the top 10 hits, if multiple best-matched varieties with 335

100% identity are returned using a certain number of barcodes, the users may genotype 336

additional barcodeSNPs until a unique best matched variety is identified. It is worth noting 337

that because the SR4R does not have a complete list of the barcodes for all commercial rice 338

varieties in the database, the 38 barcodeSNPs is considered as an initial panel for users to test 339

the best combinations with the most optimal sensitivity and specificity using flexible numbers 340

of markers. 341

342

Machine learning analysis with the barcodeSNP panel 343

If a new variety genotyped with barcodeSNPs is not found in the database, SR4R will 344

perform subpopulation classification. The traditional method of subpopulation classification 345

first integrates the genotype of the submitted variety with the genotypes of all the varieties in 346

the database, then performs phylogenetic analyses to determine the best assigned 347

subpopulation. This procedure is tedious and computationally inefficient since the database 348

contains hundreds of thousands of accessions. To simplify the procedure so that it may be 349

implemented through a web interface, we adopted an alternative method that utilizes machine 350

learning-based subpopulation classification models with the 38 barcodeSNPs as features. We 351

used all of the 2,556 rice accessions to evaluated seven commonly used machine learning 352

algorithms to perform subpopulation classification including decision tree, k-nearest 353

neighbouring, naïve Bayesian, artificial neural network, random forest, multinomial logistic 354

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 13: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

regression and one-vs-rest logistic regression algorithms, followed by ten-fold cross 355

validation assessment (Materials and Methods). A series of assessments of the classification 356

precision in the five cultivated rice subpopulations indicated that, out of the seven methods 357

the best one is the multinomial logistic regression model, whose AUC (Area under curve) 358

values were all ≥ 0.99 for all subpopulations (Figure S3B-F). Additional methods are one-vs-359

rest logistic regression and the random forest model; where results from each yielded similar 360

classification precision to the multinomial logistic regression model. Then, we used an 361

independent datasets containing 880 rice accessions profiled by 770 Kb rice SNP chip for 362

independent validation. The multinomial logistic regression model was trained by the 2,556 363

rice accessions, and then predict the subpopulation classifications on the 880 samples. The 364

AUC values were all ≥ 0.99 for all subpopulations in this independent datasets, indicating 365

robustness of the model. Moreover, compared the original label and the predicted label with 366

the max probability for each sample, the true positive rate (TPR) and false positive rete (FPR) 367

are also reasonable (Figure S4). The pre-trained classification models with the seven 368

machine learning algorithms have been implemented on the SR4R server provided as a web 369

tool for users to perform subpopulation classification when the genotype information of the 370

38 barcodeSNPs in submitted. 371

372

The barcodeInDel panel 373

InDel (Insertion and Deletion) is another form of genomic variations (usually less than 50 bp 374

in length) that can be used as molecular markers for a variety of population analysis. From 375

the 5,152 rice accessions, a total of 4,217,174 raw InDel variations were identified using the 376

IC4R variation calling pipeline [2]. After filtering low-quality InDels, 109,898 high-377

confidence InDels were retained with missing rate less than 0.01 and MAF ≥ 0.05 within 378

2,556 rice accessions. Among the 109,898 high-confidence InDels, we further identified 62 379

subpopulation-specific InDels which can be used as barcodeInDels to differentiate the six rice 380

subpopulation TeJ, TrJ, Aro, Aus, Ind and Oru, and the six subgroups of indica rice S1-S6 381

(Table S3). The 109,898 high-confidence InDels can be download from SR4R for users’ 382

customized analysis. 383

384

Web interface of SR4R database 385

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 14: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Using unified bioinformatics pipelines, the genotype data of the 18 million raw SNPs 386

identified from 5,152 rice accessions were processed to construct four reference panels of 387

SNPs for different utilizations. Because genotype data processing is a complicated and 388

computationally intensive procedure, the four SNP panels are readily usable for a variety of 389

analyses simplify task for rice researchers. For better sharing of SNPs and improvement of 390

the rice variation map utility, we developed the SnpReady for Rice (SR4R) database. 391

Through the SR4R web interface, users may directly browse the four panels and retrieve 392

detailed information related to the 2,097,405 hapmapSNPs, 125,502 tagSNPs, 1,180 393

fixedSNPs, 38 barcodeSNPs. In addition, the protein-coding genes exhibiting strong selection 394

signatures, associated with the 1,180 fixedSNPs were also included in the SR4R database 395

with detailed functional annotations (Figure 7A). When users retrieve a SNP such as the first 396

SNP “OSA01S00001362”, the genomic location and the adjacent gene or the gene containing 397

the queried SNP are displayed. Users may also retrieve a visualized allele frequency map in 398

the six major subpopulations, and the six subgroups of indica rice (Figure 7B). 399

The users may also download the four panels of SNPs along with the original genotype 400

files for local analysis via http://sr4r.ic4r.org/download. In addition, the “Tools” module 401

presents 18 handy scripts and pipelines that users may install on their local computers for a 402

variety of analysis, including basic genotype processing, population diversity analysis, rice 403

variety identification and subpopulation classification. For example, assuming one user may 404

want to perform a genotype imputation of 44K SNP rice Chip, she or he may first download 405

the file “hapmapSNPs-genotype.tar.gz (892 MB)” containing the genotypes of the 2,097,405 406

hapmapSNPs in 2,516 rice accessions. Then, the user may use the pipeline and scripts 407

demonstrated in Figure 7C to perform imputation on a local server. SR4R also offers two 408

modules of online analysis. The first module is to use a machine learning-based method to 409

assign the subpopulation type based on the user-submitted genotype file including no more 410

than 20 samples. The model will return the probability of the type of subpopulation assigned 411

to each sample (Figure 7D). The second module is to perform DNA fingerprint analysis. 412

When the user submits a genotype file containing no more than 20 samples, the model will 413

search the accession database, and return the top three matches of existing varieties with the 414

number of mismatched nucleotide and heterozygosity rate displayed (Figure 7E). The 415

programs and scripts for these two modules along with demo input and output files are also 416

available to download for local analysis of genotypes with large sample numbers. 417

418

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 15: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Conclusions 419

The IC4R Rice Variation Database collects over 18 million raw SNPs identified from 420

resequencing of 5,152 accessions. To meet the different demands for the rice research 421

community and breeding industry, we further generated four panels of 2,097,405 422

hapmapSNPs, 156,502 tagSNPs, 1,180 fixedSNPs and 38 barcodeSNPs with standard 423

processing pipelines and uniform analytical parameters (Table S2). The four panels of SNPs 424

can be either accessed online or downloaded for local use from the daughter database of RVD 425

– SnpReady for rice (SR4R). The hapmapSNP panel contains 2 million non-missing 426

genotypes of 2,556 accessions offers a reference HapMap for genotype imputation and high-427

resolution GWAS analysis. The non-redundant 150K tagSNP panel is an ideal magnitude for 428

population genetics and evolutionary analysis for research, as well as an ideal marker pool for 429

genomic selection-assisted breeding in rice. For a breeding population with about 500 F1 430

hybrids, 1,500 to 15,000 markers selected from the tagSNP panel can be used to build a GS 431

model, reaching a satisfactory genotype-to-phenotype prediction accuracy. The fixedSNP 432

panel with high effectiveness and stability can be regarded as a marker pool for various 433

molecular breeding practice suitable for low-budget, flexible genotyping platform, in terms of 434

subpopulation classification, seed purity analysis and genetic background analysis. The 38 435

barcodeSNPs selected by MinimalMarker algorithm is an initial marker set for generating 436

DNA fingerprints for commercial rice varieties. Along with the barcodeSNP panel, two web-437

based tools, one for variety identification and another for subpopulation classification, are 438

offered in SR4R. In addition, the SR4R database also offers a series of standard pipelines 439

used to construct the four sets of SNPs, and local handy tools to perform rice varieties 440

classification, barcode development, and other types of genetic and breeding research. With 441

the incremental accumulation of population genotype data in BIGD center, these 442

bioinformatics tools can be applied to other animal or plant species such as corn, wheat, 443

soybeans, for a centralized reference HapMap and SNP panel databases for plants. 444

445

Materials and Methods 446

Construction of hapmapSNP and tagSNP panels 447

The raw 18 million SNPs with genotype information of 5,152 rice accessions were obtained 448

from the IC4R rice variation database (http://variation.ic4r.org). Accession filtration, SNP 449

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 16: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

filtration and basic statistics of homozygous SNPs and accession heterozygosity were 450

performed using in-house scripts. Genotype imputation of missing sites and phasing were 451

performed using Beagle [8]. A SNP site with missing genotype was removed if an inferred 452

genotype with a posterior probability was smaller than 0.5. Genomic annotation of 453

hapmapSNPs was performed using ANNOVAR (version 20160201) against the rice 454

International Rice Genome Sequencing Project (IRGSP) gene annotation. Using the reported 455

LD length of rice ranging from 40 to 500 Kb, an LD-based SNP pruning method was used to 456

construct the tagSNPs category using PLINK with –indep command [15] [16]. The PLINK 457

parameters were selected based on the variance inflation factor (VIF), which recursively 458

removed SNPs within a sliding window of 50 SNPs and a step size of 5 SNPs to shift the 459

window. 460

461

Tools for subpopulation structure analysis 462

The tagSNPs for 2,556 rice accessions were concatenated as input sequences for constructing 463

the phylogenetic tree using the neighbour joining algorithm implemented in MegaCC with 464

pairwise gap deletion and 100 bootstrap replications [17]. The output tree file for all 2,556 465

rice accessions and the subtree file of indica rice accessions were visualized in MEGA7 [18]. 466

Principal component analysis of the 2,556 rice accessions was done by flashPCA [9]. 467

Population admixture structure analysis was done by fastSTRUCTURE using the variational 468

bayesian framework, and k=2 to k=8 were set to infer the admixture of ancestors for the 469

accessions. 470

471

Tools for genetic diversity analysis 472

Genetic diversity related analyses were mostly done using PLINK [16]. Genome-wide 473

pairwise IBS calculations were performed between each pair of accessions within the same 474

subpopulation in order to deduce the genetic affinity, and an IBS pairwise distance matrix 475

was generated for each subpopulation. The ROH analysis for each subpopulation used a 476

sliding window method to scan each accession’s genotype for a given population at each 477

marker position to detect homozygous segments. The parameters and thresholds applied to 478

define ROH were set as follows: a minimum ROH length of 200 Kbp and a minimum number 479

of 1,000 consecutive SNPs included in an ROH. Correlation coefficient (r2) of SNPs was 480

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 17: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

calculated to measure LD level for each subpopulation. The average r2 value was calculated 481

for each length of distance from 0 to 500 Kbp, followed by drawing LD decay figures using 482

an R script for each subpopulation. Population diversity of rice varieties was measured by 483

two indexes: θπ and Fst. Nucleotide diversity θπ was used as a measurement of the degree of 484

genotype variability within each subpopulation, while subpopulation differentiations were 485

evaluated by the fixation index Fst for each of the cultivated subpopulations against the wild 486

rice population and for the cultivated subpopulations compared to each other. Values of θπ 487

and Fst were calculated using sites mode implemented in VCFtools [19]. 488

489

Tools for genomic selection analysis 490

Genotype and phenotype datasets of the 44k rice chip were downloaded from the Rice 491

Diversity Website (http://www.ricediversity.org/). Genotype imputation and phasing were 492

then performed using Beagle (version 3.3.2), and the site was filtered if an inferred genotype 493

with a posterior probability was smaller than 0.5. Genomic selection analysis was performed 494

using RR-BLUP mixed model implemented in R package rrBLUP [13] for nine well-495

measured traits (flowering time, panicle fertility, seed width, seed volume, seed surface area, 496

plant height, flag leaf length, flag leaf width, and florets per panicle) with five different 497

feature combinations. The prediction accuracy under each feature combination was evaluated 498

by five-fold cross-validation and Pearson correlation coefficient. An example of the process 499

is as follows: the original samples were randomly partitioned into five subsets; of the five 500

subsets, a single subset was retained as the validation data, and the remaining four subsets 501

were used as training data. This process was repeated five times, with each of the ten subsets 502

used exactly once as the validation data. The Pearson correlation coefficients of the predicted 503

breeding values and the real phenotype values were calculated for each fold. 504

505

Construction of the fixedSNP panel 506

θπ and Tajima’ D values were calculated for six rice subpopulations (TeJ, TrJ, aro, aus, ind, 507

Oru) with a sliding-window fashion across the genome using in-house scripts. Fst values 508

were calculated for the five cultivated subpopulations against the wild Oru subpopulation, as 509

well as for the five cultivated subpopulations against each other. For each pairwise 510

comparison, the intersection of the top 5% windowed θπ ratios (wild subpopulation vs. 511

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 18: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

cultivated subpopulation), and the top 5% windowed Fst values correspondingly were 512

selected as strong selective sweep signals. Window sizes of both 100 Kbp and 10 Kbp were 513

used to detect large or small selective sweep regions, followed by merging the results as the 514

candidate selective sweep regions for each subpopulation. Tajima’ D distribution was also 515

drawn for the candidate selective sweep regions against the whole genomes for each pairwise 516

comparison. Genes located within the candidate selective sweep regions were extracted for 517

each comparison, and Gene Set Enrichment Analysis (GSEA) was performed for each gene 518

listed by using PlantGSEA web tools [20]. Genic SNPs located in the candidate selective 519

sweep regions identified from the above-mentioned pairwise comparisons were merged as 520

fixedSNPs. 521

522

Construction of the barcodeSNP panel 523

The 1,180 fixedSNPs were used as the initial marker set to select the minimal number of 524

barcodeSNPs that can maximally distinguish the 2,556 rice accessions using a heuristic mode 525

implemented in MinimalMarker [7]. Three minimal sets each containing 28 SNPs were 526

generated, and after merging the three sets, 38 unique SNPs were selected as barcodeSNPs 527

for generating DNA fingerprints for each accessions. 528

To identify commercialized rice varieties using the combination of 38 barcodeSNPs, seven 529

machine learning-based methods were used: decision tree, k-nearest neighboring, naïve 530

Bayesian, artificial neural network, random forest, multinomial logistic regression, and one-531

vs-rest logistic regression algorithms in the Python sklearn library (https://scikit-532

learn.org/stable/). The precision of each model was assessed using ten-fold cross-validation 533

method. Specifically, the original sample set was randomly partitioned into ten subsets in 534

which nine subsets were used for training model and the remaining subset was used as the 535

testing model; this procedure was repeated ten times and an average prediction accuracy was 536

computed from the overall performance of the tested models. Five one-hot codes (10000, 537

01000, 00100, 00010, 00001) to label the five subpopulations for classification using 538

machine learning models. Then, the predicted label with the max probability was compared 539

with the original label for each sample. If the predicted label is identical with the original 540

label, the prediction result was regarded as correct. Then, the ratios of positive and negative 541

rate were computed to plot ROC curves and compute AUC values. 542

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 19: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

543

Construction of the barcodeInDel panel 544

Raw InDels were identified using the IC4R variation calling pipeline from the origin 5,152 545

rice accessions [2]. Then, the InDels from the 2,556 rice accessions with high sequencing 546

coverage (depth ≥ 5) presented in SR4R database were extracted using customized Python 547

scripts, followed by using VCFtools [19] to filter InDels to generate a high-confidence InDel 548

dataset, with parameters of missing rate less than 0.01 and MAF ≥ 0.05. Finally, using 549

customized Python scripts, InDels which have the same sequence type within each 550

subpopulation were retained to generate the subpopulation-specific barcodeInDel panel. 551

552

Data availability 553

All the data is freely available and downloadable at http://sr4r.ic4r.org/. 554

555

Authors’ contributions 556

XFW, SHS and ZZ conceived the project; JY and CL collected the samples; JY conducted 557

the data analysis; DZ developed the database; JY, XFW, SHS and ZZ wrote the manuscript. 558

559

Competing interests 560

The authors declare no competing interests. 561

562

Acknowledgments 563

We are grateful to a number of users for reporting bugs and providing suggestions in 564

improving SR4R. This work was supported by the National Science Foundation of China 565

[31871706], by the Department of Agriculture of Guangdong Province (2018-36), Science 566

and technology program of Guangdong Province (2019B030316006) and by The Youth 567

Innovation Promotion Association of the Chinese Academy of Sciences [2017141]. 568

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 20: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

569

References 570

[1] Li Z, Fu BY, Gao YM, Wang WS, Xu JL, Zhang F, et al. The 3,000 rice genomes 571

project. GigaScience 2014;3. 572

[2] Zhang Z, Hu S, He H, Zhang H, Chen F, Zhao W, et al. Information Commons for 573

Rice (IC4R). Nucleic Acids Research 2016;44:D1172–80. 574

[3] Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, et al. A second 575

generation human haplotype map of over 3.1 million SNPs. Nature 2007;449:851–61. 576

[4] Flint-Garcia SA, Thornsberry JM, Buckler ES. Structure of Linkage Disequilibrium in 577

Plants. Annual Review of Plant Biology 2003;54:357–74. 578

[5] Nielsen R. Molecular Signatures of Natural Selection. Annual Review of Genetics 579

2005;39:197–218. 580

[6] Lam HM, Xu X, Liu X, Chen W, Yang G, Wong FL, et al. Resequencing of 31 wild 581

and cultivated soybean genomes identifies patterns of genetic diversity and selection. 582

Nature Genetics 2010;42:1053–9. 583

[7] Fujii H, Ogata T, Shimada T, Endo T, Iketani H, Shimizu T, et al. Minimal marker: An 584

algorithm and computer program for the identification of minimal sets of 585

discriminating dna markers for efficient variety identification. Journal of 586

Bioinformatics and Computational Biology 2013;11:1250022. 587

[8] Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data 588

inference for whole-genome association studies by use of localized haplotype 589

clustering. American Journal of Human Genetics 2007;81:1084–97. 590

[9] Abraham G, Inouye M. Fast principal component analysis of large-scale genome-wide 591

data. PLoS ONE 2014;9. 592

[10] Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, et al. Genomic variation in 593

3,010 diverse accessions of Asian cultivated rice. Nature 2018;557:43–9. 594

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 21: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

[11] McCouch SR, Wright MH, Tung CW, Maron LG, McNally KL, Fitzgerald M, et al. 595

Open access resources for genome-wide association mapping in rice. Nature 596

Communications 2016;7. 597

[12] Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E, et al. Genomic 598

Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic 599

Architecture, Training Population Composition, Marker Number and Statistical Model 600

on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS 601

Genetics 2015;11:1–25. 602

[13] Endelman JB. Ridge Regression and Other Kernels for Genomic Selection with R 603

Package rrBLUP. The Plant Genome Journal 2011;4:250. 604

[14] Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, et al. Genome-wide 605

association mapping reveals a rich genetic architecture of complex traits in Oryza 606

sativa. Nature Communications 2011;2. 607

[15] Mather KA, Caicedo AL, Polato NR, Olsen KM, McCouch S, Purugganan MD. The 608

extent of linkage disequilibrium in rice (Oryza sativa L.). Genetics 2007;177:2223–32. 609

[16] Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ. Second-610

generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience 611

2015;4:7. 612

[17] Kumar S, Stecher G, Peterson D, Tamura K. MEGA-CC: Computing core of 613

molecular evolutionary genetics analysis program for automated and iterative data 614

analysis. Bioinformatics 2012;28:2685–6. 615

[18] Kumar S, Stecher G, Tamura K. MEGA7: Molecular Evolutionary Genetics Analysis 616

Version 7.0 for Bigger Datasets. Molecular Biology and Evolution 2016;33:1870–4.. 617

[19] Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The 618

variant call format and VCFtools. Bioinformatics 2011;27:2156–8. 619

[20] Yi X, Du Z, Su Z. PlantGSEA: a gene set enrichment analysis toolkit for plant 620

community. Nucleic Acids Research 2013;41:W98-103. 621

622

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 22: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure legends 623

Figure 1 An overview of the four SNP panels of the SR4R database 624

The flow chart describes procedures on how the four SNP panels were generated. 625

626

Figure 2 Basic statistics of the rice hapmapSNPs after four steps of genotype processing 627

After four steps of genotype processing, a series of basic statistical analyses was performed at 628

each step to exhibit the characteristics of the SNPs, including: A. statistics of individual 629

missing rate, B. statistics of individual heterozygote rate, C. statistics of minor allele 630

frequency, D. statistics of site missing rate, and E. statistics of site heterozygote rate. F. After 631

ARNOVAR analysis to annotated the hapmapSNPs, distribution of the hapmapSNPs in 632

different genomic regions were analyzed. 633

634

Figure 3 Population structure analysis of the 2,556 rice accessions using tagSNPs 635

To test whether the 150K tagSNPs can generate the population structures consistent with 636

previous reports, we performed a series of population structure analyses to generate: A. the 637

phylogenetic tree, B. the PCA map, C. the admixture structure of 2,556 rice accessions, D. 638

the phylogenetic tree of the six subgroups of indica rice. The tagSNPs effectively and 639

accurately classified the 2,556 rice accessions to corresponding populations. 640

641

Figure 4 Genetic diversity analysis of rice accessions using tagSNPs 642

The 150K tagSNPs were used in a series of population genetic analysis to show the 643

effectiveness of tagSNPs including: A. statistics of homozygous SNPs, B. statistics of 644

individual heterozygosity, C. pairwise IBS values distribution, D. statistics of ROH regions, 645

E. LD decay analysis, in the five major rice subpopulations. F. Genetic diversity (θπ) and 646

population differentiation (Fst) between cultivated and wild subpopulations. G. Population 647

differentiation (Fst) of cultivated subpopulations. 648

649

Figure 5 Genomic selection-based phenotype prediction using tagSNPs 650

Nine agronomical phenotypes were predicted based on rrBLUP models to evaluate the 651

effectiveness of tagSNPs. Five sets of SNPs with equal amounts were compared, including 652

Set-1: the original 29,434 SNPs on the 44K chip; Set-2: the 1,090 SNPs overlapped between 653

the 156,502 tagSNPs and 29,434 SNPs; Set-3: the 1,090 SNPs randomly selected from the 654

29,434 SNPs; Set-4: the 1,090 SNPs evenly distributed in the genome (350 Kb per SNP) 655

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 23: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

selected from the 29,434 SNPs; Set-5: 1,090 SNPs localized within a genomic region from 656

the 29,434 SNPs. 657

658

Figure 6 Screening and validation of FixedSNPs 659

A. Distribution of θπ ratios (wild vs cultivar) and corresponding Fst values, which are 660

calculated in 100kb windows. Data points located to the right of the vertical dashed line and 661

to the top of the horizontal dashed line are potential strong selective sweep signals (Red 662

points, corresponding to the 5% right tails of the empirical θπ ratio and Fst values distribution, 663

respectively). Distribution of Tajima’s D values for the potential selective sweep signals and 664

whole genomes are shown within the plot. Other comparisons for the screening of subgroup 665

specific selective sweep signals were not shown here, but demonstrate similar trends. B. 666

Common and specific selective signals among cultivar subgroups (Number of genes or GSEA 667

terms are shown out and in the brackets, respectively). C. Phylogenetic tree of 2,538 rice 668

cultivars in fixedSNPs data set. D. Phylogenetic tree of 880 rice cultivars in 700K chip data 669

set. E. Phylogenetic tree of 351 rice cultivars in 44K chip data set. 670

671

Figure 7 Representative functional modules in SR4R database 672

A. Genes exhibiting significant selection signatures in the corresponding subpopulations are 673

listed in the “Selected Genes” module in the Browser. B. Allele frequencies in different 674

subpopulations of the first hapmapSNP (SNPID: OSA01S00001362, associated gene: 675

Os01g0100100, position: chr01-1362, allele: Alt-A, Ref-G). C. One example of the script and 676

pipeline for population diversity analysis. D. The online analysis module of subpopulation 677

classification using machine learning algorithms. E. The online analysis module of rice 678

variety identification using the 38 barcodeSNPs. 679

680

Supplementary material 681

Figure S1 Population structure of 1,655 varieties in indica 682

A. PCA classification for 1,655 varieties in indica subgroup. B. Structure analysis for 1,655 683

varieties in indica subgroup. 684

685

Figure S2 T-test for Pearson correlations of the selected 1,090 tagSNPs set and other 686

four SNP sets 687

Set-1: the original 29,434 SNPs on the 44K chip; Set-2: the 1,090 SNPs overlapped between 688

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 24: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

the 156,502 tagSNPs and 29,434 SNPs; Set-3: the 1,090 SNPs randomly selected from the 689

29,434 SNPs; Set-4: the 1,090 SNPs evenly distributed in the genome (350 Kb per SNP) 690

selected from the 29,434 SNPs; Set-5: 1,090 SNPs localized within a genomic region from 691

the 29,434 SNPs. Different colors present different P values. * P value < 0.05; ** P value < 692

0.01. 693

694

Figure S3 BarcodeSNPs and machine learning models for the classification of rice 695

varieties 696

A. Heat-map of BarcodeSNPs of 2,538 rice cultivars (Red: A, Yellow: T, Blue: G, Green: C). 697

B. Decision model. C. KNN model. D. Multinomial logistic regression model. E. Naive 698

Bayesian model F. Neural network model. G. Random forest model. H. One-vs-rest logistic 699

regression model. AUC curves were drawn using the mean values of ten cross validations for 700

B-H. 701

702

Figure S4 Independent validation of the machine learning model. 703

A. ROC curve for the 770 Kb rice SNP chip dataset using the pre-build multinomial logistic 704

regression model. B. The true positive rate (TPR) and false positive rate (FPR) statistics for 705

each subpopulation of the 770 Kb rice SNP chip dataset. 706

707

708

Table S1 Summary of 2,556 rice accessions with subpopulation classification and 709

origins 710

711

Table S2 Summary of SNPs annotation for SR4R database 712

713

Table S3 The barcodeInDel panel in SR4R database 714

715

716

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 25: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 1 717

718

719

720

721

722

723

Figure 2 724

725

726

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 26: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 3 727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 27: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 4 742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 28: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 5 760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 29: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 6 781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 30: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure 7 797

798

799

800

801

802

803

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 31: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure S1 804

805

806

807

808

809

810

Figure S2 811

812

813

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint

Page 32: SnpReady for Rice (SR4R) DatabaseOryza sativa, or rice, was the first crop genome to be sequenced. In the past decade, thousands of rice accessions in the germplasm banks worldwide

Figure S3 814

815

816

817

Figure S4 818

819

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 14, 2020. . https://doi.org/10.1101/2020.01.11.902999doi: bioRxiv preprint