INTEGRATED GENOTYPING SERVICE AND …hpc.ilri.cgiar.org/beca/training/data_mgt_2017/DArTseq...DArTseq – Genotyping by Sequencing Developed since 1998 Complexity reduction optimised

INTEGRATED GENOTYPING SERVICE AND SUPPORT (IGSS)

DArTSeq: Wet Lab Components

Martin Kanyeki

DArTseq™ – Genotyping by Sequencing

  Developed since 1998

  Complexity reduction optimised for each organism

  Targeting 100,000- 200,000 mostly low copy sequences

(methyl filtration)

  Most assays read at >10X

 High call rate

 High data quality

  Extensive quality control including individual libraries

  Technical replication ->selection of best markers

  Sequencing platform independent

2

DArTSeq: features   Detects all types of polymorphism (mostly SNP but also

InDels, CNVs and methylation variation)

  Whole genome coverage allows detection of structural

rearrangements

  Unbiased for the location of polymorphism, but highly

enriched for genetically active regions of the genome

  Profiles the whole genome (thousands of loci) in a single

assay on automated platforms

  Fast and inexpensive technology development

–  no need for assay development or sequence

information

Robustness of DArTSeq

  Methylation filtration effect

 enriches for hypo methylated, expressed genome

regions and low copy sequences

 makes DArT robust with respect to genome size

  High frequency (~50%) of markers are low copy and with

high homology to “genes”

  Thanks to this feature DArT and DArTseq deployed in

many genome sequencing projects mostly for linking high

density genetic maps with sequence assemblies

Library Preparation

DNA fragment

BarcodeAdapter(Rarecu/erSpecific) CommonAdapter(Frequentcu/erspecific)

Genomic DNA sample1

Digestion (Rare and Frequent Cutter)

Ligation(T4 ligase)

PCR amplification

Post-PCR

Library Pooling Kit-based Purification

Cbot Cluster Generation Library Quantification

Cluster Generation

l  Cluster generation is carried out in cBOT (Illumina)

according to the procedures described by the

manufacturer.

l  Briefly: 10nM DNA of each library is denatured, diluted

in hybridization buffer, loaded into machine, and

clusters are generated in flow cell by cBOT with use of

the set cBOT reagents(Bridge Amplification)

l  During cluster generation the molecules of each library

are attached to the flow cell surface and amplified to

form clonal clusters.

Sequencing l  Sequencing is carried out in the sequencer HiSeq 2500

( Illumina) using the methodology provided by the

manufacturer.

l  Briefly:

Ø  The flow cell with clusters generated in the previous

step (cBOT) is loaded to the HiSeq 2500 together with the

sequencing reagents.

Ø HiSeq 2500 performs sequencing according to user

selected sequencing parameters.

Ø We sequence our libraries from one end performing single

Read sequencing runs and sequence 77 bases.

Real Time Analysis (RTA)

l  RTA happens simultaneously to the sequencing run and

the RTA data are outputted to a server indicated by the

user

l  Currently for DarTSeq, one HiSeq 2500 sequencing run

results is less than 1 TB of compressed data, which are

results files, log files and a number of quality control

files.

l  The main sequence output files are base calling files

*.bcl files. These files are the input files for downstream

data conversion

INTEGRATED GENOTYPING SERVICE AND SUPPORT (IGSS)

DArTSeq: Analytical Components

Leonard Kiche

Primary Workflow l  Primary workflow is a custom build software for downstream

processing of *.bcl files. l  First step is a conversion of *.bcl files into *.fastq files which is done

by Illumina bcl2fastq software embedded in primary workflow l  Second step performs two functions at the same time:

Ø  First, using target definition from DarTdb the software splits the sequencing reads according to the barcode sequence (de-multiplexing)

Ø  Secondly, it removes reads below quality filters. l  Two filters are applied: more stringent for barcode sequence and

less stringent for remaining part of the sequencing read l  Finally ten fold compression of the sequence tags are copied to

DArTdb for permanent storage. l  From DArTdb we extract compressed sequence tags and load them to

DArTsoft14 for marker data extraction

DArTdb: LIMS/database for GBS data

  Stores all data (including raw)   All alignment and “counts” data from GBS pipeline   Fully configurable workflow (any assay, chemistry or sequencer)   Easy connection with KDDart database storing processed marker data +

field + environment/GIS data

DArTdb – High level view

DArTsoft14: Secondary pipeline in KDCompute plug-in platform

  Novel marker calling algorithm   DArTsoft14 extracts two types of

marker data: SNPs and SilicoDArTs   SNPs, SilicoDArTs and 20+

metadata for final marker selection   Alignment to the model genome

if available   Stable framework but “evolvable”

algorithmically   Superfast and efficient: tens of

thousands of samples analysed within 24 hours

14

DArTseq implementation: Basic facts

  Launched when cost of sequencing

became affordable

  Established for nearly 200

organisms

  Typical number of fragments in

representation 200,000-300,000

  Fully scalable technology

  Most markers (>70%) in genic

regions

  Balance of SNPs versus DArTs

depends on the level of sequence

polymorphism

MarkernameSNP CallrateREFOneRatioREFRowsum Countrowsums#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####13005|F|026:T>G 0.913043 0.285714 115.397 117 1 0 1 1 1 1 0 1 1 0 1 - 1 1 0 1 1 0 - - 1 1 013005|F|0 0.913043 0.285714 115.397 121 0 1 0 0 0 0 1 0 0 1 0 - 0 0 1 0 0 1 - 0 0 0 110786|F|06:C>G 1 0.434783 408.516 408 - 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 - 1 1 010786|F|0 1 0.434783 408.516 510 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 11220|F|0 43:T>C 0.913043 0.428571 99.60309 318 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 01220|F|0 0.913043 0.428571 99.60309 330 0 0 0 0 0 1 - 1 0 1 0 - 1 1 0 0 0 0 1 0 1 1 111445|F|012:C>T 0.913043 0.428571 274.0484 274 0 0 1 1 1 1 1 1 0 0 0 - 1 1 - - 0 1 0 0 1 1 111445|F|0 0.913043 0.428571 274.0484 298 1 1 0 0 0 0 0 0 1 1 1 - 0 0 1 - 1 0 1 1 0 0 011304|F|026:T>G 0.913043 0.47619 331.8471 332 0 - - 1 1 0 1 1 0 1 - - 0 0 1 1 0 1 1 1 0 0 111304|F|0 0.913043 0.47619 331.8471 332 1 1 - 0 0 1 0 0 1 0 - 1 1 1 0 0 1 0 0 0 1 1 01120|F|0 57:T>C 0.869565 0.5 61.19317 327 0 1 0 1 - 1 0 0 0 1 - 0 1 1 0 0 0 1 1 1 1 1 11120|F|0 0.869565 0.5 61.19317 316 1 - 1 0 1 0 1 1 1 0 - 1 0 0 1 1 1 0 - 0 0 0 03947|F|0 8:T>C 0.956522 0.409091 55.54602 141 1 1 1 0 0 0 1 1 1 - 1 - 0 0 0 1 1 1 1 0 0 0 13947|F|0 0.956522 0.409091 55.54602 150 0 0 0 1 1 1 0 0 0 0 0 - 1 1 1 0 0 0 0 1 1 1 04226|F|0 28:A>T 0.956522 0.409091 13.88985 120 0 0 1 - 0 1 1 0 0 - 1 0 1 1 1 1 0 1 0 0 1 1 14226|F|0 0.956522 0.409091 13.88985 133 1 1 0 - 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0758|F|0 62:G>A 1 0.434783 159.9895 352 1 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 1758|F|0 1 0.434783 159.9895 386 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 011790|F|048:G>A 0.913043 0.47619 212.1765 211 0 0 1 1 0 1 0 0 0 1 1 - 1 1 - 0 0 1 0 - 1 1 111790|F|0 0.913043 0.47619 212.1765 230 1 1 0 0 1 0 1 1 1 0 0 - 0 0 1 1 1 0 1 - 0 0 01746|F|0 27:T>G 0.869565 0.45 49.3698 195 0 - 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 - - 1 1 01746|F|0 0.869565 0.45 49.3698 270 1 - - 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 - 1 0 0 12167|F|0 25:A>G 0.913043 0.428571 144.8196 200 0 1 1 0 - 1 1 1 0 - 1 0 1 1 1 0 0 - 1 0 1 0 02167|F|0 0.913043 0.428571 144.8196 242 1 0 0 1 - 0 0 0 1 0 0 1 0 0 0 1 1 - 0 1 0 1 14384|F|0 56:G>C 0.869565 0.35 23.8538 112 1 1 0 1 1 0 0 1 1 1 - 0 0 0 1 1 - - 0 1 0 0 14384|F|0 0.869565 0.35 23.8538 130 0 0 1 0 0 1 1 0 0 0 - 1 1 1 0 0 0 0 - 0 1 - 0899|F|0 61:G>A 0.956522 0.5 32.2733 255 1 1 1 1 1 0 0 - 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0

Distribution of counts in wheat. Current reference of wheat=564,000 , Validated HQ SNPs>100,000

SNP scoring table in maize. Similar size of representation as in wheat but >>marker frequency

Conclusion DArTseq developed for nearly 200 organisms representing wide range of

breeding systems, genome size and ploidy levels

  Addressing the need for technologies and services providing genetic

fingerprints at the density appropriate for application

  Scalable technology working from hundreds of markers to >100,000

  Best genotyping by sequencing technology on the market according to our

clients

  Information technologies properly developed to deal with huge volume of

data

  Permanent storage of raw data from the beginning of company existence –

safety of data assured for all clients

Documents

INTEGRATED GENOTYPING SERVICE AND …hpc.ilri.cgiar.org/beca/training/data_mgt_2017/DArTseq...DArTseq – Genotyping by Sequencing Developed since 1998 Complexity reduction optimised