Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
INTEGRATED GENOTYPING SERVICE AND SUPPORT (IGSS)
DArTSeq: Wet Lab Components
Martin Kanyeki
DArTseq™ – Genotyping by Sequencing
Developed since 1998
Complexity reduction optimised for each organism
Targeting 100,000- 200,000 mostly low copy sequences
(methyl filtration)
Most assays read at >10X
High call rate
High data quality
Extensive quality control including individual libraries
Technical replication ->selection of best markers
Sequencing platform independent
2
DArTSeq: features Detects all types of polymorphism (mostly SNP but also
InDels, CNVs and methylation variation)
Whole genome coverage allows detection of structural
rearrangements
Unbiased for the location of polymorphism, but highly
enriched for genetically active regions of the genome
Profiles the whole genome (thousands of loci) in a single
assay on automated platforms
Fast and inexpensive technology development
– no need for assay development or sequence
information
Robustness of DArTSeq
Methylation filtration effect
enriches for hypo methylated, expressed genome
regions and low copy sequences
makes DArT robust with respect to genome size
High frequency (~50%) of markers are low copy and with
high homology to “genes”
Thanks to this feature DArT and DArTseq deployed in
many genome sequencing projects mostly for linking high
density genetic maps with sequence assemblies
Library Preparation
DNA fragment
BarcodeAdapter(Rarecu/erSpecific) CommonAdapter(Frequentcu/erspecific)
Genomic DNA sample1
Digestion (Rare and Frequent Cutter)
Ligation(T4 ligase)
PCR amplification
Post-PCR
Library Pooling Kit-based Purification
Cbot Cluster Generation Library Quantification
Cluster Generation
l Cluster generation is carried out in cBOT (Illumina)
according to the procedures described by the
manufacturer.
l Briefly: 10nM DNA of each library is denatured, diluted
in hybridization buffer, loaded into machine, and
clusters are generated in flow cell by cBOT with use of
the set cBOT reagents(Bridge Amplification)
l During cluster generation the molecules of each library
are attached to the flow cell surface and amplified to
form clonal clusters.
Sequencing l Sequencing is carried out in the sequencer HiSeq 2500
( Illumina) using the methodology provided by the
manufacturer.
l Briefly:
Ø The flow cell with clusters generated in the previous
step (cBOT) is loaded to the HiSeq 2500 together with the
sequencing reagents.
Ø HiSeq 2500 performs sequencing according to user
selected sequencing parameters.
Ø We sequence our libraries from one end performing single
Read sequencing runs and sequence 77 bases.
Real Time Analysis (RTA)
l RTA happens simultaneously to the sequencing run and
the RTA data are outputted to a server indicated by the
user
l Currently for DarTSeq, one HiSeq 2500 sequencing run
results is less than 1 TB of compressed data, which are
results files, log files and a number of quality control
files.
l The main sequence output files are base calling files
*.bcl files. These files are the input files for downstream
data conversion
INTEGRATED GENOTYPING SERVICE AND SUPPORT (IGSS)
DArTSeq: Analytical Components
Leonard Kiche
Primary Workflow l Primary workflow is a custom build software for downstream
processing of *.bcl files. l First step is a conversion of *.bcl files into *.fastq files which is done
by Illumina bcl2fastq software embedded in primary workflow l Second step performs two functions at the same time:
Ø First, using target definition from DarTdb the software splits the sequencing reads according to the barcode sequence (de-multiplexing)
Ø Secondly, it removes reads below quality filters. l Two filters are applied: more stringent for barcode sequence and
less stringent for remaining part of the sequencing read l Finally ten fold compression of the sequence tags are copied to
DArTdb for permanent storage. l From DArTdb we extract compressed sequence tags and load them to
DArTsoft14 for marker data extraction
DArTdb: LIMS/database for GBS data
Stores all data (including raw) All alignment and “counts” data from GBS pipeline Fully configurable workflow (any assay, chemistry or sequencer) Easy connection with KDDart database storing processed marker data +
field + environment/GIS data
DArTdb – High level view
DArTsoft14: Secondary pipeline in KDCompute plug-in platform
Novel marker calling algorithm DArTsoft14 extracts two types of
marker data: SNPs and SilicoDArTs SNPs, SilicoDArTs and 20+
metadata for final marker selection Alignment to the model genome
if available Stable framework but “evolvable”
algorithmically Superfast and efficient: tens of
thousands of samples analysed within 24 hours
14
DArTseq implementation: Basic facts
Launched when cost of sequencing
became affordable
Established for nearly 200
organisms
Typical number of fragments in
representation 200,000-300,000
Fully scalable technology
Most markers (>70%) in genic
regions
Balance of SNPs versus DArTs
depends on the level of sequence
polymorphism
MarkernameSNP CallrateREFOneRatioREFRowsum Countrowsums#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####13005|F|026:T>G 0.913043 0.285714 115.397 117 1 0 1 1 1 1 0 1 1 0 1 - 1 1 0 1 1 0 - - 1 1 013005|F|0 0.913043 0.285714 115.397 121 0 1 0 0 0 0 1 0 0 1 0 - 0 0 1 0 0 1 - 0 0 0 110786|F|06:C>G 1 0.434783 408.516 408 - 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 - 1 1 010786|F|0 1 0.434783 408.516 510 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 11220|F|0 43:T>C 0.913043 0.428571 99.60309 318 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 01220|F|0 0.913043 0.428571 99.60309 330 0 0 0 0 0 1 - 1 0 1 0 - 1 1 0 0 0 0 1 0 1 1 111445|F|012:C>T 0.913043 0.428571 274.0484 274 0 0 1 1 1 1 1 1 0 0 0 - 1 1 - - 0 1 0 0 1 1 111445|F|0 0.913043 0.428571 274.0484 298 1 1 0 0 0 0 0 0 1 1 1 - 0 0 1 - 1 0 1 1 0 0 011304|F|026:T>G 0.913043 0.47619 331.8471 332 0 - - 1 1 0 1 1 0 1 - - 0 0 1 1 0 1 1 1 0 0 111304|F|0 0.913043 0.47619 331.8471 332 1 1 - 0 0 1 0 0 1 0 - 1 1 1 0 0 1 0 0 0 1 1 01120|F|0 57:T>C 0.869565 0.5 61.19317 327 0 1 0 1 - 1 0 0 0 1 - 0 1 1 0 0 0 1 1 1 1 1 11120|F|0 0.869565 0.5 61.19317 316 1 - 1 0 1 0 1 1 1 0 - 1 0 0 1 1 1 0 - 0 0 0 03947|F|0 8:T>C 0.956522 0.409091 55.54602 141 1 1 1 0 0 0 1 1 1 - 1 - 0 0 0 1 1 1 1 0 0 0 13947|F|0 0.956522 0.409091 55.54602 150 0 0 0 1 1 1 0 0 0 0 0 - 1 1 1 0 0 0 0 1 1 1 04226|F|0 28:A>T 0.956522 0.409091 13.88985 120 0 0 1 - 0 1 1 0 0 - 1 0 1 1 1 1 0 1 0 0 1 1 14226|F|0 0.956522 0.409091 13.88985 133 1 1 0 - 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0758|F|0 62:G>A 1 0.434783 159.9895 352 1 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 1758|F|0 1 0.434783 159.9895 386 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 011790|F|048:G>A 0.913043 0.47619 212.1765 211 0 0 1 1 0 1 0 0 0 1 1 - 1 1 - 0 0 1 0 - 1 1 111790|F|0 0.913043 0.47619 212.1765 230 1 1 0 0 1 0 1 1 1 0 0 - 0 0 1 1 1 0 1 - 0 0 01746|F|0 27:T>G 0.869565 0.45 49.3698 195 0 - 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 - - 1 1 01746|F|0 0.869565 0.45 49.3698 270 1 - - 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 - 1 0 0 12167|F|0 25:A>G 0.913043 0.428571 144.8196 200 0 1 1 0 - 1 1 1 0 - 1 0 1 1 1 0 0 - 1 0 1 0 02167|F|0 0.913043 0.428571 144.8196 242 1 0 0 1 - 0 0 0 1 0 0 1 0 0 0 1 1 - 0 1 0 1 14384|F|0 56:G>C 0.869565 0.35 23.8538 112 1 1 0 1 1 0 0 1 1 1 - 0 0 0 1 1 - - 0 1 0 0 14384|F|0 0.869565 0.35 23.8538 130 0 0 1 0 0 1 1 0 0 0 - 1 1 1 0 0 0 0 - 0 1 - 0899|F|0 61:G>A 0.956522 0.5 32.2733 255 1 1 1 1 1 0 0 - 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0
Distribution of counts in wheat. Current reference of wheat=564,000 , Validated HQ SNPs>100,000
SNP scoring table in maize. Similar size of representation as in wheat but >>marker frequency
Conclusion DArTseq developed for nearly 200 organisms representing wide range of
breeding systems, genome size and ploidy levels
Addressing the need for technologies and services providing genetic
fingerprints at the density appropriate for application
Scalable technology working from hundreds of markers to >100,000
Best genotyping by sequencing technology on the market according to our
clients
Information technologies properly developed to deal with huge volume of
data
Permanent storage of raw data from the beginning of company existence –
safety of data assured for all clients