Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome

  • View
    221

  • Download
    0

Embed Size (px)

Text of Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome

Snmek 1

Towards your own genome Designing your Sequencing Runhttps://genohub.com/next-generation-sequencing-guide/

Sequencing strategyGenome size and genome complexity?!related organism, PFGE, flow cytometry

Noncoding DNA in genomes

Repetitive DNA in the human genome

Approximately 50% of the human genome is comprised of repeats. The table in panel a shows various named classes of repeat in the human genome, along with their pattern of occurrence (shown as 'repeat type' in the table; this is taken from the RepeatMasker annotation). The number of repeats for each class found in the human genome, along with the percentage of the genome that is covered by the repeat class (Cvg) and the approximate upper and lower bounds on the repeat length (bp). The graph in panel b shows the percentage of each chromosome, based on release hg19 of the genome, covered by repetitive DNA as reported by RepeatMasker. The colours of the graph in panel b correspond to the colours of the repeat class in the table in panel a. Microsatellites constitute a class of repetitive DNA comprising tandem repeats that are 210 bp in length, whereas minisatellites are 1060 bp in length, and satellites are up to 100 bp in length and are often associated with centromeric or pericentromeric regions of the genome. DNA transposons are full-length autonomous elements that encode a protein, transposase, by which an element can be removed from one position and inserted at another. Transposons typically have short inverted repeats at each end. Long terminal repeat (LTR) elements (which are often referred to as retrovirus-like elements) are characterized by the LTRs (2005000 bp) that are harboured at each end of the retrotransposon. LINE, long interspersed nuclear element; rDNA, ribosomal DNA; SINE, short interspersed nuclear element. 4Sequencing strategy

Template and Library prep: Fragment (SE),Paired-end (PE)or Mate pair (MP)BAC clones, fosmids....

Sequencing Platform

Mate pair: Following DNA fragmentation, 2-5 Kb fragments are end-repaired with biotin labeled dNTPs. The DNA fragments are circularized, and non-circularized DNA is removed by digestion. Circular DNA is fragmented and fragments biotin labels (corresponding to the ends of the original DNA ligated together) are affinity purified. Purified fragments are end-repaired and ligated to Illumina Paired-End sequencing adapters.Additional sequences complementary to the flow cell oligonucleotides are added to the adapter sequence with tailed PCR primers. The final prepared libraries consist of short fragments made up of two DNA segments that were originally separated by several kilobases. These libraries are ready for paired-end cluster generation followed by paired-end sequencing

5MethodSingle-molecule real-time sequencing (Pacific Bio)Ion semiconductor (Ion Torrent sequencing)Pyrosequencing (454)Sequencing by synthesis (Illumina)Sequencing by ligation (SOLiD sequencing)Chain termination (Sanger sequencing)Read length2900 bp average[38]200 bp700 bp50 to 250 bp50+35 or 50+50 bp400 to 900 bpAccuracy87% - 99%98%99.9%98%99.9%99.9%Reads per run35-75 thousand up to 5 million1 millionup to 3 billion1.2 to 1.4 billionN/ATime per run30 minutes to 2 hours2 hours24 hours1 to 10 days, depending upon sequencer1 to 2 weeks20 minutes to 3 hoursCost per 1 mil. bases $2$1$10$0.05 to $0.15$0.13$2400AdvantagesLongest read length. Fast. Detects 4mC, 5mC, 6mA.[41]Less expensive equipment. Fast.Long read size. Fast.Potential for high sequence yield, depending upon sequencer model and desired application.Low cost per base.Long individual reads. Useful for many applications.DisadvantagesLow yield at high accuracy. Equipment can be very expensiveHomopolymer errors.Runs are expensive. Homopolymer errors.Short reads. Slower than other methods.More expensive and impractical for larger sequencing projects.

Genome sequencing: Comparison of NGS methods InstrumentApplication: de novo assembliesBACs, plastids, & microbial genomesTranscriptomePlant & animal genome454 GS Jr.B good but expensiveC need multiple runs, expensiveD cost prohibitive454 FLX+A good, need to multiplex to be economicalB good but expensive, libraries usually normalized, not best for short RNAsC OK as part of a mixed platform strategy, prohibitive to use aloneMiSeq v2A good, need to multiplex for best economicsA/B expensive for rare transcripts (compared to HiSeq), but reads are longer for better assemblyB expensive relative to HiSeq, but additional read length can be valuableHiSeq 2000/2500, standard runB/C more data than needed unless highly indexed; assembly more challenging than 454 or MiSeqA good, assembly more challenging than 454 but much more data available for analysesA primary data type in many current projects; requires mate-pair librariesHiSeq 2500, rapid run (projected)B more data than needed unless highly indexed; assembly more challenging than 454A good, assembly more challenging than 454 but much more data available for analysesA will probably be more expensive than HiSeq2000, but increased read length may be worth itIon Torrent 314B/C OK, lowest experimental cost but reads are shorter & more expensive than IlluminaC OK, but reads are shorter & more expensive than IlluminaD cost prohibitive, reads shorter than alternativesIon Torrent 318B/A good, less data than MiSeqB/A good, less data than MiSeq, reads similar to 454 titanium but less expensiveC high cost relative to Proton or Illumina, more economical than 454 for mixed platform strategyIon Torrent Proton IB more data than needed unless indexed; assembly more challenging than 454 or IlluminaB/A assembly currently more challenging than Illumina or 454B expensive relative to HiSeq or Proton II/IIIIon Torrent Proton II (projected)B/C more data than needed unless highly indexed; assembly more challenging than 454 or IlluminaB/A assembly currently more challenging than Illumina or 454A/B should be similar to HiSeqIon Torrent Proton III (forecast)C more data than needed unless highly indexedB/A need assembly pipelinesA cost per MB could make it the bestSOLiD 5500C more data than needed unless highly indexed; assembly more challenging than 454 or IlluminaC/D short reads make assembly challenging or impossibleC/D short reads make assembly challenging or impossiblePacBio RSB good for hybrid assemblies; not economical for solo assemblies requires high coverage due to high error ratesB/D good for hybrid assemblies; too expensive for solo use; short RNA is challengingB/D good for hybrid assemblies & scaffolding (mixed platform strategy); cost prohibitive for solo usePlatform instrumentApplication: resequencingTargeted lociTranscript countingGenome resequencing454 GS Jr.B/C good but expensive, need to limit lociD cost prohibitiveD cost prohibitive for large genomes454 FLX+B good but expensive, should limit lociD cost prohibitiveD cost prohibitive for large genomesMiSeqA/B good, fewer and higher cost reads than HiSeqB more expensive than HiSeq or SOLiD or ProtonII+B/C expensive for large genomesHiSeq 2000/2500 standard runA primary data type in many current projects; best for many lociA primary data type in many current projectsA primary data type in many current projectsHiSeq 2500 rapid run (projected)A faster path to leading data typeA/B likely to be slightly more expensive than with standard flow cellA faster path to leading data typeIon Torrent 314C OK but expensive, need to limit lociD cost prohibitiveD cost prohibitiveIon Torrent 318B good, slightly less data per run than MiSeqB/C more expensive than HiSeq or SOLiD; new informatics pipelines needed; new error profileC expensive for large genomesIon Torrent Proton IA/B similar to MiSeq, but different error profile will inhibit switchingB more expensive than Illumina or SOLiD; new informatics pipelines needed (different error profile than Illumina)B expensive relative to HiSeq or Proton II+Ion Torrent Proton II (projected)A/B similar to HiSeq, but different error profile will inhibit switchingA/B new informatics pipelines neededA supposed to set new pricing standard, could become leading shorter-read platformIon Torrent Proton III (forecast)A/B costs projected to be better than HiSeq; error profile different than IlluminaA/B new informatics pipelines neededA supposed to set new pricing standard, could become leading shorter-read platformSOLiD 5500xlB harder to assemble than IlluminaA/B used much less than HiSeqA/B used much less than HiSeqPacBio RSC/D expensive but can sequence difficult regionsD cost prohibitiveC/D cost prohibitive except for strutural variantsBacterial genomes

Noncoding DNA in genomes

Bacterial genomes

Bacterial genomes

Bacterial genomes

Complex Bacterial Genomes

Fosmid and plasmid library; Sanger

Simplified Bacterial Genomes

MDA for 16h on one lysed cell3kb Sanger libraries plus 45415 gaps (chimeric clones) Sanger finishingPolishing by Illumina reads37 regions Sanger polishing454 (average read length 225bp)Illumina (33bp)

Bacterial genomes

Eukaryotic Genomes

Eukaryotic Genomes: Fish genomesTemplate: A female fish was chosen because of its XX sex chromosome constitutionRoche 454 Titanium (3 and 20kb libraries)Illumina PE insert size 200bp and 75 bp readsphysical map: fingerprints with ABI3730 from the WLC-1247 BAC library (insert size of 160 kb; 10 genome coverage with a total of 43,192 clones available)

Bird genomes

Mammalian genomesHiSeq2000

DNA isolated from blood

Extremelly large genomesloblolly pine (Pinus taeda)The largest genome assembled to dateDNA template:a single megagametophyte, the haploid tissue of a single pine seed quantitylong-fragment mate pair libraries from the parental diploid DNA

Novel fosmid DiTag libra