Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Course outline Goal: Learn basic programming and bioinformatic skills to complete a project using available NGS data Structure: Lectures (4) Journal club (4) Workshops (4) Grading: Problem sets (3) Class participation (journal club) Project report (oral and written)
Introduction to genome sequencing: Approaches and Platforms Bio472- Spring 2014 Amanda Larracuente
Outline 1. History 2. Basic assembly approaches 3. First generation technology 4. Second generation technology 5. Third generation technology 6. Challenges
Progress in genome sequencing
NHGRI at genome.gov
1. History
History: Sanger sequencing • Introduced in 1975 • 1982- Bacteriophage lambda • 1995- H. influenzae • 1996- Yeast • 1998- C. elegans • 2000- Drosophila melanogaster • 2000- Arabidopsis • 2001- Human
1. History
Sequence reads • Reads
• Sequence output from a DNA fragment • Base qualities
• Paired-end reads
• Reads from both ends of a DNA fragment • Similar to or same as mate pairs (depending on platform)
2. Basic Assembly Approaches
DNA fragment
Paired-end reads
Genome assemblies
Human male karyotype http://www.genome.gov
109 short sequencing reads 3Gb whole genome
2. Basic Assembly Approaches
Whole Genome Shotgun (WGS) approach
( (
Overlapping reads
contig
Mate pairs
scaffold
Chromosomes GATCGTGTCCCATTGTCAGATCGTG Finished assembly
1. Shear genome into 3-5kb
fragments, clone into plasmids and sequence
2. Find overlapping reads 3. Assemble overlapping reads
into contigs
4. Assemble contigs into scaffolds 5. Link scaffolds into “finished”
sequence corresponding to chromosomes
2. Basic Assembly Approaches
Hierarchical Approach
( (
BACs
100-150 kb inserts
Mate pairs
scaffold
Chromosomes
1. Shear genome into 150kb
fragments and put in BACs 2. Create map of BACs to
genome and create a tiling path 3. Shotgun sequence individual
BACs from tiling path
4. Assemble BAC sequences 5. Use sequenced tiling path to
reconstruct genome
GATCGTGTCCCATTGTCAGATCGTG Finished assembly
Tiling path
2. Basic Assembly Approaches
Comparing assembly approaches • Whole Genome Shotgun
• Faster • Assembly is a huge
computational effort
• Celera Genomics approach to human genome
• Hierarchical • Slower • Labor-intensive • Higher quality assembly in
difficult-to-assemble regions
• Publicly funded Human Genome Project
2. Basic Assembly Approaches
Took >10 years and cost $3 billion
First generation sequencing technology
Shear genomic DNA
Subclone into vectors
Bacterial replication
Isolate amplified clones
Capillary sequencing
3. First generation technology
!"!#$#$""!$"##!#"$#!"%!"!#$#$""!$"##!#"$#!%!"!#$#$""!$"##!#"$#%!"!#$#$""!$"##!#"$%!"!#$#$""!$"##!#"%!"!#$#$""!$"##!#%!"!#$#$""!$"##!%!"!#$#$""!$"##%!"!#$#$""!$"#%!"!#$#$""!$"%!"!#$#$""!$%!"!#$#$""!%!"!#$#$""%
!"!#$#$%!"!#$#%!"!#$%!"!#%!"!%!"%!%
!"!#$#$"%
!"!#$#$""!$"##!#"$#!"%
&'#%()*+,-./0-%
!-,(*/1-%&'#%!"
"/(2**/.+%$-*%
3./4,-5
1%026-
%
7-89-5:-%
(.2,-.%Sanger sequencing • Chain termination • Fluorescently labeled,
modified nucleotides • Capillary gel
electrophoresis
3. First generation technology
Applications • Sequencing PCR fragments • Sequencing off plasmids
• Sequencing genomes
• Sequencing cDNA libraries
3. First generation technology
Second generation sequencing technology
Amplification
Base detection
Shear genomic DNA
Solid support fixation
4. Second generation technology
Wash and Scan
454 pyrosequencing
Rothberg and Leamon 2008
a. Isolate gDNA, fragment and ligate adapters
b. Bind to beads and carry out
emulsion PCR (emPCR—1 fragment/bead)
c. Break emulsion and add beads to
fiber-optic slide d. Pyrosequencing reaction, 1 nt
added at a time (peak height corresponds to # of nucl)
a
b
c
d
4. Second generation technology
Illumina • Fragment gDNA • Ligate adapters
• Fix fragments on solid surface
• Bridge amplification to generate clusters
• Sequence one end (using reversible terminators)
• If paired-end, regenerate cluster and sequence the other end
Figure from Mardis 2013
4. Second generation technology
Ion Torrent 1. Shear DNA, ligate adapters
2. Attach fragments to beads and amplify with emPCR
3. Place bead in wells on plate
4. Flow nucleotides over wells, one at a time
5. DNA polymerase incorporates bases and give off H+
6. Mini semi-conductor reads pH change
http://www.lifetechnologies.com
4. Second generation technology
*more like 2.5-generation technology
Applications • Genome re-sequencing (reference based assembly)
• Genome sequencing (de novo assembly)
• Sequencing transcriptome (RNAseq)
• Sequencing DNA associated with proteins (CHiPseq)
4. Second generation technology
Third generation sequencing technology
No amplification
Base detection
solid support fixation
Shear genomic
DNA
5. Third generation technology
Single-molecule sequencing
Single molecule sequencing e.g. Pacific Biosciences (PacBio)
• Single-molecule real-time (SMRT) sequencing • Real time fluorescent nucleotides • Some reads >10kb • High error rate
Eid et al. 2009
5. Third generation technology
Applications • Low-depth: Scaffolding contigs (de novo assembly) • High-depth: Genome sequencing of repetitive regions or
structural rearrangements
5. Third generation technology
Comparison of NGS technologies (non-exhaustive)
Method strategy Read length
Error type
Error rate Output per run
454 Synthesis/pyrosequencing Up to 700bp indels 1% 400-600 Mbp
SOLID DNA ligase 75bp AT bias >0.01-0.06% 20-30 Gbp
Illumina (HiSeq)
Synthesis/DNA poly 150bp Subs. >0.1% 600 Gbp
Ion Torrent H+ detection 90bp indels 1.5% 1 Gbp
PacBio Single
molecule/synthesis
>2.5kb (up to 10kb) insertions 15%
75-100 Mbp (5-10 Mbp
usable)
6. Challenges
The $1000 genome—Illumina!
“The HiSeq X™ Ten, composed of 10 HiSeq X Systems, is the first sequencing platform that breaks the $1000 barrier for a 30x human genome. The HiSeq X Ten System is ideal for population-scale projects focused on the discovery of genotypic variation to understand and improve human health”
http://investor.illumina.com/
Reported January 14 2014:
6. Challenges
Summary of technology • Point:
• Sequencing is cheap and easy
• Individual labs
• Current challenge • Computational • Data management
6. Challenges
NHGRI at genome.gov
Repetitive DNA
Interspersed repeats
e.g. transposable elements
Tandem repeats
e.g. satellites, CNVs
?
?
6. Challenges
Challenges for repetitive DNA • Repeat unit longer than read length (e.g. Transposable
elements)
• Repeat unit longer than insert sizes (e.g. Transposable elements)
6. Challenges
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Single end libraries
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Paired end libraries
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
TATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATG
AATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
ATGGAATATG ATATGGAATATGG
ATATGGA GCGATAATATGGAA
GCGATAATATG
TGGTGTACCCAATATGGAATAT
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
GGAATATGGAATA
AATATGGTGTA AATATGGAA
Paired end + Mate pair libraries
Repeats cause
6. Challenges
• Misassemblies • Complex rearrangements • Gaps
Next gen applications and repeats • WGS with Sanger:
• Repetitive DNA unstable in cloning vectors • Paired end/Mate pairs help with assembly
• 454 pyrosequencing • Problems with homopolymers • Paired end/Mate pairs help with assembly
• Illumina • Repetitive elements longer than read length • Deep coverage and mate pairs help with assembly
• PacBio • Problem is very high error rate: requires deep coverage PacBio or short
reads • Read length plows through repeats
6. Challenges
Further reading: • Metzker. 2010. Sequencing technologies—the next
generation. Nature Reviews. 11:31-46. • Mardis. 2013. Next-Generation Sequencing Platforms.
Ann. Rev. Anal. Chem 6:287-303. • Treangen and Salzberg. 2012. Repetitive DNA and next-
generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13:36-46.
Project background reading • Brennecke, J, AA Aravin, A Stark, M Dus, M Kellis, R Sachidanandam, GJ
Hannon. 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128:1089-1103.
• Lemos, B, LO Araripe, DL Hartl. 2008. Polymorphic Y chromosomes harbor
cryptic variation with manifold functional consequences. Science 319:91-93. • Nagao, A, T Mituyama, H Huang, D Chen, MC Siomi, H Siomi. 2010.
Biogenesis pathways of piRNAs loaded onto AGO3 in the Drosophila testis. RNA 16:2503-2515.
• Filion, GJ, JG van Bemmel, U Braunschweig, et al. 2010. Systematic protein
location mapping reveals five principal chromatin types in Drosophila cells. Cell 143:212-224.
Papers • Akbari, OS, I Antoshechkin, BA Hay, PM Ferree. 2013. Transcriptome
profiling of Nasonia vitripennis testis reveals novel transcripts expressed from the selfish B chromosome, paternal sex ratio. G3 (Bethesda) 3:1597-1605.
• Blumenstiel, JP, X Chen, M He, CM Bergman. 2014. An Age-of-Allele Test of
Neutrality for Transposable Element Insertions. Genetics 196:523-538. • Rogers, RL, JM Cridland, L Shao, TT Hu, P. Andolfatto, and KR Thornton.
2014. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. ArXiv preprint.
• Kelleher, E.S., and Barbash D.A. (2013) Analysis of piRNA-mediated
silencing of active TEs in Drosophila melanogaster suggests limits on the evolution of host genome defense. Molecular Biology and Evolution. 30:1816-1819.
Getting setup to run graphical software on BlueHive • Please go to: https://www.circ.rochester.edu/wiki/index.php/Getting_Started And https://www.circ.rochester.edu/wiki/index.php/NX_Cluster • Install X11 application if needed