37
Course outline Goal: Learn basic programming and bioinformatic skills to complete a project using available NGS data Structure: Lectures (4) Journal club (4) Workshops (4) Grading: Problem sets (3) Class participation (journal club) Project report (oral and written)

Course outline - blogs.rochester.edublogs.rochester.edu/selfishDNA/wp-content/uploads/2014/01/Bio472... · • 1996- Yeast • 1998- C ... Whole Genome Shotgun (WGS) approach ( (

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Course outline Goal: Learn basic programming and bioinformatic skills to complete a project using available NGS data Structure: Lectures (4) Journal club (4) Workshops (4) Grading: Problem sets (3) Class participation (journal club) Project report (oral and written)

  • Introduction to genome sequencing: Approaches and Platforms Bio472- Spring 2014 Amanda Larracuente

  • Outline 1.  History 2.  Basic assembly approaches 3.  First generation technology 4.  Second generation technology 5.  Third generation technology 6.  Challenges

  • Progress in genome sequencing

    NHGRI at genome.gov

    1. History

  • History: Sanger sequencing •  Introduced in 1975 •  1982- Bacteriophage lambda •  1995- H. influenzae •  1996- Yeast •  1998- C. elegans •  2000- Drosophila melanogaster •  2000- Arabidopsis •  2001- Human

    1. History

  • Sequence reads • Reads

    •  Sequence output from a DNA fragment •  Base qualities

    • Paired-end reads

    •  Reads from both ends of a DNA fragment •  Similar to or same as mate pairs (depending on platform)

    2. Basic Assembly Approaches

    DNA fragment

    Paired-end reads

  • Genome assemblies

    Human male karyotype http://www.genome.gov

    109 short sequencing reads 3Gb whole genome

    2. Basic Assembly Approaches

  • Whole Genome Shotgun (WGS) approach

    ( (

    Overlapping reads

    contig

    Mate pairs

    scaffold

    Chromosomes GATCGTGTCCCATTGTCAGATCGTG Finished assembly

    1.  Shear genome into 3-5kb

    fragments, clone into plasmids and sequence

    2.  Find overlapping reads 3.  Assemble overlapping reads

    into contigs

    4.  Assemble contigs into scaffolds 5.  Link scaffolds into “finished”

    sequence corresponding to chromosomes

    2. Basic Assembly Approaches

  • Hierarchical Approach

    ( (

    BACs

    100-150 kb inserts

    Mate pairs

    scaffold

    Chromosomes

    1.  Shear genome into 150kb

    fragments and put in BACs 2.  Create map of BACs to

    genome and create a tiling path 3.  Shotgun sequence individual

    BACs from tiling path

    4.  Assemble BAC sequences 5.  Use sequenced tiling path to

    reconstruct genome

    GATCGTGTCCCATTGTCAGATCGTG Finished assembly

    Tiling path

    2. Basic Assembly Approaches

  • Comparing assembly approaches • Whole Genome Shotgun

    •  Faster •  Assembly is a huge

    computational effort

    •  Celera Genomics approach to human genome

    • Hierarchical •  Slower •  Labor-intensive •  Higher quality assembly in

    difficult-to-assemble regions

    •  Publicly funded Human Genome Project

    2. Basic Assembly Approaches

    Took >10 years and cost $3 billion

  • First generation sequencing technology

    Shear genomic DNA

    Subclone into vectors

    Bacterial replication

    Isolate amplified clones

    Capillary sequencing

    3. First generation technology

  • !"!#$#$""!$"##!#"$#!"%!"!#$#$""!$"##!#"$#!%!"!#$#$""!$"##!#"$#%!"!#$#$""!$"##!#"$%!"!#$#$""!$"##!#"%!"!#$#$""!$"##!#%!"!#$#$""!$"##!%!"!#$#$""!$"##%!"!#$#$""!$"#%!"!#$#$""!$"%!"!#$#$""!$%!"!#$#$""!%!"!#$#$""%

    !"!#$#$%!"!#$#%!"!#$%!"!#%!"!%!"%!%

    !"!#$#$"%

    !"!#$#$""!$"##!#"$#!"%

    &'#%()*+,-./0-%

    !-,(*/1-%&'#%!"

    "/(2**/.+%$-*%

    3./4,-5

    1%026-

    %

    7-89-5:-%

    (.2,-.%Sanger sequencing • Chain termination •  Fluorescently labeled,

    modified nucleotides • Capillary gel

    electrophoresis

    3. First generation technology

  • Applications • Sequencing PCR fragments • Sequencing off plasmids

    • Sequencing genomes

    • Sequencing cDNA libraries

    3. First generation technology

  • Second generation sequencing technology

    Amplification

    Base detection

    Shear genomic DNA

    Solid support fixation

    4. Second generation technology

    Wash and Scan

  • 454 pyrosequencing

    Rothberg and Leamon 2008

    a.  Isolate gDNA, fragment and ligate adapters

    b.  Bind to beads and carry out

    emulsion PCR (emPCR—1 fragment/bead)

    c.  Break emulsion and add beads to

    fiber-optic slide d.  Pyrosequencing reaction, 1 nt

    added at a time (peak height corresponds to # of nucl)

    a

    b

    c

    d

    4. Second generation technology

  • Illumina •  Fragment gDNA •  Ligate adapters

    •  Fix fragments on solid surface

    •  Bridge amplification to generate clusters

    •  Sequence one end (using reversible terminators)

    •  If paired-end, regenerate cluster and sequence the other end

    Figure from Mardis 2013

    4. Second generation technology

  • Ion Torrent 1. Shear DNA, ligate adapters

    2. Attach fragments to beads and amplify with emPCR

    3. Place bead in wells on plate

    4. Flow nucleotides over wells, one at a time

    5. DNA polymerase incorporates bases and give off H+

    6. Mini semi-conductor reads pH change

    http://www.lifetechnologies.com

    4. Second generation technology

    *more like 2.5-generation technology

  • Applications • Genome re-sequencing (reference based assembly)

    • Genome sequencing (de novo assembly)

    • Sequencing transcriptome (RNAseq)

    • Sequencing DNA associated with proteins (CHiPseq)

    4. Second generation technology

  • Third generation sequencing technology

    No amplification

    Base detection

    solid support fixation

    Shear genomic

    DNA

    5. Third generation technology

    Single-molecule sequencing

  • Single molecule sequencing e.g. Pacific Biosciences (PacBio)

    •  Single-molecule real-time (SMRT) sequencing •  Real time fluorescent nucleotides •  Some reads >10kb •  High error rate

    Eid et al. 2009

    5. Third generation technology

  • Applications •  Low-depth: Scaffolding contigs (de novo assembly) • High-depth: Genome sequencing of repetitive regions or

    structural rearrangements

    5. Third generation technology

  • Comparison of NGS technologies (non-exhaustive)

    Method strategy Read length

    Error type

    Error rate Output per run

    454 Synthesis/pyrosequencing Up to 700bp indels 1% 400-600 Mbp

    SOLID DNA ligase 75bp AT bias >0.01-0.06% 20-30 Gbp

    Illumina (HiSeq)

    Synthesis/DNA poly 150bp Subs. >0.1% 600 Gbp

    Ion Torrent H+ detection 90bp indels 1.5% 1 Gbp

    PacBio Single

    molecule/synthesis

    >2.5kb (up to 10kb) insertions 15%

    75-100 Mbp (5-10 Mbp

    usable)

    6. Challenges

  • The $1000 genome—Illumina!

    “The HiSeq X™ Ten, composed of 10 HiSeq X Systems, is the first sequencing platform that breaks the $1000 barrier for a 30x human genome. The HiSeq X Ten System is ideal for population-scale projects focused on the discovery of genotypic variation to understand and improve human health”

    http://investor.illumina.com/

    Reported January 14 2014:

    6. Challenges

  • Summary of technology • Point:

    •  Sequencing is cheap and easy

    •  Individual labs

    • Current challenge •  Computational •  Data management

    6. Challenges

    NHGRI at genome.gov

  • Repetitive DNA

    Interspersed repeats

    e.g. transposable elements

    Tandem repeats

    e.g. satellites, CNVs

    ?

    ?

    6. Challenges

  • Challenges for repetitive DNA • Repeat unit longer than read length (e.g. Transposable

    elements)

    • Repeat unit longer than insert sizes (e.g. Transposable elements)

    6. Challenges

  • Challenges for repetitive DNA

    6. Challenges

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    True Genomic sequence

  • Challenges for repetitive DNA

    6. Challenges

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    True Genomic sequence

    Assembly CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    Single end libraries

  • Challenges for repetitive DNA

    6. Challenges

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    True Genomic sequence

    Assembly

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    Paired end libraries

  • Challenges for repetitive DNA

    6. Challenges

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    ATGGAATATGGAA

    AATATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATGAATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    TGGTGTACCCAATATGGTGTA

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    ATATGGA GCGATAATATGGAA

    AATATGGAATAT

    True Genomic sequence

    Assembly

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    AATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG

    CCTGCGATAATATGCCTGCGATAATATG

    CGATAATATGGAA

    AATATGGTGTACCCAATATGGTGTACCC

    GAATATGGTGTA

    TAATATGGAATA

    CCTGCGATAATATGGAATATGGTGTACCC

    TATGGAATAT

    AATATGGAATA

    GGAATATGGA

    TATGGAATATG

    AATATGGAA

    GGAATATGG

    CCTGCGATAATATG

    TAATATGGAATATG

    ATGGAATATG ATATGGAATATGG

    ATATGGA GCGATAATATGGAA

    GCGATAATATG

    TGGTGTACCCAATATGGAATAT

    CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC

    GGAATATGGAATA

    AATATGGTGTA AATATGGAA

    Paired end + Mate pair libraries

  • Repeats cause

    6. Challenges

    •  Misassemblies •  Complex rearrangements •  Gaps

  • Next gen applications and repeats •  WGS with Sanger:

    •  Repetitive DNA unstable in cloning vectors •  Paired end/Mate pairs help with assembly

    •  454 pyrosequencing •  Problems with homopolymers •  Paired end/Mate pairs help with assembly

    •  Illumina •  Repetitive elements longer than read length •  Deep coverage and mate pairs help with assembly

    •  PacBio •  Problem is very high error rate: requires deep coverage PacBio or short

    reads •  Read length plows through repeats

    6. Challenges

  • Further reading: • Metzker. 2010. Sequencing technologies—the next

    generation. Nature Reviews. 11:31-46. • Mardis. 2013. Next-Generation Sequencing Platforms.

    Ann. Rev. Anal. Chem 6:287-303. •  Treangen and Salzberg. 2012. Repetitive DNA and next-

    generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13:36-46.

  • Project background reading •  Brennecke, J, AA Aravin, A Stark, M Dus, M Kellis, R Sachidanandam, GJ

    Hannon. 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128:1089-1103.

    •  Lemos, B, LO Araripe, DL Hartl. 2008. Polymorphic Y chromosomes harbor

    cryptic variation with manifold functional consequences. Science 319:91-93. •  Nagao, A, T Mituyama, H Huang, D Chen, MC Siomi, H Siomi. 2010.

    Biogenesis pathways of piRNAs loaded onto AGO3 in the Drosophila testis. RNA 16:2503-2515.

    •  Filion, GJ, JG van Bemmel, U Braunschweig, et al. 2010. Systematic protein

    location mapping reveals five principal chromatin types in Drosophila cells. Cell 143:212-224.

  • Papers •  Akbari, OS, I Antoshechkin, BA Hay, PM Ferree. 2013. Transcriptome

    profiling of Nasonia vitripennis testis reveals novel transcripts expressed from the selfish B chromosome, paternal sex ratio. G3 (Bethesda) 3:1597-1605.

    •  Blumenstiel, JP, X Chen, M He, CM Bergman. 2014. An Age-of-Allele Test of

    Neutrality for Transposable Element Insertions. Genetics 196:523-538. •  Rogers, RL, JM Cridland, L Shao, TT Hu, P. Andolfatto, and KR Thornton.

    2014. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. ArXiv preprint.

    •  Kelleher, E.S., and Barbash D.A. (2013) Analysis of piRNA-mediated

    silencing of active TEs in Drosophila melanogaster suggests limits on the evolution of host genome defense. Molecular Biology and Evolution. 30:1816-1819.

  • Getting setup to run graphical software on BlueHive • Please go to: https://www.circ.rochester.edu/wiki/index.php/Getting_Started And https://www.circ.rochester.edu/wiki/index.php/NX_Cluster •  Install X11 application if needed