Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Genetics 211 - 2014 Lecture 2
High Throughput Sequencing Gavin Sherlock [email protected] January 14th 2014
• interactions between nucleic acids and proteins"
• transcript identity"• transcript abundance"
• RNA editing"• SNPs"
• Allele specific expression"• Regulation"
• Nucleosome positioning"• 3D genome architecture"
• Active promoters"• interactions between
nucleic acids and proteins"• chromatin modifications"
• genome variability"• metagenomics"
• genome modifications"• detection of mutations"
• association studies"• phylogeny"• evolution"
Applications of Next-Gen Sequencing
genome chromatin transcriptome"
de novo sequencing"
assembly"
annotation"
mapping"
resequencing"
detection of variants"
mapping"
Hi-C"
3D reconstruction"
mapping"
ChIP-Seq"
detection of binding sites"
mapping"
RNA-Seq"
transcript detection and quantification"
mapping"
ATAC-Seq"
Identify open
chromatin"
How do we make an Illumina Genomic DNA library?
Fragment (Covaris)"
Polish, add dA overhang"Add adaptors, size select"
Genomic DNA"
Sequence"
Making fragments asymmetric
5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'
Fragmented, end polished, phosphorylated, dA ligated DNA sample"
Genomic Y-adapter"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'
Ligate"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
[Ligation product is gel purified, selecting only those products in a certain size range]"
Making our genomic DNA library asymmetric
Round 1 of PCR"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Products of first round:"
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Finishing and Sequencing the Library
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Rounds 2-18"
Product of PCR amplification"
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
[Anneal to flow cell. Perform cluster generation]"
Genomic DNA Sequencing Primer"
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT
How Much Sequence? • HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads • This is 50Gb of sequence • This is ~4000x coverage yeast (12Mb). • This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster) • How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes?
Barcode Sequencing • Two ways to perform barcode sequencing
– In-line barcodes • Barcode is read as part of the normal sequencing read
– Multiplex barcodes • Barcode is read as a third, short sequencing run (also known
as index reads)
• Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.
• Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible.
In-line Barcode Sequencing
Multiplex, or Index barcoding
Random barcoding
• During the PCR step, each template gets amplified many times
• If your library is of insufficient complexity, or you overamplify you may have PCR duplicates
• You want to make independent observations, not redundant observations
• When sequencing to high coverage, you may have identical, but non-redundant observations.
• Want to be able to distinguish these.
Random Barcoding
What are the data?
• Illumina produces data in fastq format.
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
‘@’ followed by a sequence Identifier
The sequence ‘+’, optionally followed by a sequence Identifier The quality scores
Example of Illumina SeqID
@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R The unique instrument name 6 Flowcell lane 73 Tile number within the flowcell 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no
indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair
reads only)
Assessing Quality
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/"
HTQC
A
C
F G
D E
B
https://sourceforge.net/projects/htqc"
De novo assembly
• Several methods available • Short reads require long overlaps
• e.g., 33 bp reads must overlap by 20 bp • end-trimming helps, to remove low quality bases.
• Most de novo short read assemblers use a k-mer hashing based approach.
• The central challenge of genome assembly is resolving repeat regions.
De novo Assembly Strategies
• Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.
• Choice of tool depends on exactly what you are trying to assemble: – Genome size – Genome complexity – Level of polymorphism – Genome vs. transcriptome vs. – Sequence coverage you have (more is generally better) – Paired-end vs. single end (you should really have paired-end data)
• E.g. – SSAKE (Warren et al, 2007)
• Uses DNA prefix tree to find k-mer matches – Edena (Hernandez et al, 2008)
• Overlap layout algorithm plus error correction – Velvet (Zerbino and Birney, 2008)
• Uses DeBruijn graph algorithm plus error correction
Example of Velvet de novo Assembly
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG
GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA
GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG
Sequence (7bp reads)
Hashing (k = 4)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Graph Building
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� � { {�
�
GATT "
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Simplification of Linear Stretches
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGTCGAG GAGGCTTTAGA AGAGACAG!
� �
�
�
AGATCCGATGAG!
Error (tip and bubble) removal
Tips
{TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� �
{�
�
GATT "
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
Bubble
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
De novo Assembler Performance
• All three programs run with default parameters on the same dataset – Input: 8.6 millions reads – Platform: 64-bit Opteron, 4CPUs, 32 GB memory
Program Version CPU time Wall clock
SSAKE 3.0 2:24:59 5:08:59
Edena 2.11 0:28:31 28:58
Velvet 0.5 0:08:48 10:36
De novo assemblies"
Program # Contigs >200 bp N50 (bp) Sum (bp) Singletons
SSAKE 12,532 549 6,090,567 3,164,495
Edena 8,316 902 5,759,209 3,955,865
Velvet 7,382 1,252 6,474,426 1,273,164
Program # Contigs N50 (bp) Sum (bp) Max contig
SSAKE 185,030 87 14,287,079 5,490
Edena 11,180 837 6,175,460 11,300
Velvet 10,684 1,184 6,841,458 16,239
Assembly Limitations
• Common repeat regions are typically missing/collapsed – Han Chinese genome missing ~420Mbp of repeats
• Same is true for segmental duplications – Han Chinese genome only contains ~10Mbp of ~150Mbp of
segmental duplications. • You typically get very large numbers of contigs, which
range in size from very small, to sometimes quite large.
Recent Assembler Comparisons
• Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.
– Used a simulated dataset for all competitors to assemble • Salzberg et al (2012). GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome Research 22(3):557-67. – Applied several assembly algorithms to their own datasets, for several
different sized genomes • Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. Gigascience 2(1):10. – See http://assemblathon.org/
• If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers
Improving de novo Assemblies
• Need to generate additional long range continuity to be able to orient and order contigs
• Mate pairs • Hybrid approach using PacBio Reads • Synthetic long reads (aka Moleculo)
Mate-pair libraries
• Goal is to have the equivalent of 2-5kb insert libraries.
• However, technology is limited to ~700 bp fragments – Means you have to use some molecular biology to
accomplish the equivalent.
Fragment"
Genomic DNA"
Size Select (2-5kb)"
Biotinylate"
Bio"
Bio"*"
*"
Bio"
Bio"*"
*"
Circularize"
*"*"
Fragment (400-600bp)"
*"*"
*"*"
*"*"
*"*"
*"*"
*"*"
Enrich Biotinylated fragments"
Standard Paired End Illumina Sequencing"
Incorporate Mate-pair information into assembly"
Leveraging Multiple Technologies
• Illumina is great, because you can get a ton of data – BUT read length is short
• PacBio is great, because read lengths are long – BUT the data quality is terrible
• Two approaches have been used: 1. Hybrid error correction
• Use short reads to perform correction of long PacBio reads, and then assemble those
2. Use PacBio reads to improve existing (short read or Sanger based) assemblies • E.g. With 24× mapped coverage of PacBio long-reads applied
to a D. pseudoobscura assembly, 99% of gaps were addressed, with 69% being closed and further 12% improved.
Long “Synthetic Reads” aka Moleculo
Fragment"
Genomic DNA"
Size Select (10kb)"
Polish, ligate amplification adaptors"
~10 kb DNA"
Dilute to 500 molecules per well "
Amplify, fragment, add sequencing adaptors"
Pool"
Sequence"
Separate, based on bar code"
Remove barcodes, assemble 10kb fragments"
Assemble genome from 10kb fragments"
• interactions between nucleic acids and proteins"
• transcript identity"• transcript abundance"
• RNA editing"• SNPs"
• Allele specific expression"• Regulation"
• Nucleosome positioning"• 3D genome architecture"
• Active promoters"• interactions between
nucleic acids and proteins"• chromatin modifications"
• genome variability"• metagenomics"
• genome modifications"• detection of mutations"
• association studies"• phylogeny"• evolution"
Applications of Next-Gen Sequencing
genome chromatin transcriptome"
de novo sequencing"
assembly"
annotation"
mapping"
resequencing"
detection of variants"
mapping"
Hi-C"
3D reconstruction"
mapping"
ChIP-Seq"
detection of binding sites"
mapping"
RNA-Seq"
transcript detection and quantification"
mapping"
ATAC-Seq"
Identify open
chromatin"
Mapping Short Reads
• Many options; often a trade off between speed, resources and sensitivity.
• Several open source projects to solve this problem, continually improving in speed and memory requirements.
• New features being added all the time. • When dealing with short read data, make sure you
have the very latest versions of the software you’re using, as some are updated frequently.
Alignment
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Approaches to Short Read Alignment
• Hash-Based mapping – Hashing of reads (E.g. Maq, Eland, SHRiMP) – Hashing of genome (E.g. novoalign, SHRiMP2)
• Indexing using Suffix Array/Burrows-Wheeler Transform (BWT) (E.g. bowtie, bwa)
SHRiMP2
• Uses hash of genome to find alignment seeds, then performs Smith-Waterman – SW, while slow, is accelerated; requires x86_64
processor (which most macs have nowadays) • Can detect indels, as well as mismatches • As of v2.2, takes into account quality scores • Takes longer than bowtie and bwa, but is more
sensitive than both
MAQ
• Much faster than SHRiMP, at the cost of accuracy (cannot find indels)
• Uses hashing technique to index genome • Guaranteed to find alignments with up to 2 mismatches • Can take advantage of paired end reads • Uses sequence quality scores to determine best alignments • Generally no longer used
http://sourceforge.net/projects/maq/
How does “hashing” work?
• A hash function simply converts a string (“key”) to an integer (“value”).
• The integer is then used as an index in an array, for fast look up.
• In MAQ, the reads are “hashed”, using 6 different permutations of the first 28bp.
• The genome is then looked through, in 28bp chunks, to see if they match, via the hash, to reads.
Bowtie
• Similar to MAQ, in that it uses quality scores to find best alignments.
• Uses “Burrows-Wheeler index” to keep its memory footprint small.
• Can find alignments with up to 3 mismatches in the first L bases of the read.
• Only ungapped alignments • Also supports paired end reads.
http://bowtie-bio.sourceforge.net/ • Bowtie2 supports gapped alignments too.
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
AATAATACGGCGACCACCGAGATCTA!
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATAATACGGCGACCACCGAGATCTA!
BWA
• From the author of Maq, but now uses Burrows-Wheeler transform to significantly speed it up.
• Can also find small indels, in contrast to both Maq and Bowtie.
• Is slightly slower than bowtie, but ability to find indels make it more useful if SNVs are important to you.
Comparison
• PC: 2.4 GHz Intel Core 2, 2 GB RAM • Server: 2.4 GHz AMD Opteron, 32 GB RAM • Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 • SOAP not run on PC due to memory constraints • Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) • Reference: Human (NCBI 36.3, contigs)
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
Comparison
• Bowtie delivers about 30 million alignments per CPU hour
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
Comparison
• Bowtie and Maq have memory footprints compatible with a typical workstation with 2 GB of RAM
• SOAP requires a computer with >13 GB of RAM • SOAP2 claims to be “super-fast”, and require less RAM (also uses BW
Transform). • Your choice will be dictated by your needs (sensitivity, genome size,
number of reads) and your computing resources, and may change over time.
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
Precision and recall by amount of variation for 4 datasets, by polymorphism: (number of SNPs, Indel size).
More Comparison Data
David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011). SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27(7):1011-2."
Current Practice
• Most people use bwa for mapping their short read data if they want to discover variants
• If not interested in variants, people use bowtie for speed • Most people don’t determine whether the tool they are
using is the best for their purpose • There is no standard benchmark dataset, though see:
– Holtgrewe et al. (2011). A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210.
• It doesn’t hurt to experiment
Recommended Reading Mapping • Li, H., Ruan, J. and Durbin, R.. (2008). Mapping short DNA sequencing reads and calling variants using mapping
quality scores. Genome Res. 18(11):1851-8. MAQ • David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M. (2011). SHRiMP2: sensitive yet practical SHort Read
Mapping. Bioinformatics 27(7):1011-2. • Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome Biol. 10(3):R25. Bowtie • Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. • Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K. and Wang, J. (2009). SOAP2: an improved ultrafast tool for
short read alignment. Bioinformatics 25(15):1966-7. • Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25(14):1754-60. BWA Assembly • Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18(5):821-9. • Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of
repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. • Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.
Bioinformatics 26(12):i367-73. • Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data
structures. Genome Research 22(3):549-56. SGA • Earl, D. et al. (2011). Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome
Research 21(12):2224-41. • Bradnam, K.R. et al. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate
species. Gigascience 2(1):10. • English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., Gibbs,
R.A. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11):e47768.
Moleculo • Voskoboynik, A., et al. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.