Upload
annona
View
56
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Bioinformatics for next-generation DNA sequencing. Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September 2, 2008. Genetic code (DNA). AGCGT GGTAGCGCGAG TTTGCGAGCT AGCTAGGCT CCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT - PowerPoint PPT Presentation
Citation preview
Bioinformatics for next-generation DNA sequencing
Gabor T. MarthBoston College Biology Department
BC Biology new graduate student orientationSeptember 2, 2008
Genetic code (DNA)
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGTCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
The genome
Genome sequencing
~1 Mb ~100 Mb >100 Mb ~3,000 Mb
Next-generation sequencing machines
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
100 Mb
10 Mb
1Mb
1Gb
Illumina, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(20-100 Mb in 100-250 bp reads)
(1Gb in 25-50 bp reads)
Individual human resequencing
Variations at every scale of genome organization
Single-base substitutions (SNPs) Insertion-deletion polymorphisms
Structural variations including large-scale chromosomal rearrangements
Epigenetic variations (e.g. changes in methylation / chromatic structure)
We care about genetic variations because…
… they underlie phenotypic differences
… cause heritable diseases and determine responses to drugs
… allow tracking ancestral human history
Individual resequencing / SNP discovery
(iv) read assembly
REF
(iii) read mapping
IND
(i) base calling
IND(v) SNP and short INDEL calling
(ii) micro-repeat analysis
(vii) data validation, hypothesis generation
Tools
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
Base calling
Quinlan et al.Nature Methods 2008
… and they give you the picture on the box
Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Problem is, some pieces are easier to place than others…
Read mapping
Michael Strombergin prep.
SNP discovery
GigaBayesGigaBayes
Marth et al. Nature Genetics 1999Quinlan et al. in prep.
Structural variation discovery
Navigation bar
Fragment lengths in selected region
Depth of coverage in selected region
Stewart et al. in prep.
Assembly viewers
Huang and MarthGenome Research 2008
Data mining
SNP calling in single-read 454 coverage
• collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• goal was to assess polymorphism rates between 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) were collected
DNA courtesy of Chuck Langley, UC Davis
Mutational profiling in deep 454 data
• collaboration with Doug Smith at Agencourt
• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel
production)
• one specific mutagenized strain had especially high conversion efficiency
• goal was to determine where the mutations were that caused this phenotype
• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the
15MB genome)
Pichia stipitis reference sequence
• processed the sequences with our 454 pipeline
• found 39 mutations (in as many reads in which we found 650K SNP in
melanogaster)
• informatics analysis in < 24 hours (including manual checking of all
candidates)
Image from JGI web site
Smith et al. Genome Research 2008
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
Bristol, N2 strain(3 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University
SNP
• we found 45,000 SNP with very high validation rate
Hillier et al.Nature Methods 2008
Current focus
1000 Genomes Project
• data quality assessment• project design (# samples depth of read coverage)• read mapping• SNP calling• structural variation discovery
SV discovery in autism
deletion
amplification
Transcriptome sequencing
(from: Mortazavi et al. Nature Methods 2008)
Lab
The team
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
Resources
• computer cluster• 128 GB RAM server• 20TB disk space
• 2 large R01 grants from the NIH• a BC RIG grant
Collaborations
Baylor HGSC
Wash. U. GSC
Genome Canada
UBC GSC
Cornell
UC Davis UCSF
NCBI @ NIH NCI @ NIH Marshfield Clinic
UCLA
Pfizer
Graduate student rotations
• Looking for new graduate students
• Spots are available for all three rotations
• Lots or projects
• Caveat: you need to be able to program…
• Check us out at: http://bioinformatics.bc.edu/marthlab/
•If you are interested, please talk to me