30
Bioinformatics for next- generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September 2, 2008

Bioinformatics for next-generation DNA sequencing

  • Upload
    annona

  • View
    56

  • Download
    1

Embed Size (px)

DESCRIPTION

Bioinformatics for next-generation DNA sequencing. Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September 2, 2008. Genetic code (DNA). AGCGT GGTAGCGCGAG TTTGCGAGCT AGCTAGGCT CCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics for next-generation DNA sequencing

Bioinformatics for next-generation DNA sequencing

Gabor T. MarthBoston College Biology Department

BC Biology new graduate student orientationSeptember 2, 2008

Page 2: Bioinformatics for next-generation DNA sequencing

Genetic code (DNA)

AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGTCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT

Page 3: Bioinformatics for next-generation DNA sequencing

The genome

Page 4: Bioinformatics for next-generation DNA sequencing

Genome sequencing

~1 Mb ~100 Mb >100 Mb ~3,000 Mb

Page 5: Bioinformatics for next-generation DNA sequencing

Next-generation sequencing machines

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

100 Mb

10 Mb

1Mb

1Gb

Illumina, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(20-100 Mb in 100-250 bp reads)

(1Gb in 25-50 bp reads)

Page 6: Bioinformatics for next-generation DNA sequencing

Individual human resequencing

Page 7: Bioinformatics for next-generation DNA sequencing

Variations at every scale of genome organization

Single-base substitutions (SNPs) Insertion-deletion polymorphisms

Structural variations including large-scale chromosomal rearrangements

Epigenetic variations (e.g. changes in methylation / chromatic structure)

Page 8: Bioinformatics for next-generation DNA sequencing

We care about genetic variations because…

… they underlie phenotypic differences

… cause heritable diseases and determine responses to drugs

… allow tracking ancestral human history

Page 9: Bioinformatics for next-generation DNA sequencing

Individual resequencing / SNP discovery

(iv) read assembly

REF

(iii) read mapping

IND

(i) base calling

IND(v) SNP and short INDEL calling

(ii) micro-repeat analysis

(vii) data validation, hypothesis generation

Page 10: Bioinformatics for next-generation DNA sequencing

Tools

Page 11: Bioinformatics for next-generation DNA sequencing

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Page 12: Bioinformatics for next-generation DNA sequencing

Base calling

Quinlan et al.Nature Methods 2008

Page 13: Bioinformatics for next-generation DNA sequencing

… and they give you the picture on the box

Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Problem is, some pieces are easier to place than others…

Page 14: Bioinformatics for next-generation DNA sequencing

Read mapping

Michael Strombergin prep.

Page 15: Bioinformatics for next-generation DNA sequencing

SNP discovery

GigaBayesGigaBayes

Marth et al. Nature Genetics 1999Quinlan et al. in prep.

Page 16: Bioinformatics for next-generation DNA sequencing

Structural variation discovery

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

Stewart et al. in prep.

Page 17: Bioinformatics for next-generation DNA sequencing

Assembly viewers

Huang and MarthGenome Research 2008

Page 18: Bioinformatics for next-generation DNA sequencing

Data mining

Page 19: Bioinformatics for next-generation DNA sequencing

SNP calling in single-read 454 coverage

• collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• goal was to assess polymorphism rates between 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) were collected

DNA courtesy of Chuck Langley, UC Davis

Page 20: Bioinformatics for next-generation DNA sequencing

Mutational profiling in deep 454 data

• collaboration with Doug Smith at Agencourt

• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel

production)

• one specific mutagenized strain had especially high conversion efficiency

• goal was to determine where the mutations were that caused this phenotype

• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the

15MB genome)

Pichia stipitis reference sequence

• processed the sequences with our 454 pipeline

• found 39 mutations (in as many reads in which we found 650K SNP in

melanogaster)

• informatics analysis in < 24 hours (including manual checking of all

candidates)

Image from JGI web site

Smith et al. Genome Research 2008

Page 21: Bioinformatics for next-generation DNA sequencing

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University

SNP

• we found 45,000 SNP with very high validation rate

Hillier et al.Nature Methods 2008

Page 22: Bioinformatics for next-generation DNA sequencing

Current focus

Page 23: Bioinformatics for next-generation DNA sequencing

1000 Genomes Project

• data quality assessment• project design (# samples depth of read coverage)• read mapping• SNP calling• structural variation discovery

Page 24: Bioinformatics for next-generation DNA sequencing

SV discovery in autism

deletion

amplification

Page 25: Bioinformatics for next-generation DNA sequencing

Transcriptome sequencing

(from: Mortazavi et al. Nature Methods 2008)

Page 26: Bioinformatics for next-generation DNA sequencing

Lab

Page 27: Bioinformatics for next-generation DNA sequencing

The team

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Page 28: Bioinformatics for next-generation DNA sequencing

Resources

• computer cluster• 128 GB RAM server• 20TB disk space

• 2 large R01 grants from the NIH• a BC RIG grant

Page 29: Bioinformatics for next-generation DNA sequencing

Collaborations

Baylor HGSC

Wash. U. GSC

Genome Canada

UBC GSC

Cornell

UC Davis UCSF

NCBI @ NIH NCI @ NIH Marshfield Clinic

UCLA

Pfizer

Page 30: Bioinformatics for next-generation DNA sequencing

Graduate student rotations

• Looking for new graduate students

• Spots are available for all three rotations

• Lots or projects

• Caveat: you need to be able to program…

• Check us out at: http://bioinformatics.bc.edu/marthlab/

•If you are interested, please talk to me