26
Bioinformatics for high- throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Bioinformatics for high-throughput DNA sequencing

  • Upload
    sylvia

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics for high-throughput DNA sequencing. Gabor Marth Boston College Biology New grad student orientation Boston College September 8 , 2009. DNA sequence variations. The Human Genome Project has determined a reference sequence of the human genome. - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics for high-throughput DNA sequencing

Bioinformatics for high-throughput DNA sequencing

Gabor MarthBoston College Biology

New grad student orientationBoston CollegeSeptember 8, 2009

Page 2: Bioinformatics for high-throughput DNA sequencing

DNA sequence variations

The Human Genome Project has determined a reference sequence of the human genome

However, every individual is unique, and is different from others at millions of nucleotide locations

Page 3: Bioinformatics for high-throughput DNA sequencing

Why do we care about variations?

underlie phenotypic differences

cause inherited diseases

allow tracking ancestral human history

Page 4: Bioinformatics for high-throughput DNA sequencing

4

Human genetic variation

Page 5: Bioinformatics for high-throughput DNA sequencing

The first “famous” genomes

Page 6: Bioinformatics for high-throughput DNA sequencing

Genome sequencing

~1 Mb ~100 Mb >100 Mb ~3,000 Mb

Page 7: Bioinformatics for high-throughput DNA sequencing

New sequencing technologies…

Page 8: Bioinformatics for high-throughput DNA sequencing

Next-gen sequencing – a revolution

read length

base

s per

mac

hine

run

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(10-30Gb in 25-100 bp reads)

1 Mb

100 Gb

Page 9: Bioinformatics for high-throughput DNA sequencing

The re-sequencing informatics pipelineREF

(ii) read mappingIND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV calling GigaBayesGigaBayes

Page 10: Bioinformatics for high-throughput DNA sequencing

Tools

Page 11: Bioinformatics for high-throughput DNA sequencing

Read mapping is like a jigsaw puzzle…

… and they give you the picture on the box

2. Read mapping…you get the pieces…

Big and Unique pieces are easier to place than others…

Page 12: Bioinformatics for high-throughput DNA sequencing

The MOSAIK read mapping program

• Reads from repeats cannot be uniquely mapped back to their true region of origin

Michael Strömberg(Wan-Ping Lee)

Page 13: Bioinformatics for high-throughput DNA sequencing

SNP discovery

GigaBayesGigaBayes

Marth et al. Nature Genetics 1999Quinlan et al. in prep.(Amit Indap, Wen Fung Leong)

Page 14: Bioinformatics for high-throughput DNA sequencing

Structural variation discovery

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

Stewart et al. in prep.(Deniz Kural, Jiantao Wu)

Page 15: Bioinformatics for high-throughput DNA sequencing

Sequence alignment viewers

Huang et al. Genome Research 2008(Derek Barnett)

Page 16: Bioinformatics for high-throughput DNA sequencing

Data mining

Page 17: Bioinformatics for high-throughput DNA sequencing

Mutational profiling in deep 454 data

• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production)• one specific mutagenized strain had especially high conversion efficiency• goal was to determine where the mutations were that caused this phenotype• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome)

Pichia stipitis reference sequence

• found 39 mutations• informatics analysis in < 24 hours (including manual checking of all candidates)

Image from JGI web site

Smith et al. Genome Research 2008

Page 18: Bioinformatics for high-throughput DNA sequencing

SNP calling in short-read coverageC. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes• 5 runs (~120 million) Illumina reads were collected by Washington Univ.

SNP

• we found 45,000 SNP with very high validation rate

Hillier et al.Nature Methods 2008

Page 19: Bioinformatics for high-throughput DNA sequencing

Current focus

Page 20: Bioinformatics for high-throughput DNA sequencing

1000 Genomes Project

• data quality assessment• project design (# samples depth of read coverage)• read mapping• SNP calling• structural variation discovery

Page 21: Bioinformatics for high-throughput DNA sequencing

SV discovery in autismdeletion

amplification

Page 22: Bioinformatics for high-throughput DNA sequencing

Lab

Page 23: Bioinformatics for high-throughput DNA sequencing

People

Page 24: Bioinformatics for high-throughput DNA sequencing

Resources

• computer cluster (72 servers)• 128 GB RAM server• ~200TB disk space

• 2 R01 grants (NHGRI/NIH)• 1 R21 grant (NIAID/NIH)• a BC RIG grant

• 2 RC2 grants (NHGRI/NIH) starting September 2009

Page 25: Bioinformatics for high-throughput DNA sequencing

Collaborations

Baylor HGSCWash. U. GSC

Genome Canada

UBC GSC

CornellUC Davis UCSF

NCBI @ NIH NCI @ NIH Marshfield Clinic

UCLA

Pfizer

Page 26: Bioinformatics for high-throughput DNA sequencing

Graduate student rotations• Looking for new graduate students

• Spots are available for all three rotations

• Lots or projects

• Caveat: you need to be able to program…

• Check us out at: http://bioinformatics.bc.edu/marthlab/

• If you are interested, please talk to me