60
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing Technologies for Antibody Clone Screening April 6, 2009

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

  • View
    233

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis

Gabor MarthBoston College Biology

Next Generation Sequencing Technologies for Antibody Clone ScreeningApril 6, 2009

Page 2: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

New sequencing technologies…

Page 3: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

… offer vast throughput

read length

base

s per

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

(10-30Gb in 25-100 bp reads)

1 Mb

100 Gb

Page 4: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Roche / 454

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Page 5: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Illumina / Solexa

• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences

Page 6: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

AB / SOLiD

A C G T

A

C

G

T

2nd Base

1st B

ase

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Page 7: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Helicos / Heliscope

• short-read sequencer• single molecule sequencing• no amplification• variable read-length

Page 8: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Many applications

• organismal resequencing & de novo sequencing

Ruby et al. Cell, 2006

Jones-Rhoades et al. PLoS Genetics, 2007

• transcriptome sequencing for transcript discovery and expression profiling

Meissner et al. Nature 2008

• epigenetic analysis (e.g. DNA methylation)

Page 9: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Data characteristics

Page 10: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Read length

read length [bp]

0 100 200 300

~200-450 (variable)

25-100(fixed)

25-50 (fixed)

25-60 (variable)

400

Page 11: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Error characteristics (Illumina)

Insertions1.43%

Deletions3.23%

Substitutions95.34%

Page 12: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Error characteristics (454)

Page 13: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Coverage bias

~2X read genome read coverage

~20X read genome read coverage

Page 14: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Genome re-sequencing

Page 15: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Complete human genomes

Page 16: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

The re-sequencing informatics pipeline

REF

(ii) read mapping

IND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV callingGigaBayesGigaBayes

Page 17: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Read mapping

Page 18: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

… is like a jigsaw puzzle

… and they give you the picture on the box

2. Read mapping

…you get the pieces…

Big and Unique pieces are easier to place than others…

Page 19: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Challenge: non-uniqueness

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Page 20: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Non-unique mapping

Page 21: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SE short-read alignments are error-prone

0.35%

Page 22: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

Page 23: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

PE alignment statistics (simulated data)

0.00%7.6%

0.09%

0.35%

0.03%

Page 24: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

The MOSAIK read mapper/aligner

Michael Strömberg

Page 25: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Gapped alignments

Page 26: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Aligning multiple read types together

ABI/capillary

454 FLX

454 GS20

Illumina

Page 27: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SNP / short-INDEL discovery

Page 28: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Polymorphism detection

sequencing error polymorphism

Page 29: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Allele calling in multi-individual data

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Pri

or(

G1,.

.,G

i,..,

Gn)

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

Page 30: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SNP calling in deep sample sets

Population Samples Reads Allele detection

Page 31: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Capturing the allele in the samples

0.00

01

0.00

02

0.00

050.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=100

n=200

n=400

n=800

n=1600

Population AF

Pro

b(al

lele

cap

ture

d in

sam

ple)

Page 32: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

The ability to call rare alleles

reads Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

GigaBayesGigaBayes

Page 33: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Allele calling in 400 samples

Page 34: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Detecting de novo mutations

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child

Page 35: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Capture sequencing

Page 36: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Targeted mammalian re-sequencing

• Deep sequencing of complete human genomes is still too expensive

• There is a need to sequence target regions, typically genes, to follow up on GWAS studies

• Targeted re-sequencing with DNA fragment capture offers apotentially cost-effective alternative

• Solid phase or liquid phase capture• 454 or Illumina sequencing

• Informatics pipeline must accountfor the peculiarities of capture data

Page 37: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

On/off target capture

ref allele*:45%

non-ref allele*: 54%

Target region

SNP(outside target region)

Page 38: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Reference allele bias

(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

ref allele*:54%

non-ref allele*: 45%

Page 39: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SNP example

Amit Indap

Page 40: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Structural Variation discovery

Page 41: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Structural variations

Page 42: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SV/CNV detection – SNP chips

• Tiling arrays and SNP-chips made whole-genome CNV scans possible

• Probe density and placement limits resolution

• Balanced events cannot be detected

Page 43: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

SV/CNV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

Rela

tive n

um

bers

of

even

ts

CNV event length [bp]

Page 44: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

44

Read depth

Page 45: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Chromosome 2 Position [Mb]

CNV events found using RD

Page 46: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

PE read mapping positions

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Page 47: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

47

The SV/CNV “event display”

Chip Stewart

Page 48: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Spanner – specificity

Page 49: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Data standards

Page 50: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Data types with standard formats

SRF/FASTQ

SAM/BAM

GLF

Page 51: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Transcriptome sequencing

Page 52: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Data highly reproducible

Michele Busby

Page 53: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Comparative data

Michele Busby

Page 54: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Biological questions

Michele Busby

Page 55: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Our software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Software_Release

Page 56: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

CreditsElaine Mardis

Andy Clark

Aravinda Chakravarti

Doug Smith

Michael Egholm

Scott Kahn

Francisco de la Vega

Patrice MilosJohn Thompson

Page 57: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Lab

Several postdoc positions are available!

Page 58: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Mutational profiling

Page 59: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Chemical mutagenesis

Page 60: Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Mutational profiling: deep 454/Illumina/SOLiD data

• Pichia stipitis converts xylose to ethanol (bio-fuel production)• one mutagenized strain had high conversion efficiency• determine which mutations caused this phenotype• 15MB genome: 454, Illumina, and SOLiD reads• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

10-15X genome coverage required