Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis

Gabor MarthBoston College Biology

Next Generation Sequencing Technologies for Antibody Clone ScreeningApril 6, 2009

New sequencing technologies…

… offer vast throughput

read length

base

s per

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

(10-30Gb in 25-100 bp reads)

1 Mb

100 Gb

Roche / 454

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Illumina / Solexa

• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences

AB / SOLiD

A C G T

A

C

G

T

2nd Base

1st B

ase

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Helicos / Heliscope

• short-read sequencer• single molecule sequencing• no amplification• variable read-length

Many applications

• organismal resequencing & de novo sequencing

Ruby et al. Cell, 2006

Jones-Rhoades et al. PLoS Genetics, 2007

• transcriptome sequencing for transcript discovery and expression profiling

Meissner et al. Nature 2008

• epigenetic analysis (e.g. DNA methylation)

Data characteristics

Read length

read length [bp]

0 100 200 300

~200-450 (variable)

25-100(fixed)

25-50 (fixed)

25-60 (variable)

400

Error characteristics (Illumina)

Insertions1.43%

Deletions3.23%

Substitutions95.34%

Error characteristics (454)

Coverage bias

~2X read genome read coverage

~20X read genome read coverage

Genome re-sequencing

Complete human genomes

The re-sequencing informatics pipeline

REF

(ii) read mapping

IND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV callingGigaBayesGigaBayes

Read mapping

… is like a jigsaw puzzle

… and they give you the picture on the box

2. Read mapping

…you get the pieces…

Big and Unique pieces are easier to place than others…

Challenge: non-uniqueness

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Non-unique mapping

SE short-read alignments are error-prone

0.35%

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

PE alignment statistics (simulated data)

0.00%7.6%

0.09%

0.35%

0.03%

The MOSAIK read mapper/aligner

Michael Strömberg

Gapped alignments

Aligning multiple read types together

ABI/capillary

454 FLX

454 GS20

Illumina

SNP / short-INDEL discovery

Polymorphism detection

sequencing error polymorphism

Allele calling in multi-individual data

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Pri

or(

G1,.

.,G

i,..,

Gn)

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

SNP calling in deep sample sets

Population Samples Reads Allele detection

Capturing the allele in the samples

0.00

01

0.00

02

0.00

050.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=100

n=200

n=400

n=800

n=1600

Population AF

Pro

b(al

lele

cap

ture

d in

sam

ple)

The ability to call rare alleles

reads Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

GigaBayesGigaBayes

Allele calling in 400 samples

Detecting de novo mutations

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child

Capture sequencing

Targeted mammalian re-sequencing

• Deep sequencing of complete human genomes is still too expensive

• There is a need to sequence target regions, typically genes, to follow up on GWAS studies

• Targeted re-sequencing with DNA fragment capture offers apotentially cost-effective alternative

• Solid phase or liquid phase capture• 454 or Illumina sequencing

• Informatics pipeline must accountfor the peculiarities of capture data

On/off target capture

ref allele*:45%

non-ref allele*: 54%

Target region

SNP(outside target region)

Reference allele bias

(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

ref allele*:54%

non-ref allele*: 45%

SNP example

Amit Indap

Structural Variation discovery

Structural variations

SV/CNV detection – SNP chips

• Tiling arrays and SNP-chips made whole-genome CNV scans possible

• Probe density and placement limits resolution

• Balanced events cannot be detected

SV/CNV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

Rela

tive n

um

bers

of

even

ts

CNV event length [bp]

44

Read depth

Chromosome 2 Position [Mb]

CNV events found using RD

PE read mapping positions

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

47

The SV/CNV “event display”

Chip Stewart

Spanner – specificity

Data standards

Data types with standard formats

SRF/FASTQ

SAM/BAM

GLF

Transcriptome sequencing

Data highly reproducible

Michele Busby

Comparative data

Michele Busby

Biological questions

Michele Busby

Our software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Software_Release

CreditsElaine Mardis

Andy Clark

Aravinda Chakravarti

Doug Smith

Michael Egholm

Scott Kahn

Francisco de la Vega

Patrice MilosJohn Thompson

Lab

Several postdoc positions are available!

Mutational profiling

Chemical mutagenesis

Mutational profiling: deep 454/Illumina/SOLiD data

• Pichia stipitis converts xylose to ethanol (bio-fuel production)• one mutagenized strain had high conversion efficiency• determine which mutations caused this phenotype• 15MB genome: 454, Illumina, and SOLiD reads• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

10-15X genome coverage required

Documents

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing