41
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

Informatics challenges and computer tools for sequencing 1000s of human genomes

  • Upload
    ranit

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput…. - PowerPoint PPT Presentation

Citation preview

Page 1: Informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes

Gabor T. MarthBoston College Biology Department

Cold Spring Harbor LaboratoryPersonal Genomes meetingOctober 9-12, 2008

Page 2: Informatics challenges and computer tools for sequencing 1000s of human genomes

Large-scale individual human resequencing

Page 3: Informatics challenges and computer tools for sequencing 1000s of human genomes

Next-gen sequencers offer vast throughput…

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(5-15Gb in 25-70 bp reads)

1 Mb

Page 4: Informatics challenges and computer tools for sequencing 1000s of human genomes

The resequencing informatics pipeline

(iii) read assembly

REF

(ii) read mapping

IND

(i) base calling

IND(iv) SNP and short INDEL calling

(vi) data validation, hypothesis generation

(v) SV calling

Page 5: Informatics challenges and computer tools for sequencing 1000s of human genomes

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Page 6: Informatics challenges and computer tools for sequencing 1000s of human genomes

1. Base calling

base sequence

base quality (Q-value) sequence

• early manufacturer-supplied base callers were imperfect• third party software made substantial improvements• machine manufacturers are now focusing more on base calling

Page 7: Informatics challenges and computer tools for sequencing 1000s of human genomes

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Larger, more unique pieces are easier to place than others…

Page 8: Informatics challenges and computer tools for sequencing 1000s of human genomes

Next-gen reads are generally short

read length [bp]0 100 200 300

~200-450 (variable)

25-70 (fixed)

25-50 (fixed)

20-60 (variable)

400

Page 9: Informatics challenges and computer tools for sequencing 1000s of human genomes

Base error rates are low

Illumina

454

Page 10: Informatics challenges and computer tools for sequencing 1000s of human genomes

Strategies to deal with non-unique mapping

Page 11: Informatics challenges and computer tools for sequencing 1000s of human genomes

Mapping probabilities (qualities)

0.8 0.19 0.01

read

Page 12: Informatics challenges and computer tools for sequencing 1000s of human genomes

Error types are very different

Illumina

454

Page 13: Informatics challenges and computer tools for sequencing 1000s of human genomes

Gapped alignments

Page 14: Informatics challenges and computer tools for sequencing 1000s of human genomes

MOSAIK

• fast• accurate• gapped• versatile (short + long reads)

Page 15: Informatics challenges and computer tools for sequencing 1000s of human genomes

3. SNP and short-INDEL calling

• deep alignments of 100s / 1000s of individuals • trio sequences

Page 16: Informatics challenges and computer tools for sequencing 1000s of human genomes

Allele discovery is a multi-step sampling process

Population Samples Reads

Page 17: Informatics challenges and computer tools for sequencing 1000s of human genomes

Capturing the allele in the sample

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1E-0

4

2E-0

4

5E-0

40.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

5

Population AF

Pro

b(a

llele

cap

ture

d in

sam

ple

)

n=100

n=200

n=400

n=800

n=1600

Page 18: Informatics challenges and computer tools for sequencing 1000s of human genomes

Allele calling in the reads

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

ii n

l kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

base quality

allele call in read

number of individuals

GigaBayesGigaBayes

Page 19: Informatics challenges and computer tools for sequencing 1000s of human genomes

How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

Page 20: Informatics challenges and computer tools for sequencing 1000s of human genomes

The need for accurate data…

Page 21: Informatics challenges and computer tools for sequencing 1000s of human genomes

… and realistic base quality values

Page 22: Informatics challenges and computer tools for sequencing 1000s of human genomes

Recalibrated base quality values (Illumina)

Page 23: Informatics challenges and computer tools for sequencing 1000s of human genomes

More samples or deeper coverage / sample?

Shallower read coverage from more individuals …

…or deeper coverage from fewer samples?

simulation analysis by Aaron

Quinlan

Page 24: Informatics challenges and computer tools for sequencing 1000s of human genomes

Analysis indicates a balance

Page 25: Informatics challenges and computer tools for sequencing 1000s of human genomes

SNP calling in trios

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child

Page 26: Informatics challenges and computer tools for sequencing 1000s of human genomes

SNP calling in trios

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

mother father

childP=0.79

P=0.86

Page 27: Informatics challenges and computer tools for sequencing 1000s of human genomes

4. Structural variation discovery

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Read pair mapping pattern (breakpoint detection)

Page 28: Informatics challenges and computer tools for sequencing 1000s of human genomes

Copy number estimation

Depth of read coverage

Page 29: Informatics challenges and computer tools for sequencing 1000s of human genomes

Deletion: Aberrant positive mapping distance

Page 30: Informatics challenges and computer tools for sequencing 1000s of human genomes

Tandem duplication: negative mapping distance

Page 31: Informatics challenges and computer tools for sequencing 1000s of human genomes

Het deletion “revealed” by normalization

Chip StewartSaturday poster session

Page 32: Informatics challenges and computer tools for sequencing 1000s of human genomes

5. Data visualization

• software development• data validation• hypothesis generation

Page 33: Informatics challenges and computer tools for sequencing 1000s of human genomes

Summary

• Next-generation sequencing is a boon for large-scale individual human resequencing

• Basic data mining tools are getting applied and tested in the 1000 Genomes Project

• There is still a lot of fine-tuning to do

• A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

Page 34: Informatics challenges and computer tools for sequencing 1000s of human genomes

Credits

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Several postdoc positions are available… … mail [email protected]

Page 35: Informatics challenges and computer tools for sequencing 1000s of human genomes

Software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 36: Informatics challenges and computer tools for sequencing 1000s of human genomes

Positions

Several postdoc positions are available… mail [email protected]

Page 37: Informatics challenges and computer tools for sequencing 1000s of human genomes

Individual genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

A/C

C/C

A/A

Page 38: Informatics challenges and computer tools for sequencing 1000s of human genomes

Genotyping from primary sequence data

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNP Position

Fra

ctio

n of

con

fiden

t gen

otyp

es

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

1400

1600

100 @ 16x: 0.975 +/- 0.121

200 @ 8x: 0.968 +/- 0.129

400 @ 4x: 0.924 +/- 0.151

800 @ 2x: 0.769 +/- 0.154

Page 39: Informatics challenges and computer tools for sequencing 1000s of human genomes

Most reads contain no or few errors

Page 40: Informatics challenges and computer tools for sequencing 1000s of human genomes

Paired-end reads help unique read placement

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

PE

MP

Page 41: Informatics challenges and computer tools for sequencing 1000s of human genomes

How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

P=0.82 P=0.08