Upload
derora
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Informatics tools for next-generation sequence analysis. Gabor Marth Boston College Biology Next-Generation Sequencing MiniSymposium CHOP Philadelphia, PA April 6, 2009. New sequencing technologies…. … offer vast throughput. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers. - PowerPoint PPT Presentation
Citation preview
Informatics tools for next-generation sequence analysis
Gabor MarthBoston College Biology
Next-Generation Sequencing MiniSymposiumCHOP Philadelphia, PAApril 6, 2009
New sequencing technologies…
… offer vast throughput
read length
base
s per
mac
hine
run
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina/Solexa, AB/SOLiD sequencers
ABI capillary sequencer
Roche/454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(10-30Gb in 25-100 bp reads)
1 Mb
100 Gb
Roche / 454
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads
Illumina / Solexa• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences
AB / SOLiD
A C G TA
C
G
T
2nd Base
1st B
ase
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics
Helicos / Heliscope• short-read sequencer• single molecule sequencing• no amplification• variable read-length
Many applications• organismal resequencing & de novo sequencing
Ruby et al. Cell, 2006
Jones-Rhoades et al. PLoS Genetics, 2007
• transcriptome sequencing for transcript discovery and expression profiling
Meissner et al. Nature 2008
• epigenetic analysis (e.g. DNA methylation)
Data characteristics
Read length
read length [bp]0 100 200 300
~200-450 (variable)
25-100(fixed)
25-50 (fixed)
25-60 (variable)
400
Error characteristics (Illumina)
Insertions1.43%
Deletions3.23%
Substitutions95.34%
Error characteristics (454)
Coverage bias
~2X read genome read coverage
~20X read genome read coverage
Genome re-sequencing
Complete human genomes
The re-sequencing informatics pipelineREF
(ii) read mappingIND
(i) base calling
IND(iii) SNP and short INDEL calling
(v) data viewing, hypothesis generation
(iv) SV calling GigaBayesGigaBayes
Read mapping
… is like a jigsaw puzzle
… and they give you the picture on the box
2. Read mapping…you get the pieces…
Big and Unique pieces are easier to place than others…
Challenge: non-uniqueness
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
Non-unique mapping
SE short-read alignments are error-prone
0.35%
Paired-end (PE) reads
fragment length: 100 – 600bp
Korbel et al. Science 2007
fragment length: 1 – 10kb
PE alignment statistics (simulated data)
0.00%7.6%
0.09%
0.35%
0.03%
The MOSAIK read mapper/aligner
Michael Strömberg
Gapped alignments
Aligning multiple read types together
ABI/capillary454 FLX
454 GS20
Illumina
SNP / short-INDEL discovery
Polymorphism detection
sequencing error polymorphism
Allele calling in multi-individual data
P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(SNP)
“genotype probabilities”
P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)
P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)
P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)
“genotype likelihoods”
Prio
r(G1,.
.,Gi,..
, Gn)
-----a----------a----------c----------c-----
-----a----------a----------a----------a----------c-----
-----c----------c----------c----------c-----
SNP calling in deep sample sets
Population Samples Reads Allele detection
Capturing the allele in the samples
0.000
1
0.000
2
0.000
50.0
010.0
020.0
05 0.01
0.02
0.05 0.1 0.2 0.5
00.10.20.30.40.50.60.70.80.9
1
n=100n=200n=400n=800n=1600
Population AF
Pro
b(al
lele
cap
ture
d in
sam
ple)
The ability to call rare alleles
reads Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
GigaBayesGigaBayes
Allele calling in 400 samples
Detecting de novo mutations
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 12 2 11: 111: 1
1 111 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 12 2
1 1 111: 1 1 11:2 2 4
Pr | , 1 112 12 : 2 1 12 2
1 122 : 12 2
M M M
F
C M FF
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 12 4 2 2
1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2
1 111: 12 211: 1
1 122 12 : 1 12 : 12
22 : 1FG
2
22
11:2 1 12 : 2 1
222 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child
Capture sequencing
Targeted mammalian re-sequencing
• Deep sequencing of complete human genomes is still too expensive
• There is a need to sequence target regions, typically genes, to follow up on GWAS studies
• Targeted re-sequencing with DNA fragment capture offers apotentially cost-effective alternative
• Solid phase or liquid phase capture• 454 or Illumina sequencing
• Informatics pipeline must accountfor the peculiarities of capture data
On/off target captureref allele*:
45%non-ref allele*: 54%
Target region
SNP(outside target region)
Reference allele bias
(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346
ref allele*:54%
non-ref allele*: 45%
SNP example
Amit Indap
Structural Variation discovery
Structural variations
SV/CNV detection – SNP chips
• Tiling arrays and SNP-chips made whole-genome CNV scans possible
• Probe density and placement limits resolution
• Balanced events cannot be detected
SV/CNV detection – resolution
Expected CNVsKaryotype
Micro-arraySequencing
Rela
tive
num
bers
of e
vent
s
CNV event length [bp]
44
Read depth
Chromosome 2 Position [Mb]
CNV events found using RD
PE read mapping positions
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
patternLMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
47
The SV/CNV “event display”
Chip Stewart
Spanner – specificity
Data standards
Data types with standard formats
SRF/FASTQ
SAM/BAM
GLF
Transcriptome sequencing
Data highly reproducible
Michele Busby
Comparative data
Michele Busby
Biological questions
Michele Busby
Our software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Software_Release
CreditsElaine Mardis
Andy Clark
Aravinda Chakravarti
Doug Smith
Michael Egholm
Scott Kahn
Francisco de la Vega
Patrice MilosJohn Thompson
Lab
Several postdoc positions are available!
Mutational profiling
Chemical mutagenesis
Mutational profiling: deep 454/Illumina/SOLiD data
• Pichia stipitis converts xylose to ethanol (bio-fuel production)• one mutagenized strain had high conversion efficiency• determine which mutations caused this phenotype• 15MB genome: 454, Illumina, and SOLiD reads• 14 true point mutations in the entire genome
Pichia stipitis reference sequence
Image from JGI web site
10-15X genome coverage required