Upload
aubrey-alexander
View
220
Download
6
Tags:
Embed Size (px)
Citation preview
Genome Informatics I (2015 Spring)
MES7594-01 Genome Infor-matics I
- Lecture V. Short Read Alignment
Sangwoo Kim, Ph.D.Assistant Professor,
Severance Biomedical Research Institute, Yonsei University College of Medicine
Genome Informatics I (2015 Spring)
Overview• Goal of this lecture
– You will learn the principle of mapping NGS short read to reference genome and practice alignment tools
• Short Read Alignment Theory– Why do we need special algorithm?– The Burrows-Wheeler Transformation (BWT)
• BWT indexing• LF search• Examples
• Practice with BWA• with NA18507 sequences
• Understanding alignment information– Viewing/Converting SAM/BAM format– Interpreting alignment information
Genome Informatics I (2015 Spring)
SHORT READ ALIGNMENT THEORY
Genome Informatics I (2015 Spring)
RAW NGS DATA (FASTQ)@SRR764745.4352210/1TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAA-GAGCTGTGAGA+5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>[email protected]/1ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAG-GAGCTCTAAA+6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>>@SRR764746.2695391/1TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAA-CACTGGAGC+6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>[email protected]/1TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTAT-CAAGACCTC+5FBCC@A*CHDFDDDDEFBDDGADFCBDFFEEGEGADEEAE4DEFFEGBEHE8;ADHD@DGGFCGDEDGFB==B?GNG@FMC@JFF>:FG=DDED=&>@A#@SRR764746.5506495/1CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAG-CACAGACAA+5HIDDDEEBDEEEFEEEFEFGFFEECFFGFFFFGFFFGDHGGCFGFGGFGGHDEFDFDHGGFGDGGFGFGFDFAEFBCFFFFJDIKCEEFACFBCA?;A@[email protected]/1CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACT-CAACATTGT+5IHCBE@EEFFDEDGDEDDCFEEGFEEEDFDFGEHEFFFHEBHABHDEDHGDGFFGDFFHEEGGDGHFIFFIEDGFGHGHHCJCIGCEEEHFAB?B@<[email protected]/1TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCAT-GCCTGTAGT+5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB@SRR764745.944258/1GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAAT-ACCTAATGTAGA+5FFDEFEFEDIH?CECEHEHCHIJI>BCCCIDFFFFIHIBHBHFAAFEGGFHMM8FDCDGIEHGAGG@BGAAFKH?6>DKDDNIK?9<FHGBICDBG@<<[email protected]/1TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACT-CATTCAGGTTGAATCTTGAAG+6NMHHFBGGFFEGHEEEIHIDIFGFDFFHFFEFEEGFIJGGGEHHLHIJEFHGHGHFFGGFJKHJJHHFFMHKNBEIFMMGLEIGJHMJCM@CA?FCD;GB
Genome Informatics I (2015 Spring)
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA
Where is this sequence in human genome?
Genome Informatics I (2015 Spring)
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA
Where is this sequence in human genome?
Do this as fast as possible!
Genome Informatics I (2015 Spring)
brute force way
T G A C G T G T G A T T C A A A A A A G C
The reference genome (chr1, start)
G A T T C A A A Your query
G A T T C A A A
G A T T C A A A
G A T T C A A A
Find “GATTCAAA” in human genome
This is very long (3 bil-lion)
Genome Informatics I (2015 Spring)
How fast should it be?
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algo-rithm
3 3.6x108 10 yrs
minimum re-quired
0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 daysbased on 200bp read length, 80x single-end wgs
Genome Informatics I (2015 Spring)
Searching with index• Assume you’re searching
“genome” in a English dictio-nary– You don’t search every line in ev-
ery page– You first find the page range of “g”
in the dictionary– in the above range (of ‘g’), you
find the page range of “ge” in the dictionary
– in the above range (of ‘ge’), you find the page range of “gen” in the dictionary
– ...– until you find “genome”
Genome Informatics I (2015 Spring)
Indexing genome
• We are going to make an index for genome– to make it possible to search a read-sequence
as we do it in an English dictionary
Burrows-Wheeler Transformation
BANANA
Burrows-Wheeler Transformation
BANANA$Lexicographically smallest
Burrows-Wheeler Transformation
BANANA$ANANA$B
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BA
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original
string
Inverse BWT
We are given “ANNB$AA”
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
sort
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
sort
Inverse BWT
We are given “ANNB$AA”
ANNB$AA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$AAABNN
Attach the last column
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
ANNB$AA
Attach the last column
Inverse BWT
We are given “ANNB$AA”
A$NANABA$BANAN
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
$BA$ANANBANANA
sort
ANNB$AA
Attach the last column
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NANN
ANNAN
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• the number of ‘N’• to determine the end point
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
This is a range for ‘A’ not ‘AN’!!
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘A’ + number of ‘A’ before start point• to determine the start point
• =1 + 1 = 2• the number of ‘A’ before end point
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1
“Ax” is not “AN” and less than “AN”
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
start
end
The range of strings that start with “NAN” can be calculated from:
• the number of symbols that are lexi-cographically less than ‘N’ + number of ‘N’ before start point• to determine the start point
• =5 + 1 = 6• the number of ‘N’ before end point
• to determine the end point• =2
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
startend
2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”
BANANA
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Inexact matching
T G A C G T G T G A T T C A A A A A A G C
G A T T G A A A
When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the
mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed
Genome Informatics I (2015 Spring)
Goal achieved
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algo-rithm
3 3.6x108 10 yrs
minimum re-quired
0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 days
Genome Informatics I (2015 Spring)
PRACTICE WITH BWA
Genome Informatics I (2015 Spring)
BWA
Genome Informatics I (2015 Spring)
bwa practice
• In the cluster– >bwa
Genome Informatics I (2015 Spring)
bwa process• bwa index
– to index the reference genome (one time process)• = to create bwt for reference genomoe
• bwa aln– will calculate suffix array (SA) coordinate
• bwa samse (or bwa sampe for paired end se-quencing)– will convert the SA coordinate to chromosomal locations
• Input for bwa– reference genome– fastq file (the raw NGS data)
Genome Informatics I (2015 Spring)
reference data
Genome Informatics I (2015 Spring)
reference data
“bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again
Genome Informatics I (2015 Spring)
sequence data
- Pick one chromosome for you- copy the fastq file to your
directory
- use “cp” command to do it
- example (copying chr8 NGS data to rachmani di-rectory)
>cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/
Genome Informatics I (2015 Spring)
run bwa aln
>bwa aln reference yourdata.fastq > yourdata.sai
example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai
runbwaaln.sh
>qsub runbwaaln.sh
write a job script
submit to clus-ter
Genome Informatics I (2015 Spring)
run bwa samse
>bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam
example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam
runbwasamse.sh
>qsub runbwasamse.sh
write a job script
submit to clus-ter
Genome Informatics I (2015 Spring)
the output
>less NA18507_chr8.01.sam
This is your first alignment with real NGS data
Genome Informatics I (2015 Spring)
break
• Please ask any questions to us if you have problems (do not give up)
• If possible, try mapping in a paired-end mode– bwa sampe reference data01.sai data02.sai
data01.fastq data02.fastq > output.sam
Genome Informatics I (2015 Spring)
The SAM Format
For more details about SAM format please refer to:https://samtools.github.io/hts-specs/SAMv1.pdf
Genome Informatics I (2015 Spring)
SAM/BAM
• SAM and BAM are convertible (exactly same information)
• SAM file– human readable text file
• BAM file (binary)– human unreadable binary file– compressed (much smaller size)– able to index (for random access)
Genome Informatics I (2015 Spring)
Converting SAM to BAM
• >samtools view yourdata.sam –Sb > your-data.bam– -S option means input is SAM format– -b option means output is BAM format–
Genome Informatics I (2015 Spring)
Sorting and Indexing BAM
• samtools sort yourdata.sam yourdata.-sorted– will create yourdata.sorted.bam
• samtools index yourdata.bam– will create yourdata.bam.bai
• Now everything’s ready
Genome Informatics I (2015 Spring)
Visualizing alignment
• IGV (Integrative Genomics Viewer)
Genome Informatics I (2015 Spring)
Visualizing alignment
• samtools tview yourdata.bam reference– example:
• >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIn-dex/genome.fa
Genome Informatics I (2015 Spring)