MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

MES7594-01 Genome Infor-matics I

- Lecture V. Short Read Alignment

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine


Overview• Goal of this lecture

– You will learn the principle of mapping NGS short read to reference genome and practice alignment tools

• Short Read Alignment Theory– Why do we need special algorithm?– The Burrows-Wheeler Transformation (BWT)

• BWT indexing• LF search• Examples

• Practice with BWA• with NA18507 sequences

• Understanding alignment information– Viewing/Converting SAM/BAM format– Interpreting alignment information


SHORT READ ALIGNMENT THEORY


RAW NGS DATA (FASTQ)@SRR764745.4352210/1TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAA-GAGCTGTGAGA+5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>[email protected]/1ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAG-GAGCTCTAAA+6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>>@SRR764746.2695391/1TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAA-CACTGGAGC+6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>[email protected]/1TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTAT-CAAGACCTC+5FBCC@A*CHDFDDDDEFBDDGADFCBDFFEEGEGADEEAE4DEFFEGBEHE8;ADHD@DGGFCGDEDGFB==B?GNG@FMC@JFF>:FG=DDED=&>@A#@SRR764746.5506495/1CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAG-CACAGACAA+5HIDDDEEBDEEEFEEEFEFGFFEECFFGFFFFGFFFGDHGGCFGFGGFGGHDEFDFDHGGFGDGGFGFGFDFAEFBCFFFFJDIKCEEFACFBCA?;A@[email protected]/1CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACT-CAACATTGT+5IHCBE@EEFFDEDGDEDDCFEEGFEEEDFDFGEHEFFFHEBHABHDEDHGDGFFGDFFHEEGGDGHFIFFIEDGFGHGHHCJCIGCEEEHFAB?B@<[email protected]/1TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCAT-GCCTGTAGT+5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB@SRR764745.944258/1GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAAT-ACCTAATGTAGA+5FFDEFEFEDIH?CECEHEHCHIJI>BCCCIDFFFFIHIBHBHFAAFEGGFHMM8FDCDGIEHGAGG@BGAAFKH?6>DKDDNIK?9<FHGBICDBG@<<[email protected]/1TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACT-CATTCAGGTTGAATCTTGAAG+6NMHHFBGGFFEGHEEEIHIDIFGFDFFHFFEFEEGFIJGGGEHHLHIJEFHGHGHFFGGFJKHJJHHFFMHKNBEIFMMGLEIGJHMJCM@CA?FCD;GB


Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA

Where is this sequence in human genome?


Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA

Where is this sequence in human genome?

Do this as fast as possible!


brute force way

T G A C G T G T G A T T C A A A A A A G C

The reference genome (chr1, start)

G A T T C A A A Your query

G A T T C A A A

G A T T C A A A

G A T T C A A A

Find “GATTCAAA” in human genome

This is very long (3 bil-lion)


How fast should it be?

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algo-rithm

3 3.6x108 10 yrs

minimum re-quired

0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 daysbased on 200bp read length, 80x single-end wgs


Searching with index• Assume you’re searching

“genome” in a English dictio-nary– You don’t search every line in ev-

ery page– You first find the page range of “g”

in the dictionary– in the above range (of ‘g’), you

find the page range of “ge” in the dictionary

– in the above range (of ‘ge’), you find the page range of “gen” in the dictionary

– ...– until you find “genome”


Indexing genome

• We are going to make an index for genome– to make it possible to search a read-sequence

as we do it in an English dictionary

Burrows-Wheeler Transformation

BANANA


BANANA$Lexicographically smallest


BANANA$ANANA$B


BANANA$ANANA$BNANA$BA


BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA


0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA



0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort




sort

ANNB$AA

last col-umn




sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”




sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original

string

Inverse BWT

We are given “ANNB$AA”

Inverse BWT


ANNB$AA


Inverse BWT


ANNB$AA


$AAABNN

sort

Inverse BWT


ANNB$AA


$AAABNN

sort

Inverse BWT


ANNB$AA


$AAABNN

Attach the last column

Inverse BWT


A$NANABA$BANAN


sort

Inverse BWT


A$NANABA$BANAN


$BA$ANANBANANA

sort

Inverse BWT


A$NANABA$BANAN


$BA$ANANBANANA

sort

ANNB$AA


Inverse BWT


A$NANABA$BANAN


$BA$ANANBANANA

sort

ANNB$AA


LF Search


Question: Find “NAN” from BANANA

LF Search



NANN

ANNAN

LF Search



NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point

• the number of ‘N’• to determine the end point

start

end

LF Search



NAN



• =5 • the number of ‘N’

• to determine the end point• =2

start

end

LF Search



NAN



• =5 • the number of ‘N’


start

end

LF Search



NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point

• =1 • the number of ‘A’


start

end

LF Search



NAN





start

end

This is a range for ‘A’ not ‘AN’!!

LF Search



NAN





start

end

LF Search



NAN





start

end

count of ‘A’ before start point = 1

LF Search



NAN


• the number of symbols that are lexi-cographically less than ‘A’ + number of ‘A’ before start point• to determine the start point

• =1 + 1 = 2• the number of ‘A’ before end point


start

end

count of ‘A’ before start point = 1

“Ax” is not “AN” and less than “AN”

LF Search



NAN

start

end

The range of strings that start with “NAN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’ + number of ‘N’ before start point• to determine the start point

• =5 + 1 = 6• the number of ‘N’ before end point


LF Search



NAN

startend

2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”

BANANA


Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf


Genome query



Genome query



Genome query



Genome query



Genome query



Genome query



Inexact matching

T G A C G T G T G A T T C A A A A A A G C

G A T T G A A A

When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the

mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed


Goal achieved

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algo-rithm

3 3.6x108 10 yrs

minimum re-quired

0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days


PRACTICE WITH BWA


BWA


bwa practice

• In the cluster– >bwa


bwa process• bwa index

– to index the reference genome (one time process)• = to create bwt for reference genomoe

• bwa aln– will calculate suffix array (SA) coordinate

• bwa samse (or bwa sampe for paired end se-quencing)– will convert the SA coordinate to chromosomal locations

• Input for bwa– reference genome– fastq file (the raw NGS data)


reference data


reference data

“bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again


sequence data

- Pick one chromosome for you- copy the fastq file to your

directory

- use “cp” command to do it

- example (copying chr8 NGS data to rachmani di-rectory)

>cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/


run bwa aln

>bwa aln reference yourdata.fastq > yourdata.sai

example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai

runbwaaln.sh

>qsub runbwaaln.sh

write a job script

submit to clus-ter


run bwa samse

>bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam

example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam

runbwasamse.sh

>qsub runbwasamse.sh

write a job script

submit to clus-ter


the output

>less NA18507_chr8.01.sam

This is your first alignment with real NGS data


break

• Please ask any questions to us if you have problems (do not give up)

• If possible, try mapping in a paired-end mode– bwa sampe reference data01.sai data02.sai

data01.fastq data02.fastq > output.sam


The SAM Format

For more details about SAM format please refer to:https://samtools.github.io/hts-specs/SAMv1.pdf

https://samtools.github.io/hts-specs/SAMv1.pdf

https://samtools.github.io/hts-specs/SAMv1.pdf


SAM/BAM

• SAM and BAM are convertible (exactly same information)

• SAM file– human readable text file

• BAM file (binary)– human unreadable binary file– compressed (much smaller size)– able to index (for random access)


Converting SAM to BAM

• >samtools view yourdata.sam –Sb > your-data.bam– -S option means input is SAM format– -b option means output is BAM format–


Sorting and Indexing BAM

• samtools sort yourdata.sam yourdata.-sorted– will create yourdata.sorted.bam

• samtools index yourdata.bam– will create yourdata.bam.bai

• Now everything’s ready


Visualizing alignment

• IGV (Integrative Genomics Viewer)


Visualizing alignment

• samtools tview yourdata.bam reference– example:

• >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIn-dex/genome.fa


Documents

MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei