Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
From Sequences to Systems:Accelerating BAM to SAM Parsing on FPGASafa Messaoud
Overview1. Genome Sequencing and Variant Calling
2. SAM to BAM Parsing is a computational bottleneck in the Variant Calling pipeline
3. Propose SAM2BAM, an FPGA-based Accelerator for SAM to BAM parsing• Developed during a summer internship at IBM Research Tokyo• In patenting process• 10x speed up compared to the SW baseline
4. Conclusion and Future Work
2
Genome Sequencing
• DNA Molecule: the instruction molecule, that contains the code for transforming an organism from aa single cell to a complex system, like a human being.
• The code is a sequence of 4 types of nucleotides/bases: A, T, C and G
• Human Genome Length: ~ 3.1 billion base pairs
• Genome Sequencing: Process of determining the order of A, T, C and G in a DNA molecule
• Read Length: ~ 200 nucleotides
ATCGATTTTCGTAACCTAGCTGCTAGA
ATCGAT…TGCTGTAGCTA…ACGAC
Sequencing Machine
3
Variant Calling
Reference Genome
SampleGenome
A C A G G T - T G T A T A … A T G C
A C G G G T A - G T G G T … A T G C
• Goal: Identify variations in the genome that could be associated with certain diseases, given the sequencing data
• Variants are detected by comparing a sample genome to a reference genome.• Variants are of 3 types:
• Best Practice Workflow: The Genome Analysis Toolkit (GATK) Variant Calling pipeline
Single Nucleotide
Polymorphism(SNP)
Insertion/Deletion(INDEL)
Structural Variant(SV)
4
Variant Calling Pipeline -Alignment-
Reference Genome A C A G G T G A A T G C
SampleGenome
A C A G T G C
C A G G G A A T
A G G T
Goal: Map each short read to a position in the reference sequence
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
.SAM (~1TB)
5
Variant Calling Pipeline -SAM to BAM Conversion-
Goal: Compress SAM file for more efficient storage and communication
• SAM• Sequence Alignment/Map format• ASCII File • Stores Alignment Data
• BAM• Binary and Compressed SAM
• Compression ratio: ~4
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
.SAM (~1TB)
SAM 2 BAM Conversion(SAMTools)
.BAM (~250GB)
6
Variant Calling Pipeline -Data Cleaning
Goal: Compensate for inaccuracies in the sequencing and alignment
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
.SAM (~1TB)
SAM 2 BAM Conversion(SAMTools)
.BAM (~250GB)
Data Cleaning
.BAM (~250GB)
7
Variant Calling Pipeline -Variant Calling-
Goal: Determines the most likely base when there aremismatches between the aligned short reads and thecorresponding base in the reference sequence.
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
.SAM (~1TB)
SAM 2 BAM Conversion(SAMTools)
.BAM (~250GB)
Data Cleaning
Variant Calling
Variants.VCF (
Accelerating the Variant Calling Pipeline
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
~4h
.SAM (~1TB)
SAM 2 BAM Conversion(SAMTools) ~4h
.BAM (~250GB)
Data Cleaning ~13h
Variant Calling ~4h
Variants.VCF (
Accelerating the Variant Calling Pipeline
Alignment
Reference .FASTA (~3GB)
Short Reads.FASTQ (~180GB)
~4h
.SAM (~1TB)
SAM 2 BAM Conversion(SAMTools) ~4h
.BAM (~250GB)
Data Cleaning ~13h
Variant Calling ~4h
Variants.VCF (
SAM2BAM Conversion
11
SAM and BAM Files Alignment
Record2
11 Mandatory Fields N Optional Fields
REF
R01
R02SAMFile
GCATGTTAGATAAGATAGCTTATCGAT
GCATGTTTAGCTTATCG
ALIGNMENT
R01
R02
SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
Record1 …
…
FN
FN
QNAME Short Read NameRNAME Reference Name
POS Mapping PositionQUAL Quality Score
SEQ….
Short Read Sequence….
TAG 1 - TAG N
(Additional Sequencing/ Alignment Information)
12
SAM and BAM Files Alignment
REF
R01
R02SAMFile
GCATGTTAGATAAGATAGCTTATCGAT
GCATGTTTAGCTTATCG
ALIGNMENT
R01
R02
SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
…
…
FN
FN
R01
R02
BAM FORMAT
g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11)
g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11) F
…
…
gN(FN)
gN(FN)
1. After Parsing
1. After Compression0111011100010100110110111110111110101010101001101001010101001…
13
SAM to BAM Parsing
14
The single-threaded SAM parser cannot provide sufficient records to multiple compression threads
Parsing is a Bottleneck in the SAM 2 BAMConversion Stage
• Parsing Stage is single-threaded• Gzip Compression Stage is multi-threaded
• Profiling Experiment• Platform: IBM Power System S822L (Two 12-core 3.02Ghz POWER8 cards, 512 GB memory)→ Up to 192 threads are supported• Benchmark: Cancer cell line G15512.HCC1954.1 (variable length records)
• Profiling Results• On average, only 5.72% (11 out of 192) cores are used
ParserGzip_0
Gzip_N…
Gzip_1SAM
B
15
Accelerating SAM2BAM Parsing
• Opportunities for parallelization• Independent Fields/ records can be processed in parallel
• Challenges• Each field is processed by a different Parsing Function• Each Processing function operates on variable size fields/ records• Parsing functions perform few operations on small chunks on Data
• Desired is a versatile computation Platform• Must be able to achieve fine grained parallelism• Must be able to efficiently execute serial routines
FPGA can do all!B
16
B
Accelerator Design
17
Design Space Exploration
F8 \t F9 \t F10 \t F11 \n F1 … F8
F1
F2
…
F11O1
…
ON
\t
\t
\t\t
\t
\t
Input Data from cache@ 64 B/Cycle
1 2 3 4 5 6 7 8 9 … 64
\n
\n
\n
en_O1=1
• Use an FSM for control flow• States code for the different fields in one record• Transition to next state is triggered by a tab/newline• Output of each state is an enable signal to the
corresponding Processing Unit• Advantage
• Simple Design• Disadvantage
• Low throughput: 1Byte/cycle→ Processing Units are under-utilized
Record Start
en_F11=1 Can we do better?Desired Throughput: 64 Byte/Cycle
18
SAM2BAM Accelerator Deeply pipelined designThroughput: 64Byte/cycle
SAM2BAM Accelerator
Bit Level Parallelism
Data Level Parallelism
Instruction Level Parallelism
Data Level Parallelism
Dispatcher Design Challenges
Challenges• An input buffer might contain data from different records
• The number of fields varies across different records (11 mandatory+ N optional fields) • The size of every field varies across different records• One record can be fragmented across different cycles (First field type is unknown)
\t \t \n \t \t …
Record1 Record2
12345678 64
19
Goal: Determine types and starting positions of the different fields in the input buffer in one clock cycle
F3 F3 \t \n \t \t …
\t \t \t \t \t … F31 23456 7 8 64
Cycle1
Cycle2
Statistical Estimation of Design Parameters
• Statistical Estimation of Design Parameters• Analyzed 34 SAM file records • Derived tight confidence Intervals on critical
design parameters
• In case of violation of one of these assumptions, an Error Signal is sent to the application
Design Parameter Minimum MaximumNB optional fields per record 1 11NB fields in the Input buffer 1 11
NB of records in the Input buffer 1 2Size of FLAG 1 Byte 4 Bytes
Size of POS/ PNEXT 1 Byte 9 BytesMAPQ 1 Byte 3 BytesTLEN 1 Byte 10 Bytes
TAG_VALUE 1 Bytes 5 Bytes
20
1 2 3 4 5 6 7 8 9 10 Input Buffer 64
= = = = = = = = =….t1-t64 n1-n64
pos1-pos11
Data Router
PosDeterminePipeline
StartPositionsofthefields
t1-t64 n1-n64
StateDeterminePipeline
Type/numberofthefields
en1-en11
Start1_11 length1_11 en1_11 done1_11
Dispacher Design
10 stages
Results
Metrics Values
Speedup 10X
Latency 20 cycles
Resource Utilization 8%
Conclusion
• Customized Hardware to enable Precision Medicine
• FPGA implementation of the SAM to BAM parsing stage, that exploits:
• Bit-level parallelism
• Instruction-level parallelism
• Data-level parallelism
• ~10x acceleration of SAM to BAM parsing, compared to the baseline SW.
• Estimated Speed-up of the SAM to BAM conversion
Personalize Medicine
ATCGATTGTTCGTAACCTAGCTGCTAGA
Variant 1
Medical Records Family History
Medical Tests
Variants
NGS Machine
Variant 2
Personalized Diagnosis
THANK YOU!
Fast and Accurate Mutation Detection with Dr. Takeshi Ogasawara at IBM Research Tokyo
Photo taken for the IBM Instegram Campaign #InTheLab
B
SAM2BAM Multithreaded Implementation
1. First run through the SAM file to determine the records’ start positions 2. Each record is assigned to a thread
• Challenges• Fields are processed differently• Fields are of different size across records → High Branch Mis-prediction rate→ SIMD is not efficient
Record StartAddrese
1 0x1524562
…
10000 0x4521682
Record1
Record2
….Thread_1
Thread_N
Method1
While (Not EndOfRecord) Read (c)While (c!=tab)
case (field)F1: S1 ← [S1 c]…
FN: SN ←[SN c]field←field+1case (field-1)F1: B1=foo1(S1)…
F2: B2=foo2(S2)…..
SAM
SAM2BAM Multithreaded Implementation
1. Read a serial stream from a SAM file2. Dispatch each field to a different thread
• Challenges• Parsing Functions do few computation on small chunks of data → Communication overhead• Threads are idle until the whole field is read
Method2
DispacherSAM
Field1
Field_N
…..
B
Profiling of the Alignment Stage
System: Power8 (20 cores, SMT8 enabled)
% C
PUcycles
Jasper, M,J. “Acceleration of Read Alignment with Coherent Attached FPGA coprocessor”. Master Thesis. TU Delft (2015)
Performance does not scale with the number of threads
Power8 Specifications
Produced 2013Designed by IBMMax. CPU Clock rate 2.5 GHz to 5 GHzMin. feature size 22 nmInstruction Set Architecture PowerISA v.2.07Cores Up to 12 SMT8L1 cache 64 (D)+32(I) KB per coreL2 cache 512 KB per chipletL3 cache 96 MBL4 cache 128 MBOther Features Coherent Accelerator Processor Interface (CAPI)
Coherent Accelerator Processor Interface
CAPI (Coherent Accelerator Processor Interface )• enables a client-defined hybrid-computing engine to act as a peer to the multiple POWER8 cores • The accelerator uses the application’s virtual address space
→ It becomes another thread of the application (shared memory/PT/…).• The FPGA cache is fully coherent with the caches in the POWER8 cores.
→ Use a hardware-managed 256 KB cache in the FPGA.
Typical I/O Model Flow is inefficient
http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf
Coherent Accelerator Processor Interface
• CAPP • provides coherency by snooping the bus on the POWER8 CPU on behalf of the accelerator
• PHB • provides connectivity between the CAPP and the PCIe link
• PSL • provides a number of independent interfaces to the AFU• handles virtual-to-physical memory translations• contains a 256 KB resident cache on behalf of the accelerator
http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf