From Sequences to Systems: Accelerating BAM to SAM Parsing … · 2020. 3. 23. · 11 Mandatory Fields N Optional Fields REF R01 R02 SAM File GCATGTTAGATAAGATAGCTTATCGAT GCATGTT TAGCTTATCG

From Sequences to Systems:Accelerating BAM to SAM Parsing on FPGASafa Messaoud

Overview1. Genome Sequencing and Variant Calling

2. SAM to BAM Parsing is a computational bottleneck in the Variant Calling pipeline

3. Propose SAM2BAM, an FPGA-based Accelerator for SAM to BAM parsing• Developed during a summer internship at IBM Research Tokyo• In patenting process• 10x speed up compared to the SW baseline

4. Conclusion and Future Work

2

Genome Sequencing

• DNA Molecule: the instruction molecule, that contains the code for transforming an organism from aa single cell to a complex system, like a human being.

• The code is a sequence of 4 types of nucleotides/bases: A, T, C and G

• Human Genome Length: ~ 3.1 billion base pairs

• Genome Sequencing: Process of determining the order of A, T, C and G in a DNA molecule

• Read Length: ~ 200 nucleotides

ATCGATTTTCGTAACCTAGCTGCTAGA

ATCGAT…TGCTGTAGCTA…ACGAC

Sequencing Machine

3

Variant Calling

Reference Genome

SampleGenome

A C A G G T - T G T A T A … A T G C

A C G G G T A - G T G G T … A T G C

• Goal: Identify variations in the genome that could be associated with certain diseases, given the sequencing data

• Variants are detected by comparing a sample genome to a reference genome.• Variants are of 3 types:

• Best Practice Workflow: The Genome Analysis Toolkit (GATK) Variant Calling pipeline

Single Nucleotide

Polymorphism(SNP)

Insertion/Deletion(INDEL)

Structural Variant(SV)

4

Variant Calling Pipeline -Alignment-

Reference Genome A C A G G T G A A T G C

SampleGenome

A C A G T G C

C A G G G A A T

A G G T

Goal: Map each short read to a position in the reference sequence

Alignment

Reference .FASTA (~3GB)

Short Reads.FASTQ (~180GB)

.SAM (~1TB)

5

Variant Calling Pipeline -SAM to BAM Conversion-

Goal: Compress SAM file for more efficient storage and communication

• SAM• Sequence Alignment/Map format• ASCII File • Stores Alignment Data

• BAM• Binary and Compressed SAM

• Compression ratio: ~4

Alignment



.SAM (~1TB)

SAM 2 BAM Conversion(SAMTools)

.BAM (~250GB)

6

Variant Calling Pipeline -Data Cleaning

Goal: Compensate for inaccuracies in the sequencing and alignment

Alignment



.SAM (~1TB)


.BAM (~250GB)

Data Cleaning

.BAM (~250GB)

7

Variant Calling Pipeline -Variant Calling-

Goal: Determines the most likely base when there aremismatches between the aligned short reads and thecorresponding base in the reference sequence.

Alignment



.SAM (~1TB)


.BAM (~250GB)

Data Cleaning

Variant Calling

Variants.VCF (

Accelerating the Variant Calling Pipeline

Alignment



~4h

.SAM (~1TB)

SAM 2 BAM Conversion(SAMTools) ~4h

.BAM (~250GB)

Data Cleaning ~13h

Variant Calling ~4h

Variants.VCF (

SAM2BAM Conversion

11

SAM and BAM Files Alignment

Record2

11 Mandatory Fields N Optional Fields

REF

R01

R02SAMFile

GCATGTTAGATAAGATAGCTTATCGAT

GCATGTTTAGCTTATCG

ALIGNMENT

R01

R02

SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

Record1 …

…

FN

FN

QNAME Short Read NameRNAME Reference Name

POS Mapping PositionQUAL Quality Score

SEQ….

Short Read Sequence….

TAG 1 - TAG N

(Additional Sequencing/ Alignment Information)

12

SAM and BAM Files Alignment

REF

R01

R02SAMFile

GCATGTTAGATAAGATAGCTTATCGAT

GCATGTTTAGCTTATCG

ALIGNMENT

R01

R02

SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

…

…

FN

FN

R01

R02

BAM FORMAT

g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11)

g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11) F

…

…

gN(FN)

gN(FN)

1. After Parsing

1. After Compression0111011100010100110110111110111110101010101001101001010101001…

13

SAM to BAM Parsing

14

The single-threaded SAM parser cannot provide sufficient records to multiple compression threads

Parsing is a Bottleneck in the SAM 2 BAMConversion Stage

• Parsing Stage is single-threaded• Gzip Compression Stage is multi-threaded

• Profiling Experiment• Platform: IBM Power System S822L (Two 12-core 3.02Ghz POWER8 cards, 512 GB memory)→ Up to 192 threads are supported• Benchmark: Cancer cell line G15512.HCC1954.1 (variable length records)

• Profiling Results• On average, only 5.72% (11 out of 192) cores are used

ParserGzip_0

Gzip_N…

Gzip_1SAM

B

15

Accelerating SAM2BAM Parsing

• Opportunities for parallelization• Independent Fields/ records can be processed in parallel

• Challenges• Each field is processed by a different Parsing Function• Each Processing function operates on variable size fields/ records• Parsing functions perform few operations on small chunks on Data

• Desired is a versatile computation Platform• Must be able to achieve fine grained parallelism• Must be able to efficiently execute serial routines

FPGA can do all!B

16

B

Accelerator Design

17

Design Space Exploration

F8 \t F9 \t F10 \t F11 \n F1 … F8

F1

F2

…

F11O1

…

ON

\t

\t

\t\t

\t

\t

Input Data from cache@ 64 B/Cycle

1 2 3 4 5 6 7 8 9 … 64

\n

\n

\n

en_O1=1

• Use an FSM for control flow• States code for the different fields in one record• Transition to next state is triggered by a tab/newline• Output of each state is an enable signal to the

corresponding Processing Unit• Advantage

• Simple Design• Disadvantage

• Low throughput: 1Byte/cycle→ Processing Units are under-utilized

Record Start

en_F11=1 Can we do better?Desired Throughput: 64 Byte/Cycle

18

SAM2BAM Accelerator Deeply pipelined designThroughput: 64Byte/cycle

SAM2BAM Accelerator

Bit Level Parallelism

Data Level Parallelism

Instruction Level Parallelism

Data Level Parallelism

Dispatcher Design Challenges

Challenges• An input buffer might contain data from different records

• The number of fields varies across different records (11 mandatory+ N optional fields) • The size of every field varies across different records• One record can be fragmented across different cycles (First field type is unknown)

\t \t \n \t \t …

Record1 Record2

12345678 64

19

Goal: Determine types and starting positions of the different fields in the input buffer in one clock cycle

F3 F3 \t \n \t \t …

\t \t \t \t \t … F31 23456 7 8 64

Cycle1

Cycle2

Statistical Estimation of Design Parameters

• Statistical Estimation of Design Parameters• Analyzed 34 SAM file records • Derived tight confidence Intervals on critical

design parameters

• In case of violation of one of these assumptions, an Error Signal is sent to the application

Design Parameter Minimum MaximumNB optional fields per record 1 11NB fields in the Input buffer 1 11

NB of records in the Input buffer 1 2Size of FLAG 1 Byte 4 Bytes

Size of POS/ PNEXT 1 Byte 9 BytesMAPQ 1 Byte 3 BytesTLEN 1 Byte 10 Bytes

TAG_VALUE 1 Bytes 5 Bytes

20

1 2 3 4 5 6 7 8 9 10 Input Buffer 64

= = = = = = = = =….t1-t64 n1-n64

pos1-pos11

Data Router

PosDeterminePipeline

StartPositionsofthefields

t1-t64 n1-n64

StateDeterminePipeline

Type/numberofthefields

en1-en11

Start1_11 length1_11 en1_11 done1_11

Dispacher Design

10 stages

Results

Metrics Values

Speedup 10X

Latency 20 cycles

Resource Utilization 8%

Conclusion

• Customized Hardware to enable Precision Medicine

• FPGA implementation of the SAM to BAM parsing stage, that exploits:

• Bit-level parallelism

• Instruction-level parallelism

• Data-level parallelism

• ~10x acceleration of SAM to BAM parsing, compared to the baseline SW.

• Estimated Speed-up of the SAM to BAM conversion

Personalize Medicine

ATCGATTGTTCGTAACCTAGCTGCTAGA

Variant 1

Medical Records Family History

Medical Tests

Variants

NGS Machine

Variant 2

Personalized Diagnosis

THANK YOU!

Fast and Accurate Mutation Detection with Dr. Takeshi Ogasawara at IBM Research Tokyo

Photo taken for the IBM Instegram Campaign #InTheLab

SAM2BAM Multithreaded Implementation

1. First run through the SAM file to determine the records’ start positions 2. Each record is assigned to a thread

• Challenges• Fields are processed differently• Fields are of different size across records → High Branch Mis-prediction rate→ SIMD is not efficient

Record StartAddrese

1 0x1524562

…

10000 0x4521682

Record1

Record2

….Thread_1

Thread_N

Method1

While (Not EndOfRecord) Read (c)While (c!=tab)

case (field)F1: S1 ← [S1 c]…

FN: SN ←[SN c]field←field+1case (field-1)F1: B1=foo1(S1)…

F2: B2=foo2(S2)…..

SAM

SAM2BAM Multithreaded Implementation

1. Read a serial stream from a SAM file2. Dispatch each field to a different thread

• Challenges• Parsing Functions do few computation on small chunks of data → Communication overhead• Threads are idle until the whole field is read

Method2

DispacherSAM

Field1

Field_N

…..

B

Profiling of the Alignment Stage

System: Power8 (20 cores, SMT8 enabled)

% C

PUcycles

Jasper, M,J. “Acceleration of Read Alignment with Coherent Attached FPGA coprocessor”. Master Thesis. TU Delft (2015)

Performance does not scale with the number of threads

Power8 Specifications

Produced 2013Designed by IBMMax. CPU Clock rate 2.5 GHz to 5 GHzMin. feature size 22 nmInstruction Set Architecture PowerISA v.2.07Cores Up to 12 SMT8L1 cache 64 (D)+32(I) KB per coreL2 cache 512 KB per chipletL3 cache 96 MBL4 cache 128 MBOther Features Coherent Accelerator Processor Interface (CAPI)

Coherent Accelerator Processor Interface

CAPI (Coherent Accelerator Processor Interface )• enables a client-defined hybrid-computing engine to act as a peer to the multiple POWER8 cores • The accelerator uses the application’s virtual address space

→ It becomes another thread of the application (shared memory/PT/…).• The FPGA cache is fully coherent with the caches in the POWER8 cores.

→ Use a hardware-managed 256 KB cache in the FPGA.

Typical I/O Model Flow is inefficient

http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf

Coherent Accelerator Processor Interface

• CAPP • provides coherency by snooping the bus on the POWER8 CPU on behalf of the accelerator

• PHB • provides connectivity between the CAPP and the PCIe link

• PSL • provides a number of independent interfaces to the AFU• handles virtual-to-physical memory translations• contains a 256 KB resident cache on behalf of the accelerator

http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf

Documents

From Sequences to Systems: Accelerating BAM to SAM Parsing … · 2020. 3. 23. · 11 Mandatory Fields N Optional Fields REF R01 R02 SAM File GCATGTTAGATAAGATAGCTTATCGAT GCATGTT TAGCTTATCG