38
From Sequences to Systems: Accelerating BAM to SAM Parsing on FPGA Safa Messaoud

From Sequences to Systems: Accelerating BAM to SAM Parsing … · 2020. 3. 23. · 11 Mandatory Fields N Optional Fields REF R01 R02 SAM File GCATGTTAGATAAGATAGCTTATCGAT GCATGTT TAGCTTATCG

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • From Sequences to Systems:Accelerating BAM to SAM Parsing on FPGASafa Messaoud

  • Overview1. Genome Sequencing and Variant Calling

    2. SAM to BAM Parsing is a computational bottleneck in the Variant Calling pipeline

    3. Propose SAM2BAM, an FPGA-based Accelerator for SAM to BAM parsing• Developed during a summer internship at IBM Research Tokyo• In patenting process• 10x speed up compared to the SW baseline

    4. Conclusion and Future Work

    2

  • Genome Sequencing

    • DNA Molecule: the instruction molecule, that contains the code for transforming an organism from aa single cell to a complex system, like a human being.

    • The code is a sequence of 4 types of nucleotides/bases: A, T, C and G

    • Human Genome Length: ~ 3.1 billion base pairs

    • Genome Sequencing: Process of determining the order of A, T, C and G in a DNA molecule

    • Read Length: ~ 200 nucleotides

    ATCGATTTTCGTAACCTAGCTGCTAGA

    ATCGAT…TGCTGTAGCTA…ACGAC

    Sequencing Machine

    3

  • Variant Calling

    Reference Genome

    SampleGenome

    A C A G G T - T G T A T A … A T G C

    A C G G G T A - G T G G T … A T G C

    • Goal: Identify variations in the genome that could be associated with certain diseases, given the sequencing data

    • Variants are detected by comparing a sample genome to a reference genome.• Variants are of 3 types:

    • Best Practice Workflow: The Genome Analysis Toolkit (GATK) Variant Calling pipeline

    Single Nucleotide

    Polymorphism(SNP)

    Insertion/Deletion(INDEL)

    Structural Variant(SV)

    4

  • Variant Calling Pipeline -Alignment-

    Reference Genome A C A G G T G A A T G C

    SampleGenome

    A C A G T G C

    C A G G G A A T

    A G G T

    Goal: Map each short read to a position in the reference sequence

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    .SAM (~1TB)

    5

  • Variant Calling Pipeline -SAM to BAM Conversion-

    Goal: Compress SAM file for more efficient storage and communication

    • SAM• Sequence Alignment/Map format• ASCII File • Stores Alignment Data

    • BAM• Binary and Compressed SAM

    • Compression ratio: ~4

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    .SAM (~1TB)

    SAM 2 BAM Conversion(SAMTools)

    .BAM (~250GB)

    6

  • Variant Calling Pipeline -Data Cleaning

    Goal: Compensate for inaccuracies in the sequencing and alignment

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    .SAM (~1TB)

    SAM 2 BAM Conversion(SAMTools)

    .BAM (~250GB)

    Data Cleaning

    .BAM (~250GB)

    7

  • Variant Calling Pipeline -Variant Calling-

    Goal: Determines the most likely base when there aremismatches between the aligned short reads and thecorresponding base in the reference sequence.

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    .SAM (~1TB)

    SAM 2 BAM Conversion(SAMTools)

    .BAM (~250GB)

    Data Cleaning

    Variant Calling

    Variants.VCF (

  • Accelerating the Variant Calling Pipeline

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    ~4h

    .SAM (~1TB)

    SAM 2 BAM Conversion(SAMTools) ~4h

    .BAM (~250GB)

    Data Cleaning ~13h

    Variant Calling ~4h

    Variants.VCF (

  • Accelerating the Variant Calling Pipeline

    Alignment

    Reference .FASTA (~3GB)

    Short Reads.FASTQ (~180GB)

    ~4h

    .SAM (~1TB)

    SAM 2 BAM Conversion(SAMTools) ~4h

    .BAM (~250GB)

    Data Cleaning ~13h

    Variant Calling ~4h

    Variants.VCF (

  • SAM2BAM Conversion

    11

  • SAM and BAM Files Alignment

    Record2

    11 Mandatory Fields N Optional Fields

    REF

    R01

    R02SAMFile

    GCATGTTAGATAAGATAGCTTATCGAT

    GCATGTTTAGCTTATCG

    ALIGNMENT

    R01

    R02

    SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

    F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

    Record1 …

    FN

    FN

    QNAME Short Read NameRNAME Reference Name

    POS Mapping PositionQUAL Quality Score

    SEQ….

    Short Read Sequence….

    TAG 1 - TAG N

    (Additional Sequencing/ Alignment Information)

    12

  • SAM and BAM Files Alignment

    REF

    R01

    R02SAMFile

    GCATGTTAGATAAGATAGCTTATCGAT

    GCATGTTTAGCTTATCG

    ALIGNMENT

    R01

    R02

    SAM FORMATF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

    F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

    FN

    FN

    R01

    R02

    BAM FORMAT

    g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11)

    g1(F1) g2(F2) g3(F3) g4(F4) g5(F5) g6(F6) g7(F7) g8(F8) g9(F9) g10(F10) g11(F11) F

    gN(FN)

    gN(FN)

    1. After Parsing

    1. After Compression0111011100010100110110111110111110101010101001101001010101001…

    13

  • SAM to BAM Parsing

    14

  • The single-threaded SAM parser cannot provide sufficient records to multiple compression threads

    Parsing is a Bottleneck in the SAM 2 BAMConversion Stage

    • Parsing Stage is single-threaded• Gzip Compression Stage is multi-threaded

    • Profiling Experiment• Platform: IBM Power System S822L (Two 12-core 3.02Ghz POWER8 cards, 512 GB memory)→ Up to 192 threads are supported• Benchmark: Cancer cell line G15512.HCC1954.1 (variable length records)

    • Profiling Results• On average, only 5.72% (11 out of 192) cores are used

    ParserGzip_0

    Gzip_N…

    Gzip_1SAM

    B

    15

  • Accelerating SAM2BAM Parsing

    • Opportunities for parallelization• Independent Fields/ records can be processed in parallel

    • Challenges• Each field is processed by a different Parsing Function• Each Processing function operates on variable size fields/ records• Parsing functions perform few operations on small chunks on Data

    • Desired is a versatile computation Platform• Must be able to achieve fine grained parallelism• Must be able to efficiently execute serial routines

    FPGA can do all!B

    16

    B

  • Accelerator Design

    17

  • Design Space Exploration

    F8 \t F9 \t F10 \t F11 \n F1 … F8

    F1

    F2

    F11O1

    ON

    \t

    \t

    \t\t

    \t

    \t

    Input Data from cache@ 64 B/Cycle

    1 2 3 4 5 6 7 8 9 … 64

    \n

    \n

    \n

    en_O1=1

    • Use an FSM for control flow• States code for the different fields in one record• Transition to next state is triggered by a tab/newline• Output of each state is an enable signal to the

    corresponding Processing Unit• Advantage

    • Simple Design• Disadvantage

    • Low throughput: 1Byte/cycle→ Processing Units are under-utilized

    Record Start

    en_F11=1 Can we do better?Desired Throughput: 64 Byte/Cycle

    18

  • SAM2BAM Accelerator Deeply pipelined designThroughput: 64Byte/cycle

  • SAM2BAM Accelerator

    Bit Level Parallelism

    Data Level Parallelism

    Instruction Level Parallelism

    Data Level Parallelism

  • Dispatcher Design Challenges

    Challenges• An input buffer might contain data from different records

    • The number of fields varies across different records (11 mandatory+ N optional fields) • The size of every field varies across different records• One record can be fragmented across different cycles (First field type is unknown)

    \t \t \n \t \t …

    Record1 Record2

    12345678 64

    19

    Goal: Determine types and starting positions of the different fields in the input buffer in one clock cycle

    F3 F3 \t \n \t \t …

    \t \t \t \t \t … F31 23456 7 8 64

    Cycle1

    Cycle2

  • Statistical Estimation of Design Parameters

    • Statistical Estimation of Design Parameters• Analyzed 34 SAM file records • Derived tight confidence Intervals on critical

    design parameters

    • In case of violation of one of these assumptions, an Error Signal is sent to the application

    Design Parameter Minimum MaximumNB optional fields per record 1 11NB fields in the Input buffer 1 11

    NB of records in the Input buffer 1 2Size of FLAG 1 Byte 4 Bytes

    Size of POS/ PNEXT 1 Byte 9 BytesMAPQ 1 Byte 3 BytesTLEN 1 Byte 10 Bytes

    TAG_VALUE 1 Bytes 5 Bytes

    20

  • 1 2 3 4 5 6 7 8 9 10 Input Buffer 64

    = = = = = = = = =….t1-t64 n1-n64

    pos1-pos11

    Data Router

    PosDeterminePipeline

    StartPositionsofthefields

    t1-t64 n1-n64

    StateDeterminePipeline

    Type/numberofthefields

    en1-en11

    Start1_11 length1_11 en1_11 done1_11

    Dispacher Design

    10 stages

  • Results

    Metrics Values

    Speedup 10X

    Latency 20 cycles

    Resource Utilization 8%

  • Conclusion

    • Customized Hardware to enable Precision Medicine

    • FPGA implementation of the SAM to BAM parsing stage, that exploits:

    • Bit-level parallelism

    • Instruction-level parallelism

    • Data-level parallelism

    • ~10x acceleration of SAM to BAM parsing, compared to the baseline SW.

    • Estimated Speed-up of the SAM to BAM conversion

  • Personalize Medicine

    ATCGATTGTTCGTAACCTAGCTGCTAGA

    Variant 1

    Medical Records Family History

    Medical Tests

    Variants

    NGS Machine

    Variant 2

    Personalized Diagnosis

  • THANK YOU!

    Fast and Accurate Mutation Detection with Dr. Takeshi Ogasawara at IBM Research Tokyo

    Photo taken for the IBM Instegram Campaign #InTheLab

  • B

  • SAM2BAM Multithreaded Implementation

    1. First run through the SAM file to determine the records’ start positions 2. Each record is assigned to a thread

    • Challenges• Fields are processed differently• Fields are of different size across records → High Branch Mis-prediction rate→ SIMD is not efficient

    Record StartAddrese

    1 0x1524562

    10000 0x4521682

    Record1

    Record2

    ….Thread_1

    Thread_N

    Method1

    While (Not EndOfRecord) Read (c)While (c!=tab)

    case (field)F1: S1 ← [S1 c]…

    FN: SN ←[SN c]field←field+1case (field-1)F1: B1=foo1(S1)…

    F2: B2=foo2(S2)…..

    SAM

  • SAM2BAM Multithreaded Implementation

    1. Read a serial stream from a SAM file2. Dispatch each field to a different thread

    • Challenges• Parsing Functions do few computation on small chunks of data → Communication overhead• Threads are idle until the whole field is read

    Method2

    DispacherSAM

    Field1

    Field_N

    …..

    B

  • Profiling of the Alignment Stage

    System: Power8 (20 cores, SMT8 enabled)

    % C

    PUcycles

    Jasper, M,J. “Acceleration of Read Alignment with Coherent Attached FPGA coprocessor”. Master Thesis. TU Delft (2015)

    Performance does not scale with the number of threads

  • Power8 Specifications

    Produced 2013Designed by IBMMax. CPU Clock rate 2.5 GHz to 5 GHzMin. feature size 22 nmInstruction Set Architecture PowerISA v.2.07Cores Up to 12 SMT8L1 cache 64 (D)+32(I) KB per coreL2 cache 512 KB per chipletL3 cache 96 MBL4 cache 128 MBOther Features Coherent Accelerator Processor Interface (CAPI)

  • Coherent Accelerator Processor Interface

    CAPI (Coherent Accelerator Processor Interface )• enables a client-defined hybrid-computing engine to act as a peer to the multiple POWER8 cores • The accelerator uses the application’s virtual address space

    → It becomes another thread of the application (shared memory/PT/…).• The FPGA cache is fully coherent with the caches in the POWER8 cores.

    → Use a hardware-managed 256 KB cache in the FPGA.

    Typical I/O Model Flow is inefficient

    http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf

  • Coherent Accelerator Processor Interface

    • CAPP • provides coherency by snooping the bus on the POWER8 CPU on behalf of the accelerator

    • PHB • provides connectivity between the CAPP and the PCIe link

    • PSL • provides a number of independent interfaces to the AFU• handles virtual-to-physical memory translations• contains a 256 KB resident cache on behalf of the accelerator

    http://www.nallatech.com/wp-content/uploads/Ent2014-CAPI-on-Power8.pdf