Biscuit: AFrameworkforNear-Data ProcessingofBigD .Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

  • View
    219

  • Download
    0

Embed Size (px)

Text of Biscuit: AFrameworkforNear-Data ProcessingofBigD .Biscuit: AFrameworkforNear-Data...

  • Biscuit: A Framework for Near-DataProcessing of Big Data Workloads

    Boncheol GuAndre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon,

    Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang

    Memory Business, Samsung Electronics

  • Near-Data Processing (NDP)

    Storage

    Processing unit

    Data

    Traditional data processing

    Storage

    Processing unit

    Data

    Near-data processing

    Processing

    ResultHost interface

    / Network /

    Near-data processing moves computation to data.

    Computation is performed right at the data source.

    Efficient when the cost of moving data is very high

    Most prior work focuses on proving the concept of NDP.

    Little attention to designing and realizing a practical framework

    Realistic large application studies were omitted.

    Boncheol Gu et al. 1/16

  • Near-Data Processing (NDP)

    Storage

    Processing unit

    Data

    Traditional data processing

    Storage

    Processing unit

    Data

    Near-data processing

    Processing

    ResultHost interface

    / Network /

    Near-data processing moves computation to data.

    Computation is performed right at the data source.

    Efficient when the cost of moving data is very high

    Most prior work focuses on proving the concept of NDP.

    Little attention to designing and realizing a practical framework

    Realistic large application studies were omitted.

    Boncheol Gu et al. 1/16

  • Samsung NVMe SSD (PM1725)

    Boncheol Gu et al. 2/16

  • Biscuit

    A user-programmable NDP framework for SSDs and data-intensiveapplications

    C++ language support (C++ standard library)

    Dynamic loading of user programs

    Mutithreading, multi-core support

    The 1st reported product-strength NDP system

    Boncheol Gu et al. 3/16

  • Biscuit

    A user-programmable NDP framework for SSDs and data-intensiveapplications

    C++ language support (C++ standard library)

    Dynamic loading of user programs

    Mutithreading, multi-core support

    The 1st reported product-strength NDP system

    Boncheol Gu et al. 3/16

  • SSD Hardware

    DRAMDRAM

    NAND flash memory

    NAND flash memory

    PCIe

    inte

    rfac

    e

    NDP-enabledSSD controller

    [Inside of PM1725]

    Item DescriptionHost interface PCIe Gen.3 4 (3.2 GB/s)Protocol NVMe 1.1Device density 1 TBSSD architecture Multiple channels/ways/coresStorage medium Multi-bit NAND flash memoryCompute resources Two ARM Cortex R7 coresfor Biscuit @750MHz with MPU

    On-chip SRAM < 1 MiBDRAM 1 GiB

    Hardware pattern matcher on each flash memory channel.

    To search given bit patterns in the flash memory

    Limitations

    Low compute power, no cache coherence, a small amount of fast memory, noMMU, and restrictive synchronization primitives

    Boncheol Gu et al. 4/16

  • SSD Hardware

    DRAMDRAM

    NAND flash memory

    NAND flash memory

    PCIe

    inte

    rfac

    e

    NDP-enabledSSD controller

    [Inside of PM1725]

    Item DescriptionHost interface PCIe Gen.3 4 (3.2 GB/s)Protocol NVMe 1.1Device density 1 TBSSD architecture Multiple channels/ways/coresStorage medium Multi-bit NAND flash memoryCompute resources Two ARM Cortex R7 coresfor Biscuit @750MHz with MPU

    On-chip SRAM < 1 MiBDRAM 1 GiB

    Hardware pattern matcher on each flash memory channel.

    To search given bit patterns in the flash memory

    Limitations

    Low compute power, no cache coherence, a small amount of fast memory, noMMU, and restrictive synchronization primitives

    Boncheol Gu et al. 4/16

  • Biscuit Runtime

    Cooperative multithreading

    A limited form of multithreading (fiber as a scheduling unit)

    Less context switching overhead

    Safe resource sharing without locking

    Shared nothing architecture

    All data transmission among threads through I/O ports

    Enforced by the programming model and APIs

    C++11 move semantics supported

    Dynamic loader for user programs

    User program as position-independent code (PIC)

    Symbol relocation to locate each program in a separate address space

    Boncheol Gu et al. 5/16

  • Biscuit Runtime

    Cooperative multithreading

    A limited form of multithreading (fiber as a scheduling unit)

    Less context switching overhead

    Safe resource sharing without locking

    Shared nothing architecture

    All data transmission among threads through I/O ports

    Enforced by the programming model and APIs

    C++11 move semantics supported

    Dynamic loader for user programs

    User program as position-independent code (PIC)

    Symbol relocation to locate each program in a separate address space

    Boncheol Gu et al. 5/16

  • Biscuit Runtime

    Cooperative multithreading

    A limited form of multithreading (fiber as a scheduling unit)

    Less context switching overhead

    Safe resource sharing without locking

    Shared nothing architecture

    All data transmission among threads through I/O ports

    Enforced by the programming model and APIs

    C++11 move semantics supported

    Dynamic loader for user programs

    User program as position-independent code (PIC)

    Symbol relocation to locate each program in a separate address space

    Boncheol Gu et al. 5/16

  • Biscuit System Architecture

    Host system

    libsisc

    NVMe driver

    Biscuit-enabled SSD

    Biscuit runtime

    SSD-side module

    SSD-sideTask

    SSD-sideTask

    SSD-sideTask

    fiber fiber fiber

    SSD firmware

    libslethost-side program

    I/O requests

    host-sideTask

    host-sideTask

    CPUs CPUs SRAM

    DRAM

    Hos

    t i

    nterf

    ace c

    ontro

    ller

    LLCNAND NAND

    Channel #0FMC

    NAND NAND

    Channel #(nch-1)FMC

    mainmemory

    systemagent

    PCIe I/F

    hardwarepattern matcher

    Boncheol Gu et al. 6/16

  • Development Process

    Run the host program5

    ARM cross compile

    X86compile

    Copy the module into /var/isc/slets (/dev/nvme0n1)

    HostComputer

    HostComputer

    Host program

    SSD-sidemodule

    2 3

    4

    Write the codes1

    SSD-side taskHost-side task

    Boncheol Gu et al. 7/16

  • Development Process

    Copy the module into /var/isc/slets (/dev/nvme0n1)

    Run the host program HostComputer

    HostComputer

    4

    5

    Write the codes1

    ARM cross compile

    X86compile

    SSD-side taskHost-side task

    Host program

    SSD-sidemodule

    2 3

    Boncheol Gu et al. 7/16

  • Development Process

    Run the host program HostComputer

    HostComputer

    5

    Write the codes1

    ARM cross compile

    X86compile

    Copy the module into /var/isc/slets (/dev/nvme0n1)

    SSD-side taskHost-side task

    Host program

    SSD-sidemodule

    2 3

    4

    Boncheol Gu et al. 7/16

  • Development Process

    Write the codes1

    ARM cross compile

    X86compile

    Copy the module into /var/isc/slets (/dev/nvme0n1)

    Run the host program

    SSD-side taskHost-side task

    HostComputer

    HostComputer

    Host program

    SSD-sidemodule

    2 3

    4

    5

    Boncheol Gu et al. 7/16

  • Experimental Setup

    H/W setup

    System Dell PowerEdge R720 serverCPU 2 Intel Xeon(R) CPU E5-2640

    (12 threads per socket) @2.50GHzMemory 64 GiB DRAMOS 64-bit Ubuntu 15.04

    Basic performance results

    Communication latency, data read bandwidth and latency

    Application level results

    String search, pointer chasing, and DB scan/filtering

    Notations

    Conv: system configuration with a default conventional SSD

    Biscuit: system configuration with the Biscuit framework on the SSD

    Boncheol Gu et al. 8/16

  • Experimental Setup

    H/W setup

    System Dell PowerEdge R720 serverCPU 2 Intel Xeon(R) CPU E5-2640

    (12 threads per socket) @2.50GHzMemory 64 GiB DRAMOS 64-bit Ubuntu 15.04

    Basic performance results

    Communication latency, data read bandwidth and latency

    Application level results

    String search, pointer chasing, and DB scan/filtering

    Notations

    Conv: system configuration with a default conventional SSD

    Biscuit: system configuration with the Biscuit framework on the SSD

    Boncheol Gu et al. 8/16

  • Basic Performance Results - Data Read Bandwidth

    Conv: transfer data to the host-side program

    Biscuit: transfer data to the SSD-side module (i.e., internal read)

    0 200 400 600 800 1000 1200Request size [KiB]

    0

    1

    2

    3

    4

    5

    Band

    widt

    h [G

    iB/s

    ]

    Biscuit

    Biscuit (pattern-matching)

    Conv

    PCIe Gen.3 x4

    Config. overhead

    Biscuit exploits the underutilized internal bandwidth.

    Boncheol Gu et al. 9/16

  • Application Level Results

    BiscuitSSD

    Biscuit-awareDatabaseEngine

    Earlyfiltering

    Data analytics with a real DB engine

    MariaDB 5.5.42 (XtraDB)

    TPC-H dataset at a scale factor of 100 (160 GiB)

    We modified the query planner to

    1. identify a candidate table amenable foroffloading;

    2. estimate its selectivity1 using a samplingmethod;

    3. determine whether the table is indeed agood target (based on a selectivitythreshold);

    4. and finally offload the identified filter to theSSD.

    1a fraction of pages that satisfy filter conditions from the to