37
Biscuit: A Framework for Near-Data Processing of Big Data Workloads Boncheol Gu Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang Memory Business, Samsung Electronics

Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

  • Upload
    lamlien

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit: A Framework for Near-Data

Processing of Big Data Workloads

Boncheol GuAndre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon,

Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang

Memory Business, Samsung Electronics

Page 2: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Near-Data Processing (NDP)

Storage

Processing unit

Data

Traditional data processing

Storage

Processing unit

Data

Near-data processing

Processing

ResultHost interface

/ Network / …

Near-data processing moves computation to data.

• Computation is performed right at the data source.

• Efficient when the cost of moving data is very high

Most prior work focuses on proving the concept of NDP.

• Little attention to designing and realizing a practical framework

• Realistic large application studies were omitted.

Boncheol Gu et al. 1/16

Page 3: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Near-Data Processing (NDP)

Storage

Processing unit

Data

Traditional data processing

Storage

Processing unit

Data

Near-data processing

Processing

ResultHost interface

/ Network / …

Near-data processing moves computation to data.

• Computation is performed right at the data source.

• Efficient when the cost of moving data is very high

Most prior work focuses on proving the concept of NDP.

• Little attention to designing and realizing a practical framework

• Realistic large application studies were omitted.

Boncheol Gu et al. 1/16

Page 4: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Samsung NVMe SSD (PM1725)

Boncheol Gu et al. 2/16

Page 5: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit

A user-programmable NDP framework for SSDs and data-intensive

applications

• C++ language support (C++ standard library)

• Dynamic loading of user programs

• Mutithreading, multi-core support

• The 1st reported product-strength NDP system

Boncheol Gu et al. 3/16

Page 6: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit

A user-programmable NDP framework for SSDs and data-intensive

applications

• C++ language support (C++ standard library)

• Dynamic loading of user programs

• Mutithreading, multi-core support

• The 1st reported product-strength NDP system

Boncheol Gu et al. 3/16

Page 7: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

SSD Hardware

DRAMDRAM

NAND flash memory

NAND flash memory

PCIe

inte

rfac

e

NDP-enabledSSD controller

[Inside of PM1725]

Item Description

Host interface PCIe Gen.3 ×4 (3.2 GB/s)

Protocol NVMe 1.1

Device density 1 TB

SSD architecture Multiple channels/ways/cores

Storage medium Multi-bit NAND flash memory

Compute resources Two ARM Cortex R7 cores

for Biscuit @750MHz with MPU

On-chip SRAM < 1 MiB

DRAM ≥ 1 GiB

Hardware pattern matcher on each flash memory channel.

• To search given bit patterns in the flash memory

Limitations

• Low compute power, no cache coherence, a small amount of fast memory, no

MMU, and restrictive synchronization primitives

Boncheol Gu et al. 4/16

Page 8: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

SSD Hardware

DRAMDRAM

NAND flash memory

NAND flash memory

PCIe

inte

rfac

e

NDP-enabledSSD controller

[Inside of PM1725]

Item Description

Host interface PCIe Gen.3 ×4 (3.2 GB/s)

Protocol NVMe 1.1

Device density 1 TB

SSD architecture Multiple channels/ways/cores

Storage medium Multi-bit NAND flash memory

Compute resources Two ARM Cortex R7 cores

for Biscuit @750MHz with MPU

On-chip SRAM < 1 MiB

DRAM ≥ 1 GiB

Hardware pattern matcher on each flash memory channel.

• To search given bit patterns in the flash memory

Limitations

• Low compute power, no cache coherence, a small amount of fast memory, no

MMU, and restrictive synchronization primitives

Boncheol Gu et al. 4/16

Page 9: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit Runtime

Cooperative multithreading

• A limited form of multithreading (fiber as a scheduling unit)

• Less context switching overhead

• Safe resource sharing without locking

Shared nothing architecture

• All data transmission among threads through I/O ports

• Enforced by the programming model and APIs

• C++11 move semantics supported

Dynamic loader for user programs

• User program as position-independent code (PIC)

• Symbol relocation to locate each program in a separate address space

Boncheol Gu et al. 5/16

Page 10: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit Runtime

Cooperative multithreading

• A limited form of multithreading (fiber as a scheduling unit)

• Less context switching overhead

• Safe resource sharing without locking

Shared nothing architecture

• All data transmission among threads through I/O ports

• Enforced by the programming model and APIs

• C++11 move semantics supported

Dynamic loader for user programs

• User program as position-independent code (PIC)

• Symbol relocation to locate each program in a separate address space

Boncheol Gu et al. 5/16

Page 11: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit Runtime

Cooperative multithreading

• A limited form of multithreading (fiber as a scheduling unit)

• Less context switching overhead

• Safe resource sharing without locking

Shared nothing architecture

• All data transmission among threads through I/O ports

• Enforced by the programming model and APIs

• C++11 move semantics supported

Dynamic loader for user programs

• User program as position-independent code (PIC)

• Symbol relocation to locate each program in a separate address space

Boncheol Gu et al. 5/16

Page 12: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit System Architecture

Host system

libsisc

NVMe driver

Biscuit-enabled SSD

Biscuit runtime

SSD-side module

SSD-sideTask

SSD-sideTask

SSD-sideTask

fiber fiber fiber

SSD firmware

libslethost-side program

I/O requests

host-sideTask

host-sideTask

CPUs CPUs SRAM

DRAM

Host

int

erf

ace c

ontro

ller

LLCNAND NAND…

Channel #0FMC

NAND NAND…

Channel #(nch-1)FMC

mainmemory

systemagent

PCIe I/F

hardwarepattern matcher

Boncheol Gu et al. 6/16

Page 13: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Development Process

Run the host program5

ARM cross compile

X86compile

Copy the module into /var/isc/slets (/dev/nvme0n1)

HostComputer

HostComputer

Host program

SSD-sidemodule

2 3

4

Write the codes1

SSD-side taskHost-side task

Boncheol Gu et al. 7/16

Page 14: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Development Process

Copy the module into /var/isc/slets (/dev/nvme0n1)

Run the host program HostComputer

HostComputer

4

5

Write the codes1

ARM cross compile

X86compile

SSD-side taskHost-side task

Host program

SSD-sidemodule

2 3

Boncheol Gu et al. 7/16

Page 15: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Development Process

Run the host program HostComputer

HostComputer

5

Write the codes1

ARM cross compile

X86compile

Copy the module into /var/isc/slets (/dev/nvme0n1)

SSD-side taskHost-side task

Host program

SSD-sidemodule

2 3

4

Boncheol Gu et al. 7/16

Page 16: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Development Process

Write the codes1

ARM cross compile

X86compile

Copy the module into /var/isc/slets (/dev/nvme0n1)

Run the host program

SSD-side taskHost-side task

HostComputer

HostComputer

Host program

SSD-sidemodule

2 3

4

5

Boncheol Gu et al. 7/16

Page 17: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Experimental Setup

H/W setup

System Dell PowerEdge R720 server

CPU 2 Intel Xeon(R) CPU E5-2640

(12 threads per socket) @2.50GHz

Memory 64 GiB DRAM

OS 64-bit Ubuntu 15.04

Basic performance results

• Communication latency, data read bandwidth and latency

Application level results

• String search, pointer chasing, and DB scan/filtering

Notations

• Conv: system configuration with a default conventional SSD

• Biscuit: system configuration with the Biscuit framework on the SSD

Boncheol Gu et al. 8/16

Page 18: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Experimental Setup

H/W setup

System Dell PowerEdge R720 server

CPU 2 Intel Xeon(R) CPU E5-2640

(12 threads per socket) @2.50GHz

Memory 64 GiB DRAM

OS 64-bit Ubuntu 15.04

Basic performance results

• Communication latency, data read bandwidth and latency

Application level results

• String search, pointer chasing, and DB scan/filtering

Notations

• Conv: system configuration with a default conventional SSD

• Biscuit: system configuration with the Biscuit framework on the SSD

Boncheol Gu et al. 8/16

Page 19: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Basic Performance Results - Data Read Bandwidth

• Conv: transfer data to the host-side program

• Biscuit: transfer data to the SSD-side module (i.e., ”internal read”)

0 200 400 600 800 1000 1200

Request size [KiB]

0

1

2

3

4

5

Bandw

idth

[G

iB/s

]

Biscuit

Biscuit (pattern-matching)

Conv

PCIe Gen.3 x4

Config. overhead

Biscuit exploits the underutilized internal bandwidth.

Boncheol Gu et al. 9/16

Page 20: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Application Level Results

BiscuitSSD

Biscuit-awareDatabaseEngine

Earlyfiltering

Data analytics with a real DB engine

• MariaDB 5.5.42 (XtraDB)

• TPC-H dataset at a scale factor of 100 (160 GiB)

• We modified the query planner to

1. identify a candidate table amenable for

offloading;

2. estimate its selectivity1 using a sampling

method;

3. determine whether the table is indeed a

good target (based on a selectivity

threshold);

4. and finally offload the identified filter to the

SSD.

1a fraction of pages that satisfy filter conditions from the total number of pages in the table

Boncheol Gu et al. 10/16

Page 21: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Application Level Results

BiscuitSSD

Biscuit-awareDatabaseEngine

Earlyfiltering

Data analytics with a real DB engine

• MariaDB 5.5.42 (XtraDB)

• TPC-H dataset at a scale factor of 100 (160 GiB)

• We modified the query planner to

1. identify a candidate table amenable for

offloading;

2. estimate its selectivity1 using a sampling

method;

3. determine whether the table is indeed a

good target (based on a selectivity

threshold);

4. and finally offload the identified filter to the

SSD.

1a fraction of pages that satisfy filter conditions from the total number of pages in the table

Boncheol Gu et al. 10/16

Page 22: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Sample Queries

Query 11 (single filter predicate, selectivity for shipdate: 0.02)

SELECT l_orderkey, l_shipdate, l_linenumber

FROM lineitem

WHERE l_shipdate = '1995-1-17'

Query 21 (more complex WHERE clause, selectivity for shipdate: 0.04)

SELECT l_orderkey, l_shipdate, l_linenumber

FROM lineitem

WHERE (l_shipdate = '1995-1-17' OR l_shipdate = '1995-1-18')

AND (l_linenumber = 1 OR l_linenumber = 2)

High filtering ratio with a low computational complexity

1“Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading”, VLDB 2014

Boncheol Gu et al. 11/16

Page 23: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Application Level Results - Execution Time

Biscuit

Conv

Query 1

0 100 200 300 400 500Execution time [sec]

Biscuit

Conv

Query 2

• Biscuit achieves speed-ups of about 11× and 10×.• Large internal bandwidth of Biscuit

• Rapid scanning of the dataset by the hardware pattern matcher

• Execution times on Biscuit were very consistent.

Boncheol Gu et al. 12/16

Page 24: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Application Level Results - Power Consumption (Query 1)

0 100 200 300 400 500

Time [sec]

100

120

140

160Po

wer

[Watt

s]ConvBiscuit

Biscuit

Conv

• Some extra work, such as synchronizing the buffer cache, performed

after the query execution (included in the results)

• Biscuit consumes more power1during query processing.

• It keeps the target SSD busy and exploits nearly full bandwidth.

1The power is of the whole system including the server and the target SSD.

Boncheol Gu et al. 13/16

Page 25: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Application Level Results - Power Consumption

Average power(Watts)

Total energy(kJ)

136

12.2

122

60.5Conv

Biscuit

• Biscuit achieves significantly lower energy consumption thanks to its

reduced execution time.

Boncheol Gu et al. 14/16

Page 26: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

TPC-H Execution Results

0

100

200

300

400

I/O R

atio

315.4x

113.0x

19.6x 21.0x 2.7x 3.8x 1.0x 1.4x 10.9x

Q14 Q2 Q10 Q8 Q9 Q17 Q12 Q5 GM0

40

80

120

160

Spee

d-up

166.8x

43.7x

9.1x 4.6x 2.8x 1.6x 1.2x 1.1x 6.1x

• Running all queries, Conv takes nearly two days, while Biscuit takesabout 13 hours (3.6x speed-up).

• Top 5 queries take 70+% of total execution time.

Boncheol Gu et al. 15/16

Page 27: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Conclusions

• We presented the design and implementation of Biscuit, an NDP

framework built for high-speed SSDs.

• With Biscuit, we pursued achieving high programmability on distributed

resources including processing units of SSDs as well as host CPUs.

• Biscuit is the first reported product-strength NDP system

implementation.

• We successfully ported Biscuit on small and large data-intensive

applications including MariaDB.

• Biscuit accomplished the performance improvement of up to 166x for

TPC-H queries (average 6.1x improvement).

Boncheol Gu et al. 16/16

Page 28: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Biscuit: A Framework for Near-Data

Processing of Big Data Workloads

Boncheol GuAndre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon,

Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang

Memory Business, Samsung Electronics

Page 29: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Programming Model

Host-side program(coordinator)

App. 1

SSDletin out

SSD-side module(computation units)

// docomputation

. . .

App. 2proxyproxy

Biscuitruntime

instanceinstance . . .

libslet

libsisc

Computation model

• Specifying computation using its

input and output data

• Libslet to define SSD-side tasks

and perform file I/O.

Boncheol Gu et al. 16/16

Page 30: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Programming Model

Host-side program(coordinator)

App. 1

SSDletin out

SSD-side module(computation units)

// docomputation

. . .

App. 2proxyproxy

Biscuitruntime

instanceinstance . . .

libslet

libsisc

Computation model

• Specifying computation using its

input and output data

• Libslet to define SSD-side tasks

and perform file I/O

Coordination model

• Creating and managing tasks

• Establishing producer/consumer

relationship

• Libsisc to invoke and coordinate

execution of SSD-side tasks

Boncheol Gu et al. 16/16

Page 31: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

SSDlet

• A simple C++ program written with Biscuit APIs.

• A unit of execution independently scheduled, represented by the SSDLet class

Example (User-defined SSDlet)

class UserTask

: public SSDLet<IN_TYPE<int32_t>,

OUT_TYPE<int32_t, bool>,

ARG_TYPE<double>> {

public:

void run() override {

auto in = getInputPort<0>();

auto out0 = getOutputPort<0>();

auto out1 = getOutputPort<1>();

double& value = getArgument<0>();

// do some computation

}}

• A user task is derived from the SSDLet class.

• Its run method is overridden to describe the execution of the task.

• The type information is used to perform strict type checking at compile/run time.

Boncheol Gu et al. 16/16

Page 32: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Wordcount Example

MapperMapper ShufflerShuffler

ReducerReducer

ReducerReducer

<word, frequency>

wordcount module

filename

host-sideprogram

word

word

word

<word, vec>

<word, vec>

MapperMapper

MapperMapper

Boncheol Gu et al. 16/16

Page 33: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Wordcount Example

class Mapper : public SSDLet<OUT_TYPE<std::pair<std::string, uint32_t>>,

ARG_TYPE<File>> {

public:

void run() {

auto& file = getArgument<0>();

FileStream fs(std::move(file));

auto output = getOutputPort<0>();

while (true) {

...

if (!readline(fs, line)) break;

line.tokenize();

while ((word = line.next_token()) != line.cend()) {

// put output (i.e., each word) to the output port

if (!output.put({std::string(word), 1})) return;

}}}};

Boncheol Gu et al. 16/16

Page 34: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Wordcount Example

int main(int argc, char *argv[]) {

SSD ssd("/dev/nvme0n1");

auto mid = ssd.loadModule(File(ssd, "/var/isc/slets/wordcount.slet"));

// create an Application instance and proxy SSDLet instances

Application wc(ssd);

SSDLet mapper1(wc, mid, "idMapper", make_tuple(File(ssd, filename)));

SSDLet shuffler(wc, mid, "idShuffler");

SSDLet reducer1(wc, mid, "idReducer");

...

// make connections between SSDlets and from Reducers back to the host

wc.connect(mapper1.out(0), shuffler.in(0));

wc.connect(shuffler.out(0), reducer1.in(0));

auto port1 = wc.connectTo<pair<string, uint32_t>>(reducer1.out(0));

Boncheol Gu et al. 16/16

Page 35: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Wordcount Example

...

// start application so that all SSDlets would begin execution

wc.start();

pair<string, uint32_t> value;

while (port1.get(value) || port2.get(value)) // print out <word,freq> pairs

cout << value.first << "\t" << value.second << endl;

// wait until all SSDlets stop execution and unload the wordcount module

wc.wait();

ssd.unloadModule(mid);

return 0;

}

Boncheol Gu et al. 16/16

Page 36: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

I/O Ports

Communication through ports

• (a) Inter-SSDlet ports: among SSDlet instances belonging to a single Application

instance

• (b) Host-to-device ports: between an SSDlet instance and a host program

• (c) Inter-application ports: between two SSDlets from different Application

instances

• The APIs are strongly typed and implicit type conversion is not allowed.

host-sideprogram SSDletSSDlet

Biscuit runtime

SSDletSSDlet

host I/F

input port output port

SSDletSSDlet

App. 1

App. 2

(a)

(c)(b)

Boncheol Gu et al. 16/16

Page 37: Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloadsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-1.pdf · Biscuit: AFrameworkforNear-Data ProcessingofBigDataWorkloads

Dynamic module loading

• Libslet bridges the Biscuit runtime and SSDlet instances.

• An SSDlet instance obtains a function table from the runtime.

• Key functions of memory management, scheduling and file I/O.

• Biscuit receives functions necessary to create and manage SSDlet

instances.

• Symbol relocation is performed locating each SSDlet in a separate

address space.

module

libslet

SSDletSSDletmalloc()free()yield()read()

Biscuit runtime

swi #n, swi #m,__yield(), __read()

createSSDLet()

Boncheol Gu et al. 16/16