32
Chronicle: Capture and Analysis of NFS Workloads at Line Rate Ardalan Kangarlou, Sandip Shete, and John Strunk Advanced Technology Group © 2015 NetApp, Inc. All rights reserved. 1

Chronicle: Capture and Analysis of NFS Workloads at … · Chronicle: Capture and Analysis of NFS Workloads at Line Rate Ardalan Kangarlou, Sandip Shete, and John Strunk Advanced

Embed Size (px)

Citation preview

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

 Ardalan Kangarlou, Sandip Shete, and John Strunk  Advanced Technology Group

© 2015 NetApp, Inc. All rights reserved. 1

Motivation

Goal: To gather insights from customer workloads via trace capture and real-time analysis.

© 2015 NetApp, Inc. All rights reserved. 2

•  Evaluating new algorithms Design

•  Designing representative benchmarks Test

•  Diagnosing problems and misconfigurations Support

•  Identifying appropriate storage platforms Sales

Use Case 1: Research

3

An I/O-by-I/O view of workloads is preferred for research in

•  Data caching, prefetching, and tiering techniques

•  Data deduplication analysis

•  Creation of representative benchmarks for real-world workloads

•  Studying data access and growth patterns over time

© 2015 NetApp, Inc. All rights reserved.

Use Case 2: Sales

© 2015 NetApp, Inc. All rights reserved. 4

Which platform?

Workload Sizing Questionnaire •  I/O rate? •  Number of clients? •  Random read working set size? •  Ratio of random reads vs. random writes?

Chronicle Framework

© 2015 NetApp, Inc. All rights reserved. 5

•  Capture and real-time analysis of NFSv3 workloads

•  Rates above 10Gb/s

•  Commodity hardware

NFS clients NFS server

Chronicle appliance

•  Passive network monitoring

•  Runs for days to weeks

•  Programmer-friendly and extensible API

Background – NFS Traffic

© 2015 NetApp, Inc. All rights reserved. 6

IP header

Physical Link

Network

Transport Session

Presentation

Application

RPC

NFS

OSI reference model

Ethernet header

TCP header Deep Packet Inspection (DPI)

RPC Protocol Data Unit (PDU)

TCP

reas

sem

bly

NFS PDU

Background – NAS Tracing •  [Ellard-FAST’03] and [Leung-ATC’08]

•  NFS and CIFS workload analysis based on pcap traces

•  Driverdump [Anderson-FAST’09] •  The fastest software-only solution operating @ 1.4Gb/s •  Network driver stores packets directly in the pcap format

•  Main limitations for our use case: •  Packet capture through the lossy pcap interface

•  High storage bandwidth and capacity requirements •  Trace capture @ 10Gb/s for a week requires 750TB of storage!

•  pcap traces require offline parsing of data •  Stateless parsing: inability to parse fields that span packets

© 2015 NetApp, Inc. All rights reserved. 7

Our Approach – Efficient Trace Storage

© 2015 NetApp, Inc. All rights reserved. 8

•  Instead of storing raw packets (i.e., pcap format) •  Use DPI to identify fields of interests in packets •  Checksum read and write data for data deduplication analysis •  Leverage DataSeries [Anderson-OSR’09] as the trace format

•  Efficient storage of structured, serial data •  Inline compression, non-blocking I/O, and delta encoding •  Extents for storing RPC-, NFS-, and network-level information as well as

read/write data checksums

•  With above techniques, a single standard disk can handle the storage bandwidth requirements for tracing at line rate!

record_id operation Simplified RPC extent

request_ts reply_ts client server transaction_id

Background – Packet Processing

•  Active area of research in software routing and network security: •  Common techniques: partitioning and pipelining work across cores, judicious

placement and scheduling of threads, minimizing synchronization overhead, batch processing, recycling allocated memory, zero-copy parsing, and bypassing kernel

•  Some examples: •  RouteBricks [Dobrescu-SOSP’09]:

•  Packet forwarding @ 10Gb/s, packet routing @ 6.4Gb/s, and IPsec @ 1.4Gb/s •  Required “tedious manual tuning”

•  NetSlices [Marian-ANCS’12]: •  Fixed mapping between packets and cores to support 9.7Gb/s routing throughput

•  Click [Kohler-TOCS’00, Chen-USENIXATC’01] •  Kernel-mode (3-4 Mpps per core) and user-mode (490 Kpps)

•  netmap [Rizzo-USENIXATC’12, Rizzo-INFOCOM’12]: •  Send/receive packets at line rate (14.88 Mpps @ 10Gb/s); 20ns/pkt vs. 500-1000ns/pkt for sockets •  User-space Click on netmap resulted in the same throughput as kernel-mode Click (3.9Mpps)

© 2015 NetApp, Inc. All rights reserved. 9

Packet Processing Frameworks cont.

•  Major limitations for our use case: •  A network-centric view of packet processing:

•  No DPI, TCP reassembly, and stateful parsing across packets •  Fixed, small per-packet processing cost •  Maintaining low latency is as important as high throughput

•  Manual tuning for specific hardware platforms •  Management of shared resources and state (e.g., locks, thread-safe

queues, etc.) •  Kernel implementations are hard to extend with custom libraries

Main Challenge:

To extend proven packet processing techniques to the application layer, for a more CPU-intensive use case, and in a programmer-friendly manner!

© 2015 NetApp, Inc. All rights reserved. 10

Our Approach – Packet Processing

•  Libtask: A user-space actor model library •  Performance:

•  Seamless scalability to many cores •  Implicit batching of work to support high throughput

•  Flexibility and usability: •  A pluggable, pipelined architecture •  Portable software by hiding hardware configuration from users •  Unburden application programmers of concurrency bugs

•  Leverage netmap instead of libpcap for reading packets •  Efficient framework for bypassing kernel based on modified network

drivers •  We extended netmap to support jumbo frames

© 2015 NetApp, Inc. All rights reserved. 11

Background – Actor Model Programming

© 2015 NetApp, Inc. All rights reserved. 12

Actor: A computation agent that processes tasks

Message: Information to be shared with a target actor about a task or tasks

Libtask

© 2015 NetApp, Inc. All rights reserved. 13

•  A light-weight actor model library written in C++

•  Three constructs: Scheduler , Process , and Message

Core

Libtask cont.

© 2015 NetApp, Inc. All rights reserved. 14

•  Load balancing and seamless scalability

•  Two versions of Libtask: NUMA-aware and NUMA-agnostic

Core Core Core Core

NUMA-aware NUMA-agnostic

CPU 1 CPU 2

Chronicle Architecture

© 2015 NetApp, Inc. All rights reserved. 15

Packet Reader

Network Parser

Packet Reader

Network Parser

Packet Reader

Network Parser Chronicle Pipeline n

Chronicle Pipeline 2

Chronicle Pipeline 1

Workload Sizing Questionnaire •  I/O rate? •  Number of clients? •  Random read working set size? •  Ratio of random reads vs.

random writes?

Chronicle Pipelines

© 2015 NetApp, Inc. All rights reserved. 16

Trace Capture Pipeline

RPC Parser

NFS Parser

Checksum Module

DataSeriesWriter

Workload Sizer Pipeline

RPC Parser

NFS Parser

Workload Sizer

Chronicle Modules – RPC Parser •  Reassembly of TCP segments

•  Construction of RPC PDUs

•  Two modes of operation

© 2015 NetApp, Inc. All rights reserved. 17

•  Filtering of TCP and RPC traffic

•  Detection and parsing of RPC header

•  Matching RPC replies with the corresponding calls

RPC Protocol Data Unit (PDU)

IP header TCP header

TCP

reas

sem

bly

NFS PDU

Ethernet header RPC header RPC header

Fast mode Slow mode

More information on Chronicle

Please refer to the paper for more information on the following:

•  The functions of each module in the pipeline

•  The messages passed between modules

•  Chronicle’s novel, zero-copy application layer parsing approach

•  A comprehensive comparison with other packet processing frameworks

•  Insights from Chronicle that helped our customers

© 2015 NetApp, Inc. All rights reserved. 18

Evaluation Setup

© 2015 NetApp, Inc. All rights reserved. 19

•  Chronicle server: •  Two Intel Xeon E5-2690 2.90GHz CPUs (8 physical cores/16 logical

cores per CPU) •  128GB of 1600MHz DDR3 DRAM memory (64GB per CPU) •  Two dual-port Intel 82599EB 10GbE NICs •  Ten 3TB SATA disks •  3.2.32 Linux kernel

•  A NetApp FAS6280 as the NFS server

$10,000

Libtask Evaluation – Message Ring Benchmark

0

20

40

60

80

1 2 4 8 16 32 Mes

sage

s/s

(x10⁶)

Schedulers

NUMA-aware Libtask NUMA-agnostic Libtask Erlang (V: R15B01) Go (V: 1.0.2)

© 2015 NetApp, Inc. All rights reserved. 20

•  1000 Processes pass ~100M Messages in a ring

•  100 outstanding Messages at a given time

•  Averages of 10 runs

Libtask Evaluation – All-to-All Benchmark

0 10 20 30 40 50

1 2 4 8 16 32

Mes

sage

s/s

(x10⁶)

Schedulers

NUMA-aware Libtask NUMA-agnostic Libtask Erlang (V: R15B01) Go (V: 1.0.2)

© 2015 NetApp, Inc. All rights reserved. 21

•  100 Processes pass ~100M Messages randomly

•  1000 outstanding Messages at a given time

•  Averages of 10 runs

Chronicle Evaluation – Maximum Sustained Throughput

0

5

10

15

1 2 4 8 16 32

Thro

ughp

ut (G

b/s)

Cores

© 2015 NetApp, Inc. All rights reserved. 22

< 2.5GB of RAM usage

•  One client issuing 64KB sequential R/W ops across two 10Gb links using fio workload generator

2 CPUs

1 CPU + hyperthreading

1 CPU

NFS server max: 14Gb/s

0.0002% op loss for 32 cores

max

min

avg

Chronicle Evaluation – Maximum Sustained IOPS

0 20 40 60 80

100 120

1 2 4 8 16 32

Ope

ratio

ns/s

(x10³)

Cores

© 2015 NetApp, Inc. All rights reserved. 23

•  One client issuing 1B sequential R/W ops across two 10Gb links using fio workload generator

< 100MB of RAM usage

NFS server max: 106KIOPS

< 0.0001% op loss for 8-32 cores

1 CPU + hyperthreading

1 CPU

2 CPUs

150KIOPS Metadata-intensive workload

3,000 clients

max

min

avg

Chronicle Evaluation – Packet Loss

© 2015 NetApp, Inc. All rights reserved. 24

•  Controlled experiment to study the impact of packet loss at 10Gb/s

4% loss

10.1Gb/s

Chronicle Evaluation – Trace Storage Efficiency

•  7-hour-long trace of a production workload

•  40x reduction in trace size over pcap traces (1.8TB 44.6GB)!

© 2015 NetApp, Inc. All rights reserved. 25

0 50

100 150 200 250 300 350

Pcap DS Total Network RPC NFS Checksum

Exte

nt s

ize

(GB

)

p DS Uncompressed DS Compressed

1.8TB

44.6GB

321.3GB

Conclusions

•  Chronicle is an efficient framework for trace capture and real-time analysis of NFS workloads: •  Operates at 14Gb/s using general-purpose CPUs, disks, and NICs •  Based on actor model programming •  Seamless scalability and a pluggable, pipelined •  Programmer-friendly API •  CPU-intensive operations like stateful parsing, pattern matching, data

checksumming, and compression •  Extensible to support other network storage protocols (e.g., SMB/

CIFS, iSCSI, RESTful key-value store protocols)

© 2015 NetApp, Inc. All rights reserved. 26

Questions?

Thank You!

Chronicle’s source code is available under an academic, non-commercial license:

https://github.com/NTAP/chronicle

© 2015 NetApp, Inc. All rights reserved. 27

Use Case 3: Data-Driven Management

© 2015 NetApp, Inc. All rights reserved. 28

Libtask Evaluation – Message Ring Benchmark

0 10 20 30 40 50 60 70

1 2 4 8 16 32

Mes

sage

s/s

(x10⁶)

Threads

NUMA-aware Libtask NUMA-agnostic Libtask NUMA-aware libtask + load NUMA-agnostic libtask + load Erlang (V: R15B01) Go (V: 1.0.2)

© 2015 NetApp, Inc. All rights reserved. 29

Libtask Evaluation – All-to-All Benchmark

0

10

20

30

40

50

1 2 4 8 16 32

Mes

sage

s/s

(x10⁶)

Threads

NUMA-aware Libtask NUMA-agnostic Libtask NUMA-aware libtask + load NUMA-agnostic libtask + load Erlang (V: R15B01) Go (V: 1.0.2)

© 2015 NetApp, Inc. All rights reserved. 30

Chronicle Evaluation – Maximum Sustained Throughput

0

5

10

15

1 2 4 8 16 32 Thro

ughp

ut (G

b/s)

Cores

© 2015 NetApp, Inc. All rights reserved. 31

0 20 40 60 80

100

1 2 4 8 16 32

Nor

mal

ized

CPU

us

age

(%)

Cores

0 0.5

1 1.5

2 2.5

1 2 4 8 16 32 Max

. mem

ory

usag

e (G

B)

Cores

0.00001 0.0001

0.001 0.01

0.1 1

1 2 4 8 16 32

Loss

(%)

Cores

NFS calls NFS replies

NFS server max

Chronicle Evaluation – Maximum Sustained IOPS

0

50

100

1 2 4 8 16 32 Ope

ratio

ns/s

(x10³)

Cores

© 2015 NetApp, Inc. All rights reserved. 32

0 20 40 60 80

100

1 2 4 8 16 32

Nor

mal

ized

CPU

us

age

(%)

Cores

0 0.02 0.04 0.06 0.08

0.1

1 2 4 8 16 32 Max

mem

ory

usag

e (G

B)

Cores

0.00001 0.0001

0.001 0.01

0.1 1

1 2 4 8 16 32

Loss

(%)

Cores

NFS calls NFS replies

NFS server max