23
LPS Lightweight Provenance Service for High-Performance Computing Dong Dai*, Yong Chen, Philip Carns, John Jenkins, and Robert Ross 9/11/17 2017 PACT Portland, Oregon 1

Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

LPS

Lightweight Provenance Service for

High-Performance Computing

Dong Dai*, Yong Chen, Philip Carns, John Jenkins, and Robert Ross

9/11/17 2017PACTPortland,Oregon 1

Page 2: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

Ingeneral,provenanceisdocumentedhistoryofanobject

9/11/17 2017PACTPortland,Oregon 2

Little Dancer Aged Fourteen1. Degas, Edgar (created1878-1881) 2. René De Gas (heritage 1917) 3. Adrien-Aurélien Hébrard (a contract 5/3/1918)4. Nelly Hébrard (heritage 1937)5. M. Knoedler & Company, Inc. (cosigned 1955)6. Paul Mellon (purchased 5/1956)7. National Gallery of Art (bequest 1999).

- From National Gallery of Art website

In computer science, provenance means the lineage of data, including• processes that act on data • agents that are responsible for those processes.

Page 3: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 3

Open Provenance Model (OPM)• a graph-based provenance representation model

Nodes

Edges

A

Artifact Process Agent

A

P Ag

P

A P

Ag P

PP...

Example

Prj-1

Joe

catvi

mpi-ior -np 3

/scratch/joe/ior.confVersion: 1

rank 0

rank 2

/scratch/joe/ior.confVersion: 0

Page 4: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

emacs

/scratch/joe/ior.confVersion: 1

/scratch/joe/ior.confVersion: 0

iorior

iorior

iorior

Shared Libs

• Evaluateanewsystem• repeatedlyrunthesamebenchmark(typicallytimeconsuming)• calculateavg andstd forcomparing

• Questions• Ifunexpectedvariationsoccur,howtoensuretheyarefromyoursystemorfromyourevaluations?• Canothereasilyrepeatthesameevaluations?• ...

9/11/17 2017PACTPortland,Oregon 4

HPC provenance is useful in the simulate-analyze-publish science discovery cycle

Provenance Helps

Page 5: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

• Performance Requirements: • HPC users are performance sensitive. • Managing overhead should be less than 1% slowdown and less than

1MB memory footprint per core.• Coverage Requirements:• Provenance generated from multiple physical locations.• Provenance could have various granularities.

• Transparency Requirements:• Users should not change or recompile their codes for provenance.• More aggressively, users should not disable it when provenance is

used in critical missions.

9/11/17 2017PACTPortland,Oregon 5

Requirements on managing provenance in HPC

Page 6: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 6

However, in many cases, these metrics are conflicting• To cover more details, one has to collect more fine-grain events,

introducing higher probing overheads• To be transparent, one has to rely on noisy system events,

introducing higher processing overheads

Existing solutions make a fixed tradeoff among overhead, coverage, and transparency during design.• PASSv2 [ATC’06], SPADEv2 [Middleware’12]: transparent and

detailed, with 23% and 10% overhead respectively• Zoom [VLDB’07], VisTrails [CC’08]: low overhead, but only

covers workflows• ...

Page 7: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 7

LPS – Adapt with Flexible Provenance Granularity

File Atimeline

Execution

Open Close

Read

WriteRead

Write

Open/Close Granularity

First/Last Granularity (Read)

Read/Write Gran.

...

• Fix the primitive of provenance Node• Agent: user• Process: local process• Artifact: whole file (local or distributed)

• Adapt the primitive of provenance Edge• Mainly “Process” to “Artifact”• Also important for causality inference• Open/Close• First/Last• Read/Write

Page 8: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 8

LPS – Adapt with Flexible Provenance Granularity• Open/Close

• 2 events per {process, file} pair• No need to know all read/write operations• Low Overheads, Coarse Granularity

• First/Last• 2 events per {process, file} pair• Need to know all read/write operations• High Overheads, Fine Granularity

• Read/Write• N events per {process, file} pair• Need to know all read/write operations• Resource consuming, Not practical in HPC

File Atimeline

Execution

Open Close

Read

WriteRead

Write

Open/Close Granularity

First/Last Granularity (Read)

Read/Write Gran.

...

Flexibly Change the granularity according to current workloads

Page 9: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 9

LPS – Unified Model for Flexible Granularities• Flexibly changing the granularities means uncomplete event

pairs, like Open and LastAccess;• A unified granularity model• [Tevent_start, Tevent_end] to represent a unified granularity• allows any two events to be paired to form a granularity• rule table determines the legal granularity

Page 10: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 10

LPS Overall Architecture in HPC

Provenance

Distributed LPS Cluster

Local LPS

Logi

n N

odes

Compute Nodes

Parallel File System

ProvenanceData

Distributed LPS Cluster

Kernel Space

System Call Layer

LPS Aggregatoruser processes

/Proc

A Single Server

LPS Tracer

User SpaceLPS Aggregator

A Single Server

Server

LPS Builder LPS Builder LPS Builder LPS Builder

Server Server Server...,...

LPS Design and Implementation

coveredinDonget.al., GraphMeta:AGraph-BasedEngineforManagingLarge-ScaleHPCRichMetadata,IEEECluster,2016

Page 11: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 11

LPS Tracer• LPS leverages kernel instrument to collect detailed runtime

events to build provenance [among three methods]

• To support flexible granularity, it needs to enable/disable probing read/write events• Dynamic Probe• Two kernel instrument scripts (Systemtap)• The second one only probes read/write events• Can be disabled/enabled accordingly in runtime

Page 12: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 12

LPS Aggregator1. Monitoring overhead and direct granularity change2. Pruning noisy events to improve performance

• Instrumentation introduces overheads• Instrument Read/Write towards an

application issuing 1M 1-byte writes

• The aggregator monitors read/write frequency• a counter records the events• a timer that resets the counter• notify and change granularity

Page 13: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 13

LPS Aggregator1. Monitoring overhead and direct granularity change2. Pruning noisy events to improve performance

bash

14879+0

ls

/usr/lib64/ld-2.21.so

/etc/ld

libselinux.so.1

libcap.so.2.24

libacl.so.1.1.0

libc-2.21.so

libpcre.so.1.2.5

libdl-2.21.so

libattr.so.1.1.0

libpthread-2.21.so

locale-archive

/home/daidong

UNDEFINED

UNDEFINED

bash

/usr/bin/sed

gconv-modules.cache

UNDEFINED

UNDEFINED

UNDEFINED

bash

UNDEFINED

UNDEFINED

bash

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

bash

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

1952+0

firefox

15453+1441600853121328

basename

firefox

uname

UNDEFINED

/proc/15458/task/15459/stat

/proc/15458/stat

UTF-16.so

/home/daidong/.mozilla/firefox/Crash Reports/InstallTime20150518070114

/usr/share/locale/locale.alias

gtk30.mo

/run/user/1000/gdm/Xauthorityfirefox

mkdir

UNDEFINED

UNDEFINED

firefox

UNDEFINED

UNDEFINED

firefox

run

run

dirname

run

UNDEFINED

15453+1441600856998997

/dev/null

swrast_dri.so

libdrm_nouveau.so.2.0.0

libdrm_intel.so.1.0.0

libdrm_radeon.so.1.0.1

libelf-0.161.so

libLLVM-3.5.so

libpciaccess.so.0.11.1

libedit.so.0.0.53

libtinfo.so.5.9

/sys/devices/system/cpu/online

/etc/drirc

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

bash

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

/home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/entries/85222C3CBB346ADAAB5A6E9B7FB98BAB19631C19

/proc/15453/mountinfo

/home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/entries/F18D85F52EBBBA2AB081EF739ED0D6E8A76D497C

/home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/index.log

/home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/index

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

bash

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

/home/daidong/DocumentsUNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

UNDEFINEDUNDEFINED

UNDEFINED

UNDEFINED

UNDEFINED

/home/daidong/.ICEauthority

/usr/share/emacs/24.5/etc/charsets/MULE-is13194.map

/usr/share/emacs/24.5/etc/charsets/symbol.map

DejaVuSansMono.ttf

/usr/share/icons/Adwaita/cursors/xterm

/usr/share/icons/Adwaita/cursors/left_ptr

/usr/share/icons/Adwaita/cursors/watch

/usr/share/icons/Adwaita/cursors/hand2

/usr/share/icons/Adwaita/cursors/sb_h_double_arrow

/usr/share/icons/Adwaita/cursors/sb_v_double_arrow

/usr/share/icons/Adwaita/index.theme/usr/share/icons/default/index.theme

/usr/share/emacs/24.5/etc/images/icons/hicolor/scalable/apps/emacs.svg

mime.cache

libpixbufloader-svg.so

/home/daidong/.emacs.d/auto-save-list

/usr/lib64/pango/1.8.0/modules.cache

Cantarell-Regular.otf

DejaVuSansMono-Oblique.ttf

/usr/share/emacs/24.5/lisp/calendar/time-date.elc/usr/share/emacs/site-lisp/site-start.el

/usr/share/emacs/site-lisp/site-start.d

/usr/share/emacs/site-lisp/site-start.d/autoconf-init.el

DejaVuSansMono-Bold.ttf

/usr/share/emacs/site-lisp/site-start.d/desktop-entry-mode-init.el

/usr/share/emacs/24.5/etc/images/new.xpm

/usr/share/emacs/24.5/etc/images/open.xpm

/usr/share/emacs/24.5/etc/images/diropen.xpm

/usr/share/emacs/24.5/etc/images/close.xpm

/usr/share/emacs/24.5/etc/images/save.xpm

/usr/share/emacs/24.5/etc/images/undo.xpm

/usr/share/emacs/24.5/etc/images/cut.xpm

/usr/share/emacs/24.5/etc/images/copy.xpm

/usr/share/emacs/24.5/etc/images/paste.xpm

/usr/share/emacs/24.5/etc/images/search.xpm

icon-theme.cache

/usr/share/icons/hicolor/index.theme

icon-theme.cache

/usr/share/icons/gnome/scalable/places

/usr/share/icons

/usr/share/pixmaps

/usr/share/icons/Adwaita/24x24/actions/document-new.png

/usr/share/icons/Adwaita/24x24/actions/document-open.png

/usr/share/icons/Adwaita/24x24/apps/system-file-manager.png

/usr/share/icons/Adwaita/24x24/actions/window-close.png

/usr/share/icons/Adwaita/24x24/actions/document-save.png

/usr/share/icons/Adwaita/24x24/actions/edit-undo.png

/usr/share/icons/Adwaita/24x24/actions/edit-cut.png

/usr/share/icons/Adwaita/24x24/actions/edit-copy.png

/usr/share/icons/Adwaita/24x24/actions/edit-paste.png

/usr/share/icons/Adwaita/24x24/actions/edit-find.png

/usr/share/emacs/site-lisp/site-start.d/systemtap-init.el

/home/daidong/.emacs

/usr/share/emacs/site-lisp/default.el

/home/daidong/Documents/fs.stp

/usr/share/emacs/site-lisp/systemtap-mode.el

/usr/share/emacs/24.5/lisp/progmodes/cc-mode.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-defs.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-vars.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-engine.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-styles.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-align.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-cmds.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-menus.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-guess.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/easymenu.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-fonts.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-awk.elc

/usr/share/emacs/24.5/lisp/progmodes/cc-langs.elc/usr/share/emacs/24.5/lisp/emacs-lisp/cl-lib.elc

/usr/share/zoneinfo/America/Chicago

/usr/share/emacs/24.5/lisp/emacs-lisp/cl-loaddefs.el

/usr/share/emacs/24.5/lisp/emacs-lisp/cl.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/gv.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/bytecomp.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/cconv.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/cl-extra.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/byte-opt.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/cl-seq.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/derived.elc

/usr/share/emacs/24.5/lisp/emacs-lisp/cl-macs.elc

/usr/share/emacs/24.5/etc/images/splash.svg

/usr/share/emacs/24.5/etc/tutorials/TUTORIAL

/usr/share/emacs/24.5/etc/images/checked.xpm

/usr/share/emacs/24.5/etc/images/unchecked.xpm

DejaVuSans.ttf

DejaVuSans-Oblique.ttf

/home/daidong/.emacs.d/auto-save-list/.saves-15582-localhost.localdomain~

UNDEFINED

15589+1441600998209413

bash

Raw system events from kernel

instrumentation

Page 14: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 14

LPS Aggregator1. Monitoring overhead and direct granularity change2. Pruning noisy events to improve performance

Bash (pid: 2322)

(pid: 6444) (pid: 6447)mpiexec

(pid: 6439)

(pid: 6445)sed (pid: 6446)

(pid: 6448)

(pid: 6449)

hydra_pmi_proxy (pid: 6440)

./mpitest (pid: 6441)

./mpitest(pid: 6442)

./mpitest (pid: 6443)

• Representative Executions• Executions that users care

the most• Eliminate unimportant child

processes• Eliminate helper child

processes• Events of non-R executions

are counted to their ancestor R executions

Page 15: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 15

LPS Builder1. Fusing with environmental variables2. Building provenance with versioning

• Local aggregators generate isolated provenance events• Workflows or jobs that are across multiple servers

• A fatal challenge• To match identities in different machines needs a unique ID• Unique IDs are generated by specific software, no transparency

• A compromise solution• LPS relies on specific environmental variables to match identities• LPS should be notified about the name of these env variables

Page 16: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 16

LPS Builder1. Fusing with environmental variables2. Building provenance with versioning

w0 start w1 start w1 end w0 end

r0 start

w0

List w0

w1w0Av1 v2

r0 end

v3

w0w1A v2

w0A v3 A v4

• Events on the same file will be sent to the same builder• RULE: If events are overlapped, read always depend on the

newest version that the overlapped writes create

Page 17: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 17

Evaluation Results – LPS on Benchmarks

• All evaluations done on CloudLab• 45 servers are used to build the HPC

environment

• HPCG Benchmark• Network- and CPU-intensive workloads• Report performance in GFlops• Runs on 8 – 196 processes

• Results• Less than 0.1% performance degradation

Page 18: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 18

Evaluation Results – LPS on Benchmarks

• IOR Benchmark• Data-intensive workloads• 256 processes, each running on a core• issues 50K rounds of random 4K writes• Runs on 8 – 256 processes

• Results• Open/Close barely introduces overhead

(around 0.1%)• First/Last introduces noticeable

overheads, but they are largely covered by the variations and still less than 1%

Page 19: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 19

Evaluation Results – LPS on Benchmarks

• MDTest Benchmark• metadata-intensive

workloads• report 4 representative

operations• Results• Less than 1% overhead

in all ops except file read• MDTest file reads hit the

I/O buffer a lot.• Should switch to

open/close in this case

Page 20: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

9/11/17 2017PACTPortland,Oregon 20

Evaluation Results – LPS Component by Component

LPS Tracer Overhead LPS Aggregator Monitoring

LPS Aggregator Pruning

LPS BuilderScalability

Page 21: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

• Summary• It is always needed to balance between performance and

accuracy

• Using flexible provenance granularity, LPS is able to capture full provenance in HPC environment with low overhead

• Future Work• Quantitatively analysis on the uncertainty introduced by

flexible provenance granularity

9/11/17 2017PACTPortland,Oregon 21

Summary and Future Work

Page 22: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

Thanks&Questions

9/11/17 2017PACTPortland,Oregon 22

Page 23: Lightweight Provenance Service for High-Performance …In general, provenance is documented history of an object 9/11/17 2017 PACT Portland, Oregon 2 Little Dancer Aged Fourteen 1

Provenance

9/11/17 2017PACTPortland,Oregon 23