20
DRepl Optimizing Access to Application Data for Analysis and Visualization Latchesar Ionkov Michael Lang LANL Carlos Maltzahn UCSC

Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

DReplOptimizing Access to Application Data for

Analysis and Visualization

L a t c h e s a r I o n k o vM i c h a e l L a n g

L A N L

C a r l o s M a l t z a h nU C S C

Page 2: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

HPC Cluster

Head Node

CN1 CN2 CNk...

CNk+1 CNk+2 CNl...

CNl+1 CNl+2 CNm...

CNm+1 CNm+2 CNn...

IO1

IO2

IO3

IOp

FS

FS

FS

FS

FS

Desktop Desktop Desktop Desktop

Page 3: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Data Storage

Data stored in files

Many applications use legacy formats

Data is stored in format, convenient for the producer

In-situ and in-transit data analysis slow

Page 4: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Objective

Decouple storage data layout from application data layout(s)

Make replicas with different data layouts

Each application working with the data can use a layout that is optimized for it

Allow both materialized (on-storage) and on-the-fly data layouts

Page 5: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

DesignDefinitions

Dataset -- abstract data model

Views -- how applications see the data

Replicas -- how the data is stored

Provision of an easy way to express how data is used by the applications

View View

Dataset

Replica Replica Replica

Page 6: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Example

pressure=5.1temp=33.1density=0.4

N.........

......

Mpsim

Dataset

pressure=5.1temp=33.1density=0.4

N.........

......

Mpsim

N.........

......

Mpressure

N.........

......

Mtemp

N.........

......

Mdensity

N.........

......

M

pressure

N.........

......

M

tempN

...

...

......

...

Mdensity

pressure=5.1density=0.4N

...

...

......

...

Mpsim

View 1 View 2

Replica 1

Replica 2

Page 7: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

DRepl

Dataset Language

Parser

Replication Engine

File Server

DRepl

dataset { var p struct { a, b, c float32 }}

view default { var p = p }view viz { var pa { a } = p var pba { b, a } = p}

replica default { view default }replica viz { view viz }

Parser

Replication EngineDReplFS

ReplicasReplicas

ReplicasOS

SimulationVizualization

viz

default

S viz:000000

S default:000004

dest

S viz:000004

S default:000000

dest

S viz:000000dest

destT viz:000004

field

field

T viz:000000dest

T default:000000

dest

fielddest

dest

S default:000008dest

dest

field

field

field

DatasetDescription

ConversionMap

Page 8: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Configurations

Burst Buffer Node

Parallel FS

Replica 1

Replica 2

DRepl

...

...

Parallel FS

Replica 1

Replica 2Replica 3

ApplicationDRepl

Node 2

ApplicationDRepl

Node 1

ApplicationDRepl

Node N

ApplicationNode 1

ApplicationNode 2

ApplicationNode N

Application

DR

epl

Node 1

Application

DR

epl

Node 2

Application

DR

epl

Node N

Parallel FS

Replica 1

Replica 2...

Embedded Separate

Burst Buffer

Page 9: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Dataset Language

Syntax Similar to C, C++, Java

Dataset

define data types (structs, arrays)

define named data of the types

View(s)

define substructs and subarrays

define named data based on the dataset data

Replica(s)

Page 10: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Dataset LanguagePrimary types - int8, int16, int32, int64, float32, float64, stringN

Structsstruct { a, b, c float64}

Multidimensional arrays[50,40,21] Point

Custom typestype int64 Point

Arithmetic expressions in the subarray definitionsa[i*3, j + 2] = aa[j, i - 1]

Support for different array orders -- row-major, row-minor, in future Hilbert and z-order

Page 11: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Language Exampledataset {const N = 500

type Data struct {a, b, c float32

}

var data [N]Data}

view array-of-structs { var ds = data}

view struct-of-arrays { var a[i]{a} = data[i] var b[i]{b} = data[i] var c[i]{c} = data[i]}

view ab rowmajor { var ab[i]{a,b} = data[i]}

replica array-of-structs { view array-of-structs}

replica struct-of-arrays { view struct-of-arrays}

replica other { view array-of-structs view ab}

Page 12: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Subarray Examplesdataset {const N = 500const M = 200

var data [N, M]float32}

view v { // flip dimensions var flip[i,j] = data[j,i]

// middle row var mr[i] = data[N/2, i]

// each third element var te[i, j] = data[i*3, j*3]}

Page 13: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

DReplFS

Represent the application data formats (views) as virtual files

Stored data formats (replicas) -- collection of replicas

DReplFS

ABC

BAC

B

CD

ABCD AB CD

Sim2

Sim1

Viz1 A1

Page 14: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Transformation Rules

dataset { var p struct { a, b, c float32 }}

view default { var p = p

}

view viz { var pa { a } = p var pba { b, a } = p

}

default

viz

T 0004pba

T 0000pa S 0000

S 0008

S 0004

T 0000p

S 0000

S 0004

S 0008

field a

field b

field a

field a

field b

field c

destdest

dest

dest

dest

dest

dest

Page 15: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

DReplFS -- Parser, Replication Engine, File Server in Go

KDreplFS -- Parser in Go, Replication Engine and File Server in the Linux kernel

Implementation

Page 16: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

ExperimentsDataset

const N = 176160768 type Data struct {

a, b, c float32 } var data [N]Data

Views

array of structs (AOS)

struct of arrays (SOA)

partial (only b)

Replicas

three replicas (AOS, SOA, b)

two replicas (AOS, b)

one replica (AOS)

Each replica on separate SSD

File Servers

pass-through (POSIX)

kdreplfs

Page 17: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Results:Read

0

200

400

600

800

1000

Array of Structs

3 Replicas

Struct of Arrays

3 Replicas

Partial3 Replicas

Array of Structs

2 Replicas

Struct of Arrays

2 Replicas

Partial2 Replicas

Array of Structs

1 Replica

Struct of Arrays

1 Replica

Partial1 Replica

Band

wid

th (M

B/s)

kdreplfsPOSIX

Page 18: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Results:Write

0

200

400

600

800

1000

Array of Structs

3 Replicas

Struct of Arrays

3 Replicas

Partial3 Replicas

Array of Structs

2 Replicas

Struct of Arrays

2 Replicas

Partial2 Replicas

Array of Structs

1 Replica

Struct of Arrays

1 Replica

Partial1 Replica

Band

wid

th (M

B/s)

kdreplfs-synckdreplfs-async

POSIX

Page 19: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Results:Combined

0

200

400

600

800

1000

1200

1400

kdreplfs 1 replica kdreplfs 2 replicas POSIX

Band

wid

th (M

B/s)

ReadWrite

Page 20: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs

Future Work

Variable-sized arrays

More array element orders (z-order, Hilbert)

Optimizations

Endianness for primary types

Support for HDF5 replicas

Implementation that doesn’t use file servers

Automatic generation of dataset definition from standard data formats (HDF5, NetCDF)