6
1 Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel Saltz University of Maryland http://www.cs.umd.edu/projects/adr Alan Sussman ([email protected]) 2 Outline n Motivation n Approach n Optimization – Group Instances n Optimization – Transparent Copies n Ongoing Work Alan Sussman ([email protected]) 3 Targeted Applications Pathology Volume Rendering Surface Groundwater Modeling Satellite Data Analysis Alan Sussman ([email protected]) 4 Runtime Environment Heterogeneous Shared Resources: n Host level: machine, CPUs, memory, disk storage n Network connectivity Many Remote Datasets: n Inexpensive archival storage clusters (1TB ~ $10k) n Islands of useful data n Too large for replication Client Client Archival Storage System Range Query Segment Info. Segment Data Indexing Service Client Interface Service Data Access Service DataCutter Filter Filter Filtering Service Archival Storage System Segments: (File,Offset,Size) (File,Offset,Size) DataCutter Alan Sussman ([email protected]) 6 DataCutter Indexing Service n Multilevel hierarchical indexes based on spatial indexing methods – e.g., R-trees Relies on underlying multi-dimensional space User can add new indexing methods Filtering Service n Distributed C++ component framework n Transparent tuning and adaptation for heterogeneity n Filters implemented as threads – 1 process per host Versions of both services integrated into SDSC SRB

Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

  • Upload
    dinhanh

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

1

Performance Optimization ofComponent-based Data Intensive

Applications

Alan SussmanMichael Beynon

Tahsin KurcUmit Catalyurek

Joel SaltzUniversity of Maryland

http://www.cs.umd.edu/projects/adrAlan Sussman ([email protected]) 2

Outline

n Motivationn Approachn Optimization – Group Instancesn Optimization – Transparent Copiesn Ongoing Work

Alan Sussman ([email protected]) 3

Targeted ApplicationsPathology

VolumeRendering

SurfaceGroundwaterModeling

SatelliteData Analysis

Alan Sussman ([email protected]) 4

Runtime Environment

Heterogeneous Shared Resources:n Host level: machine, CPUs, memory, disk storagen Network connectivity

Many Remote Datasets:n Inexpensive archival storage clusters (1TB ~ $10k)n Islands of useful datan Too large for replication

Client Client

Archival Storage System

Range Query

SegmentInfo.

SegmentData

IndexingService

Client Interface Service

Data Access Service

DataCutter

Filter Filter

Filtering Service

Archival Storage System

Segments: (File,Offset,Size) (File,Offset,Size)

DataCutter

Alan Sussman ([email protected]) 6

DataCutterIndexing Servicen Multilevel hierarchical indexes based on spatial indexing

methods – e.g., R-trees– Relies on underlying multi-dimensional space– User can add new indexing methods

Filtering Servicen Distributed C++ component frameworkn Transparent tuning and adaptation for heterogeneityn Filters implemented as threads – 1 process per hostVersions of both services integrated into SDSC SRB

Page 2: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

2

Alan Sussman ([email protected]) 7

Indexing - Subsetting

Datasets are partitioned into segments– used to index the dataset, unit of retrieval– Spatial indexes built from bounding boxes of all

elements in a segmentIndexing very large datasets

– Multi-level hierarchical indexing scheme– Summary index files -- for a collection of segments or

detailed index files– Detailed index files -- to index the individual segments

Alan Sussman ([email protected]) 8

Filter-Stream Programming (FSP)

Purpose: Specialized components for processing data

n based on Active Disks research [Acharya, Uysal, Saltz: ASPLOS’98],dataflow, functional parallelism, message passing.

n filters – logical unit of computation– high level tasks– init,process,finalize interface

n streams – how filters communicate– unidirectional buffer pipes– uses fixed size buffers (min, good)

n manually specify filter connectivityand filter-level characteristics

Extract ref

Extract raw

3D reconstruction

View result

Raw Dataset

Reference DB

Alan Sussman ([email protected]) 9

Placement

n The dynamic assignment of filters to particular hosts for execution is placement(or mapping)

n Optimization criteria:– Communication

n leverage filter affinity to dataset n minimize communication volume on slower connectionsn co-locate filters with large communication volume

– Computationn expensive computation on faster, less loaded hosts

Alan Sussman ([email protected]) 10

FSP: AbstractionsFilter Group– logical collection of filters to use together– application starts filter group instances

Unit-of-work cycle– “work” is application defined (ex: a query)– work is appended to running instances– init() , process(), finalize() called for each work– process() returns { EndOfWork | EndOfFilter }– allows for adaptivity

A Buow 0uow 1uow 2

buf buf buf buf

S

Alan Sussman ([email protected]) 11

Optimization - Group Instances

Work

host3 (2 cpu)host2 (2 cpu)host1 (2 cpu)

P0 F0 C0

P1 F1 C1

Match # instances to environment (CPU capacity, network)

Alan Sussman ([email protected]) 12

Experiment - Application Emulator

Parameterized dataflow filter– consume from all inputs– compute– produce to all outputs

Application emulated:– process 64 units of work– single batch

inout

outprocess

Filter

P F C

Page 3: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

3

Alan Sussman ([email protected]) 13

Instances: Vary Number, Application

Setup: UMD Red Linux cluster (2 processor PII 450 nodes)Point: # instances depends on application and environment

Number of Filter Group Instances1 2 4 8 16 32 64

Res

pons

e Ti

me

(sec

)

0

50

100

150

200

250300

350

400450

min mean max

Number of Filter Group Instances1 2 4 8 16 32 64

Res

pons

e Ti

me

(sec

)

0

5

10

15

20

25

30

35

40

min mean max

I/O Intensive CPU Intensive

Alan Sussman ([email protected]) 14

Group Instances (Batches)

Work issued in instance batches until all complete.Matching # instances to environment (CPU capacity)

Batch 1

host3 (2 cpu)host2 (2 cpu)host1 (2 cpu)

P0 F0 C0

P1 F1 C1

Batch 2

Batch 0

Batch 0

Alan Sussman ([email protected]) 15

Instances: Vary Number, Batch Size

Setup: (Optimal) is lowest all timePoint: # instances depends on application and environment

CPU Intensive

Number of Filter Group Instances1 2 4 8 16

Res

pons

e Tim

e (se

c)

0

50

100

150

200250

300

350

400

450

min mean max all

Batch Size (Optimal Num Instances)1 (8) 2 (32) 4 (8) 8 (4) 16 (4) 32 (2) 64 (1)

Res

pons

e Ti

me

(sec

)

0

50

100

150200

250

300

350

400

450

min mean max all

Alan Sussman ([email protected]) 16

Adding Heterogeneity

Work

hostSMP (8 cpu)

host2 (2 cpu)host1 (2 cpu)

P0 F0 C0

P3 F3 C3

host1 (2 cpu) host2 (2 cpu)

P4 F4 C4

P7 F7 C7

Alan Sussman ([email protected]) 17

Optimization - Transparent Copies

FSP Abstractionn replicate individual filtersn transparent

– work-balance among copies– better tune resources to actual filter needs

n Provide “single-stream” illusion– Multiple producers and consumers, deadlock, flow control– Invariant: UOW i < UOW i+1

n Problem: filter state

R ERa0

M

Rak

Alan Sussman ([email protected]) 18

Runtime Workload Balancing

Use local information:– queue size, send time / receiver acks

n Adjust number of transparent copiesn Demand based dataflow (choice of consumer)

– Within a host – perfect shared queue among copies– Across hosts

n Round Robinn Weighted Round Robinn Demand-Driven sliding window (on buffer consumption rate)n User-defined

Page 4: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

4

Alan Sussman ([email protected]) 19

Experiment – Virtual Microscopen Client-server system for interactively visualizing digital slides

Image Dataset (100MB to 5GB per focal plane)Rectangular region queries, multiple data chunk reply

n Hopkins Linux cluster – 4 1-processor, 1 2-processor PIII-800,2 80GB IDE disks, 100Mbit Ethernet

n Decompress filter is most expensive, so good candidate for replication

n 50 queries at various magnifications, 512x512 pixel output

zoom viewread data decompress clip

Alan Sussman ([email protected]) 20

Virtual Microscope Results

4.601.270.620.371.49h-g-g-g-g

50x100x200x400xAverage

3.240.920.450.331.08g-g(2)-g-g-g4.461.260.580.331.44g-g-g-g-g5.341.270.680.451.68h-g(2)-g(2)-b-b3.500.960.500.391.17h-g(4)-g(2)-g-g3.430.950.490.371.15h-g(2)-g(2)-g-g3.410.950.500.391.15h-g(2)-g-g-g

6.951.730.730.382.10h-h-h-h-h

Response Time (seconds)R-D-C-Z-V

Alan Sussman ([email protected]) 21

Experiment - Isosurface Rendering

n UT Austin ParSSim species transport simulationSingle time step visualization, read all data

n Setup: UMD Red Linux cluster (2 processor PII 450 nodes)

readdataset

isosurfaceextraction

shade +rasterize

merge/ view

R E Ra M

1.5 GB 38.6 MB 11.8 MB 28.5 MB

0.64s4.3%

1.64s11.2%

11.67s79.5%

0.73s5.0%

= 14.68s (sum)= 12.65s (time)

Alan Sussman ([email protected]) 22

Sample Isosurface Visualization

V = 0.35 V = 0.7

Alan Sussman ([email protected]) 23

Transparent Copies: Replicate Raster

5.70s(22%)

7.32s

2 nodes

3.00s12.18s1 copy of Raster

3.24s(– 8%)

8.16s(33%)

2 copies of Raster

8 nodes1 node

3.88s(7%)

4.17s

4 nodes

Setup: SPMD style, partitioned input dataset per nodePoint: copies of bottleneck filter enough to balance flow

Alan Sussman ([email protected]) 24

Experiment – Resource Heterogeneity

n Isosurface rendering on Red, Blue, Rogue Linux clusters at Maryland

– Red – 16 2-processor PII-450, 256MB, 18GB SCSI disk– Blue – 8 2-processor PIII-550, 1GB, 2-8GB SCSI disk +

1 8-processor PIII-450, 4GB, 2-18GB SCSI disk– Rogue – 8 1 -processor PIII-650, 128MB, 2-75GB IDE disks– Red, Blue connected via Gigabit Ethernet, Rogue via 100Mbit

Ethernetn Two implementations of Raster filter – z-buffer and active

pixels (the one used in previous experiment)

Page 5: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

5

Alan Sussman ([email protected]) 25

Experimental setup

readdataset

isosurfaceextraction

shade +rasterize

merge/ view

R E Ra M

1.5 GB 38.6 MB 11.8 MB 28.5 MB

0.64s0.68s

1.64s1.65s

11.67s9.43s

0.73s0.90s

= 14.68s= 12.66s

Active PixelZ-buffer32.0MB

Active PixelZ-buffer

Experiment to follow combines R and E filters, since that showedbest performance in experiments not shown

Alan Sussman ([email protected]) 26

Varying number of nodes

19.020.8

10.711.5

8 nodes

9.38.9

11.37.9

2 nodes

10.511.7

7.77.9

4 nodes

2.63.2

2.93.0

8 nodes

7.75.7

12.77.3

2 nodes

3.23.9

4.84.2

4 nodes

2(0)2(2)

1(0)1(1)

#Ra

n/a8.6

n/a8.2

RE-Ra-M

n/a10.7

n/a12.2

RE-Ra-M

1 node

1 node

Configuration

Z-buffer RenderingActive Pixel Rendering

Only Red nodes used – each one runs 1 RE, 0, 1, or 2 RA, and one runs M

Alan Sussman ([email protected]) 27

Heterogeneous nodes

Active Pixel algorithm on 8 -processor Blue node + Red data nodesBlue node runs 7 Ra or ERa copies and M, Red nodes each run 1 of each except M

4.23.74.04.14.15.14.04.16.54.24.08.2R-ERa-M

3.82.93.83.53.04.43.02.65.73.03.07.3RE-Ra-M

DDWRRRRDDWRRRRDDWRRRRDDWRRRR

8 data nodes4 data nodes2 data nodes1 data node

Conf.

Alan Sussman ([email protected]) 28

Skewed data distribution

n Experimental setup– 25GB dataset from UT ParSSim (bigger grid than earlier

experiments)– Hilbert curve declustering onto disks on 2 Blue, 2 Rogue

nodes– Skew moves part of data from Blue to Rogue nodes

Alan Sussman ([email protected]) 29

Skewed data distributionBalanc e d#-#Ac tive #Pixe l#R e nde ring

0

5

1 0

1 5

2 0

2 5

3 0

R ERa -M R-E Ra-M RE -Ra -MF ilte r#Co n fig u ration

Exe

cu

tion

#Tim

e#(s

econ

ds)

RRW RRDD

S ke w e d#25%#-#Ac tive #Pixel#Re nde r ing

0

5

1 0

1 5

2 0

2 5

3 0

RE Ra-M R-E Ra -M RE-Ra- MFilter#Con figu ratio n

Exe

cutio

n#T

ime

#(sec

onds

)

RRWRRDD

Ske w e d#50 %#-#A c tive #Pi xe l#Re nde r ing

0

5

1 0

1 5

2 0

2 5

3 0

R ERa -M R-E Ra-M RE -Ra -MF ilte r#Co n fig u ration

Ex

ecut

ion

#Tim

e#(s

eco

nds

)

RRW RRDD

Ske we d#75%#-#Ac tive #Pixe l#Re nde ring

0

5

1 0

1 5

2 0

2 5

3 0

RE Ra-M R-E Ra -M RE-Ra- MFilter#Con figu ratio n

Exe

cutio

n#T

ime

#(sec

onds

)

RRWRRDD

Alan Sussman ([email protected]) 30

Experimental Implications

Multiple group instances– Higher utilization of under-used resources– Reduce application response time– but … requires work co-execution tolerance

Transparent copies can help– Most filter decompositions are unbalanced– Heterogeneous CPU capacity / load– but … requires buffer out-of-order tolerance

Page 6: Performance Optimization of Outline Component … Performance Optimization of Component-based Data Intensive Applications Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel

6

Alan Sussman ([email protected]) 31

Ongoing and Future WorkOngoing and Future work

– Automated placement, instances, transparent copies– predictive (cost models)– adaptive (work feedback)

– Filter accumulator support (partitioning, replication)– Java filters– CCA-compliant filters– Very large datasets – including HDF5 format

– Using storage clusters at UMD and OSU, then testbed at LLNL