Upload
dinhanh
View
227
Download
0
Embed Size (px)
Citation preview
1
Performance Optimization ofComponent-based Data Intensive
Applications
Alan SussmanMichael Beynon
Tahsin KurcUmit Catalyurek
Joel SaltzUniversity of Maryland
http://www.cs.umd.edu/projects/adrAlan Sussman ([email protected]) 2
Outline
n Motivationn Approachn Optimization – Group Instancesn Optimization – Transparent Copiesn Ongoing Work
Alan Sussman ([email protected]) 3
Targeted ApplicationsPathology
VolumeRendering
SurfaceGroundwaterModeling
SatelliteData Analysis
Alan Sussman ([email protected]) 4
Runtime Environment
Heterogeneous Shared Resources:n Host level: machine, CPUs, memory, disk storagen Network connectivity
Many Remote Datasets:n Inexpensive archival storage clusters (1TB ~ $10k)n Islands of useful datan Too large for replication
Client Client
Archival Storage System
Range Query
SegmentInfo.
SegmentData
IndexingService
Client Interface Service
Data Access Service
DataCutter
Filter Filter
Filtering Service
Archival Storage System
Segments: (File,Offset,Size) (File,Offset,Size)
DataCutter
Alan Sussman ([email protected]) 6
DataCutterIndexing Servicen Multilevel hierarchical indexes based on spatial indexing
methods – e.g., R-trees– Relies on underlying multi-dimensional space– User can add new indexing methods
Filtering Servicen Distributed C++ component frameworkn Transparent tuning and adaptation for heterogeneityn Filters implemented as threads – 1 process per hostVersions of both services integrated into SDSC SRB
2
Alan Sussman ([email protected]) 7
Indexing - Subsetting
Datasets are partitioned into segments– used to index the dataset, unit of retrieval– Spatial indexes built from bounding boxes of all
elements in a segmentIndexing very large datasets
– Multi-level hierarchical indexing scheme– Summary index files -- for a collection of segments or
detailed index files– Detailed index files -- to index the individual segments
Alan Sussman ([email protected]) 8
Filter-Stream Programming (FSP)
Purpose: Specialized components for processing data
n based on Active Disks research [Acharya, Uysal, Saltz: ASPLOS’98],dataflow, functional parallelism, message passing.
n filters – logical unit of computation– high level tasks– init,process,finalize interface
n streams – how filters communicate– unidirectional buffer pipes– uses fixed size buffers (min, good)
n manually specify filter connectivityand filter-level characteristics
Extract ref
Extract raw
3D reconstruction
View result
Raw Dataset
Reference DB
Alan Sussman ([email protected]) 9
Placement
n The dynamic assignment of filters to particular hosts for execution is placement(or mapping)
n Optimization criteria:– Communication
n leverage filter affinity to dataset n minimize communication volume on slower connectionsn co-locate filters with large communication volume
– Computationn expensive computation on faster, less loaded hosts
Alan Sussman ([email protected]) 10
FSP: AbstractionsFilter Group– logical collection of filters to use together– application starts filter group instances
Unit-of-work cycle– “work” is application defined (ex: a query)– work is appended to running instances– init() , process(), finalize() called for each work– process() returns { EndOfWork | EndOfFilter }– allows for adaptivity
A Buow 0uow 1uow 2
buf buf buf buf
S
Alan Sussman ([email protected]) 11
Optimization - Group Instances
Work
host3 (2 cpu)host2 (2 cpu)host1 (2 cpu)
P0 F0 C0
P1 F1 C1
Match # instances to environment (CPU capacity, network)
Alan Sussman ([email protected]) 12
Experiment - Application Emulator
Parameterized dataflow filter– consume from all inputs– compute– produce to all outputs
Application emulated:– process 64 units of work– single batch
inout
outprocess
Filter
P F C
3
Alan Sussman ([email protected]) 13
Instances: Vary Number, Application
Setup: UMD Red Linux cluster (2 processor PII 450 nodes)Point: # instances depends on application and environment
Number of Filter Group Instances1 2 4 8 16 32 64
Res
pons
e Ti
me
(sec
)
0
50
100
150
200
250300
350
400450
min mean max
Number of Filter Group Instances1 2 4 8 16 32 64
Res
pons
e Ti
me
(sec
)
0
5
10
15
20
25
30
35
40
min mean max
I/O Intensive CPU Intensive
Alan Sussman ([email protected]) 14
Group Instances (Batches)
Work issued in instance batches until all complete.Matching # instances to environment (CPU capacity)
Batch 1
host3 (2 cpu)host2 (2 cpu)host1 (2 cpu)
P0 F0 C0
P1 F1 C1
Batch 2
Batch 0
Batch 0
Alan Sussman ([email protected]) 15
Instances: Vary Number, Batch Size
Setup: (Optimal) is lowest all timePoint: # instances depends on application and environment
CPU Intensive
Number of Filter Group Instances1 2 4 8 16
Res
pons
e Tim
e (se
c)
0
50
100
150
200250
300
350
400
450
min mean max all
Batch Size (Optimal Num Instances)1 (8) 2 (32) 4 (8) 8 (4) 16 (4) 32 (2) 64 (1)
Res
pons
e Ti
me
(sec
)
0
50
100
150200
250
300
350
400
450
min mean max all
Alan Sussman ([email protected]) 16
Adding Heterogeneity
Work
hostSMP (8 cpu)
host2 (2 cpu)host1 (2 cpu)
P0 F0 C0
P3 F3 C3
host1 (2 cpu) host2 (2 cpu)
P4 F4 C4
P7 F7 C7
Alan Sussman ([email protected]) 17
Optimization - Transparent Copies
FSP Abstractionn replicate individual filtersn transparent
– work-balance among copies– better tune resources to actual filter needs
n Provide “single-stream” illusion– Multiple producers and consumers, deadlock, flow control– Invariant: UOW i < UOW i+1
n Problem: filter state
R ERa0
M
Rak
Alan Sussman ([email protected]) 18
Runtime Workload Balancing
Use local information:– queue size, send time / receiver acks
n Adjust number of transparent copiesn Demand based dataflow (choice of consumer)
– Within a host – perfect shared queue among copies– Across hosts
n Round Robinn Weighted Round Robinn Demand-Driven sliding window (on buffer consumption rate)n User-defined
4
Alan Sussman ([email protected]) 19
Experiment – Virtual Microscopen Client-server system for interactively visualizing digital slides
Image Dataset (100MB to 5GB per focal plane)Rectangular region queries, multiple data chunk reply
n Hopkins Linux cluster – 4 1-processor, 1 2-processor PIII-800,2 80GB IDE disks, 100Mbit Ethernet
n Decompress filter is most expensive, so good candidate for replication
n 50 queries at various magnifications, 512x512 pixel output
zoom viewread data decompress clip
Alan Sussman ([email protected]) 20
Virtual Microscope Results
4.601.270.620.371.49h-g-g-g-g
50x100x200x400xAverage
3.240.920.450.331.08g-g(2)-g-g-g4.461.260.580.331.44g-g-g-g-g5.341.270.680.451.68h-g(2)-g(2)-b-b3.500.960.500.391.17h-g(4)-g(2)-g-g3.430.950.490.371.15h-g(2)-g(2)-g-g3.410.950.500.391.15h-g(2)-g-g-g
6.951.730.730.382.10h-h-h-h-h
Response Time (seconds)R-D-C-Z-V
Alan Sussman ([email protected]) 21
Experiment - Isosurface Rendering
n UT Austin ParSSim species transport simulationSingle time step visualization, read all data
n Setup: UMD Red Linux cluster (2 processor PII 450 nodes)
readdataset
isosurfaceextraction
shade +rasterize
merge/ view
R E Ra M
1.5 GB 38.6 MB 11.8 MB 28.5 MB
0.64s4.3%
1.64s11.2%
11.67s79.5%
0.73s5.0%
= 14.68s (sum)= 12.65s (time)
Alan Sussman ([email protected]) 22
Sample Isosurface Visualization
V = 0.35 V = 0.7
Alan Sussman ([email protected]) 23
Transparent Copies: Replicate Raster
5.70s(22%)
7.32s
2 nodes
3.00s12.18s1 copy of Raster
3.24s(– 8%)
8.16s(33%)
2 copies of Raster
8 nodes1 node
3.88s(7%)
4.17s
4 nodes
Setup: SPMD style, partitioned input dataset per nodePoint: copies of bottleneck filter enough to balance flow
Alan Sussman ([email protected]) 24
Experiment – Resource Heterogeneity
n Isosurface rendering on Red, Blue, Rogue Linux clusters at Maryland
– Red – 16 2-processor PII-450, 256MB, 18GB SCSI disk– Blue – 8 2-processor PIII-550, 1GB, 2-8GB SCSI disk +
1 8-processor PIII-450, 4GB, 2-18GB SCSI disk– Rogue – 8 1 -processor PIII-650, 128MB, 2-75GB IDE disks– Red, Blue connected via Gigabit Ethernet, Rogue via 100Mbit
Ethernetn Two implementations of Raster filter – z-buffer and active
pixels (the one used in previous experiment)
5
Alan Sussman ([email protected]) 25
Experimental setup
readdataset
isosurfaceextraction
shade +rasterize
merge/ view
R E Ra M
1.5 GB 38.6 MB 11.8 MB 28.5 MB
0.64s0.68s
1.64s1.65s
11.67s9.43s
0.73s0.90s
= 14.68s= 12.66s
Active PixelZ-buffer32.0MB
Active PixelZ-buffer
Experiment to follow combines R and E filters, since that showedbest performance in experiments not shown
Alan Sussman ([email protected]) 26
Varying number of nodes
19.020.8
10.711.5
8 nodes
9.38.9
11.37.9
2 nodes
10.511.7
7.77.9
4 nodes
2.63.2
2.93.0
8 nodes
7.75.7
12.77.3
2 nodes
3.23.9
4.84.2
4 nodes
2(0)2(2)
1(0)1(1)
#Ra
n/a8.6
n/a8.2
RE-Ra-M
n/a10.7
n/a12.2
RE-Ra-M
1 node
1 node
Configuration
Z-buffer RenderingActive Pixel Rendering
Only Red nodes used – each one runs 1 RE, 0, 1, or 2 RA, and one runs M
Alan Sussman ([email protected]) 27
Heterogeneous nodes
Active Pixel algorithm on 8 -processor Blue node + Red data nodesBlue node runs 7 Ra or ERa copies and M, Red nodes each run 1 of each except M
4.23.74.04.14.15.14.04.16.54.24.08.2R-ERa-M
3.82.93.83.53.04.43.02.65.73.03.07.3RE-Ra-M
DDWRRRRDDWRRRRDDWRRRRDDWRRRR
8 data nodes4 data nodes2 data nodes1 data node
Conf.
Alan Sussman ([email protected]) 28
Skewed data distribution
n Experimental setup– 25GB dataset from UT ParSSim (bigger grid than earlier
experiments)– Hilbert curve declustering onto disks on 2 Blue, 2 Rogue
nodes– Skew moves part of data from Blue to Rogue nodes
Alan Sussman ([email protected]) 29
Skewed data distributionBalanc e d#-#Ac tive #Pixe l#R e nde ring
0
5
1 0
1 5
2 0
2 5
3 0
R ERa -M R-E Ra-M RE -Ra -MF ilte r#Co n fig u ration
Exe
cu
tion
#Tim
e#(s
econ
ds)
RRW RRDD
S ke w e d#25%#-#Ac tive #Pixel#Re nde r ing
0
5
1 0
1 5
2 0
2 5
3 0
RE Ra-M R-E Ra -M RE-Ra- MFilter#Con figu ratio n
Exe
cutio
n#T
ime
#(sec
onds
)
RRWRRDD
Ske w e d#50 %#-#A c tive #Pi xe l#Re nde r ing
0
5
1 0
1 5
2 0
2 5
3 0
R ERa -M R-E Ra-M RE -Ra -MF ilte r#Co n fig u ration
Ex
ecut
ion
#Tim
e#(s
eco
nds
)
RRW RRDD
Ske we d#75%#-#Ac tive #Pixe l#Re nde ring
0
5
1 0
1 5
2 0
2 5
3 0
RE Ra-M R-E Ra -M RE-Ra- MFilter#Con figu ratio n
Exe
cutio
n#T
ime
#(sec
onds
)
RRWRRDD
Alan Sussman ([email protected]) 30
Experimental Implications
Multiple group instances– Higher utilization of under-used resources– Reduce application response time– but … requires work co-execution tolerance
Transparent copies can help– Most filter decompositions are unbalanced– Heterogeneous CPU capacity / load– but … requires buffer out-of-order tolerance
6
Alan Sussman ([email protected]) 31
Ongoing and Future WorkOngoing and Future work
– Automated placement, instances, transparent copies– predictive (cost models)– adaptive (work feedback)
– Filter accumulator support (partitioning, replication)– Java filters– CCA-compliant filters– Very large datasets – including HDF5 format
– Using storage clusters at UMD and OSU, then testbed at LLNL