33
Processing large-scale multi-dimensional data in parallel and distributed environments q Michael Beynon a , Chialin Chang a , Umit Catalyurek b , Tahsin Kurc b, * , Alan Sussman a , Henrique Andrade a , Renato Ferreira a , Joel Saltz b a Department of Computer Science, University of Maryland, College Park, MD 20742, USA b Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA Received 11 March 2001; received in revised form 20 November 2001 Abstract Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist’s ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional da- tasets in parallel and distributed computing environments. Ó 2002 Elsevier Science B.V. All rights reserved. www.elsevier.com/locate/parco Parallel Computing 28 (2002) 827–859 q This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, the Office of Naval Research under Grant #N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defence, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559). * Corresponding author. E-mail addresses: [email protected] (M. Beynon), [email protected] (C. Chang), umit@c- s.umd.edu (U. Catalyurek), [email protected] (T. Kurc), [email protected] (A. Sussman), [email protected] (H. Andrade), [email protected] (R. Ferreira), [email protected] (J. Saltz). 0167-8191/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-8191(02)00097-2

Processing large-scale multi-dimensional data in parallel and distributed environments

Embed Size (px)

Citation preview

Page 1: Processing large-scale multi-dimensional data in parallel and distributed environments

Processing large-scale multi-dimensional datain parallel and distributed environments q

Michael Beynon a, Chialin Chang a, Umit Catalyurek b,Tahsin Kurc b,*, Alan Sussman a, Henrique Andrade a,

Renato Ferreira a, Joel Saltz b

a Department of Computer Science, University of Maryland, College Park, MD 20742, USAb Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA

Received 11 March 2001; received in revised form 20 November 2001

Abstract

Analysis of data is an important step in understanding and solving a scientific problem.

Analysis involves extracting the data of interest from all the available raw data in a dataset

and processing it into a data product. However, in many areas of science and engineering,

a scientist’s ability to analyze information is increasingly becoming hindered by dataset sizes.

The vast amount of data in scientific datasets makes it a difficult task to efficiently access the

data of interest, and manage potentially heterogeneous system resources to process the data.

Subsetting and aggregation are common operations executed in a wide range of data-intensive

applications. We argue that common runtime and programming support can be developed for

applications that query and manipulate large datasets. This paper presents a compendium of

frameworks and methods we have developed to support efficient execution of subsetting and

aggregation operations in applications that query and manipulate large, multi-dimensional da-

tasets in parallel and distributed computing environments.

� 2002 Elsevier Science B.V. All rights reserved.

www.elsevier.com/locate/parco

Parallel Computing 28 (2002) 827–859

qThis research was supported by the National Science Foundation under Grants #ACI-9619020

(UC Subcontract #10152408) and #ACI-9982087, the Office of Naval Research under Grant

#N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract

#10184497), and the Department of Defence, Advanced Research Projects Agency, USAF, AFMC

through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC

Subcontract #4400025559).*Corresponding author.

E-mail addresses: [email protected] (M. Beynon), [email protected] (C. Chang), umit@c-

s.umd.edu (U. Catalyurek), [email protected] (T. Kurc), [email protected] (A. Sussman),

[email protected] (H. Andrade), [email protected] (R. Ferreira), [email protected] (J. Saltz).

0167-8191/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-8191 (02 )00097-2

Page 2: Processing large-scale multi-dimensional data in parallel and distributed environments

Keywords: Data-intensive applications; Multi-dimensional datasets; Parallel processing; Distributed

computing; Runtime systems

1. Introduction

There is a large body of research devoted to developing high-performance archi-tectures and algorithms for efficient execution of large-scale scientific applications.Moreover, it is becoming increasingly more efficient to use collections of high-perfor-mance machines for application execution, because of the availability of faster net-works and tools for discovery, allocation, and management of distributed resources.As a result, long-running, large-scale simulations [20,46,56,58] are producing unprec-edented amounts of data. In addition, advanced sensors attached to instruments,such as earth-orbiting satellites and medical instruments [3,62], are generating verylarge datasets that must be made available to a wider audience.

Looking at available technology, disk space has become plentiful and relativelyinexpensive. Using off-the-shelf components, it is currently possible to build a disk-based storage cluster with about 1 Terabyte of storage space, consisting of six Pen-tium III PCs, each with two 80 GB EIDE disks, for about $10,000. The availabilityof such low-cost systems, built from networks of commodity computers and high-capacity disks, has greatly enhanced a scientist’s ability to store large-scale scientificdata. However, the primary goal of gathering data is better understanding of the sci-entific problem at hand, and data analysis is key to this understanding. The vastamount of data available in scientific datasets makes it an onerous task for a scientistboth to efficiently access the data, and to manage the system resources required toprocess it.

A growing set of data-intensive applications query and analyze collections of verylarge multi-dimensional datasets. Examples of such applications include satellitedata processing [24,27,62], full-scale water contamination studies and surface/sub-surface petroleum reservoir simulations [44,66], visualization and processing of dig-itized microscopy images [3], visualization of large-scale data [5,8,29,42,61], and datamining [4,7,34,68]. Although the datasets used for analysis and the data productsgenerated by applications that manipulate those datasets may differ in many ways,a close look at many data-intensive applications [17,21,31,42,44] reveals that thereexist commonalities in their data access patterns and processing structures. Analysisrequires extracting the data of interest from the dataset, and processing and trans-forming it into a new data product that can be more efficiently consumed by anotherprogram or analyzed by a human. Subsetting of data is often done through rangequeries, and aggregation (reduction) operations are commonly executed in the dataprocessing step of a wide range of applications.

We argue that frameworks and methods can be developed that will provide com-mon programming and runtime support for a wide range of applications that makeuse of large scientific datasets. In this paper, we present an overview of the methodsand frameworks we have developed for efficient execution of applications thatquery and manipulate large, multi-dimensional datasets. The algorithms and runtime

828 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 3: Processing large-scale multi-dimensional data in parallel and distributed environments

systems presented in this paper target architectures that range from tightly coupleddistributed-memory parallel machines with attached disk farms to heterogeneouscollections of high-performance machines and storage systems in a distributed com-puting environment.

2. Overview

In this section we briefly describe several data-intensive applications that havemotivated the design and implementation of the algorithms and frameworks pre-sented in this paper. We also discuss data access and processing patterns commonlyobserved in these applications.

2.1. Motivating applications

2.1.1. Satellite data processingEarth scientists study the earth by processing remotely sensed data continuously

acquired from sensors attached to satellites. A typical analysis processes satellitedata for ten days to a year (for the AVHRR sensor, ten days of data is about 4GB) and generates one or more composite images of the area under study [1,24].Generating a composite image requires projection of the region of interest onto atwo-dimensional (2D) grid; each pixel in the composite image is computed by select-ing the ‘‘best’’ sensor value that maps to the associated grid point. An earth scientistspecifies the projection that best suits her needs.

2.1.2. Analysis of microscopy data: Virtual MicroscopeThe Virtual Microscope (VM) [3,17] provides a realistic digital emulation of a

high power light microscope. The raw data for such a system can be captured by dig-itally scanning collections of full microscope slides under high power. The size of aslide with a single focal plane can be up to several gigabytes, uncompressed. Hun-dreds of such digitized slides can be produced in a single day in a large hospital.The processing for the VM requires projecting high resolution data in the regionof interest on the slide onto a grid of suitable resolution (governed by the desiredmagnification) and appropriately compositing the pixels mapping onto a single gridpoint, to avoid introducing spurious artifacts into the displayed image.

2.1.3. Coupling of environmental codes: water contamination studiesPowerful simulation tools are crucial to understand and predict the transport and

reaction of chemicals in bays and estuaries [44]. Such tools include a hydrodynamicssimulator [46], which simulates the flow of water in the domain of interest, and achemical transport simulator [20], which simulates the reactions between chemicalsin the bay and transport of these chemicals. For a complete simulation system,the hydrodynamics simulator needs to be coupled to the chemical transport simula-tor, since the latter uses the output of the former to simulate the transport of chem-icals within the domain. As the chemical reactions have little effect on the circulation

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 829

Page 4: Processing large-scale multi-dimensional data in parallel and distributed environments

patterns, the fluid velocity data can be generated once and used for many contami-nation studies. The output data from a large grid at a single time step may be severalmegabytes, and thousands of time steps may need to be simulated for a particularscenario. The grids used by the chemical simulator may be different from the gridsthe hydrodynamic simulator employs, and the chemical simulator usually uses coar-ser time steps. Therefore, running a chemical transport simulation requires retrievingthe hydrodynamics output in the region of interest (i.e., a region of the grid over aspecified time period) from the appropriate hydrodynamics dataset, averaging thehydrodynamics outputs over time, and projecting them into the grid used by thechemical transport simulator.

2.1.4. Visualization of simulation datasets: iso-surface renderingThe study and characterization of ground waterways and oil reservoirs involve

simulation of the transport and reaction of various chemicals over many time stepson a three-dimensional (3D) grid that represents the region of interest. In a typicalanalysis of datasets generated by a simulation, a scientist examines the transportof one or more chemicals in the region being studied over several time steps [42]. Vi-sualization is key to understanding the results of the simulation, and iso-surface ren-dering is a well-suited method to visualize the density distributions of chemicals in aregion. Given a 3D grid with scalar values at grid points and a user-defined scalarvalue, called the iso-surface value, an iso-surface rendering algorithm extracts thesurface on which the scalar value is equal to the iso-surface value. The extracted sur-face (iso-surface) is rendered to generate an image. In general, the iso-surface isapproximated by a list of polygons [45], and a polygon rendering algorithm (e.g.,the z-buffer algorithm) is employed to produce the output image [65].

2.1.5. Mining interesting patterns: decision tree constructionThe goal of data mining is to discover interesting and useful, but a priori un-

known patterns from large databases. Classification is one of the important prob-lems in data mining, and has applications in many fields, such as financial analysisand medical diagnosis [55,68]. In classification, we are given a subset of all recordsin the dataset, called the training set, in which each record consists of several fields,referred to as attributes. An attribute can be either a numerical attribute or a categor-ical attribute. If values of an attribute belong to an ordered domain, the attribute iscalled a numerical attribute (e.g., income, age). A categorical attribute, on the otherhand, has values from an unordered domain (e.g., type of car, house, job, depart-ment name). One of the categorical attributes is designated as the classification attrib-ute; its values are called class labels. The goal of the classification is to create aconcise model of the classification attribute based on the other attributes. Once sucha model is constructed, future records, which are not in the training set, can be clas-sified using the model. A decision-tree classifier builds a tree by dividing the trainingset into partitions so that all or most of the records in a partition (leaf node) havethe same class label [55]. A tree is grown by splitting each leaf node at the currentlevel, starting from the root node, which contains the entire training set, into childnodes. The records associated with a tree node are split into partitions on the split

830 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 5: Processing large-scale multi-dimensional data in parallel and distributed environments

condition. Two histograms showing the class distribution of the values of a numer-ical attribute are computed for a numerical attribute, and count tables are created forcategorical attributes, which store the class distribution of the records for each cat-egorical attribute. The histograms are used to find the best split point at a tree node[55]. Computation of the histograms and count tables involves counting the class val-ues of records and adding them to the respective tables and histograms.

2.2. Data access and processing structure: range queries and reduction operations

Most of the datasets accessed and manipulated by the applications described in theprevious sections are multi-dimensional. That is, data items are associated with pointsin a multi-dimensional attribute space. The data dimensions can be spatial coordi-nates and time, or varying conditions, such as temperature, velocity or chemical con-centration values in an environmental simulation application, or income and age in adata mining application. Oftentimes, reference to the data of interest can be describedby a range query. A range query defines a multi-dimensional bounding box in the un-derlying multi-dimensional attribute space of the dataset(s). Only the data itemswhose associated coordinates fall within the multi-dimensional box are retrieved.

Performing reduction-type operations is one of the common processing pat-terns observed in applications that analyze large datasets. Fig. 1 shows the high-levelpseudo-code of the basic data processing loop. The function Selectð. . .Þ identifies the

Fig. 1. The basic data processing loop.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 831

Page 6: Processing large-scale multi-dimensional data in parallel and distributed environments

set of data items in a dataset that intersect a given range query. An intermediate datastructure, referred to here as an accumulator, can be used to hold intermediate resultsduring processing. For example, a z-buffer can be used as an accumulator to holdcolor and distance values in an iso-surface rendering application [54]. In data mining,histograms and count tables can be used to hold the distribution of attribute valuesfor building decision trees [55]. Accumulator items are allocated and initialized dur-ing the initialization phase (steps 3–6). The processing steps consist of retrieving dataitems that intersect the range query (step 8), mapping the retrieved input items to thecorresponding output items (step 9), and aggregating, in some application specificway, all the input items that map to the same output data item (steps 10–11). Themapping function, MapðieÞ, is an application-specific function, which may map aninput item to a set of output items. The aggregation function, Aggregate ðie; aeÞ, ag-gregates the value(s) of an input item ie with the intermediate results stored in theaccumulator item ae that corresponds to one of the output items that ie maps to. Fi-nally, the intermediate results stored in the accumulator are post-processed to pro-duce the final results for the output dataset (steps 13–16). Steps 1 and 4 areneeded when the processing of data updates an already existing dataset, and dataitems are needed to initialize accumulator elements. The output can be stored ondisks in the system (step 15) or can be consumed by another program (e.g., displayedby a client program in a visualization application). The output dataset is usuallymuch smaller than the input dataset, hence steps 7–12 are called the reduction phaseof the processing. Aggregation functions in the reduction phase are usually commu-tative and associative, i.e., correctness of the output data values does not depend onthe order input data items are aggregated.

3. Supporting reduction operations on distributed-memory parallel machines

The implementation of aggregation operations on a parallel machine requires dis-tribution of data and computations among disks and processors to make efficient useof aggregate storage space and computing power, and carefully scheduling data re-trieval, computation and network operations to keep all resources (i.e., disks, proces-sor memory, network, and CPU) busy without overloading any of the resources. Wehave developed a framework, called the Active Data Repository (ADR) [21,31], thatprovides support for applications that perform range queries with user-defined ag-gregation operations on multi-dimensional datasets, to be executed on a distribut-ed-memory parallel machine with an attached disk farm. In this section, we brieflydescribe the framework, and present algorithms and optimization techniques devel-oped in the ADR framework.

3.1. Active Data Repository framework

3.1.1. Storing datasetsA dataset is partitioned into and stored as a set of data chunks. A data chunk con-

tains a subset of data items in the dataset. The dataset is partitioned into data chunks

832 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 7: Processing large-scale multi-dimensional data in parallel and distributed environments

by the application developer, and data chunks in a dataset can have different sizes.Since data is accessed through range queries, it is desirable to have data items thatare close to each other in the multi-dimensional space placed in the same data chunk.A data chunk is the unit of data retrieval. That is, it is retrieved as a whole duringprocessing. Retrieving data in chunks instead of as individual data items reducesI/O overheads (e.g., seek time), resulting in higher application level I/O bandwidth.As every data item is associated with a point in a multi-dimensional attribute space,every data chunk is associated with a minimum bounding rectangle (MBR). TheMBR of a data chunk is the smallest box in the underlying multi-dimensional spacethat encompasses all the coordinates of all the items in the data chunk.

Data chunks are distributed across the disks in the system to fully utilize the ag-gregate storage space and disk bandwidth. To take advantage of the data access pat-terns exhibited by range queries, data chunks that are close to each other in theunderlying attribute space should be assigned to different disks. In the ADR frame-work, we employ a Hilbert curve-based declustering algorithm [28] to distribute thechunks across the disks. Hilbert curve algorithms are fast and exhibit good clusteringand declustering properties. Other declustering algorithms, such as those based ongraph partitioning [47], can also be used. Each chunk is assigned to a single disk,and is read and written only by the local processor to which the disk is attached.If a chunk is required for processing by one or more remote processors, it is sentto those processors as a whole by the local processor via interprocessor communica-tion. After data chunks are assigned to disks, a multi-dimensional index is con-structed using the MBRs of the chunks. The index on each processor is used toquickly locate the chunks with MBRs that intersect a given range query. Efficientspatial data structures, such as R-trees and their variants [9], can be used for index-ing and accessing multi-dimensional datasets.

3.1.2. Query processingThe processing of a range query is accomplished in two steps: a query plan is com-

puted in the query planning step, and the actual data retrieval and processing is car-ried out in the query execution step according to the query plan.

Query planning is carried out in three phases: index lookup, tiling and workloadpartitioning. In the index lookup phase, indices associated with the datasets are usedto identify all the chunks that intersect with the query. If the accumulator data struc-ture is too large to fit entirely in memory, it must be partitioned into output tiles, eachof which contains a disjoint subset of accumulator elements. Partitioning is done inthe tiling phase so that the size of a tile is less than the amount of memory availablefor the accumulator. A tiling of the accumulator implicitly results in a tiling of theinput dataset. Each input tile contains the input chunks that map to the correspond-ing output tile. Since an input element may map to multiple accumulator elements,the corresponding input chunk may appear in more than one input tile if the accu-mulator chunks are assigned to different tiles. During query execution, the inputchunks placed in multiple input tiles are retrieved multiple times, once per outputtile. Therefore, care should be taken to minimize the boundaries of output tiles soas to reduce the number of such input chunks. In the workload partitioning phase,

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 833

Page 8: Processing large-scale multi-dimensional data in parallel and distributed environments

the workload associated with a tile is partitioned among processors. In the query ex-ecution step, the processing of an output tile is carried out according to the queryplan. A tile is processed in four phases––a query iterates through these phases repeat-edly until all tiles have been processed and the entire output has been computed.

(1) Initialization: Accumulator elements for the current tile are allocated space inmemory and initialized.

(2) Reduction: Each processor retrieves data chunks stored on local disks. Dataitems in a data chunk are aggregated into the accumulator elements allocatedin each processor’s memory during phase 1.

(3) Global combine: If necessary, partial results computed by each processor in phase2 are combined across the processors via inter-processor communication to com-pute final results for the accumulator.

(4) Output handling: The final output for the current tile is computed from the cor-responding accumulator values computed in phase 3. The output is either sentback to a client or stored back to the disks.

3.2. An implementation of the ADR framework

We have developed an implementation of the ADR framework as a set of mod-ular services, implemented as a Cþþ class library, and a runtime system 1 [21,31].Several of the services allow customization for user-defined processing. A unified in-terface is provided for customizing these services via Cþþ class inheritance and vir-tual functions. An application developer has to provide accumulator data structuresand functions that operate on in-core data, to implement application-specific pro-cessing of out-of-core data with ADR.

An ADR application consists of one or more clients, a front-end process, and acustomized back-end. The front-end interacts with clients, translates client requestsinto queries and sends one or more queries to the parallel back-end. Since the clientscan connect and generate queries in an asynchronous manner, the existence of afront-end relieves the back-end from being interrupted by clients during processingof queries. The back-end is responsible for storing datasets and carrying out appli-cation-specific processing of the data on the parallel machine.

The back-end runtime system provides support for common operations such asindex lookup, management of system memory, and scheduling of data retrievaland processing operations across a parallel machine. During the processing of aquery, the runtime system tries to overlap disk operations, network operationsand processing as much as possible. Overlap is achieved by maintaining explicitqueues for each kind of operation (data retrieval, message sends and receives, dataprocessing) and switching between queued operations as required. Pending asyn-chronous I/O and communication operations in the operation queues are polled

1 The software and user’s manual can be downloaded from http://www.cs.umd.edu/projects/adr.

834 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 9: Processing large-scale multi-dimensional data in parallel and distributed environments

and, upon their completion, new asynchronous operations are initiated when morework is required and memory buffer space is available. Data chunks are thereforeretrieved and processed in a pipelined fashion.

We have developed several applications [18,21,42] using the ADR framework im-plementation. In the following section, we describe the implementation of the VM(see Section 2) as an example application and present experimental performance re-sults. Fig. 2 illustrates the VM client graphical user interface.

3.2.1. The Virtual Microscope using ADRIn the VM application, the digitized image from a slide is essentially a 3D data set,

because each slide may consist of multiple focal planes. In other words, each digi-tized slide consists of several stacked 2D images. However, the portion of the en-tire image that must be retrieved to provide a view into the slide for any given setof query parameters (i.e., area of interest, magnification and focal plane) is 2D.

Fig. 2. The VM client.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 835

Page 10: Processing large-scale multi-dimensional data in parallel and distributed environments

Therefore, to optimize performance, each 2D image (a focal plane) can be consideredseparately for partitioning into chunks and for declustering chunks across disks.Most queries require processing only a small portion of the image. Hence the sizeof the chunks must be big enough to efficiently use the disk subsystem but not sobig as to retrieve and process too much unneeded data. An image chunk is usedas the unit of data storage and retrieval in the implementation of VM using ADR.That is, a chunk and its associated metadata (position of the tile in the whole imageand its size) are stored as a single chunk in a data file. The chunks are distributedacross the system disks using a Hilbert curve-based algorithm [28].

It is also clear that the images should be stored in a compressed form. In the cur-rent implementation we selected JPEG compression as the default compressionmethod because of the availability of fast and stable compression/decompression li-braries [37]. Therefore, the aggregation function in the VM implementation first per-forms data decompression on a retrieved chunk, then carries out clipping andsubsampling operations on the uncompressed data chunk to produce low resolutionimages as needed (data is only stored at the highest resolution). The index implemen-tation exploits the fact that the chunks are non-overlapping and that the slides arefully rectangular images without holes. The chunks are numbered in row-majororder and their location and size in the corresponding data file are stored in a 2Dmatrix, each element of which corresponds to a data chunk in a focal plane. Giventhe location of the data chunk in the overall image, the index maps it to the corre-sponding entry in the matrix to locate each chunk that intersects the query veryquickly. If it becomes necessary to store datasets with holes in the images, or to storeimages that are not rectangular, an R-tree based indexing method can be employed.

3.2.2. Experimental resultsWe present experimental performance results on a Linux PC cluster. The PC clus-

ter consists of one front-end node and five processing nodes, with a total of 800 GBof disk storage. Each processing node has an 800 MHz Pentium III CPU, 128 MBmain memory, and two 5400 RPMMaxtor 80 GB EIDE disks. The processing nodesare interconnected via 100 Mbps switched Ethernet. The front-end node is also con-nected to the same switch. We have used the driver program described in [10] to em-ulate the behavior of multiple simultaneous end users (clients). The implementationof the client driver is based on a workload model that has been statistically generatedfrom traces collected from real experienced users. Interesting regions are modeled aspoints in the slide, and provided as an input file to the driver program. When a userpans near an interesting region, there is a high probability a request will be gener-ated. The driver adds noise to requests to avoid multiple clients asking for the sameregion. In addition, the driver avoids having all the clients scan the slide in the samemanner. The slide is swept through in either an up-down fashion or a left-right fash-ion. In the experiments we use a slide consisting of 32,336�27,840 3-byte pixels 2.

2 That slide, and 22 others, can be accessed from the Johns Hopkins Medical Institutions Virtual

Microscope web page located at http://vmscope.jhmi.edu.

836 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 11: Processing large-scale multi-dimensional data in parallel and distributed environments

Performance results for the VM data server using different chunk sizes are dis-played in Fig. 3(a). In this figure, the 400�, 200�, 100� and 50� bars show the av-erage response time of the VM data server to queries at different resolutions, where400� is the highest resolution data that is actually stored in the server. The overallbar displays the average response time of the VM system to the queries at all reso-lutions. As seen in the figure, chunk size 256� 256 produces the best response time ateach resolution, and therefore for the overall average for a 512� 512 output image.Both 128� 128 and 512� 512 chunk sizes result in response times that are approx-imately 33% higher. Increasing the chunk size decreases system performance becausewith too large a chunk size all of the processing nodes in the data server cannot beefficiently utilized, especially for queries requesting a relatively small output image.As chunk size increases, the number of chunks that intersect with a fixed size userrequest decreases. For example, with chunk size 2048� 2048 a query requestingan output of size 512� 512 at the highest resolution intersects with either 1 or 2or 4 chunks. It is highly probable that most such queries will intersect only 1 chunkbecause of the large chunk size. In that case four out of the five processors in the dataserver will be idle.

Fig. 3(b) displays the average response time from queries generated by multipleconcurrently running clients. Since a 256� 256 chunk size gave the best responsetime for a single client query, we have selected 256� 256 as the default chunk sizefor this experiment. Each client is an instance of the driver program described earlierin this section and generates 100 queries (note that each client will generate a some-what different set of queries due to the design of the client driver). The generatedquery set contains queries at different resolutions, hence some of the queries (thoseat lower resolutions) require processing more data at the VM data server since thestored data is at the highest resolution. For example, a query at 50� magnificationrequires processing 64 times more data than a query requesting an output image at400� magnification. The response times that are shown in these figures are the

Fig. 3. (a) Performance results for the ADR VM server running on five processors for varying image

chunk sizes. Average response time of the server for the queries that produce 512� 512 output image.

(b) Performance figures for the server on varying number of processors to produce 512� 512 output im-

ages. Each client submits 100 queries to the server.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 837

Page 12: Processing large-scale multi-dimensional data in parallel and distributed environments

average response time for a single query. As is seen in the figure, the performance ofthe ADR version of the VM server scales well as the number of clients increases. Forexample, with five clients, the speedup for five server processors is 3.6 compared to aone-processor server.

3.3. Query processing strategies

Workload partitioning and tiling have significant effects on the performance of anapplication implemented using the ADR framework. We have evaluated several po-tential strategies [22,23,43] that use different workload partitioning and tilingschemes. To simplify the presentation, we assume that the target range query in-volves only one input and one output dataset. Both the input and output datasetsare assumed to be already partitioned into data chunks and declustered across thedisks in the system. In the following discussions we assume that an accumulatorchunk is allocated in memory for each output chunk to hold the partial resultsand that an accumulator chunk is the same as an output chunk. Therefore, outputchunk and accumulator chunk are used interchangeably in this section.

In all of the algorithms discussed in this section, we employ Hilbert space-fillingcurves [28] in the tiling phase. Our goal is to minimize the total length of the bound-aries of the tiles, by assigning spatially close chunks in the multi-dimensional attrib-ute space to the same tile, to reduce the number of input chunks crossing one ormore boundaries. The advantage of using Hilbert curves is that they have good clus-tering properties [47], since they preserve locality. In our implementation, the mid-point of the bounding box of each output chunk is used to generate a Hilbert curveindex. The chunks are sorted with respect to this index, and selected in this order fortiling. The current implementation, however, does not take into account the distribu-tion of input chunks in the output attribute space, so for some distributions of theinput data in its attribute space there can still be many input chunks intersecting mul-tiple tiles, despite a small boundary length.

3.3.1. Fully replicated accumulator strategyIn the fully replicated accumulator (FRA) strategy, each processor performs pro-

cessing associated with its local input chunks. The accumulator is partitioned intotiles, each of which fits into the available local memory of a single processor. Thisscheme effectively replicates all of the accumulator chunks in a tile on each proces-sor. During the reduction phase, each processor generates partial results for theaccumulator chunks using only its local input chunks. Replicated accumulatorchunks are then forwarded to the processors that own the corresponding accumula-tor chunks during the global combine phase to produce the complete intermediate re-sult.

3.3.2. Sparsely replicated accumulator strategyThe FRA strategy replicates each accumulator chunk in every processor, even if

no input chunks will be aggregated into the accumulator chunks in some processors.This results in unnecessary initialization overhead in the initialization phase of query

838 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 13: Processing large-scale multi-dimensional data in parallel and distributed environments

execution, and extra communication and computation in the global combine phase.The available memory in the system also is not efficiently employed, because ofunnecessary replication. Such replication may result in more tiles being created thannecessary, which may cause a large number of input chunks to be retrieved from diskmore than once. In SRA strategy, a replicated chunk is allocated only on proces-sors owning at least one input chunk that maps to the corresponding accumulatorchunk.

3.3.3. Distributed accumulator strategy.In this scheme, every processor is responsible for all processing associated with its

local output chunks. Tiling is done by selecting, for each processor, local outputchunks from that processor until the memory space allocated for the correspondingaccumulator chunks in the processor is filled. Since no accumulator chunks are rep-licated by the distributed accumulator (DA) strategy, no replicated chunks are allo-cated. This allows DA to make more effective use of memory and produce fewer tilesthan the other two schemes. As a result, fewer input chunks are likely to be retrievedfor multiple tiles. Furthermore, DA avoids interprocessor communication for accu-mulator chunks during the initialization phase and for replicated chunks during theglobal combine phase, and also requires no computation in the global combine phase.On the other hand, it introduces communication in the reduction phase for inputchunks; all the remote input chunks that map to the same output chunk must be for-warded to the processor that owns the output chunk. Since a projection functionmay map an input chunk to multiple output chunks, an input chunk may be for-warded to multiple processors.

3.3.4. A hypergraph-based strategyIn this strategy [22], workload partitioning is formulated as a hypergraph parti-

tioning problem. A hypergraph is a graph where each hyperedge (also called anet) can connect more than two vertices in the graph. We first describe the tiling al-gorithm, and then the workload partitioning algorithm.

3.3.4.1. Tiling. The memory requirement of an output tile, which determines howmany output chunks can fit in an output tile, depends on how many output chunksare replicated on processors, and that information is only available after a workloadpartitioning is computed for the output tile. To circumvent the dependency betweenthe tiling process and the workload partitioning process, we employ the followingtiling algorithm.

(1) Create output tiles, assuming each tile is replicated on all processors. This con-servative approach guarantees that for any possible workload partitioning solu-tion computed by the workload partitioning algorithm, no single output tile canuse more processor memory than is available, although more tiles than necessarymay be generated. We refer to each tile generated in this step as a small tile.

(2) Apply the workload partitioning algorithm to each small tile, and compute theactual memory requirement for each small tile.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 839

Page 14: Processing large-scale multi-dimensional data in parallel and distributed environments

(3) Merge the small tiles to form the final tiles, based on the actual memory require-ments of the small tiles, without violating thememory constraint on any processor.

3.3.4.2. Workload partitioning. For a given output tile, the hypergraph-based algo-rithm (HG) uses a hypergraph, referred to as an aggregation hypergraph, to modelthe aggregation operations involving pairs of corresponding input and outputchunks. In an aggregation hypergraph, an aggregation vertex is created for each in-put–output chunk pair. One processor vertex for each processor in the target ma-chine is also created. Every input and every output chunk is represented as a netthat connects vertices corresponding to the aggregation operations that need thedata chunk. The net also connects to the processor vertex for the processor that ownsthe input or output chunk. Fig. 4(b) shows the aggregation hypergraph for the exam-ple mapping in Fig. 4(a), with circles representing vertices and lines representing nets.In the figure, each aggregation vertex is labeled with an input–output pair, each pro-cessor vertex is labeled with a processor id, and each net is labeled with a data chunk.

HG assigns weights to vertices and nets of an aggregation hypergraph to modelI/O, communication and computation time for aggregation operations. The weightof a processor vertex is the total time required for the processor to read its local inputand output chunks. The weight of an aggregation vertex is the time required to per-form the aggregation operation involving the input–output chunk pair. For a netthat corresponds to an input chunk, the weight is the time to send the input chunkto a remote processor. For a net that corresponds to an output chunk, the weight isthe time to (1) send the output chunk from its owner to a remote processor, (2) ini-tialize the replicated output chunk on the remote processor, (3) send the replicatedoutput chunk back to the owner processor and (4) combine the replicated outputchunk with the local output chunk on the owner processor.

Fig. 4. (a) An example mapping between input chunks a–e and output chunks x and y. (b) The aggrega-

tion hypergraph for the example mapping in (a).

840 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 15: Processing large-scale multi-dimensional data in parallel and distributed environments

A P-way cut of a hypergraph partitions the vertices into P disjoint partitions. Theweight of a partition is defined as the sum of the weights of all vertices assigned tothat partition. The connectivity of a net for a cut is defined to be the number of par-titions that are connected by the net. The HG algorithm computes a workload par-titioning that minimizes execution time by solving the following optimizationproblem:

Given a threshold d and an aggregation hypergraph with P processors,compute a P-way cut C such that (1) each partition contains exactlyone processor vertex, (2) the difference between the weights of any twopartitions does not exceed d, and (3) the following cost function is min-imized

X

each net e

f½connectivityðe;CÞ � 1� � weightðeÞg ð1Þ

where connectivity (e;C) returns the connectivity of the net e for the cut C.A P-way cut that satisfies the first two constraints of the problem definition cor-

responds to partitioning of the workload so that aggregation operations assigned toa partition are performed by the processor in that partition. The first constraint en-sures that each partition is assigned to only one processor. The second ensures that

Fig. 5. A cut for the aggregation hypergraph shown in Fig. 4.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 841

Page 16: Processing large-scale multi-dimensional data in parallel and distributed environments

the computational load imbalance between any two processors does not exceed d.For example, Fig. 5 shows a cut for the hypergraph in Fig. 4. The connectivity ofa net e corresponds to the number of processors that either own or require the datachunk e. Therefore, connectivity ðe;CÞ � 1 is the number of remote processors thatthe data chunk e must be sent to. For example, net d in Fig. 5 spans two partitions,thus the input chunk d must be sent to processor P2. Net x in Fig. 5 spans all threepartitions. The output chunk x must therefore be replicated on all three processors.The cost function in Eq. (1) computes the total overhead incurred from sending in-put chunks to remote processors and from replicating output chunks on multipleprocessors, as is required by the workload partitioning induced by a cut. Minimizingthe cost function results in the cut that incurs the minimum communication over-head.

3.3.5. Experimental resultsWe compare the performance of HG to that of the DA and RA strategies on a 48-

node PC cluster running Linux. We present experimental results using datasets de-rived from the VM application [43,64].

Each node in the cluster has two 450 MHz Pentium II processors, 500 MBmemory, and one local disk. The nodes are interconnected via both Myrinet (120MB/s max.) and Fast Ethernet (100 Mb/s max.) networks. In the experiments, onlyone process was executed on each node. For HG, we use a hypergraph partitioningtool, called PaToH [16], which has been shown to generate good partitions quickly.In the figures, F, D, S, and P stand for the fully replicated, distributed, and sparselyreplicated accumulator, and the PaToH hypergraph partitioning strategies, respec-tively. In the experiments, we assume the output is divided into regular rectangu-lar regions and distributed across the disks. The assignment of both input andoutput chunks to the disks was done using a Hilbert curve based declustering algo-rithm.

Fig. 6 shows the performance of the strategies for VM using Myrinet and FastEthernet. The execution times shown in the figures are the processing times for que-ries at the server running on the parallel machine. The goal of this experiment is toshow the performance of the strategies with different communication bandwidth ca-pabilities. The number of input chunks, each of which is 192 KB, is fixed at 8192.The number of output chunks is 256, and the size of each chunk is 192 KB. On av-erage, each input chunk maps to one output chunk and each output chunk ismapped to by 32 input chunks. As is seen in the figure, when Fast Ethernet isused, the RA strategies perform better than DA for small numbers of processors,whereas DA achieves better performance on larger number of processors. In allcases, the overall execution time of HG is close to that of the better of the DAand RA strategies. HG results in low interprocessor communication volume by send-ing some of the input chunks as in the DA strategy and replicating some of the out-put chunks as in the RA strategies. As the number of processors increases, HGswitches from replicating output chunks to sending a mixture of input and outputchunks.

842 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 17: Processing large-scale multi-dimensional data in parallel and distributed environments

4. Supporting reduction operations in distributed, heterogeneous environments

In the previous section, we presented a framework and algorithms for efficient ex-ecution of data subsetting and reduction operations on tightly coupled parallel com-puter systems. With the help of faster networks and the tools to discover and allocatedistributed resources, it is increasingly becoming cost-effective to use collections ofarchival storage and computing systems in a distributed environment, to store andmanipulate large datasets. A networked collection of storage and computing systemsprovides a powerful environment, yet introduces many unique challenges for appli-cations. Such a setting requires access to and processing of data in a distributed, het-erogeneous environment. Both computational and storage resources can be atlocations distributed across the network. Also, the overall system may present a het-erogeneous environment to the application: (1) the characteristics, capacity andpower of resources, including storage, computation, and network, can vary widely,(2) space availability may require suboptimal placement of datasets within a system(i.e., across the disks in the system) and across the systems, causing non-uniformdata access costs, (3) the distributed resources can be shared by other applications,which results in varying resource availability. These characteristics have several

Fig. 6. The performance of the strategies for the VM dataset. F, D, S and P stand for the fully replicated,

distributed, and SRA, and the PaToH hypergraph partitioning strategies, respectively.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 843

Page 18: Processing large-scale multi-dimensional data in parallel and distributed environments

implications for developing efficient applications. An application should be flexi-ble to accommodate the heterogeneous nature of the environment. Moreover, theapplication should be optimized in its use of shared resources and be adaptive tochanges in their availability. For instance, it may not be efficient or feasible to per-form all processing at a data server when its load becomes high. In that case, the ef-ficiency of the application depends on its ability to perform application processingon the data as it progresses from the data source(s) to the client, and on the abilityto move all or part of its computation to other machines that are well suited for thetask.

There is a large body of research on building computational grids and providingsupport for enabling execution of applications in a Grid environment [32]. There isalso hardware and software research on archival storage systems, including distrib-uted parallel storage systems [39], file systems [59], and remote I/O [57]. However,providing support for efficient subsetting and processing of very large scientific da-tasets stored in archival storage systems in a distributed environment remains a chal-lenging research issue. Component-based programming models are becoming widelyaccepted [19,33,38,50,53] for developing applications in distributed, heterogeneousenvironments. In this model, the processing structure of an application is representedas multiple objects that interact with each other by moving data and control infor-mation.

We have developed a component-based framework, called DataCutter [11,12], forimplementing data intensive applications in a distributed environment. The frame-work is built upon prior work in our active disks [2,63] and Active Data Repositoryprojects. As was described in the previous section, the ADR framework aims to re-alize performance gains by executing application-specific data subsetting and aggre-gation operations at the server where the data is stored. The active disks projectinvestigated the potential performance benefits of pushing application-specific dataprocessing to disks, turning these passive system components into active devices.A stream-based programming model was described for programming active disks.The filter-stream programming model employed in DataCutter adapts and extendsthe programming model of active disks to heterogeneous distributed environments.Both ADR and active disks target homogeneous, tightly coupled systems. DataCut-ter is designed to support subsetting and reduction operations (the query processingloop presented in Section 3.1.2), as does ADR. However, with DataCutter, we areextending the functionality of ADR to distributed, heterogeneous environments byallowing decomposition of application-specific reduction operations into a set of in-teracting components, which we refer to as filters. The goal is to achieve performanceimprovements by providing the flexibility to (1) place components among storageand compute nodes in a system [11], and (2) instantiate and run multiple copies ofa group of components or copies of individual components in parallel [13]. The mid-dleware we have developed provides two core services: an indexing service for sub-setting of datasets via range queries, and a filtering service for instantiating andexecuting application components. In the following sections we briefly describe theframework and middleware, and present experimental results on the VM applica-tion.

844 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 19: Processing large-scale multi-dimensional data in parallel and distributed environments

4.1. Multi-level indexing for subsetting very large datasets

One of our goals is to provide support for subsetting very large datasets (sizes upto petabytes). We require that a scientific dataset contain both a set of data files anda set of index files. Data files contain the data elements of a dataset; data files can bedistributed across multiple storage systems. As in the ADR framework, each data fileis viewed as consisting of a set of data chunks, each of which contains a subset of allthe data items in the dataset and is associated with an MBR in the underlying multi-dimensional space.

Efficient spatial data structures have been developed for indexing and accessingmulti-dimensional datasets, such as R-trees and their variants [9]––ADR uses theR-tree as its default indexing method. However, storing very large datasets may re-sult in a large set of data files, each of which may itself be very large. Therefore asingle index for an entire dataset could be very large. Thus, it may be expensive, bothin terms of memory space and CPU cycles, to manage the index, and to perform asearch to find intersecting data chunks using a single index file. Assigning an indexfile for each data file in a dataset could also be expensive because it is then necessaryto access all the index files for a given search. To alleviate some of these problems, wehave developed a multi-level hierarchical indexing scheme implemented via summaryindex files and detailed index files. The elements of a summary index file associatemetadata (i.e., an MBR) with one or more data chunks and/or detailed indexfiles. Detailed index file entries themselves specify one or more data chunks. Each de-tailed index file is associated with some set of data files, and stores the index andother metadata for all data chunks in those data files. There are no restrictions onwhich data files are associated with a particular detailed index file for a dataset. Datafiles can be organized in an application-specific way into logical groups, andeach group can be associated with a detailed index file for better performance.For example, in satellite datasets, each data file may store data for one week. A de-tailed index file can be associated with data files grouped by month, and a summaryindex file can contain pointers to detailed index files for the entire range of data inthe dataset. An R-tree is used as the indexing method for the summary and detailedindex files.

4.2. Processing of data: filters and streams

In the filter-stream programming model, the processing structure of a data inten-sive application is represented as a set of interacting components, called filters. Dataexchange between any two filters is described via streams, which are uni-directionalpipes that deliver data in fixed size buffers. Filters are location-independent, becausestream names are used to specify filter to filter connectivity rather than endpoint lo-cation on a specific host. This allows the placement of filters on different hosts ina distributed environment. Therefore, processing, network and data copyingoverheads can be minimized by the ability to place filters on different platforms. Afilter is a user-defined object with methods to carry out application-specific pro-cessing on data. Currently, filter code is expressed using a Cþþ language binding

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 845

Page 20: Processing large-scale multi-dimensional data in parallel and distributed environments

by subclassing a filter base class. This provides a well-defined interface between thefilter code and the filtering service. The interface for filters consists of an initializa-tion function, a processing function, and a finalization function.

class ApplicationFilter : public DC_Filter_Base_t {public:

int init(int argc, char *argv[ ]) f. . .g;int process(stream_t st[ ]) f. . .g;int finalize(void) f. . .g;

}

A stream is an abstraction used for all filter communication, and specifies how fil-ters are logically connected. A stream is the means of uni-directional data flow be-tween two filters, from upstream filter to downstream filter. Bi-directional dataexchange is achieved by creating two streams in opposite directions. All transfersto and from streams are through a provided buffer abstraction. A buffer representsa contiguous memory region containing useful data. Streams transfer data in fixedsize buffers. The size of a buffer is determined in the init call; a filter discloses a min-imum and an optional maximum value for each of its streams. The actual size of thebuffer allocated by the filtering service is guaranteed to be at least the minimum va-lue. The optional maximum value is a preferred buffer size hint to the filtering ser-vice. The size of the data in a buffer can be smaller than the size of the buffer.Therefore, the buffer contains a pointer to the start, the length of the portion con-taining useful data, and the maximum size of the buffer. In the current prototype im-plementation we use TCP for stream communication, but any point-to-pointcommunication library could be added.

Filter operations progress as a sequence of cycles, with each cycle handling a sin-gle application-defined unit-of-work. An example of a unit-of-work would be a spa-tial query for an image processing application that describes a region within animage to retrieve and process. A work cycle starts when the filtering service callsthe filter init function, which is where any required resources such as memory or diskscratch space are pre-allocated. Next the process function is called to continuallyread data arriving on the input streams in buffers from the sending filters. A specialmarker is sent after the last buffer to mark the end for the current unit-of-work. Thefinalize function is called after all processing is finished for the current unit-of-work,to allow release of allocated resources such as scratch space. When a work cycle iscompleted, these interface functions may be called again by the runtime system toprocess another unit-of-work.

4.3. Parallel filters

Parallel filters target reduction operations in a distributed environment [13]. A re-duction operation can be realized by a filter group that implements transformation,mapping, and aggregation operations and encapsulates the accumulator data struc-ture. We are developing support for parallel filters from two classes, 1-filter and

846 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 21: Processing large-scale multi-dimensional data in parallel and distributed environments

n-filter, which are differentiated based on the granularity of mapping between appli-cation tasks and filters.1-filter parallel filters represent an entire parallel program as a single filter in-

stance. The goal is to allow the use of optimized parallel implementations of user-defined mapping, transformation, filtering and aggregation functions on a particularmachine configuration (i.e., as in ADR). For instance, coupling of a hydrodynamicssimulator to a chemical transport simulator would require a series of transformationand aggregation operations on the output of the hydrodynamics simulator to createthe input for the chemical transport simulator. A common transformation operationis the projection of flow values computed at points on one grid to flux values at facesof another mesh for chemical transport calculations. The projection requires solvinglinear systems of equations. Efficient parallel implementations have been imple-mented for solution of linear equations on distributed-memory and shared-memoryplatforms. In this case, a parallel implementation of the projection operation can bea 1-filter parallel filter in the group of filters that implement the operations neededfor coupling the two simulators.

The experimental results in this section concentrate on the execution of n-filterparallel filters, which are represented as concurrent instances of the same filter.For this class, the filter code itself is the unit of parallelism, and is replicated acrossa set of host machines. The runtime performance optimizations target the combineduse of ensembles of distributed-memory systems and SMP machines. Note that pipe-lining works well when all stages are balanced, both in terms of relative processingtime of the stages, as well as the time of each stage compared to the communicationcost between stages. Oftentimes, the processing of filter-based applications is not wellbalanced, which results in bottlenecks that cause other filters before and after a bot-tleneck filter to become idle. This imbalance and resulting performance penalty is ad-dressed by using transparent copies, in which the filter is unaware of the concurrentfilter replication. We define a copy set to be all transparent copies of a given filter thatare executing on a particular host. The filter runtime system maintains the illusion ofa single logical point-to-point stream for communication between a logical producerand a logical consumer in the filter group. When the logical producer and/or logicalconsumer has transparent copies, the system must decide for each producer whichconsumer copy set to send a stream buffer to. For example, in Fig. 7, if P1 issuesa buffer write operation to the logical stream that connects P to F, the choice is tosend it to the copy set on host3 or host4.

Each copy set shares a single buffer queue, so there is perfect demand-based bal-ance between copies within a single host. For distribution between copy sets (differ-ent hosts), we have designed and implemented several policies: (1) round robin (RR)distribution of buffers among copy sets, (2) weighted round robin (WRR) amongcopy sets based on the number of copies on that host, (3) a demand driven (DD) slid-ing window mechanism based on buffer consumption rate.

Not all filters will operate correctly in parallel as transparent copies, because of in-ternal filter state. For example, a filter that attempts to compute the average size of allbuffers processed for a unit of work will not arrive at the correct answer, because onlya subset of the total set of buffers for the unit-of-work are visible at any one copy.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 847

Page 22: Processing large-scale multi-dimensional data in parallel and distributed environments

Such filters can be annotated to prevent the filter service from utilizing transparentcopies. In the applications we have implemented, very few filters exhibit this typeof behavior, and for some of those cases an additional filter can be inserted to com-bine the distributed filter state from the transparent copies into a single coherent state.

4.4. Application: the Virtual Microscope

In the original VM system, the processing of a query is carried out entirely at theparallel server. During query processing, the chunks that intersect the query regionare read from local disks in each node. As a data chunk is stored in compressed form(JPEG format), the data chunk must be first decompressed. Then, it is clipped to thequery region. Afterwards, each clipped chunk is subsampled to achieve the zoomlevel (magnification) specified in the query. The resulting image blocks are assembledand displayed at the client.

The filter decomposition used for the VM system is shown in Fig. 8. The figureonly depicts the main dataflow path of image data through the system; other low-

Fig. 8. VM decomposition.

Fig. 7. P, F, C filter group instantiated using parallel filters.

848 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 23: Processing large-scale multi-dimensional data in parallel and distributed environments

volume streams related to the client-server protocol are not shown for clarity. Thethickness of the stream arrows indicate the relative volume of data that flows onthe different streams. In this implementation each of the main processing steps inthe server is a filter:

• read_data (R): Full-resolution data chunks that intersect the query region are readfrom disk, and written to the output stream.

• decompress (D): Image blocks are read individually from the input stream. Theblock is decompressed using JPEG decompression and converted into a 3-byteRGB format. The image block is then written to the output stream.

• clip (C): Uncompressed image blocks are read from the input stream. Portions ofthe block that lie outside the query region are removed, and the clipped imageblock is written to the output stream.

• zoom (Z): Image blocks are read from the input stream, subsampled to achieve themagnification requested in the query, and then written to the output stream.

• view (V): Image blocks are received for a given query, collected into a single reply,and sent to the client using the standard VM client/server protocol.

4.4.1. Experimental results4.4.1.1. Multi-level indexing. The first experiment isolates the impact of organizingthe dataset into multiple files and using the multi-level indexing scheme. In this ex-periment we use a 4 GB 2D compressed JPEG image dataset (90 GB uncompressed),created by stitching together smaller digitized microscopy images. This dataset isequivalent to a digitized slide with a single focal plane that has 180K � 180KRGB pixels. The 2D image is regularly partitioned into 200� 200 data chunksand stored in a set of data files in the IBM HPSS archival storage system at the Uni-versity of Maryland [11]. The HPSS setup has 10 TB of tape storage space, 500 GBof disk cache, and is accessed through a 10-node IBM SP. One node of the IBM SP isused to run the filter that carries out index lookup, and the client was run on a SUNworkstation connected to the SP node through the department Ethernet. The serverhost is where the read_data filter is run, which is the machine containing the dataset.While these experiments were conducted using HPSS, we only use HPSS as an exam-ple of an archival, high-capacity storage system.

We defined five possible queries, each of which covers 5� 5 chunks of the image(see Fig. 9(a)). The execution times we will show are response times seen by the vi-sualization client averaged over five repeated runs. Fig. 9(b) shows the results whenthe 2D image is partitioned into 1� 1, 2� 2, 4� 4 and 10� 10 rectangular regions,and all data chunks in each region are stored in a data file. Fig. 9(a) illustrates thepartitioning of the dataset into 1� 1 (entire rectangle), 2� 2 (solid lines), and4� 4 (dashed lines) files. Each data file is associated with a detailed index file, andthere is one summary index file for all the detailed index files for each partitioning.As is seen in the figure, the load time decreases as the number of files is increased.This is because of the fact that HPSS loads the entire file onto disks used as the HPSScache when a file is opened. When there is a single file, the entire 4 GB file is accessed

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 849

Page 24: Processing large-scale multi-dimensional data in parallel and distributed environments

from HPSS for each of the queries––in these experiments, all data files are purgedfrom disk cache after each query is processed. When the number of files increases,only a subset of the detailed index files and data files are accessed using the multi-level hierarchical indexing scheme, decreasing the time to access data chunks. Notethat the load time for query 5 for the 2� 2 case is substantially larger than that ofother queries, because query 5 intersects data chunks from each of the four files(Fig. 9(a)), hence the same volume of data is loaded into the disk cache as in the1� 1 case. The load time for that query is also larger than that in 1� 1 case becauseof the overhead of seeking/loading/opening four files instead of a single file. The com-putation time, on the other hand, remains almost the same, except for the 10� 10case, where it slightly increases, due to the overhead from opening many files. Theseresults demonstrate that applications can take advantage of the multi-level hierarchi-cal indexing scheme by organizing a dataset into an appropriate set of files. However,having too many files may increase computation time, potentially decreasing overallefficiency when multiple similar queries are executed on the same dataset. Overall,the conclusions to be drawn are that organization of data chunks into files can sig-nificantly affect performance, and that the use of hierarchical indexing techniquescan greatly improve overall performance.

4.4.1.2. Placement of filters and parallel filters. These experiments address theperformance implications of filter placement, and of replicating particular filters, tobetter utilize multiple processors on a multi-processor node and to utilize processorson multiple nodes. All the experiments were performed on a Linux PC cluster withfive hosts, consisting of four single processor and one dual-processor 800 MHz

Fig. 9. (a) 2D dataset and query regions for multi-level indexing experiments. (b) Query execution time

with the dataset organized into 1� 1, 2� 2, 4� 4 and 10� 10 files. Load shows the time to open and ac-

cess the files, which contain data chunks that intersect a query. Computation shows the sum of the execu-

tion time for searching data chunks that intersect a query, and for processing the retrieved data via filters.

850 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 25: Processing large-scale multi-dimensional data in parallel and distributed environments

Pentium III machines, interconnected via 100 Mb Ethernet. The performance resultsare shown in Table 1. For each configuration, the same 50 queries at various mag-nifications were processed, with all queries producing the same size output image(512� 512 pixels). The column labeled Average shows the average response timeover all 50 queries, while the columns labeled with magnifications show the averageresponse times for the subset of queries at that magnification. Note that queries atlower magnifications retrieve more data than those at higher magnifications, becausethe data is only stored at the highest magnification (400�). For this set of filters, thedecompress filter (D) is the most computationally expensive, therefore is a good can-didate for replication. In all configurations, the read filter (R) reads image data froma local disk. For configuration 8, the DD writer policy was used to distribute buffersto consumer filter copies.

Several conclusions can be drawn from the results. First, running the read filter ona different host from all the processing filters, at least for a single processor host, re-sults in a significant performance increase, as seen by comparing configurations 1and 2. This is mainly a result of asynchronous disk I/O resulting from the configu-ration structure. Second, replicating the bottleneck filter on a relatively unloadedmachine also results in performance gains, as seen by comparing configurations 2and 3, where the decompression filter was replicated on the dual-processor host. Alsonote that replicating a non-bottleneck filter does not increase performance, by com-paring configurations 3 and 4. Excessive replication of even a bottleneck filter is noteffective, as is shown by adding two more decompression copies in configuration 5 ascompared to configuration 4. Configurations 6–8 show that for this set of filters dis-tributing the computational load across multiple hosts does not improve per-formance, because the computations are not expensive enough to overcome theadditional communication costs for moving stream data between hosts. Finally,configurations 9 and 10 show the benefits of running filters on a powerful host.Host g is a dual-processor machine with a large amount of memory, that can easily

Table 1

Performance of 50 VM queries on a PC cluster, using various configurations

Configuration R–D–C–Z–V Response time (s)

Average 400� 200� 100� 50�

1 h–h–h–h–h 2.096 0.382 0.725 1.734 6.952

2 h–g–g–g–g 1.489 0.367 0.621 1.271 4.600

3 h–g(2)–g–g–g 1.153 0.393 0.501 0.953 3.413

4 h–g(2)–g(2)–g–g 1.145 0.369 0.491 0.947 3.432

5 h–g(4)–g(2)–g–g 1.171 0.394 0.500 0.957 3.500

6 h–g–g–b–b 1.874 0.437 0.740 1.501 5.996

7 h–g(2)–g(2)–b–b 1.679 0.454 0.680 1.271 5.341

8 h–g(2)–g(2)–b,l,m–b 1.659 0.518 0.727 1.254 5.101

9 g–g–g–g–g 1.436 0.333 0.575 1.263 4.464

10 g–g(2)–g–g–g 1.076 0.325 0.451 0.920 3.236

Each configuration is described by the placement its read (R), decompress (D), clip (C), zoom (Z) and view

(V) filters on hosts g, h, b, l and m. Host g has two processors, and in some configurations ran two copies

of a filter, denoted g(2). In configuration 8, the Z filter was replicated on three hosts (b, l and m).

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 851

Page 26: Processing large-scale multi-dimensional data in parallel and distributed environments

accommodate the processing requirements of all the filters, but to take full advantageof the host in terms of response time, the decompression filter should be replicated.Another overall conclusion is that the overhead introduced by placing filters on mul-tiple hosts is not very large. In particular, the performance difference between con-figurations 2 and 9 is very small, showing that it is feasible to read data on onehost and process it elsewhere, so long as the required communication bandwidthcan be supported (which holds in this case for the local area network used).

4.4.1.3. Dynamic buffer scheduling policies. To explore the performance effect of dif-ferent policies for scheduling buffers among copies of a filter, we have implementedan emulator-based extended version of the VM application. The VM emulator in-cludes an extended processing stage after the desired image data has been con-structed (i.e., after decompress, clip and zoom). For example, such processing mayperform a complex content-based classification of the cells present in the slide image.Such algorithms are computationally expensive, and provide a more complex andheterogeneous application for experimentation. The real application was executedon various nodes of the Linux cluster in isolation (each filter was executed on a sep-arate node with no other user processes running on the node), and detailed timingswere collected. These timings have been used to parameterize a generic filter emula-tor. Based on the results in the previous section, we decided to combine the function-ality of the decompress, clip, and zoom filters into a single emulated dcz filter. Inaddition, we created a new chew filter, that performs significant processing in com-parison to the other filters. Fig. 10 shows the resulting emulator-based application.

The filter emulator abstracts the processing and data handling we have seenthrough implementing various data intensive filter-based applications. The advan-tage of using a filter emulator is we can easily adjust application characteristics tofully explore the large space of potential application filters. The emulator itself as-sumes a simple dataflow model of filter operation––the filter (1) blocks to read suf-ficient input on all its input streams, (2) performs computation on the input inproportion to the size of the input and (3) generates some amount of output datato write to all its output streams. All input and output operations are performedusing fixed size buffers. The sizes of the input and output data, the computation timerequired per unit of input, and the amount of scratch working memory needed, arethe parameters that must be set to emulate a real filter. Note that settings for thecomputation time parameter values were collected from experiments on each of

Fig. 10. The emulated filters in the VM with additional computationally intensive processing.

852 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 27: Processing large-scale multi-dimensional data in parallel and distributed environments

the various hosts types in the Linux cluster, and the appropriate values are used foreach of the following experiments.

The experimental setup is a heterogeneous Linux cluster with four classes of nodes:rogue––single processor Pentium III 650 MHz nodes with 128 MB memory and mul-tiple large attached EIDE disks, blue––dual processor Pentium III 550 MHz nodeswith 1 GB memory, red––dual processor Pentium II 450 MHz nodes with 256 MBmemory, and one 8-processor Pentium III 550 MHz node with 4 GB memory. Thestored data for all experiments is assumed to be local to one of the storage-class(rogue) nodes. The interconnect shared by all nodes is Switched 100 Mb Ethernet.All results shown are for a single 512� 512 VM query at 100� zoom. Table 2 showsthe basic behavior of the filters, including computational requirements and datatransfer times, when each of the four filters is run on its own rogue node.

The chew filter is by far the most computationally expensive, and is a good can-didate for multiple transparent copies. The goal is to allow the copies to execute inparallel to offset the computational imbalance between filters. With one copy ofchew, the filters earlier in the pipeline spend most of their time stalled trying to per-form stream write operations. As described in Section 4.3, multiple copies implies adecision for each producer to choose which copy set to send to. In Table 3, we com-pare two cases designed to illustrate the difference in the write policies. The first con-figuration is designed so that the weighted round robin (WRR) policy performs best,because sending one buffer to each of the three red nodes, and six buffers to the 8 cpunode, should create a reasonably balanced workload. The DD policy performsslightly worse for that configuration, since the acknowledgment messages requiredby the algorithm are an added overhead that results in a very similar write distribu-tion as WRR. The RR policy is the worst, since filter copies on the 8 cpu node end upmostly idle. The second configuration places three copies on a single processor roguenode, and again six copies on the 8 cpu node. In this case, the three copies on therogue node contend for the single processor, and effectively perform at 1/3 the rate

Table 2

Behavior of emulated filters on isolated rogue nodes

Filter Compute time (s) Write time (s) Output

read_data 0.18 s 128.61 s 2.1 GB

dcz 2.41 s 218.13 s 853 KB

chew 231.01 s 0.02 s 853 KB

view 0.03 s n/a n/a

Table 3

Write policy impact on performance under two specific configurations

Configuration Response time (s)

RR WRR DD

3� red (1), 1� 8 cpu (6) 89.596 41.014 43.157

1� rogue (3), 1� 8 cpu (6) 117.242 76.787 50.813

N � hostðCÞ denotes N nodes from the host class are used to execute C copies of the chew filter.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 853

Page 28: Processing large-scale multi-dimensional data in parallel and distributed environments

of a single copy in the first configuration. In this case, WRR is not the correct choice,and DD performs the best, because the rate of acknowledgment messages from therogue node is reduced, hence fewer buffers are sent there. Overall, DD performs wellin both cases, provided the additional acknowledgment traffic is not a problem. Theresults from this experiment can also be applied to situations where the load on ahost increases at runtime, effectively reducing the number of processors availablefor transparent copies of a filter. In such situations DD should perform well, outper-forming the other write policies.

5. Related work

Reduction operations have long been recognized as an important source of par-allelism for many scientific applications [26,35,36,67]. Most techniques for optimiz-ing parallel reductions have been developed for scenarios where data can fit intoprocessor memory, and the main goal is to partition the iterations among processorsto achieve good load balance with low induced interprocessor communication over-head. Brezany et al. [14] have extended the inspector–executor approach [51] forout-of-core irregular applications. Recently, Yu and Rauchwerger [67] developed adecision-tree based system for shared-memory machines that selects a reduction al-gorithm from a library of algorithms according to the measured characteristics of theprogram’s data reference pattern. The strategies presented in Section 3.3 are inspiredby the various techniques developed previously. DA adopts the ‘‘owner computes’’rule, while FRA and SRA make use of replicated buffers. These strategies extendthose approaches to out-of-core multi-dimensional datasets and provide a unifiedframework for efficient execution. We also developed a hypergraph-based strategy,which takes into account the pre-existing distribution of input and output datasetsacross processors.

Several run-time support libraries and file systems have been developed to supportefficient I/O in a parallel environment [25,41,48,60]. These systems mainly focus onsupporting regular strided access to uniformly distributed datasets, such as images,maps, and dense multi-dimensional arrays. Our work, however, has focused on effi-ciently supporting parallel aggregation operations over subsets of irregular spatiallyindexed datasets specified by range queries. User-defined computation is an integralpart of the frameworks presented in this paper. Some similar work was indepen-dently done by Goil and Choudhary [34]. They developed an infrastructure, calledPARSIMONY, that provides support for online analytical processing (OLAP)and data mining operations.

Several researchers have concurrently and independently explored the concept ofactive disks (alternate names include intelligent disks and programmable disks) thatallows performing processing within the disk subsystem. Research in this area can beroughly divided into two categories: application processing in active disks and sys-tem-level processing in active disks. Riedel et al. [52] investigated the performanceof active disks for data mining and multi-media algorithms and developed an analyt-ical performance model. The ISTORE project [15] uses the IDISK [40] architecture

854 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 29: Processing large-scale multi-dimensional data in parallel and distributed environments

as a building block to create a meta-appliance; a storage infrastructure that canbe tailored for specific applications. Acharya et al. [2] introduced a stream-basedprogramming model for disklets and their interaction with host-resident peers. Re-structured versions of a wide range of data-intensive applications have also been de-veloped in that work.

There are also a number of research projects that focus on component-basedmodels for developing applications in a distributed environment. The ABACUSframework [6] addresses the automatic and dynamic placement of functions indata-intensive applications between clients and storage servers. This work is closelyrelated to DataCutter in that application components are placed to improve perfor-mance, but ABACUS only support applications that are structured as a chain offunction calls, and the only possibilities for placement are the client or server. MO-CHA [53] is a database middleware system designed to interconnect data sources dis-tributed over a wide area network. MOCHA operates in the highly structuredrelational database world, and can automatically deploy implementations of newdata types to hosts for execution of queries. The work shows how an optimizer cus-tomized to deal with ‘‘data-inflating’’ and ‘‘data-reducing’’ operators, can improveperformance. MOCHA can leverage total knowledge about query selectivities storedin its catalog, whereas DataCutter deals with arbitrary application code with no suchuseful information. Armada [49] is a flexible parallel file system framework being de-veloped to enable access to remote and distributed datasets through a network (ar-mada) of application objects, called ships. The system provides authorization andauthentication services, and runtime support for the collection of application specificships to run on I/O nodes, compute nodes, and other nodes on the network.

6. Conclusions and future work

We have presented an overview of frameworks and methods we have developed toprovide support for applications that analyze and explore large multi-dimensionalscientific datasets. The ADR framework targets optimized execution of data in-tensive applications on distributed memory architectures with a disk farm. TheDataCutter and filter-stream programming framework extend the work on tightly-coupled, homogeneous systems to distributed, heterogeneous collections of compu-tational and storage systems. ADR enables execution of user-defined functions at thestorage system where the data is stored. The filter-stream programming model pro-vides flexibility by allowing applications to be composed from interacting compo-nents. This allows applications to achieve good performance on many platformsand under varying resource availability.

We are now examining new strategies and algorithms within these frameworks tofurther improve the performance of data intensive applications. A compiler frontendis also being developed for the ADR framework [30]. In that work, application de-velopers can implement application-specific processing using a Java dialect orXML. The compiler then creates a customized instance of ADR from the Java orXML code. We have also initiated an effort to develop a framework for optimizing

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 855

Page 30: Processing large-scale multi-dimensional data in parallel and distributed environments

multiple simultaneous queries for analysis and exploration of large scientific data-sets. This work is motivated by the fact that in some cases data analysis can be em-ployed in a collaborative environment, where co-located clients access the samedatasets and perform similar processing of the datasets. In such an environment,commonalities among data, data access patterns, and processing functions on datacan be exploited to provide significant performance benefits in executing multiple si-multaneous queries.

Acknowledgements

We are grateful to the Albuquerque High Performance Computing Center forproviding access to their Linux clusters and providing all the necessary supportfor some of the ADR experiments.

References

[1] A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz, A.

Sussman, Tuning the performance of I/O-intensive parallel applications, in: Proceedings of the Fourth

ACM Workshop on I/O in Parallel and Distributed Systems, May 1996.

[2] A. Acharya, M. Uysal, J. Saltz, Active disks: programming model, algorithms and evaluation, in:

Proceedings of the Eighth International Conference on Architectural Support for Programming

Languages and Operating Systems (ASPLOS VIII), vol. 33(11), ACM SIGPLAN Notices, ACM

Press, October 1998, pp. 81–91.

[3] A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J.

Saltz, A. Sussman, H. Tsang, Digital dynamic telepathology––the Virtual Microscope, in:

Proceedings of the 1998 AMIA Annual Fall Symposium, American Medical Informatics Association,

November 1998.

[4] R. Agrawal, T. Imielinski, A. Swami, Database mining: A performance perspective, IEEE

Transactions on Knowledge and Data Engineering 5 (6) (December 1993) 914–925.

[5] J. Ahrens, K. Brislawn, K. Martin, B. Geveci, C.C. Law, M. Papka, Large-scale data visualization

using parallel data streaming, IEEE Computer Graphics and Applications 21 (4) (July/August 2001)

34–41.

[6] K. Amiri, D. Petrou, G. Ganger, G. Gibson, Dynamic function placement in active storage clusters.

Technical Report CMU-CS-99-140, Carnegie Mellon University, Pittsburg, PA, June 1999.

[7] H. Andrade, T. Kurc, A. Sussman, J. Saltz, Decision tree construction for data mining on clusters of

shared-memory multiprocessors. Technical Report CS-TR-4203 and UMIACS-TR-2000-78, Univer-

sity of Maryland, Department of Computer Science and UMIACS, December 2000.

[8] C.L. Bajaj, V. Pascucci, D. Thompson, X.Y. Zhang, Parallel accelerated isocontouring for out-of-core

visualization, in: Proceedings of the 1999 IEEE Symposium on Parallel Visualization and Graphics,

San Francisco, CA, USA, October 1999, pp. 97–104.

[9] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R-tree: An efficient and robust access

method for points and rectangles, in: Proceedings of the 1990 ACM SIGMOD International

Conference on Management of Data (SIGMOD90), Atlantic City, NJ, May 1990, pp. 322–331.

[10] M. Beynon, A. Sussman, J. Saltz, Performance impact of proxies in data intensive client-server

applications, in: Proceedings of the 1999 International Conference on Supercomputing, ACM Press,

June 1999.

[11] M.D. Beynon, R. Ferreira, T. Kurc, A. Sussman, J. Saltz, DataCutter: Middleware for filtering very

large scientific datasets on archival storage systems, in: Proceedings of the Eighth Goddard

Conference on Mass Storage Systems and Technologies/17th IEEE Symposium on Mass Storage

856 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 31: Processing large-scale multi-dimensional data in parallel and distributed environments

Systems, National Aeronautics and Space Administration, NASA/CP 2000-209888, March 2000, pp.

119–133.

[12] M.D. Beynon, T. Kurc, A. Sussman, J. Saltz, Optimizing execution of component-based applications

using group instances, in: Proceedings of CCGrid2001: IEEE International Symposium on Cluster

Computing and the Grid, IEEE Computer Society Press, May 2001, pp. 56–63.

[13] M.D. Beynon, A. Sussman, U. Catalyurek, T. Kurc, J. Saltz, Performance optimization for data

intensive grid applications, in: Proceedings of the Third Annual International Workshop on Active

Middleware Services (AMS2001), August 2001.

[14] P. Brezany, A. Choudhary, M. Dang, Parallelization of irregular codes including out-of-core data and

index arrays, in: Proceedings of Parallel Computing 1997––PARCO’97, Elsevier, Amsterdam,

September 1997, pp. 132–140.

[15] A. Brown, D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, D. Patterson, ISTORE:

Introspective storage for data-intensive network services, in: Proceedings of the 7th Workshop on Hot

Topics in Operating System (HotOS-VII), March 1999.

[16] U. Catalyurek, C. Aykanat, Hypergraph-partitioning based decomposition for parallel spars e-matrix

vector multiplication, IEEE Transactions on Parallel and Distributed Systems 10 (7) (1999) 673–693.

[17] U. Catalyurek, T. Kurc, A. Sussman, J. Saltz, Improving the performance and functionality of the

Virtual Microscope, Archives of Pathology and Laboratory Medicine 125 (8) (August 2001).

[18] U. Catalyurek, T. Kurc, A. Sussman, J. Saltz, Improving the performance and functionality of the

Virtual Microscope, Archives of Pathology and Laboratory Medicine 125 (8) (August 2001).

[19] Common Component Architecture Forum. http://www.cca-forum.org.

[20] C.F. Cerco, T. Cole, User’s guide to the CE-QUAL-ICM three-dimensional eutrophication model,

release version 1.0. Technical Report EL-95-15, US Army Corps of Engineers Water Experiment

Station, Vicksburg, MS, 1995.

[21] C. Chang, R. Ferreira, A. Sussman, J. Saltz, Infrastructure for building parallel database systems for

multi-dimensional data, in: Proceedings of the Second Merged IPPS/SPDP Symposiums, IEEE

Computer Society Press, April 1999.

[22] C. Chang, T. Kurc, A. Sussman, U. Catalyurek, J. Saltz, A hypergraph-based workload partitioning

strategy for parallel data aggregation, in: Proceedings of the Eleventh SIAM Conference on Parallel

Processing for Scientific Computing, SIAM, March 2001.

[23] C. Chang, T. Kurc, A. Sussman, J. Saltz, Optimizing retrieval and processing of multi-dimensional

scientific datasets. in: Proceedings of the Third Merged IPPS/SPDP (14th International Parallel

Processing Symposium and 11th Symposium on Parallel and Distributed Processing). IEEE

Computer Society Press, May 2000.

[24] C. Chang, B. Moon, A. Acharya, C. Shock, A. Sussman, J. Saltz, Titan: A high performance remote-

sensing database, in: Proceedings of the 1997 International Conference on Data Engineering, IEEE

Computer Society Press, April 1997, pp. 375–384.

[25] P.F. Corbett, D.G. Feitelson, The Vesta parallel file system, ACM Transactions on Computer

Systems 14 (3) (1996) 225–264.

[26] R. Das, M. Uysal, J. Saltz, Y.-S. Hwang, Communication optimizations for irregular scientific

computations on distributed memory architectures, Journal of Parallel and Distributed Computing 22

(3) (1994) 462–479.

[27] H. Fallah-Adl, J. J�aaj�aa, S. Liang, J. Townshend, Y.J. Kaufman, Fast algorithms for removing

atmospheric effects from satellite images, IEEE Computational Science and Engineering 3 (2) (1996)

66–77 (Summer).

[28] C. Faloutsos, P. Bhagwat, Declustering using fractals, in: Proceedings of the 2nd International

Conference on Parallel and Distributed Information Systems, January 1993, pp. 18–25.

[29] R. Farias, C.T. Silva, Out-of-core rendering of large, unstructured grids, IEEE Computer Graphics

and Applications 21 (4) (2001) 42–50.

[30] R. Ferreira, G. Agrawal, J. Saltz, Compiling object-oriented data intensive applications, in: Proceed-

ings of the 2000 International Conference on Supercomputing, ACM Press, May 2000, pp. 11–21.

[31] R. Ferreira, T. Kurc, M. Beynon, C. Chang, A. Sussman, J. Saltz, Object-relational queries into multi-

dimensional databases with the Active Data Repository, Parallel Processing Letters 9 (2) (1999) 173–195.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 857

Page 32: Processing large-scale multi-dimensional data in parallel and distributed environments

[32] I. Foster, C. Kesselman, The GRID: blueprint for a new computing infrastructure, Morgan-

Kaufmann (1999).

[33] Global Grid Forum. http://www.gridforum.org.

[34] S. Goil, A. Choudhary, PARSIMONY: An infrastructure for parallel multidimensional analysis and

data mining, Journal of Parallel and Distributed Computing 61 (3) (2001) 285–321.

[35] M. Hall, S. Amarasinghe, B. Murphy, S. Liao, M. Lam, Detecting coarse-grain parallelism using an

interprocedural parallelizing compiler, in: Proceedings of Supercomputing’95, San Diego, CA,

December 1995.

[36] H. Han, C.-W. Tseng, Improving compiler and run-time support for irregular reductions, in:

Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, August

1998.

[37] The Independent JPEG Group’s JPEG software, March 1998, http://www.ijg.org.

[38] C. Isert, K. Schwan, ACDS: Adapting computational data streams for high performance, in: 14th

International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun, Mexico, IEEE

Computer Society Press, May 2000, pp. 641–646.

[39] W.E. Johnston, B. Tierney, A distributed parallel storage architecture and its potential application

within EOSDIS, NASA Mass Storage Symposium (March 1995).

[40] K. Keeton, D.A. Patterson, J.M. Hellerstein, A case for intelligent disks (IDISKS), ACM SIGMOD

Record 27 (3) (September 1998) 42–52.

[41] D. Kotz, Disk-directed I/O for MIMD multiprocessors, in: Proceedings of the 1994 Symposium on

Operating Systems Design and Implementation, ACM Press, November 1994, pp. 61–74.

[42] T. Kurc, U. Catalyurek, C. Chang, A. Sussman, J. Saltz, Visualization of large datasets with the

Active Data Repository, IEEE Computer Graphics and Applications 21 (4) (2001) 24–33.

[43] T. Kurc, C. Chang, R. Ferreira, A. Sussman, J. Saltz, Querying very large multi-dimensional datasets

in ADR, in: Proceedings of the 1999 ACM/IEEE SC99 Conference, ACM Press, November 1999.

[44] T.M. Kurc, A. Sussman, J. Saltz, Coupling multiple simulations via a high performance customizable

database system, in: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific

Computing, SIAM, March 1999.

[45] W. Lorensen, H. Cline, Marching cubes: a high resolution surface reconstruction algorithm,

Computer Graphics 21 (4) (1987) 163–169.

[46] R.A. Luettich, J.J. Westerink, N.W. Scheffner, ADCIRC: An advanced three-dimensional circulation

model for shelves, coasts, and estuaries. Technical Report 1, Department of the Army, US Army

Corps of Engineers, Washington DC 20314-1000, December 1991.

[47] B. Moon, J.H. Saltz, Scalability analysis of declustering methods for multidimensional range queries,

IEEE Transactions on Knowledge and Data Engineering 10 (2) (1998) 310–327.

[48] N. Nieuwejaar, D. Kotz, The Galley parallel file system, in: Proceedings of the 1996 International

Conference on Supercomputing, ACM Press, May 1996, pp. 374–381.

[49] R. Oldfield, D. Kotz, Armada: A parallel file system for computational, in: Proceedings of

CCGrid2001: IEEE International Symposium on Cluster Computing and the Grid, Brisbane,

Australia, IEEE Computer Society Press, May 2001.

[50] B. Plale, K. Schwan, dQUOB: Managing large data flows using dynamic embedded queries, in: IEEE

International High Performance Distributed Computing Conference (HPDC), August 2000.

[51] R. Ponnusamy, J. Saltz, A. Choudhary, Y.-S. Hwang, G. Fox, Runtime support and compilation

methods for user-specified irregular data distributions, IEEE Transactions on Parallel and

Distributed Systems 6 (8) (1995) 815–831.

[52] E. Riedel, C. Faloutsos, G. Gibson, Active storage for large-scale data mining and multimedia

applications, in: Proceedings of VLDB’98, 1998.

[53] M. Rodr�ııguez-Mart�ıınez, N. Roussopoulos, MOCHA: A self-extensible database middleware system

for distributed data sources, in: Proceedings of the 2000 ACM SIGMOD International Conference on

Management of Data (SIGMOD00), ACM SIGMOD Record, vol. 29(2), ACM Press, May 2000, pp.

213–224.

[54] W. Schroeder, K. Martin, B. Lorensen, The Visualization Toolkit: An Object-Oriented Approach To

3D Graphics, second ed., Prentice Hall, Englewood Cliffs, NJ, 1997.

858 M. Beynon et al. / Parallel Computing 28 (2002) 827–859

Page 33: Processing large-scale multi-dimensional data in parallel and distributed environments

[55] J. Shafer, R. Agrawal, M. Mehta, SPRINT: A scalable parallel classifier for data mining, in: The 22nd

VLDB Conference, Bombay, India, September 1996, pp. 544–555.

[56] P.H. Smith, J. van Rosendale (Ed.), Data and Visualization Corridors: Report on the 1998 DVC

Workshop Series. Technical Report CACR-164, California Institute of Technology, September 1998.

[57] SRB: The Storage Resource Broker. http://www.npaci.edu/DICE/SRB/index.html.

[58] T. Tanaka, Configurations of the solar wind flow and magnetic field around the planets with no

magnetic field: calculation by a new MHD, Journal of Geophysical Research 98 (A10) (1993) 17251–

17262.

[59] M. Teller, P. Rutherford, Petabyte file systems based on tertiary storage. In the Sixth NASA Goddard

Space Flight Center Conference on Mass Storage Systems and Technologies, Fifteenth IEEE

Symposium on Mass Storage Systems, 1998.

[60] R. Thakur, A. Choudhary, R. Bordawekar, S. More, S. Kuditipudi, Passion: Optimized I/O for

parallel applications, IEEE Computer 29 (6) (1996) 70–78.

[61] S.-K. Ueng, K. Sikorski, K.-L. Ma, Out-of-core streamline visualization on large unstructured

meshes, IEEE Transactions on Visualization and Computer Graphics 3 (4) (1997) 370–380.

[62] U.S. Geological Survey. Land satellite (LANDSAT) thematic mapper (TM). http://ed-

cwww.cr.usgs.gov/nsdi/html/landsat_tm/landsat_tm.

[63] M. Uysal, A. Acharya, J. Saltz, Evaluation of active disks for decision support databases, in:

Proceedings of the 6th International Symposium on High-Performance Computer Architecture, IEEE

Computer Society Press, January 2000.

[64] M. Uysal, T.M. Kurc, A. Sussman, J. Saltz, A performance prediction framework for data intensive

applications on large scale parallel machines, in: Proceedings of the Fourth Workshop on Languages,

Compilers and Run-time Systems for Scalable Computers, Springer-Verlag, May 1998, pp. 243–258.

[65] A. Watt, Fundamentals of Three-dimensional Computer Graphics, Addison Wesley, 1989.

[66] M.F. Wheeler, W. Lee, C.N. Dawson, D.C. Arnold, T. Kurc, M. Parashar, J. Saltz, A. Sussman,

Parallel computing in environment and energy, in: J. Dongarra, I. Foster, G. Fox, K. Kennedy, L.

Torczon, A. White (Eds.), CRPC Handbook of Parallel Computing, Morgan Kaufman Publishers,

2001.

[67] H. Yu, L. Rauchwerger, Adaptive reduction parallelization techniques, in: Proceedings of the 14th

ACM International Conference on Supercomputing, Santa Fe, New Mexico, May 2000, pp. 66–77.

[68] M.J. Zaki, C.-T. Ho, R. Agrawal, Parallel classification for data mining on shared-memory

multiprocessors, in: IEEE International Conference on Data Engineering, Sydney, Australia, March

1999, pp. 198–205.

M. Beynon et al. / Parallel Computing 28 (2002) 827–859 859