8
Parallel Discrete Wavelet Transform using the Open Computing Language: a performance and portability study Bharatkumar Sharma and Naga Vydyanathan Siemens Corporate Technology Bangalore, India {bharatkumar.sharma, nagavijayalakshmi.vydyanathan}@siemens.com Abstract—The discrete wavelet transform (DWT) is a power- ful signal processing technique used in the JPEG 2000 image compression standard. The multi-resolution sub-band encod- ing provided by DWT allows for higher compression ratios, avoids blocking artifacts and enables progressive transmission of images. However, these advantages come at the expense of additional computational complexity. Achieving real-time or interactive compression/de-compression speeds, therefore, requires a fast implementation of DWT that leverages emerging parallel hardware systems. In this paper, we develop an optimized parallel implementation of the lifting-based DWT algorithm using the recently proposed Open Computing Lan- guage (OpenCL). OpenCL is a standard for cross-platform parallel programming of heterogeneous systems comprising of multi-core CPUs, GPUs and other accelerators. We explore the potential of OpenCL in accelerating the DWT computation and analyze the programmability, portability and performance aspects of this language. Our experimental analysis is done using NVIDIA’s and AMD’s drivers that support OpenCL. I. I NTRODUCTION JPEG 2000 is a wavelet-based image compression stan- dard and coding system that succeeds the original discrete cosine transform based JPEG standard. JPEG 2000 offers a number of advantages over traditional JPEG compression such as, superior performance at low bit rates, progressive transmission by pixel accuracy and resolution, region of interest coding, lossless and lossy compression, robustness to bit errors etc. The core component of the JPEG 2000 image coding system that facilitates the above advantages, is the discrete wavelet transform (DWT) computation that allows for multi-level decomposition of the image data into sub-bands. DWT has been shown to account for 35%-75% of the overall JPEG 2000 encoding/decoding time [1]. DWT can be computed using two approaches: the convolution-based approach and the lifting scheme. In com- parison to convolution-based DWT, the lifting approach has been shown to result in better computational efficiency and memory savings [2]. However, inspite of these savings, DWT computation in software still has a high computa- tional complexity and achieving real-time compression/de- compression speeds requires efficient leveraging of the raw performance of underlying computing systems. Recent years have seen the emergence of multi-core processors and accelerators such as graphics processing units (GPUs) as sources of massive computing power. Intel and AMD’s roadmap for the next 5 years reveals the trends for systems with many cores. NVIDIA’s next generation Fermi architecture boasts of 16 multi-processing units, each having 32 cores, i.e a total of 512 processing cores. To pro- gram these heterogeneous parallel computing systems, the Khronos group, in partnership with several industry-leading companies like Apple, AMD, NVIDIA, Intel and IBM, have developed a unified programming model called the Open Computing Language (OpenCL) [3]. OpenCL is a standard for cross-platform parallel programming of heterogeneous systems comprising of multi-core CPUs, GPUs and other accelerators. In this paper, using the Open Computing Language, we leverage the compute capabilities of modern parallel com- puting systems for acclerating the discrete wavelet transform computation. We design and develop an optimized parallel implementation of the lifting-based DWT algorithm using OpenCL. Through experiments using NVIDIA’s and AMD’s OpenCL drivers and comparisons with native CUDA (Com- pute Unified Device Architecture) [4] implementations, we analyze the programmability, portability and performance aspects of this language. To the best of our knowledge, this is the first work to record practical experiences of using the Open Computing Language for developing platform- independent applications. The rest of this paper is organized as follows. The next section outlines the related work. Section III gives a brief overview of DWT in JPEG 2000 compression and the Open Computing Language. Section IV describes our OpenCL-based parallel DWT implementation. The results of our performance and portability study and our practical experiences and insights are presented in Section V and Section VI summarizes our observations and concludes the paper. II. RELATED WORK DWT is a signal processing technique that has emerged as a key part of various image and video coding frameworks including JPEG 2000. Due to its high computational com- plexity, the incorporation of DWT in real-time, interactive image visualization applications that process large amounts 978-1-4244-6534-7/10/$26.00 ©2010 IEEE

[IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

  • Upload
    naga

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Parallel Discrete Wavelet Transform using the Open Computing Language: aperformance and portability study

Bharatkumar Sharma and Naga VydyanathanSiemens Corporate Technology

Bangalore, India{bharatkumar.sharma, nagavijayalakshmi.vydyanathan}@siemens.com

Abstract—The discrete wavelet transform (DWT) is a power-ful signal processing technique used in the JPEG 2000 imagecompression standard. The multi-resolution sub-band encod-ing provided by DWT allows for higher compression ratios,avoids blocking artifacts and enables progressive transmissionof images. However, these advantages come at the expenseof additional computational complexity. Achieving real-timeor interactive compression/de-compression speeds, therefore,requires a fast implementation of DWT that leverages emergingparallel hardware systems. In this paper, we develop anoptimized parallel implementation of the lifting-based DWTalgorithm using the recently proposed Open Computing Lan-guage (OpenCL). OpenCL is a standard for cross-platformparallel programming of heterogeneous systems comprising ofmulti-core CPUs, GPUs and other accelerators. We explorethe potential of OpenCL in accelerating the DWT computationand analyze the programmability, portability and performanceaspects of this language. Our experimental analysis is doneusing NVIDIA’s and AMD’s drivers that support OpenCL.

I. INTRODUCTION

JPEG 2000 is a wavelet-based image compression stan-dard and coding system that succeeds the original discretecosine transform based JPEG standard. JPEG 2000 offersa number of advantages over traditional JPEG compressionsuch as, superior performance at low bit rates, progressivetransmission by pixel accuracy and resolution, region ofinterest coding, lossless and lossy compression, robustnessto bit errors etc. The core component of the JPEG 2000image coding system that facilitates the above advantages,is the discrete wavelet transform (DWT) computation thatallows for multi-level decomposition of the image data intosub-bands. DWT has been shown to account for 35%-75%of the overall JPEG 2000 encoding/decoding time [1].

DWT can be computed using two approaches: theconvolution-based approach and the lifting scheme. In com-parison to convolution-based DWT, the lifting approach hasbeen shown to result in better computational efficiency andmemory savings [2]. However, inspite of these savings,DWT computation in software still has a high computa-tional complexity and achieving real-time compression/de-compression speeds requires efficient leveraging of the rawperformance of underlying computing systems.

Recent years have seen the emergence of multi-coreprocessors and accelerators such as graphics processing units

(GPUs) as sources of massive computing power. Intel andAMD’s roadmap for the next 5 years reveals the trendsfor systems with many cores. NVIDIA’s next generationFermi architecture boasts of 16 multi-processing units, eachhaving 32 cores, i.e a total of 512 processing cores. To pro-gram these heterogeneous parallel computing systems, theKhronos group, in partnership with several industry-leadingcompanies like Apple, AMD, NVIDIA, Intel and IBM, havedeveloped a unified programming model called the OpenComputing Language (OpenCL) [3]. OpenCL is a standardfor cross-platform parallel programming of heterogeneoussystems comprising of multi-core CPUs, GPUs and otheraccelerators.

In this paper, using the Open Computing Language, weleverage the compute capabilities of modern parallel com-puting systems for acclerating the discrete wavelet transformcomputation. We design and develop an optimized parallelimplementation of the lifting-based DWT algorithm usingOpenCL. Through experiments using NVIDIA’s and AMD’sOpenCL drivers and comparisons with native CUDA (Com-pute Unified Device Architecture) [4] implementations, weanalyze the programmability, portability and performanceaspects of this language. To the best of our knowledge, thisis the first work to record practical experiences of usingthe Open Computing Language for developing platform-independent applications.

The rest of this paper is organized as follows. Thenext section outlines the related work. Section III gives abrief overview of DWT in JPEG 2000 compression andthe Open Computing Language. Section IV describes ourOpenCL-based parallel DWT implementation. The resultsof our performance and portability study and our practicalexperiences and insights are presented in Section V andSection VI summarizes our observations and concludes thepaper.

II. RELATED WORK

DWT is a signal processing technique that has emergedas a key part of various image and video coding frameworksincluding JPEG 2000. Due to its high computational com-plexity, the incorporation of DWT in real-time, interactiveimage visualization applications that process large amounts

978-1-4244-6534-7/10/$26.00 ©2010 IEEE

Page 2: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

(a) (b) (c)

Figure 1. Original image and transformed images at two levels of decomposition.

of data, has been limited by the availability of computeresources. In the last few years, the proliferation of lowcost high performance computing devices and acceleratorshas led to a significant body of work on the efficientimplementation of DWT on emerging parallel hardwaresystems.

Wong et al. [5] have proposed a fast convolution-basedDWT shader that runs on consumer level GPUs. Simek andRakesh [6] outline their initial investigation on acceleratingthe 2D DWT image compression using CUDA and MAT-LAB. A recent work by Franco et al. [7] that describesa parallel CUDA-based implementation of DWT, showsimpressive speedups of approximately 20x of the CUDA-based code (run on a NVIDIA Tesla C870) over a sequentialimplementation.

The above works aim to provide optimized fast implemen-tations of DWT on specific parallel hardware systems andhence are not portable solutions. To encourage developmentof platform-independent parallel code, very recently, theKhronos group has proposed the Open Computing Lan-guage [3]. OpenCL enables development of parallel codethat is executable across heterogeneous devices such asmulti-core CPUs as well as GPUs. The aim of this paperis to design an optimized implementation of the DWTcomputation using OpenCL and evaluate the usefulness ofthis language in terms of performance achieved and ease ofportability.

A recent work by Nottingham et al. [8] shows the applica-bility of the OpenCL architecture to packet analysis. How-ever, implementation and evaluations of an OpenCL-basedpacket analyzer was not done. As the OpenCL standards andsupport by industry vendors has only been very recentlyavailable, to the best of our knowledge, this paper is thefirst work to record practical experiences of using the OpenComputing Language for developing platform-independentapplications.

III. BACKGROUND

This section gives a brief overview of the discrete wavelettransform computation in JPEG 2000 compression, followed

by a description of the Open Computing Language and itsplatform, memory and execution models.

A. Discrete Wavelet Transform in JPEG 2000

The discrete wavelet transform is a powerful imageprocessing technique used to transform image pixels intowavelets, which are then used for wavelet-based compres-sion and coding. DWT forms a core component of the JPEG2000 image compression standard [9]. In JPEG 2000 com-pression, an image is divided into equally-sized rectangularregions called tiles. Each tile is transformed and encodedindependently. DWT in JPEG 2000, transforms each tile intoa set of two-dimensional sub-bands, each representing theactivity of the image signal in different frequency bands, indifferent spatial resolutions.

Figure 1 shows the original image and the outputs of thediscrete wavelet transform at two levels of decomposition.For simplicity, assume that the entire image is composed of asingle tile. At the first level of decomposition (Figure 1(b)),the original image is decomposed into 4 sub-bands, each athalf the original resolution. These sub-bands are obtainedby successive application of a low-pass and high-pass filterand down-sampling by a factor of 2. The lower left sub-band is the LH sub-band that is obtained by applying high-pass filtering to the columns and low-pass filtering to therows of the image tile. The lower right sub-band is the HHsub-band which is obtained by applying high-pass filteringto both the rows and columns of the image tile. The topright sub-band is the HL sub-band obtained by applying alow-pass filter to the columns and high-pass filter to therows. The top left sub-band is the LL sub-band which isobtained by applying low-pass filters to both the rows andcolumns of the image tile. The low-pass filter preserves thelow frequencies of a signal while attenuating or eliminatingthe high frequencies, thus resulting in a blurred version ofthe original signal. Conversely, the high-pass filter preservesthe high frequencies in a signal such as edges, texture, anddetail, while removing or attenuating the low frequencies.At the second level of decomposition (Figure 1(c)), thelowest frequency sub-band, LL, of the first level is furtherdecomposed into 4 levels. Thus at level n, there are 3×n+1

Page 3: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Figure 2. OpenCL: Programming Heterogeneous Platforms (courtesy:Khronos).

sub-bands.There are two ways for performing the DWT computation

- the traditional convolution-based approach and the lifting-based approach. In the convolution-based approach, theimage signals are convolved with the high pass and lowpass filters and down-sampled, to obtain the sub-bands.On the other hand, the lifting scheme involves computinga lazy transform by splitting the original signal into oddand even indexed sub-sequences and modifying these usingprediction and updation steps. In comparison to convolution-based DWT, the lifting approach has been shown to result inbetter computational efficiency (by about 50%) and memorysavings [2], [9]. Therefore, in this paper, we employ thelifting-based DWT computation.

The multi-resolution sub-band analysis offered by DWTallows for several advantages. It enables higher compressionratios, avoids blocking artifacts and enables progressivetransmission of images. DWT also allows for a flexiblerandom access codestream that enhances transmission scal-ability and enables region of interest encoding. However,these advantages come at the cost of increased computationalcomplexity. As DWT accounts for a major part of the com-putation in JPEG 2000 encoding/decoding, to achieve fastcompression/de-compression, it is necessary to acceleratethis computation.

B. Open Computing Language

The Open Computing Language is an open, royalty-freestandard for developing platform-independent parallel ap-plications on multi-core CPUs, GPUs and other acceleratorslike the cell BE (refer to Figure 2). It is developed by theKhronos group in partnership with several industry-leadingcompanies and institutions like Apple, AMD, NVIDIA,Intel, IBM etc. The vision of this initiative is to createa foundation layer for a parallel computing ecosystem ofplatform-independent tools, middleware and applications.

The OpenCL platform model assumes the target com-puting system to comprise of a host and a set of computedevices (see Figure 3). Each compute device consists of oneor more compute units and each compute unit has a setof processing elements. In the case of a graphics card, thecompute device maps to the GPU, the compute units are the

Figure 3. The OpenCL Platform Model (courtesy: Khronos).

Figure 4. The OpenCL memory model (courtesy: Khronos).

multi-processor units and the processing elements are theSIMD cores.

The OpenCL execution model maps very closely to theCUDA programming model. An OpenCL application runson the host and submits work to be executed on the computedevice. This work is organized as work-items and work-groups. A work-item is akin to the threads within a threadblock in the CUDA language, and executes on the processingelements of a compute unit. A work-group is a collectionof work-items and is similar to a CUDA thread block. Thework-items and work-groups could be organized in one, twoor three dimensions. The code for a work-item is given bythe OpenCL kernel and a program comprises of a bunchof kernels and other functions. Kernel execution instancesare queued to specific compute devices and executed eitherin-order or out-of-order. Each kernel execution instance islaunched as a set (grid) of work-groups. Work-items withina work-group can synchronize, while no synchronization ispossible between work-items in different work-groups.

The OpenCL memory model is shown in Figure 4. Theprivate memory is typically mapped to registers and is perwork-item. The local memory is shared within the work-items belonging to a work-group. This is akin to the sharedmemory in CUDA. Global and constant memory are visibleacross work-groups, while the host memory is on the CPU.

The main advantage of OpenCL is its portability acrossmultiple devices and capability for heterogeneous comput-ing. NVIDIA has publicly launched the OpenCL driversfor its CUDA-enabled GPUs, while AMD has released abeta version of its stream SDK that supports ATI GPUs

Page 4: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Figure 5. Mapping of the DWT processing to the OpenCL executionmodel.

and SSE3 capable CPUs. Apple’s latest version of SnowLeopard supports OpenCL. In this paper, we use OpenCL toaccelerate the DWT computation in JPEG 2000 compressionand study the potential of this new standard in terms of itsperformance as well as portability.

IV. PARALLEL DISCRETE WAVELET TRANSFORM

This section presents our optimized parallel implementa-tion of the DWT computation using OpenCL. As mentionedin Section III-A, DWT in JPEG 2000 compression, is appliedto rectangular, equally-sized regions of an image, called tiles.DWT processing of each tile is independent of the other.

The first step in parallelizing the DWT computationrequires the algorithm to be organized as work-items andwork-groups which form the core of the OpenCL executionmodel (refer to Section III-B). Figure 5 shows how theprocessing of the pixels of an input image is mapped toOpenCL work-items and work-groups. The input image isdivided into Nx × Ny rectangular regions or tiles. Theprocessing of the pixels within a tile is mapped to a 2Dcollection of Nx × Ny work-items. Each such collectionforms a work-group. The total number of work-groups willcorrespond to the number of tiles in the input image. Thus,the work-groups process different image tiles in parallel,while the work-items within a work-group process the pixelsof a tile in parallel.

The DWT kernel performs the 2D sub-band decom-position of an image tile, through iterative 1D sub-banddecompositions applied to the column and row pixels of thetile. The 1D sub-band decomposition is done through twomain steps: a) extension, which is the periodic symmetricextension of the input signal, and b) filtering, which is eitherconvolution-based filtering or lifting-based filtering. The restof this section describes our parallel design for these twosteps.

A. Extension

Prior to applying the filtering operation, the 1D signalwhich is the row or column of an image tile, is periodicallyextended at both ends in a symmetric manner. Figure 6

Figure 6. Periodic symmetric extension of a finite length signal,ABCDEFG.

Figure 7. Extension in parallel.( Fig-a Shows the extended tile view. Fig-b shows the horizontal extension of a row performed by work items inparallel, while Fig-c shows the vertical extension of a column.

depicts the periodic symmetric extension of a sample signal.This step is necessary to ensure smooth filtering at the signalboundaries. The length of extension at the ends depends onthe length of the filter used. For example, the use of 5/3filter for reversible transform, requires extension by one ortwo samples.

Figure 7 illustrates how the extension is done in parallelby the OpenCL work-items within a work-group. Figure 7(a)shows the tile view after row-wise and column-wise exten-sion by two samples at each end, while Figures 7(b) and (c)denote the horizontal (row-wise) and vertical (column-wise)extension of a signal respectively. In horizontal extension,the rows of the image tile are extended in parallel, while forvertical extension the columns are extended in parallel. Thenumber of work-items active is always equal to the numberof rows or columns multiplied by the number of elements tobe extended at each end of the signal. In forward transform,the vertical extension and filtering is done first followed byhorizontal extension and filtering. The work-items withineach work-group first fetch their image tile from textures intothe local memory and subsequent operations of extensionand filtering is done on local memory. As the local memoryis on-chip and fast, this enhances the performance.

B. Lifting-based filtering

As mentioned in Section III-A, the filtering step inDWT can be performed by two techniques - convolution-based filtering and lifting-based filtering. In comparison to

Page 5: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Figure 8. Lifting based filtering

convolution-based filtering, the lifting approach has beenshown to result in better computational efficiency and mem-ory savings [2]. Lifting-based filtering consists of a sequenceof three steps - a) splitting, b) predict and c) update. In thesplitting step, the input signal is divided into odd and evensamples. The predict step predicts the odd samples using aweighted sum of the even sample values and modifies theodd values with the difference between the actual and thepredicted values. The even samples are left unchanged inthis step. In the update step, the even samples are updatedusing the weighted sum of odd samples, to smoothen outthe approximations.

Figure 8 depicts the lifting-based filtering approach for areversible 5/3 filter. As shown in the figure, the odd work-items are active in the predict step while the even work-items are active in the update step. Once the 1D verticaland horizontal sub-band decompositions are performed, thecoefficients obtained are de-interleaved into the 4 sub-bands:HH, HL, LH, LL. This sub-band decomposition repeats withthe LL sub-band iteratively depending on the number oflevels of decomposition required.

In the next section we describe our performance andportability study of our parallel DWT implementation usingNVIDIA’s and AMD’s drivers for OpenCL.

V. PERFORMANCE AND PORTABILITY ANALYSIS

The performance and portability analysis of our OpenCL-accelerated DWT computation is done using NVIDIA’sand AMD’s OpenCL drivers. For the performance analysis,the memory access and kernel performance of NVIDIA’sOpenCL-accelerated DWT is compared against that of anative CUDA implementation. AMD’s OpenCL drivers areused to test the portability of the developed OpenCL codeon SSE3 capable multi-core CPUs. We also study theperformance degradation, if any, that has to be accepted,to allow for portability. The first part of this section de-scribes our performance analysis experiments, while the

second, outlines our portability tests using AMD’s drivers.All our experiments were done on a Intel(R) Xeon(R) 3.33GHz quad-core system with 3.25 GB RAM (Dell PrecisionT7400) having a NVIDIA GeForce GTX 285 GPU. ThisGPU features 30 multi-processors, with a total of 240processing cores, 16 KB shared memory per multi-processorand 1GB device memory.

A. Performance Analysis using NVIDIA’s Drivers

To study the performance efficiency of our OpenCL-accelerated DWT algorithm on GPUs, we compared thememory access and kernel performance of our OpenCL-based DWT against a native CUDA implementation. Theparallel design of both the implementations were exactlysimilar. NVIDIA’s OpenCL v1.0 conformant drivers andCUDA 2.3 drivers were used in our analysis.

1) Memory Access Performance: In our parallel DWTimplementation, the original image is accessed by the DWTkernel through a 2D texture and the computed transformedimage is copied back to the host memory from the globalmemory. 2D texture creation in OpenCL involves a callto clCreateImage2D(), while in CUDA it involves callsto cudaMallocArray(), cudaMemcpyToArray() (host to de-vice memory transfer), and calls for creating channels andbinding textures. In OpenCL, there are two alternatives tocreate and populate 2D textures as shown in Figure 9. 2Dtexture creation is done using the clCreateImage2D() call.Populating the texture with data from host memory can bedone either by setting the CL MEM COPY HOST PTRflag, or using clEnqueueWriteImage().

Figure 10 compares the time taken to create and populatea 2D texture in native CUDA against the two OpenCLalternatives. The performance using the first alternative isgiven in Figure 10(a) where we notice that the OpenCLtexture performance is upto 6 times worse than the CUDAtexture performance. As seen in Figure 10(b), the sec-ond approach resulted in texture performance equivalentto that offered by CUDA. This indicates that NVIDIA’sclEnqueueWriteImage() API is better optimized for datatransfers between the host memory and textures as comparedto using the CL MEM COPY HOST PTR flag. We haveposted a query on NVIDIA’s OpenCL forum to gain aninsight into the implementation difference between these twoapproaches. Please note that in both cases, the correctnessof the algorithm was ensured.

Our next experiment involved benchmarking the deviceto host memory transfer bandwidth using clCreateBuffer()and clEnqueueReadBuffer() calls against the CUDA coun-terparts - i.e cudaMalloc() and cudaMemcpy(). In the CUDAimplementation, the host buffer was allocated using cud-aMallocHost() to enable use of non-pageable memory. ForOpenCL, we had two variants, one where the host bufferis allocated using malloc() and the other where the hostbuffer is allocated using clCreateBuffer() and mapped using

Page 6: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Figure 9. Two alternatives for creating and populating a 2D texture in OpenCL.

(a) (b)

Figure 10. Performance of creating and populating 2D texture memory in DWT: CUDA Vs OpenCL by (a) setting CL MEM COPY HOST PTR flagin clCreateImage2D() and (b) using clEnqueueWriteImage() for texture population.

(a) (b)

Figure 11. Device to host memory transfer performance in DWT: CUDA Vs OpenCL, OpenCL host buffer allocated using (a) malloc() and (b)clCreateBuffer() and mapped using clEnqueueMapBuffer().

Figure 12. Using non-pageable/pinned host memory in OpenCL.

Page 7: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

(a) (b)

Figure 13. CUDA Vs OpenCL performance of the DWT kernel. In (a) 2D texture creation and population is done by setting theCL MEM COPY HOST PTR flag in clCreateImage2D(), while in (b), clEnqueueWriteImage() is used.

clEnqueueMapBuffer(). The performance trends for theseapproaches is given in Figure 11. Using pinned or non-pageable memory through OpenCL calls leads to bettermemory bandwidth than using malloced buffer. This hasalso been mentioned in the OpenCL Best Practices Guidereleased by NVIDIA. Figure 12 shows the code snippet forusing non-pageable host memory in OpenCL.

An anomaly we notice in both CUDA and openCLmemory bandwidths (texture population involves copyingdata from host to device) is that the device to host bandwidthis almost twice as slow as the host to device transferbandwidth. The bandwidth tests provided in the CUDA andOpenCL SDKs also yielded similar behavior. According toresponses to our queries [10], [11] about this anomaly in theNVIDIA forums, it might probably be a motherboard issue.

2) Kernel Performance: In this section, we compare theperformance of our OpenCL-based DWT kernel with thatof a native CUDA implementation. In the ideal case, usingOpenCL should provide for portability and heterogeneouscomputing without comprising on the performance achieved.Figure 13 plots the performance comparison between theOpenCL and CUDA based DWT kernel. We measure theimpact of different alternatives for texture population (referto Section V-A1), on the texture access performance in thekernel. Figure 13(a) plots the performance of the OpenCL-based DWT kernel against that of CUDA when texturepopulation is done using the CL MEM COPY HOST PTRflag in clCreateImage2D(), while Figure 13(b) plots theperformance when the texture population is done using theclEnqueueWriteImage() call. In both cases, the texture is ac-cessed within the kernel using read image calls. Populatingthe texture using clEnqueueWriteImage() enhances the ker-nel performance by atleast 3 times as compared to using theCL MEM COPY HOST PTR flag. This indicates that thetexture access is much faster when the texture is populatedusing the clEnqueueWriteImage(), implying that probablythe 2D spatial locality is better. We have communicated thisobservation in NVIDIA’s OpenCL forum to gain insight on

the reason behind this behavior.Inspite of the performance enhancement due to using

the clEnqueueWriteImage() API to populate the texture, theOpenCL kernel performance is upto 72% slower than that ofthe native CUDA kernel. This performance discrepancy hasbeen raised in the forums and we hope that these concernswould be addressed in subsequent releases of the OpenCLdrivers.

3) Portability Analysis using AMD’s drivers: The mainintent of the Open Computing Language is to lay the foun-dation for a programming model for developing platform-independent parallel programs. To evaluate the ease of porta-bility of an OpenCL code developed for the GPU platformto multi-core CPUs, we tried to compile our OpenCL-basedDWT implementation using the OpenCL drivers providedby AMD in the beta version of their ATI Stream SDK (ATIStream SDK v2.0 Beta 4). Based on our experiences, wehave made a few observations on programming practices tobe followed to develop portable and efficient OpenCL code.

Application execution on the GPU requires transfer ofdata between the host and the device memory. This istypically done by allocating buffers on the device andcopying the data from host memory using the clCreate-Buffer() API with the CL MEM COPY HOST PTR flagset. If, for the sake of readability, the same semantics aremaintained for multi-core CPU as well, care should be takento replace the CL MEM COPY HOST PTR flag withCL MEM USE HOST PTR to avoid uneccessary memorycopies in host memory.

Some OpenCL APIs are supported only on specifichardwares. For example, APIs for creating, modifying andaccessing image objects are supported on GPUs but not onmulti-core CPUs. For an OpenCL kernel to be portable,we should either refrain from using device-specific APIs,or the drivers should support default implementations. Forexample, a call related to texture memory should map toa global memory access when textures are not supportedby the device. Yet another option is to branch the kernel

Page 8: [IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Atlanta, GA, USA (2010.04.19-2010.04.23)] 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops

Figure 14. DWT kernel performance on a quad-core using AMD’s OpenCLdrivers.

code based on the device type. This might however comewith some performance penalty. In our experiments usingAMD’s OpenCL drivers, image access qualifiers were notsupported as texture access was not supported in the betaversion. So, to make the OpenCL-based DWT portable, wereplaced all texture (image) calls with accesses to globalmemory. However, such a change can impact the kernelperformance. In a sample kernel with random accessesto texture memory, we noticed a ≈4X increase in kernelexecution time when the texture calls were replaced by callsto global memory. This is because of non-coalesced accessesto global memory [4].

Finally, to ensure portability of an OpenCL application,we should refrain from accessing vendor-specific utilitylibraries. Making these changes to our OpenCL-based DWTcode, we were able to successfully execute the applicationon our Dell Precision T7400 quad-core system using AMD’sOpenCL drivers. Figure 14 plots the performance of theDWT kernel using AMD’s drivers.

VI. DISCUSSION AND CONCLUSIONS

This paper presents a parallel implementation of thediscrete wavelet transform using the recently proposedOpen Computing Language. DWT is a powerful, compute-intensive, signal processing technique used in the JPEG 2000image compression codec. Using OpenCL, we leverage thecompute capabilities of modern parallel computing systemsto accelerate the DWT computation.

OpenCL provides a unified programming model that en-ables cross-platform parallel programming of heterogeneoussystems like multi-core CPUs and GPUs. The main strengthof OpenCL is its support for portable heterogeneous parallelprogramming. Using the recently released NVIDIA’s andAMD’s OpenCL drivers, we explore the potential of thislanguage in boosting the DWT performance and analyze itsprogrammability and portability aspects.

Through our experimental results and analysis, we showthat our parallel implementation of DWT is able to de-liver real-time performance on NVIDIA GPUs (GTX 285).However, comparisons with native CUDA implementations

reveal that there is a kernel performance lag between thetwo languages. Also, through methodical experiments, weidentify how OpenCL texture and memory APIs shouldbe used to ensure good memory bandwidth. Finally, usingAMD’s OpenCL drivers, we test the ease of portability ofthis language. Based on our observations, we outline theprogramming practices that need to be followed in order todevelop portable and efficient OpenCL applications.

OpenCL is a good effort in promoting platform indepen-dent parallel program development. However, it’s successlargely depends on widespread support from industry ven-dors. To leverage the heterogeneous computing capability ofOpenCL, vendors need to co-operate and develop OpenCLdrivers that work across different computing platforms.

To the best of our knowledge, this paper is the first work torecord practical experiences in developing portable OpenCLapplications. We hope that our observations outlined in thispaper would help the parallel computing user community inbetter understanding this language.

REFERENCES

[1] M. D. Adams and F. Kossentini, “Jasper: A software-basedjpeg-2000 codec implementation,” in Intl Conf on Image Proc(ICIP), 2000.

[2] T. Acharya and A. K. Ray, Image Processing - Principles andApplications. Wiley-Interscience, 2005.

[3] Khronos, “Opencl - the open standard forparallel programming of heterogeneous systems,”http://www.khronos.org/opencl/, 2008.

[4] NVIDIA, “Nvidia compute unified device architecture,”http://www.nvidia.com/object/cuda.html, 2008.

[5] T.-T. Wong, C.-S. Leung, P.-A. Heng, and J. Wang, “Discretewavelet transform on consumer-level graphics hardware,”IEEE Trans. on Multimedia, vol. 9, no. 3, pp. 668–673, 2007.

[6] V. Simek and R. R. Asn, “Gpu acceleration of 2d-dwt imagecompression in matlab with cuda,” in Euro Symp on ComputerModeling and Simulation, 2008, pp. 274–277.

[7] J. Franco, G. Bernabe, J. Fernandez, and M. E. Acacio, “Aparallel implementation of the 2d wavelet transform usingcuda,” in Euromicro Intl Conf on Par, Dist and Network-basedProc, 2009, pp. 111–118.

[8] A. Nottingham and B. Irwin, “Gpu packet classification usingopencl: a consideration of viable classification methods,” inAnnual Res.Conf of the South African Inst of Comp. Scientistsand Info. Technologists, 2009, pp. 160–169.

[9] D. S. Taubman and M. W. Marcellin, JPEG2000 : ImageCompression Fundamentals, Standards and Practice. KluwerAcademic Publishers, Boston, 2002.

[10] “Asymmetric pinned memory bandwidth ondell precision 7400 with gtx 285 card,”http://forums.nvidia.com/index.php?showtopic=150656.

[11] “Memory read and write to device gives different timing,”http://forums.nvidia.com/index.php?showtopic=149591.