22
Multimed Tools Appl (2014) 70:2347–2368 DOI 10.1007/s11042-013-1440-x Object oriented framework for real-time image processing on GPU Nicolas Seiller · Williem · Nitin Singhal · In Kyu Park Published online: 3 April 2013 © Springer Science+Business Media New York 2013 Abstract General purpose computation on graphics processing unit (GPGPU) provides a significant gain in terms of the processing time compared with CPU. Images are particularly good subjects for massive parallel implementations on GPU. Thus, the processing time can be improved for computer vision and image/video processing algorithms. However, GPGPU has a fairly complex integration process in a framework and they evolve very rapidly. In this paper, we present a framework that provides all the desired primitives related to GPGPU-based image processing algo- rithms, which makes it easy and straightforward for the user to exploit. The proposed framework is object-oriented, and it utilizes design patterns. The user can benefit from all the advantages of object-oriented programming, such as code reusability/ extensibility, flexibility, information hiding, and complexity hiding. This makes it possible to rapidly integrate new technologies and functionality as they appear. Keywords GPGPU · Computer vision · Image/video processing · Object-oriented · Design patterns N. Seiller SmartCo, Paris 75008, France e-mail: [email protected] N. Singhal Biomedical Signal Analysis Lab., GE Global Research, Bangalore 560066, India e-mail: [email protected] Williem · I. K. Park (B ) School of Information and Communication Engineering, Inha University, Incheon 402-751, Korea e-mail: [email protected] Williem e-mail: [email protected]

Object oriented framework for real-time image processing on GPUimage.inha.ac.kr/paper/MTAP2013Nicolas.pdf · 2017-08-27 · 2348 Multimed Tools Appl (2014) 70:2347–2368 1 Introduction

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • Multimed Tools Appl (2014) 70:2347–2368DOI 10.1007/s11042-013-1440-x

    Object oriented framework for real-time imageprocessing on GPU

    Nicolas Seiller ·Williem ·Nitin Singhal · In Kyu Park

    Published online: 3 April 2013© Springer Science+Business Media New York 2013

    Abstract General purpose computation on graphics processing unit (GPGPU)provides a significant gain in terms of the processing time compared with CPU.Images are particularly good subjects for massive parallel implementations on GPU.Thus, the processing time can be improved for computer vision and image/videoprocessing algorithms. However, GPGPU has a fairly complex integration process ina framework and they evolve very rapidly. In this paper, we present a framework thatprovides all the desired primitives related to GPGPU-based image processing algo-rithms, which makes it easy and straightforward for the user to exploit. The proposedframework is object-oriented, and it utilizes design patterns. The user can benefitfrom all the advantages of object-oriented programming, such as code reusability/extensibility, flexibility, information hiding, and complexity hiding. This makes itpossible to rapidly integrate new technologies and functionality as they appear.

    Keywords GPGPU ·Computer vision · Image/video processing ·Object-oriented ·Design patterns

    N. SeillerSmartCo, Paris 75008, Francee-mail: [email protected]

    N. SinghalBiomedical Signal Analysis Lab., GE Global Research, Bangalore 560066, Indiae-mail: [email protected]

    Williem · I. K. Park (B)School of Information and Communication Engineering, Inha University,Incheon 402-751, Koreae-mail: [email protected]

    Willieme-mail: [email protected]

  • 2348 Multimed Tools Appl (2014) 70:2347–2368

    1 Introduction

    The development of general purpose computation on graphics processing unit(GPGPU) [8] has created a wealth of opportunities for developers to offload com-putationally intensive tasks onto the graphics processing unit (GPU), which providessignificant speedup compared with the central processing unit (CPU). In the diversefields related to GPGPU, image processing and computer vision have gained consid-erable attention because they usually perform computations using a massive numberof pixels or features. Therefore, they can exploit the single instruction multiple data(SIMD) GPU architecture and be effectively implemented on the GPU in manycases. However, modern multi-core CPU supports MIMD processing, which canexecute different instructions at the same time.

    When a GPU is programmed using conventional shading languages such asthe OpenGL Shading Language (GLSL) [23], an in-depth knowledge of computergraphics concepts is required. This problem was alleviated by the development ofrecent interfaces such as NVIDI’s Compute Unified Device Architecture (CUDA)[12] and OpenCL [11]. However, these interfaces have a rather complex integrationframework and require much more than individual kernel programming. Note thatseveral criteria affecting the performance of image processing algorithms imple-mented on GPU were addressed in our previous work [21].

    In this paper, we present an object-oriented framework for GPGPU-based imageprocessing using GLSL, CUDA, and OpenCL. We present a class hierarchy basedon Object-oriented Programming (OOP). The design and programming advantagesof OOP paradigms strongly enhance the proposed framework in terms of codereusability/extensibility, flexibility, information hiding, and complexity hiding. Weincorporate shader (GLSL) and kernel (CUDA, OpenCL) programming into theproposed framework to provide full functionality. The performance is evaluatedin terms of the programming effort, execution overheads, and speedup factorscompared with a CPU. To the best of our best knowledge, the proposed frameworkis the first object-oriented framework that supports GLSL, CUDA, and OpenCLsimultaneously.

    This paper is organized as follows. Section 2 provides a survey of previous studies.Section 3 describes the proposed object-oriented framework, and Section 4 explainsthe framework integration in a GPGPU background. Section 5 describes additionalfeatures incorporated in the framework. Section 6 presents our experimental results,and finally, Section 7 concludes the paper.

    2 Previous work

    The development of GPGPU interfaces such as GLSL [23], CUDA [12], andOpenCL [11] on a PC platform means that the GPU can be used to process datain a massive parallel way and deal with computationally intensive tasks. An intensivesurvey of GPGPU was published by Owen et al. [20]. GPU-based libraries for imageprocessing and computer vision were developed in GpuCV [1] and OpenVIDIA [6]projects. OpenVIDIA provides a framework for video input, display, and GPU

  • Multimed Tools Appl (2014) 70:2347–2368 2349

    processing, as well as the implementations of feature detection and tracking, skintone tracking, and projective panoramas. An active project within OpenVIDIA isthe CUDA Vision Workbench (CVWB). GpuCV provides seamless accelerationwith familiar OpenCV [3] interfaces. Recently, an OpenCV GPU has been releasedthat utilizes the NVIDIA CUDA Runtime API to develop functions from low levelprimitives into high level algorithms [19]. NVIDIA released an open source imageprocessing library known as NPP [18] that exploits a GPU architecture to acceleratecommon image processing algorithms. They provide low level API support for thedevelopment of higher level algorithms. However, all of these image processinglibraries are implemented using procedural programming, and they lack the benefitsof an object-oriented framework.

    MinGPU [2] proposed a general purpose computation library based on an object-oriented framework. However, its class hierarchy is limited to use in general imageprocessing or computer vision. In addition, MinGPU does not support any languageother than Cg. Kuck et al. [13] presented an interesting example of how to build aclass structure around OpenGL and GLSL using OOP. They wrapped each stageof the rendering pipeline in classes and provided useful facilities for integrating newshader codes. However, their work was directed specifically at 3D graphics rendering.Most modern GPGPU technologies require the separation of the CPU code andGPU code in different files. Jansen presented a novel approach where OOP and ad-hoc polymorphism permit the definition of CPU and GPU targeted programs in thesame files [10]. This also facilitates automatic code optimization by the developmentsystem.

    One of the first real attempts to seriously apply design patterns and OOP para-digms to image processing was presented by Raspe [22]. However, this was a ratherspecialized medical imaging application that lacked standard code extensibilityprocedures.

    Hou et al. presented an innovative approach where a new GPU-oriented pro-gramming language was implemented [9]. Implicit data flow control and many basicoperations such as fork, kill, or barriers were provided. These operations are absentfrom current GPU processing technologies. The features provided advantages interms of the code reusability, maintenance, and ease of use. McCool et al. proposeda C++-based object-oriented approach to GPU programming in Sh [14] where theshader algebra resembles the composite processor proposed in the present paper.Membarth et al. proposed a GPU code generation framework for dealing with imageprocessing kernels [15]. The framework generates CUDA and OpenCL code fromC++ code automatically. However, it is limited to simple image processing kernelssuch as blur, dilote, erode, Gaussian, and others. In a commercial approach, CAPSreleased the HMMP workbench, which can parallelize sequential code and distributeit into CPU and GPU programming models [4]. The Portland Group introduced PGIAccelerator Compilers that allow programmers to accelerate x64 applications fromdistinct operating systems using NVIDIA CUDA [24].

    Table 1 summarizes the features of existing frameworks and the proposed object-oriented framework. Note that the proposed framework is superior to others in termsof its GPU language support, adaptive implementation, reusability, and extensibility.Each of the existing frameworks has its own method by which users implement imageprocessing routines. In NPP, algorithms are implemented by calling built-in functions

  • 2350 Multimed Tools Appl (2014) 70:2347–2368

    Table 1 Comparison between existing image processing frameworks on GPU

    Framework GpuCV OpenVIDIA OpenCV GPU NPP MinGPU Proposed

    Strategy Procedural Procedural Procedural Procedural Object Objectoriented oriented

    Language GLSL, CUDA CUDA CUDA CUDA Cg CUDA,support GLSL,

    OpenCLAdaptive × × × × × ◦

    implementationReusability and × × × × ◦ ◦

    extensibility

    in the library. It is not possible to inherit a new class and modify the algorithms.In terms of programming complexity, NPP is easier to use because it hides theinvolution of CUDA, which is similar to C language. Algorithm implementationsin GpuCV and OpenCV GPU are similar to that in NPP because they simplyprovide the implementation of image processing and computer vision algorithms.Unlike NPP, GpuCV accommodates GLSL and CUDA. NPP and GpuCV are bothprocedural. A slightly different approach is OpenVIDIA, which is an open sourceproject, so we can use it as a template for adding new and more specific algorithms. Inaddition, all of these frameworks do not separate the process from the image, whichmakes it difficult for the user to understand their shared complexity. By contrast,MinGPU allows the separation of the process and the image. However, it does notprovide sufficient algorithms for use and users need to learn a specific language, i.e.,Cg, before they can add algorithms themselves. However, the proposed frameworkprovides separation, algorithms, and code reusability, which makes it easy to use andsuperior to the other frameworks.

    3 Object oriented framework for image processing

    3.1 Framework overview

    The proposed object-oriented structure is based on two central concepts: imagesand processes. The main motivation is to develop a framework that is extensible interms of the algorithm integration and implementation strategy. The key concept isto separate the process from the image so that the target algorithm is isolated fromthe programming context.

    The proposed framework structure is shown in Fig. 1. The highest level is theImageRegister class, which serves as a container for image(s). This class allows theuser to centralize multiple images in a single set. The Image class contains all theinformation about the image data, such as the pixel color, spatial resolution, andimage format. Next in the hierarchy is the IImageProcessor interface, which allowsthe abstraction of the available processes and receives information from the Im-ageRegister class to process the target algorithm. Each of the processing algorithmsis implemented as a subclass of the IImageProcessor.

  • Multimed Tools Appl (2014) 70:2347–2368 2351

    Fig. 1 Core structure of the proposed framework in UML (Unified Modeling Language)

    In the proposed framework, we employ the strategy design pattern [7] to isolatethe algorithms from the rest of the structure. This algorithm decoupling providessignificant benefits, e.g., the inheritance of an entire algorithm hierarchy and theconstruction of categories such as edge detection, blurring, and color conversion. Itis also possible to define GPGPU-specific routines that can be reused in different al-gorithms. Other data management routines such as image data updating and featureinformation creation can also be factorized. The strategy design pattern facilitatesthe interchange of algorithms that can be used independently. Furthermore, complexalgorithms are reified into objects, which can be switched dynamically with noperturbation in the target image set.

    3.2 Class details

    3.2.1 Image class

    The Image class accommodates the data model for a 2D image with width, height,channel count, data format, and the pixel data as its parameter set. The pixel datais implemented using a C++ template that supports multiple data types such as 8-bit integers, single precision floating point, and higher order integers. The channelcount is always hidden from the user. The number of channels is determined fromthe data format of the image, which needs to be specified during image construction.The Image class provides the following functionalities.

    – Image construction from image files, predefined pixel values, and other existingimages.

    – Getters and setters for all members.– Saving the image on the file system in different formats.

    OOP provides advantages to users who do not know all the internal details of the Im-age class. For each change performed in the Image class data, all the coherence checksand necessary changes are performed in the methods. This provides the user with asafer means of manipulating images. At the same time, a user who is more com-fortable with the framework can access most of the data structure and manipulateit freely to increase efficiency.

  • 2352 Multimed Tools Appl (2014) 70:2347–2368

    3.2.2 IImageProcessor interface

    This interface has the role of the abstraction to the image processing algorithms.This is the base interface for integrating any type of image processing algorithm inthe framework and exploiting it. The implementation methods are as follows.

    – processAf fect(Image): processes an image and overrides the original imagedata.

    – process(src, dst): uses src as the source image for processing and places theresulting data in dst.

    – process(src): uses src as the source image for processing, and creates and returnsa new image containing the resulting data.

    The interface provides the desired decoupling between the process and the datastructure that the Image class represents. Therefore, it is possible to add newprocesses and modify existing processes without disturbing the rest of the frameworkclasses.

    3.2.3 ImageRegister class

    It is common for multiple images to be used jointly during image processing. To meetthis need, the ImageRegister class is implemented, which is a container for multipleimages. This class implements methods of adding images to an existing ImageRegisterobject based on its filename and parameters such as the width, height, data format,and binary data.

    The ImageRegister class has a key role in linking the Image class and the IIm-ageProcessor interface. The process or processAf fect message is sent initially to theImageRegister, which delegates the instruction to the IImageProcessor, a memberof the ImageRegister. The processor is selected by calling the setProcessor methodbeforehand. The Image objects are then sent to the IImageProcessor object from theImageRegister that needs to be processed.

    A simple example of the use of the basic framework is shown in Algorithm 1 inAppendix. In this example, we create a register and add an image to it, which is readfrom the file system.

    No data format has been specified, so the framework tries to detect it automati-cally in the header information. If the format cannot be recognized, it is set to theRGBA format by default, which is the most generic format. Next, the GLSLNeg-ativeProcessor is selected as the image processor for the ImageRegister. Thus, anysubsequent process messages will be delegated to it. We use the process method forprocessing the register’s first image and storing the result in the register’s secondimage. We then switch the processor to GLSLSepiaProcessor. We process the regis-ter’s second image and replace the image data with the newly processed data usingprocessAf fect. Finally, we save this image in the file system as a JPEG image.

    3.3 Video processing

    The proposed framework provides functionalities for capturing and processing videoframes in real time. Figure 2 shows the basic structure, which is similar to the generalstructure shown in Fig. 1.

  • Multimed Tools Appl (2014) 70:2347–2368 2353

    Fig. 2 Classes for videoprocessing

    The Video class is the root class in the hierarchy, and it uses a single Image as thecurrent frame of the Video. It also has an IImageProcessor for processing an inputframe in real time.

    The source of the video is made abstract so that the Video class can use differentvideo sources to capture frames independently. A USB camera interface is imple-mented using the OpenCV library, and an IEEE 1394 camera interface is imple-mented using the proprietary library. The video source is dynamic and abstract, soother types of video source can be implemented and integrated in the framework.The user has to implement the methods that are used to capture a video frame andset the current frame data. More specifically, the user implements the IVideoSourceinterface and getFrame method. Different video sources such as cameras, video files,and network streams can be added to this interface.

    4 GPGPU integration

    The framework proposed in Fig. 1 is implemented using the strategy design pattern,so it is straightforward to create different categories and processing groups. Thus, it ispossible to integrate different GPU interfaces in the same framework. The proposedframework can integrate different GPU interfaces including GLSL, CUDA, andOpenCL. Note that this does not mean that the user needs to implement imageprocessing algorithms to support all the GPU interfaces simultaneously. The useronly needs to implement one depending on the actual GPGPU platform they areusing.

    4.1 GLSL integration

    The main class in GLSL integration is the GLSLProcessor class. Individual imageprocessing algorithms are implemented as its subclasses. All the GLSL compilationand execution operations are implemented in the abstract class, which reducesthe programming burden significantly. GLSL integration is structured as shown inFig. 3a. The GLSLProcessor class has a GLSLShader object as a member. The GLSLshader program is implemented in the GLSLShader class.

  • 2354 Multimed Tools Appl (2014) 70:2347–2368

    Fig. 3 Classes for GPGPU integration. a Using GLSL. b Using CUDA. c Using OpenCL

    The GLSLProcessor uses the OpenGLGraphicContext object. All OpenGL prim-itive operations needed to perform GLSL processing are centralized here. Theseinclude loading images to the texture memory, rendering, and reading the GPUmemory. An example of the implementation of a basic GLSL processor is shown inAlgorithm 2 in the Appendix. We use the term basic GLSL processor to refer to algo-rithms such as negative and sepia effects, and Sobel and Prewitt edge filters, which donot need any external parameters. In these algorithms, it is sufficient to implementthe constructor and the path to the file containing the shader code. The setAttributesmethod in the GLSLProcessor class is used to set the values of external parameters ofthe algorithm, if needed. This method does not set any attributes by default. The useradds necessary parameters as members and sets them by overriding the setAttributesmethod. Furthermore, Algorithm 3 shows as implementation of an algorithm usingexternal parameters.

    Note that the user has to redefine all of the processing routines by overriding theprocess method if the algorithm works in a manner other than the standard behavior(single pass rendering).

  • Multimed Tools Appl (2014) 70:2347–2368 2355

    4.2 CUDA integration

    The proposed class structure for CUDA integration is shown in Fig. 3b. The highestlevel is the CUDAProcessor class. The GLSLProcessor class performs the samerole for GLSL algorithm integration, so CUDAProcessor provides an integrationframework for CUDA-based algorithms.

    The CUDAProcessor class incorporates the process and processAf fect methods.The CUDAProcessor class also implements a cudaCall abstract method. This methodprovides the entry point for calling the CUDA kernel functions. Appropriate CUDAmemory management (i.e., allocation and transfer) should also be implemented inthe cudaCall routine. If the user wants to add a new CUDA algorithm, the task willbe as follows.

    – Implement the constructor– Implement the call to the appropriate CUDA function– Release the texture memory– Update the output image

    The user typically implements an appropriate CUDA kernel before the external Cfunctions. All the kernel code settings are set up in the function from which the kernelis called. An example of the implementation of a CUDA processor is presentedin Algorithm 4 in the Appendix. This structure allows the integration of CUDAimage processing algorithms with a minimum of overhead code and maximum codereuse. Thus, attention can be focused on the CUDA algorithm without perturbingthe integration framework.

    4.3 OpenCL integration

    Figure 3c shows the structure of OpenCL integration. The OCLProcessor class is atthe top of this structure, which is similar to the GLSLProcessor and CUDAProcessorclasses during GLSL and CUDA integration, respectively. The OCLProcessor classprovides a base class for all IImageProcessor subclasses. These subclasses implementOpenCL kernel functions.

    The OCLProcessor class has an OCLContext object that handles the contextand initialization of the execution queue. The OCLContext object intrinsically hasan instance of the OCLDevice class. All OpenCL device objects are wrapped inthis class. The OCLGPU is a subclass of OCLDevice, which is specialized forimplementing GPU platform-specific routines. This subclassing of the OCLDevicemain class allows the extension of the structure and the incorporation of new devices.The OCLContext class also handles the OpenCL memory allocation and transferoperations. The OCLProcessor class incorporates an OCLProgram object thathandles the source compilation and kernel execution routines. The overall structurefacilitates OpenCL function implementation by a programmer.

    The process and processAf fect methods are implemented in the OCLProcessorclass. Similar to the GLSLProcessor class, the implementation of the constructoris sufficient if the algorithm is simple. The kernel name and the path to the filecontaining the kernel code are required by the constructor function. The setAttributes

  • 2356 Multimed Tools Appl (2014) 70:2347–2368

    method is used to set the attributes of the kernel function (if any). Finally, the proces-sImage method is used to define the execution logic. The execution flow is similar tothe GLSL case. The framework allows the programmer to specify custom executionflows by overriding the processImage method.

    4.4 Adaptive implementation scenario

    As discussed, the proposed framework incorporates different technologies andallows the addition of new technologies by the user. Several criteria affect the choiceof implementation technology.

    First, the choice of CPU or GPU depends mainly on the amount of parallelismin a given algorithm. If the parallel fraction of a specific algorithm is high, it is wellsuited to a GPU implementation. Several criteria that affect the performance duringimplementations on a GPU are addressed by Park et al. [21].

    A choice has to be made for the GPU programming API even when the GPUhas been selected. In the proposed framework, we deal with CUDA, GLSL, andOpenCL. High performance, strong innovation, and user friendliness have madeCUDA the de facto standard in GPU parallel programming. However, GLSL isstill a widely used standard for GPU. OpenCL is a new standard API for parallelcomputing on different platforms. It was designed with the purpose of providingplatform-independent parallel processing operations.

    It is still hard for users to make the choice between different interfaces for GPUthat support multiple programming interfaces. In this study, we append an object-oriented framework to include adaptive selection of the most suitable algorithmimplementation. The AdaptiveProcessor class is the central class for this adaptiveoption, and it implements the IImageProcessor as a decorator design pattern [7]. Asshown in Fig. 4, this design pattern facilitates the dynamic selection of the processorused when the process method is called. If the user sends a process message tothe ImageRegister, the AdaptiveProcessor instance calls the adapt method to choosethe most suitable processor. It then assigns the process to the selected processor.The criterion used to select the processor can be implemented by subclassing theAdaptiveProcessor and overriding the adapt method.

    Fig. 4 Adaptiveimplementation selectionmechanism

  • Multimed Tools Appl (2014) 70:2347–2368 2357

    5 Additional features

    A thorough image processing and computer vision library requires more than imagesand algorithms for applications. Feature detection is a key aspect of computer vi-sion algorithms and needs to be addressed. The proposed framework also providesadditional features for building processes from others and checking the process-ing time.

    5.1 Feature detection

    Feature detection is a common method in computer vision where features areextracted to analyze the contents of an image. Thus, we provide a programminginterface for feature detection as part of the proposed framework.

    Image features are intrinsically very broad concepts that contain various entities.For example, corners or blobs are usually represented by their coordinates. Edgesare another type of feature that can be described based on the position, normal localvector, and edge magnitude. Image histograms can also be considered as a global fea-ture, although their characteristics are different from the other features discussed. Itis difficult to find any consistency in terms of data representation for different imagefeatures. Therefore, we provide a simple and basic interface for modeling image fea-tures using the ImageFeatures interface. The ImageFeatures is a superclass used forall feature types. This interface facilitates the integration of custom image featuresin the proposed framework.

    The ImageFeatures objects are generated by IImageProcessor objects, which arespecialized for this task. These IImageProcessor objects apply the algorithm and addfeature data to the Image.

    OOP facilitates the modeling of concepts under a unified data type using inheri-tance. Thus, we take advantage of OOP for image features and model them in thesame ImageFeatures superclass. However, image features are entities that are verydifferent from one another in terms of their data representation. Thus, most of theirmethods must be declared and implemented in the subclasses.

    Thus, it is essential to isolate all the subtypes from one another. One approach maybe to force the user to determine the order of the features stored in the image object,although this approach may be too intricate for the user. Thus, the FeatureIteratorclass is designed for this purpose.

    It is designed based on the common Iterator pattern. The most significantdifference is that it filters the ImageFeatures objects to iterate only those belongingto a specific subclass. We define a getFeatureTypeName method in the interface; itreturns a string, which identifies the class of the object. Subclasses implement thismethod, so each class automatically provides a standard discrimination mechanism.The object-oriented paradigm allows us to force each ImageFeatures class to identifyitself.

    5.2 Composite processor

    The proposed framework provides the ability to create cascading algorithmscomposed of multiple IImageProcessor objects. This operation is handled by the

  • 2358 Multimed Tools Appl (2014) 70:2347–2368

    CompositeProcessor class. The user can instantiate a CompositeProcessor object andadd all of the desired IImageProcessor objects, before using it as a single IImage-Processor object that performs multiple algorithms. This structure implementationis based on the composite design pattern [7]. The ability to group different objectsthat collaborate to solve a problem is a remarkable benefit of OOP compared withprocedural programming.

    5.3 Performance checking and logging

    The ability to measure the execution time of algorithms is indispensable in theimage processing field. We implement a structure that keeps track of every processperformed in an ImageRegister. The PerformanceLogger is added as a member of theImageRegister class, and it allows the user methods to handle and control the processlog. The following functionalities are provided.

    – enableProcessLogging: enables logging on the ImageRegister. After calling thismethod, all the processes that go through the ImageRegister are entered inthe Log.

    – disableProcessLogging: disables logging on the ImageRegister. After calling thismethod, all the processes that go through the ImageRegister are no longerentered in the Log

    – toggleProcessLogging: toggles the process logging state.

    6 Experimental results and discussion

    We test two experimental scenarios. In the first case, we select two popular imageprocessing algorithms and evaluate their performance using CPU multi-core, GPUprocedural programming, and the proposed framework. We use GLSL (version 4.1),CUDA (version 4.0), and OpenCL (version 1.1) to implement algorithms on theGPU. In the second case, we select two user case scenarios to demonstrate theextensibility of the proposed framework. Our experimental setup use an Intel CPU(i5 CPU 750 @ 2.67 GHz (quad cores) with 4096 MB RAM) and an NVIDIA GPU(GTX 570 with 2.5 GB RAM). GTX 570 has 15 multiprocessors with 480 processingcores that operate at a peak performance of 1405.4 GFLOPS, while the i5 CPU 750operate at 42.56 GFLOPS. All the CPU implementation in the experiment run onquad cores using the OpenMP method.

    6.1 Performance evaluation

    We select bilateral filtering as a basic case study because the algorithm structure iswell suited to massively parallel computation with a reasonable amount of floatingpoint computation. The proposed framework is aimed at image processing and com-puter vision, so we are not focused on more general purpose algorithms (e.g., parallelreduction or sorting algorithms). Instead, we focus on algorithms that are typicallyused for image processing. Our goal is also to compare and measure OOP andprocedural execution. These measurements are independent of specific algorithms.

  • Multimed Tools Appl (2014) 70:2347–2368 2359

    The algorithms are implemented on the GPU using the proposed framework anda stand-alone procedural programming application. The OpenCV framework isemployed for the CPU. The execution time was calculated as the mean of multipleiterations.

    Figure 5 shows the execution time in milliseconds with various image resolutionlevels. The GPU implementation outperform the multi-core CPU implementation.However, GLSL is slower in the read-back operation (glReadPixels()) com-pared with CUDA and OpenCL, so the GLSL implementation do not outperformthe others. The OpenCL and CUDA implementations deliver comparable perfor-mance owing to their similar thread batch structure.

    6.1.1 OOP overhead analysis

    The main advantages of the proposed framework are code reusability/extensibility,information hiding, and data abstraction. However, these advantages add minoroverheads for the maintenance of the OOP-based code running scheme. Object-oriented frameworks typically incur such overheads owing to constant function callsin the class hierarchy and data hiding mechanism. To evaluate these overheads,we compared the performance of the proposed object oriented framework withprocedural programming in terms of the total execution time.

    Figure 5 shows that the proposed framework is slower compared with theprocedural implementation, which is primarily due to the overheads of the object-oriented framework. The flexible, extensible, and reusable code used in object-oriented programming separate the code into objects in an inheritance hierarchy.Each object communicates with others by sending and receiving messages, whichincur additional overheads. In addition, each class member access requires callingset() and get() functions with data validity checking. Furthermore, memoryallocation and deallocation are performed multiple times if the same method is calledmore than once. In most cases, however, the overheads are negligible compared withthe speedup achieved over the CPU multi-core implementation. The gap increasesslightly with the image size because the overheads increase in proportion to the levelof data access, which is also proportional to the image size. However, the superiorperformance compared with the multi-core CPU is higher than the increase in theoverheads. Therefore, the overheads due to OOP in the proposed framework do nothave a major impact on the performance.

    Fig. 5 Performance comparison for bilateral filtering algorithm. a Execution time for different imageresolution. b Comparison for 4000 × 4000

  • 2360 Multimed Tools Appl (2014) 70:2347–2368

    Performance issues are known to be important in the image processing field.However, it is difficult to develop and maintain image processing software if it is stillbased on a procedural structure. Objected-oriented software is also better at avoidingcode bugs compared with the procedural structure. Therefore, it might take longerto develop a procedural framework than an object-oriented framework, especiallywhen building a framework at a larger scale. The procedural framework is suitablefor small-sized projects that do not require the advantages of the object-orientedapproach.

    6.1.2 Code complexity analysis

    The proposed framework has clear advantages in terms of its complexity hidingcompared with existing frameworks such as NPP, GpuCV, and MinGPU. To com-pare the code complexity required for interfacing each framework, we evaluate thelogical source lines of code (SLOC) [17] in the interface code. To ensure a faircomparison, the function body is not counted when evaluating the SLOC. The SLOCcounts for the Sobel and Canny edge detection algorithms are shown in Table 2. TheNPP and GpuCV libraries require more lines of code because they are procedurallibraries where several housekeeping routines must be run before invoking func-tions. For example, they require parameter initialization, memory allocation, andmemory transfer between the CPU and the GPU. By contrast, MinGPU and theproposed framework has comparable complexity, which is much less than that ofNPP and GpuCV. Note that both of these methods are object-oriented. However,MinGPU does not allow the use of different GPU programming languages, exceptCg, whereas the proposed framework supports GLSL, CUDA, and OpenCL, whichare mainstream languages for GPGPU.

    6.2 Framework extensibility

    A new class needs to be appended in the framework to include new functionality.This new class is inherited from the base class, such as CUDAProcessor and GLSL-Processor. The new algorithm is then implemented in the inherited method (e.g.,cudaCall in CUDAProcessor) and the newly added method. Similar to adding a newfunctionality, new classes have to be appended to support more complex operations,such as image segmentation and stereo matching. To demonstrate the extensibility ofthe proposed object-oriented framework, we integrate two advanced algorithms, i.e.,Nevatia and Babu’s linear feature extraction [16] and multi-view stereo matching [5]as user case scenarios.

    In particular, we demonstrate how a complex algorithm with a number of para-meters can be integrated into the framework and how an entire feature hierarchycan be built and integrated easily. The line detection algorithm requires a single type

    Table 2 Logical SLOC comparison based on C/C++ CodeCount Counting Standard [17]NPP GpuCV MinGPU Proposed

    Sobel 23 22 4 5Canny 33 18 4 5

  • Multimed Tools Appl (2014) 70:2347–2368 2361

    of feature, which makes it is easy to appreciate our approach. The multi-view stereomatching algorithm is particularly suitable as a more elaborate example because itrequires many parameters such as projection matrices, background information, andan image index.

    The proposed framework provides basic image processing routines (e.g. Sobeloperator, Canny detector, Line detector, Bilateral processing, etc), but many otherroutines need to be implemented. Note that the proposed framework is developed asan open-source project so that people could download the proposed framework andadd suitable routines for their own projects. The address of the open source projectwebsite is http://image.inha.ac.kr/oopgpu/.

    6.2.1 Linear feature extraction

    The linear feature extraction algorithm involves Canny edge detection followed byedge thinning, linking, and iterative line fitting procedures. To include linear featureextraction, the proposed framework was appended with the LineFeatures and theCUDALineDetector classes. The LineFeatures class contains the extracted featuresand implements the ImageFeatures interface. The objects belonging to ImageFeaturesare generated by the IImageProcessor objects, which apply the algorithm and addfeature data to the Image. The overall structure is shown in Fig. 6a. An integerarray contains all the line coordinates and a static member string is added to containthe type name, which is a string returned by the getFeatureTypeName methodin the ImageFeatures interface. The getFeatureTypeName method is used by theFeatureIterator to retrieve the desired data structure from the Image. The printmethod is used to represent the features on an Image. Methods are also implementedto get the data, to add or remove a line, and to clear the contained data.

    The CUDAFeatureDetector is the entry point for the feature detection algorithmwhen using the CUDA programming interface. The CUDALineDetector implementsthe linear feature extraction algorithm. An abstract generateFeatures method isprovided to add the extracted features to the Image. The algorithm flow is as follows.

    – Allocate the device memory for input and output data, and upload the input datato the allocated device memory.

    Fig. 6 Class structure for the use case scenario. a Linear feature extraction. b Multi-view stereomatching

    http://image.inha.ac.kr /oopgpu/

  • 2362 Multimed Tools Appl (2014) 70:2347–2368

    Fig. 7 Visual results of the usecase scenarios. a Line segmentextraction of Pentagon data(512 × 512). b Multi-viewstereo matching of templeRingdata (47 images in 480 × 640)

    (a) (b)

    – Call the main CUDA kernel function that applies the algorithm.– Rearrange the data downloaded from the device memory to adapt it to the

    LineFeatures class.– Create a LineFeatures object.– Add the newly created object to the processed Image object.– Release the device (CUDA) memory.

    Figure 7a shows the extracted line segments. The execution time in Table 3shows that the overheads are small with the object-oriented framework. Note that anefficient CPU algorithm cannot be parallelized. Thus, a modified algorithm is usedfor implementation on the GPU. Consequently, the CPU processing time in Table 3is the same order of magnitude as the GPU time.

    6.2.2 Multi-view stereo matching

    The multi-view stereo matching algorithm involves sum of absolute difference(SAD) and normalized cross correlation (NCC) calculations. For each pixel positionin the input image, we consider 3D points sampled along the line of sight. We project

    Table 3 Processing times ondifferent frameworks (inmilliseconds)

    Computing framework Line feature Multi-view stereoextraction matching

    Multicore CPU 66 9,909CUDA in procedural 10 127CUDA in object oriented 12 149OpenCL in procedural 12 121OpenCL in object oriented 17 143

  • Multimed Tools Appl (2014) 70:2347–2368 2363

    these 3D points onto the neighboring images and store those with the minimum SADor maximum NCC. This procedure is repeated for other neighboring images. The 3Dpoints are kept as the correct depth points if their total local best match is above agiven threshold.

    We need several image parameters to perform this algorithm, which we implementas ImageFeatures subclasses. The implemented structure is shown in Fig. 6b. TheBackgroundFeatures class models the background information used as a mask duringthe depth calculation. The DepthFeatures class models the output depth map. Bothof these classes reuse an Image instance. The MatrixParameter class is implementedfor modeling any type of matrix. We implement the ProjectionMatrix and theProjectionPseudoInverse to differentiate the role of each matrix. The IndexFeatureclass is implemented to discriminate the index of the image in the collection, whichis an important parameter during stereo matching.

    The CUDADepthComputer class integrates stereo matching in the framework.The result is shown in Fig. 7b. The processing times with different frameworks arecompared in Table 3. The performance differences between the object-oriented andprocedural frameworks are negligible compared with the gain over the CPU multi-core framework. The OpenCL implementation is slightly faster than the CUDAimplementation during multi-view stereo matching. This is because the CUDAimplementation used a larger register and had lower occupancy than the OpenCLimplementation. Each kernel has an identical computation flow, but the distinctcompilers generate different machine code and register usage. OpenCL uses Clangas its compiler while CUDA uses nvcc. Note that the optimization performance ofthe compilers also varies depending on the release version.

    6.3 Limitation

    Developers can extend the proposed framework simply by adding new function-alities. Although the framework supports multiple GPU programming language, itdoes not provide the mechanism to reduce the developers’ effort to implement newroutines. Therefore, they have to implement the same algorithm for each GPU pro-gramming language (CUDA, OpenCL, GLSL). From the perspective of frameworkextensibility, this is the limitation of the current framework.

    7 Conclusion

    In this paper, we proposed a GPU-based image processing framework using theobject-oriented paradigm and design patterns. The developed framework had in-creased flexibility and ease-of-use during GPGPU-based algorithm implementation.The code developed on our framework can be readily reused at many levels and indifferent contexts. The experimental results showed that the overheads were neg-ligible compared with a procedural implementation when considering the speedupachieved over CPU multi-core implementations.

    Acknowledgements This work was supported by the Industrial Strategic Technology DevelopmentProgram (10041664, The Development of Fusion Processor based on Multi-Shader GPU) funded bythe Ministry of Knowledge Economy (MKE, Korea).

  • 2364 Multimed Tools Appl (2014) 70:2347–2368

    Appendix: Algorithm listings

    Algorithm 1 Basic exampleImageRegister iregister;iregister.addImage(“lena.jpg”);iregister.addImage(200, 200, RGB);iregister.setProcessor(new GLSLNegativeProcessor(gc));iregister.process(0, 1);iregister.setProcessor(new GLSLSepiaProcessor(gc));iregister.processAffect(1);iregister.saveToImageFile(1,“new.jpg”,OOPGPU::jpeg);

    Algorithm 2 Basic GLSLProcessor implementationtemplateclass GLSLSepiaProcessor : public OOPGPU::GLSLProcessor{

    public:GLSLSepiaProcessor(OpenGLGraphicContext *);

    };

    templateGLSLSepiaProcessor::GLSLSepiaProcessor (OpenGLGraphicContext*gc): GLSLProcessor(“sepia.frag”, gc){}

    Algorithm 3 GLSLProcessor implementation with external parameterstemplate class GLSLRadialBlurer: public OOPGPU::GLSLProcessor{public:

    GLSLRadialBlurer(OpenGLGraphiContext *,float = 0.0, float = 0.0, float =0.0);protected :

    virtual void setAttributes() const;private:

    float cx, cy, radius;};

    templateGLSLRadialBlurer::GLSLRadialBlurer (OpenGLGraphicContext *gc, float_cx, float _cy, float _radius):

    GLSLProcessor(“radialblur.frag”, gc), cx(_cx), cy(_cy), radius(_radius){}

    templatevoid GLSLRadialBlurer::setAttributes() const {

    setAttribute(“cx”, cx);setAttribute(“cy”, cy);setAttribute(“radiusSize”, radius);

    }

  • Multimed Tools Appl (2014) 70:2347–2368 2365

    Algorithm 4 CUDA implementationtemplate class CUDACannyProcessor: publicOOPGPU::CUDAProcessor {

    CUDACannyProcessor (void);protected :

    virtual void cudaCall(const Image *, T *);private:

    int win_size;};

    extern “C” {void GPUCanny(unsigned char *dst, const int width, const int height, const int

    win_size);void ReleaseNormalizedUcharTexture();

    }

    template CUDACannyProcessor::CUDACannyProcessor(const int _win_size):win_size(_win_size)

    template void CUDACartoonProcessor::cudaCall (const OOPGPU::Image *im,Image *result){

    int width = im→getWidth();int height = im→getHeight();int channel_count = im→getChannelCount();DataFormat f = im→getDataFormat();T *data = new T[width*height*channel_count];transferToTextureMemory(im);GPUCanny(data,im→getWidth(),im→getHeight(),5);ReleaseNormalizedUcharTexture();result→setDatas(data, width, height, f);

    }

    References

    1. Allusse Y, Horain P, Agarwal A, Saipriyadarshan C (2008) GpuCV: an opensource GPU-accelerated framework for image processing and computer vision. In: Proc. of the 16th ACMinternational conference on multimedia, pp 1089–1092

    2. Babenko P, Shah M (2008) MinGPU: a minimum GPU library for computer vision. Real-TimeImage Process 3(4):255–268

    3. Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV library.O’Reilly

    4. CAPS Enterprise: HMMP Workbench. http://www.caps-entreprise.com/index.php5. Chang JY, Park H, Park IK, Lee KM, Lee SU (2011) Gpu-friendly multi-view stereo recon-

    struction using surfel representation and graph cuts. Comput Vis Image Underst 115(5):620–634

    http://www.caps-entreprise.com/index.php

  • 2366 Multimed Tools Appl (2014) 70:2347–2368

    6. Fung J, Mann S, Aimone C (2005) OpenVIDIA: parallel GPU computer vision. In: Proc. of the13th annual ACM international conference on multimedia, pp 849–852

    7. Gamma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading

    8. General Purpose GPU Programming (GPGPU) Website. http://www.gpgpu.org9. Hou Q, Zhou K, Guo B (2008) BSGP: bulk-synchornous GPU programming. ACM Trans Graph

    27(3):1–1210. Jansen T (2007) GPU++, an embedded GPU development system for general-purpose compu-

    tations. Ph.D. Thesis, Technical University Munich11. Khronos Group: Open computing language. http://www.khronos.org/opencl/12. Kirk D, Hwu W (2010) Programming massively parallel processors: a hands-on approach.

    Morgan Kaufmann, San Mateo13. Kuck R, Wesche G (2009) A framework for object-oriented shader design. In: Proc. intl. sympo-

    sium on advances in visual computing, pp 1019–103014. McCool M, Toit SD, Popa T, Chan B, Moule K (2004) Shader algebra. ACM Trans Graph

    23(3):784–79215. Membarth R, Lokhmotov A, Teich J (2011) Generating GPU code from a high-level represen-

    tation for image processing kernels. In: HPPC 2011, p 2816. Nevatia R, Babu KR (1980) Linear feature extraction and description. Comput Graph Image

    Process 13(3):257–26917. Nguyen V, Deeds-Rubin S, Tan T, Boehm B (2007) A SLOC coding standard. In: Proc. interna-

    tional annual forum on COCOMO and systems/software cost modeling, pp 1–1618. NVIDIA NPP Library. http://www.nvidia.com/object/npp.html19. OpenCV GPU. http://http://opencv.willowgarage.com/wiki/OpenCV_GPU20. Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc

    IEEE 96(5):879–89921. Park IK, Singhal N, Lee MH, Cho S, Kim CW (2011) Design and performance evalua-

    tion of image processing algorithms on GPUs. IEEE Trans Parallel Distrib Syst 22(1):91–104

    22. Raspe M (2009) GPU-assisted diagnosis and visualization of medical volume data. Ph.D. Thesis,University of Koblenz and Landau

    23. Rost R (2006) OpenGL shading language. Addison-Wesley, Reading24. The Portland Group: PGI accelerator compilers. http://www.pgroup.com/resources/accel.htm

    Nicolas Seiller received the BS degree in computer science from Université de Haute Alsace (UHA),Mulhouse, France, in 2006 and the M.S. degree in information engineering from Inha University,Korea in 2010. He is currently a member of the research and development team of SmartCo wherehe develops data management software.

    http://www.gpgpu.orghttp://www.khronos.org/opencl/http://www.nvidia.com/object/npp.htmlhttp://http://opencv.willowgarage.com/wiki/OpenCV_GPUhttp://www.pgroup.com/resources/accel.htm

  • Multimed Tools Appl (2014) 70:2347–2368 2367

    Williem received the B.S. degree in computer science from Bina Nusantara University, Indonesia in2011. He is currently working toward M.S. degree in Inha University, Korea. His research interestsinclude GPU computing, computer vision, computational photography.

    Nitin Singhal received the M.S. degree in electrical engineering and computer science fromSeoul National University, South Korea in 2008 and B.S. degree in electronics and communicationengineering from Indian Institute of Technology, Guwahati, India in 2006. He is presently workingat GE Global Research, Bangalore, India and is affiliated with the Biomedical Signal AnalysisLaboratory. He is an active member of IEEE. His research interests include computer vision,computational photography, GPU computing, and digital right management.

  • 2368 Multimed Tools Appl (2014) 70:2347–2368

    In Kyu Park received the B.S., M.S., and Ph.D. degrees from Seoul National University (SNU)in 1995, 1997, and 2001, respectively, all in electrical engineering and computer science. FromSeptember 2001 to March 2004, he was a member of Technical Staff at Samsung AdvancedInstitute of Technology (SAIT). Since March 2004, he has been with the School of Information andCommunication Engineering, Inha University, where he is an associate professor. From January 2007to February 2008, he was an exchange scholar at Mitsubishi Electric Research Laboratories (MERL).Dr. Park’s research interests include the joint area of computer graphics and vision, including 3Dshape reconstruction from multiple views, image-based rendering, computational photography, andGPGPU for image processing and computer vision. He is a member of IEEE and ACM.

    Object oriented framework for real-time image processing on GPUAbstractIntroductionPrevious workObject oriented framework for image processingFramework overviewClass detailsImage classIImageProcessor interfaceImageRegister class

    Video processing

    GPGPU integrationGLSL integrationCUDA integrationOpenCL integrationAdaptive implementation scenario

    Additional featuresFeature detectionComposite processorPerformance checking and logging

    Experimental results and discussionPerformance evaluationOOP overhead analysisCode complexity analysis

    Framework extensibilityLinear feature extractionMulti-view stereo matching

    Limitation

    ConclusionAppendix: Algorithm listingsReferences