Comparative analysis of multi-threading on different operating

Comparative analysis of multi-threading on different operating systems

applied on digital image processing

Dulcinéia O. da Penha* João B. T. Corrêa Luís F. W. Góes { [email protected] } { [email protected] } { [email protected] } Luiz E. S. Ramos Christiane V. Pousa Carlos A. P. S. Martins { [email protected] } {[email protected]} { [email protected] }

Informatics Institute / Post-Graduation Program in Electrical Engineering

Computational and Digital Systems Laboratory Pontifical Catholic University of Minas Gerais

Av. Dom José Gaspar 500, 30535-610 Belo Horizonte, MG, Brazil

Telephone/Fax: 55-31-33194305

Topic area: Parallel and Distributed Systems *primary contact person

Comparative analysis of multi-threading on different operating systems

applied on digital image processing

Abstract:

This work presents a comparative analysis of parallel image convolution implementations based

on the shared-variable programming model. Those implementations use explicit compiler

directives from multi-thread support libraries. The comparison between implementations was

done in Windows and Linux operating systems. It considered both performance and

programmability. The performance was analyzed considering the execution response times of the

implementations. The convolution implementations analysis showed that the Windows sequential

implementation and the most of the tests with Linux parallel one presented better results. All

parallel implementations showed significant performance gains over sequential ones both in

Windows and Linux O.S. The programmability analysis showed that it is simpler for the

programmer to develop pthread based applications, since this library is more portable than

winthread. The first is compatible with the most of GNU gcc compilers provided with Linux,

while the latter varies from one compiler or O.S. version to another.

The objective of this work is to compare and analyze the application of a multiprocessor

programming support mechanism (shared-variable programming model) on image processing

operations, using different operating systems. The used mechanisms were multi-thread standard

support libraries: winthread and pthread, respectively for Windows and Linux operating

systems. The analysis considered: performance and programmability, response time

performance gains, programming methods, the simplicity and the transparency that those

mechanisms provide to the programmer.

Keywords:

parallel programming; shared-variable model; multi-thread; operating system; image

processing; image convolution; programmability; performance analysis

1. Introduction

Nowadays, a number of applications in many areas of knowledge (scientific, commercial,

industrial, and so on) demand very short response times. One possible solution to this problem is

the use of high performance computing [1]. Digital image processing (DIP) operations are

examples of applications that demand a considerable amount of computational resources to be

executed. The main reason for this is the fact that images are often stored in matrixes and the

computational cost to manipulate them is usually high [5]. Nevertheless, DIP operations have a

parallel nature [3], because they perform independent actions over independent data (image

pixels, which are the elements that compose the image representation matrix) [6]. Thus, in many

times the use of general-purpose parallel architectures using shared memory shows satisfactory

results in terms of performance gain [4]. DIP operations are used in many applications like

computer vision, medical and meteorological areas. Some of those are: image enhancement,

restoration, addiction, multiplication, matrix operations, filtering and so on [1].

The objective of this work is to compare and analyze the application of a multiprocessor

programming support mechanism (shared-variable programming model) on image processing

operations, using different operating systems. The used mechanisms were multi-thread standard

support libraries: winthread and pthread, respectively for Windows and Linux operating systems.

The analysis considered: performance and programmability, response time performance gains,

programming methods, the simplicity and the transparency that those mechanisms provide to the

programmer. The main goals of this work are the convolution implementations.

The DIP operation used was the image convolution. It was chosen because it is one of the

most important operations in the image processing area. It is simple and highly parallel. In that

operation, a convolution mask is applied on an input image, generating a convolved (or filtered)

output image. The masks may have different coefficients and sizes. Depending on the applied

convolution mask coefficients the convolution will result in smoothness, noise

elimination/reduction or other output image [1]. Some of the typical applications of image

convolution are edge detection, image enhacement, image blurring, morphological image

processing, feature extraction, template matching and regularization threory, and so on [6] [9].

The effective use of parallel systems is a very difficult task because it involves the project

of correct and efficient parallel aplications. In those systems, programming transparency is an

important issue for the developer of parallel applications. Some ways can be used to provide this

transparency to the programmer. Compiler directives from multi-thread support libraries are

commonly used in operating systems that provide support to multiprocessor systems.

2. The Convolution Operation

A filtering operation in space domain is called convolution. The term space domain refers

to the aggregation of pixels that compose an image. Operations in space domain are the

procedures applied directly on those pixels [6].

Equation 1 describes the convolution operation. The convolution is carried out for each

pixel (P[line][column]) of an NxN-sized image I, with a KxK-sized mask. The convolution mask

is applied on each pixel of the input image, resulting in a convolved (filtered) output image [6].

In this work, tests and comparisons were done using high-pass and low-pass spatial filters.

∑ ∑−= −=

×++=2

2

2

2

]][[]][[]][[

k

kuu

k

kvv

vuMvyuxIyxP

Equation 1. Equation of the convolution

The convolution mask that characterizes a high-pass filter is composed by positive

coefficients in its center (or next to it), and negative coefficientes in the surroundings [6]. The

high-pass filtering operating produces a highlight effect on the edges of the original image. It

happens because the appliance of a high-pass mask on a constant area (or with a small gray level

variation) makes the output a zero value or near zero [6]. This result significantly reduces the

global contrast of the image [1] [8]. Figure 1 (a) presents a 3x3 high-pass mask [1].

Figure 1 (a). 3x3 high-pass convolution mask.

Figure 1 (b). 3x3 low-pass convolution mask.

The low-pass image filtering operation produces an effect of image blurring (smoothing).

The smoothing effect is produced by attenuating the high-frequency components of the image.

High-frequency components are “filtered out” and information in the low-frequency range

“passes” without attenuation [6]. Figure 1 (b) presents the 3x3 low-pass mask [8].

Figure 2. Original image for high-pass convolution.

(a) (b)

Figure 3. Image of Figure 2 convolved with a high-pass filter (a),

and its negative image (with edge) (b).

Figure 3 presents the result of the convolution with a high-pass filter on the original

image showed in the Figure 2. The negative image is showed to validate the convolution

operation. The edge of the negative image was only introduced to show the actual image

dimensions and is not a part of the image. Figure 5 presents the result of the convolution with a

low-pass filter on the original image showed in Figure 4.

Figure 4. Original Image for low-pass convolution.

Figure 5. Image of Figure 4 convolved with a low-pass filter.

3. Parallel programming with shared-variables

Shared-memory parallel architectures can use shared-variables for communication

between the application processes. The effective use of parallel systems is a very difficult task

because it involves the project of correct and efficient parallel applications. This fact results in

several complex problems, like: process synchronization, data coherence and event ordering. In

an ideal parallel system, the user would not have to be responsible for controlling the parallel

execution of processes. There are some ways of using parallelism in order to provide some

transparency to the programmer [1]. Usually modern operating systems provide support to

multiprocessor systems, like: Unix, Linux, Windows, and so on. In those systems, the parallel

execution is activated by the creation of multiple threads that run in parallel. Usually, the number

of threads is independent from the number of physical processors available [7].

A thread (or thread of control) is a sequence of instructions being executed. Each process

has one or more threads. The threads of a process share its address space, its code, most of its

data, and most of its process descriptor information. The use of threads makes easier for the

programmer to write concurrent applications, in a transparent manner. Those applications may

run on machines with one or more processors (multiprocessors), with the advantages of using

additional processors, when they exist [1].

Creating and managing threads is much simpler than creating processes, since there is no

need to create and manipulate different address spaces. Nevertheless, multithread programming

is yet very difficult for the programmer. It happens because the programmer must explicitly

manage system features, caring about several details, like: thread creation, thread starting,

synchronization management and ordering.

There are two main multithread programming libraries, one for Unix/Linux platforms: the

pthread standard, and another for Windows platforms: winthread.

4. Implementations of the image convolution operation

In order to carry out a convolution, the pixels of a source image are changed by means of

an operation with its neighbor pixels and the coefficients in the convolution mask. All pixels

produced from that operation belong to the output image or convolved image (see Equation 1).

The edge pixels of the input image were discarded to eliminate edge effects.

The sequential algorithm that carries out the convolution operation [8] was implemented

as showed in Figure 6. It is composed of a four-loop. The sequential implementations were made

using the Borland C++ Builder 5.0 compiler, for Windows NT 4.0, and using the GNU gcc

compiler, for Mandrake 9.0 (kernel version 2.4.19-16mdksmp). The program that implements

the sequential algorithm for Windows platform was called KMT-IPS, described in [8].

In the sequential algorithm (Figure 6), the first couple of loops (outer ones) cover the

input image pixels, while the last couple of loops (inner ones) cover the convolution mask. Each

pixel at the output image is calculated by the sum of the input image pixels multiplied by the

convolution mask coefficient and divided by a division factor.

Some modifications in the obtained values are necessary when the result of the

convolution returns a value outside the limits of the grayscale representation (8 bits or values

between 0 and 255 per pixel). If the obtained value is negative the convolved pixel should have a

0 (zero) value. On the other hand, if the value exceeds 255 the pixel’s value should be 255.

This implementation is very simple. There are three matrixes: the first stores the input

image; the second stores the filter (convolution mask); and the third is a temporary matrix and

stores the output image pixels when they are convolved. When the convolution is done on all

input image pixels, the temporary matrix that stores the convoluted pixels is copied back to the

first matrix. Thus, the convoluted (output) image replaces the original (input) image.

Figure 6. Basic convolution algorithm with four loops

4.1. Parallel implementations

The image convolution operation is naturally parallel. In a convolution implementation

using shared-variables, the input image and the convolution mask can be shared between the

threads (or processes). The task of convolving an image is equally divided between all threads,

which is done when the input image lines are distributed among them. Thus, each thread is

responsible for calculating a number of output image lines, based on the source image lines that

were assigned to it.

Two parallel convolution implementations were developed, both using explicit multi-

thread programming. The first implementation is based on the winthread standard, compiled with

Borland C++ Builder 5.0 (libraries) [8], for Windows platforms. The second implementation

uses the pthread standard, compiled with GNU gcc libraries for Linux platforms.

In order to implement the parallel convolution using both winthread and pthread, it was

necessary to analyze the operation and the code portions that would be implemented in parallel,

considering the shared-variables programming model. The input image matrix, the temporary

matrix and the convolution filter were shared between all threads. On its creation, each thread

receives a parameter meaning the number of line where it should start performing convolution

(initial line). With this parameter, each thread gets to know the number of the last line that it

must convolve (final line). Each thread carries out the convolution operation in the image portion

between the initial and final lines that were assigned to it.

In both implementations, some difficulties were found. One of them is the fact of that the

programmer must to know the shared-memory programming model. He must also know how to

analyze the sequential operation in order to find the parallel portions and to develop the correct

application. Moreover, the programmer must explicitly care about issues, like: variable creation

and management, thread creation, initialization and synchronization (using a thread vector),

critical section access control, and so on.

5. Tests and experimental results

The computer used to execute all tests was an Intel Dual Pentium III 933MHz with a

1024MB primary memory. In each test, the image was previously loaded and then the time was

measured for the following computations. The operating systems used to execute the tests were

Windows NT 4.0 and Linux Mandrake 9.0 (kernel version 2.4.19-16mdksmp).

The tests for the Windows environment were executed using Prober, a functional and

performance analysis tool [2]. The tests for the Linux platform were executed manually.

The pthread implementation always presented larger sequential response times (one

thread) than the ones obtained from the winthread implementation. The same fact was observed

when a few threads were used. Table 1 shows the response times and speedup of the tests with

2048x2048-sized images, different mask sizes, 1 to 10 threads in Linux and Windows operating

systems. The Graphic 1 shows the Mandrake 9.0 speedups and Graphic 2 shows the Windows

2000 speedups.

Mask Sizes 3x3 5x5 7x7

O.S. Mandrake Win 2000 Mandrake Win 2000 Mandrake Win 2000

Threads resp. times sp. resp.

times sp. resp. Times sp. resp.

times sp. resp. times sp. resp.

times sp.

1 7221,00 1,00 4023,70 1,00 7172,00 1,00 6203,10 1,00 11306,00 1,00 8879,60 1,00

2 4178,00 1,73 2596,80 1,55 4126,00 1,74 3674,90 1,69 6208,00 1,82 5039,20 1,76

3 2999,00 2,41 2626,70 1,53 3033,00 2,36 3728,20 1,66 4380,00 2,58 5081,00 1,75

4 2552,00 2,83 2613,80 1,54 2545,00 2,82 3683,00 1,68 3596,00 3,14 5060,90 1,75

5 2190,00 3,30 2601,70 1,55 2163,00 3,32 3679,70 1,69 2930,00 3,86 5045,30 1,76

6 1986,00 3,64 2601,40 1,55 1950,00 3,68 3679,80 1,69 2691,00 4,20 5048,70 1,76

7 1813,00 3,98 2597,20 1,55 1763,00 4,07 3679,60 1,69 2334,00 4,84 5048,20 1,76

8 1679,00 4,30 2601,40 1,55 1662,00 4,32 3671,90 1,69 2183,00 5,18 5029,80 1,77

9 1594,00 4,53 2595,30 1,55 1616,00 4,44 3684,30 1,68 2039,00 5,54 5043,90 1,76

10 1526,00 4,73 2597,00 1,55 1509,00 4,75 3660,70 1,69 1928,00 5,86 5028,20 1,77

Table 1. Response times (resp. times) and speedup (sp.) of the tests with image size 2048x2048,

different mask sizes, 1 to 10 threads to Linux and Windows operating systems

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

1 2 3 4 5 6 7 8 9 10

Number of Threads

Spe

edup 3x3 Mandrake Speedup

5x5 Mandrake Speedup

7x7 Mandrake Speedup

Graphic 1 – Speedup of the Mandrake 9.0 operating system, for the 2048x2048-sized image and different mask sizes.

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

1 2 3 4 5 6 7 8 9 10

Number of Threads

Spe

edup 3x3 Win 2000 Speedup

5x5 Win 2000 Speedup

7x7 Win 2000 Speedup

Graphic 2 – Speedup of the Windows 2000 operating system, for the 2048x2048-sized image and different mask sizes.

A comparative analysis was made between the results obtained in this work and the

results obtained in [1]. In order to make this comparison valid, we used the same computer (Intel

Dual Pentium III 933MHz with a 1024MB primary memory) and the same sequential

implementation’s source code (KMT-IPS [8]). The only change was the operating system

version. In [1], Windows 2000 was used to perform the tests, while in this work Windows NT

4.0 was utilized. The first provided better response times than the latter. The speedup achieved

with Windows 2000 regarding NT 4.0 was 1.0803, using 2048x2048-sized images and 7x7-sized

masks on the sequential execution. Likewise, the speedup of the parallel execution (2 threads) in

Windows 2000 regarding NT 4.0 was of 1.0969 using the same image and mask sizes. Those

results can be calculated from Table 3. Windows 2000 also provided better results for the rest of

the parallel execution times, regarding NT 4.0.

The same comparison was made for the Linux platforms between the results obtained in

[1] and the ones obtained in this work. Likewise, there were maintained the computer and the

sequential implementation’s source code. However, the parallel results in [1] were obtained with

a parallel convolution implementation using the OpenMp API (Application Program Interface)

[7]. In [1], the Conectiva Linux 7.0 operating system was used, while in this work Linux

Mandrake 9.0 was utilized. The latter provided significantly better response times than the first.

The speedup of the sequential execution in Mandrake 9.0 with regard to Conectiva Linux 7.0

was of 1.6634, using 2048x2048-sized images and 7x7-sized masks. Likewise, the speedup of

the parallel execution (with 4 threads) with the Mandrake 9.0 regarding Conectiva 7.0 was of

1.4831, with the same image and mask sizes. Those results can be calculated from Table 2. The

parallel times obtained with Mandrake 9.0 for 5x5 and 7x7-sized masks were significantly better

than the ones obtained with Conectiva Linux 7.0.

Table 2 and Table 3 show the sequential and parallel response times and speedup (using 2

threads for Windows and 4 threads for Linux) in tests with the different operating systems

versions, 2048x2048-sized images and different mask sizes. In all cases the speedup obtained

with the parallel implementations regarding the sequential ones increased when the image and

mask sizes increased.

Mask Size Imagem 2048x2048

3x3 5x5 7x7

Conectiva 5113,3 10880 18840 Sequential

response time Mandrake 7221 7172 11306

Conectiva 1733,33 3286,67 5333,33 Parallel

response time Mandrake 2552 2545 3596

Conectiva 2,95 3,3103 3,5325 Speedup

Mandrake 2,8295 2,818 3,144

Table 2. Sequential and parallel (4 threads)

response time and speedup to tests to Linux O.S. to image size 2048x2048 and

different mask sizes

Mask Size Imagem 2048x2048

3x3 5x5 7x7

Win 2000 3734,33 5739,67 8219 Sequential

response time Win NT 4023,7 6203,1 8879,6

Win 2000 2427 3391 4594 Parallel

response time Win NT 2596,8 3674,9 5039,2

Win 2000 1,54 1,6926 1,79 Speedup

Win NT 1,55 1,6879 1,76

Table 3. Sequential and parallel (2 threads)

response time and speedup to tests to Windows O.S. to image size 2048x2048 and

different mask sizes

6. Conclusions

For the most of tests with different mask and image sizes, considering tests with a few

threads the response times obtained with the Windows were smaller (better) than with the Linux.

Nevertheless, with the increase of the number of threads, the response times of Linux showed a

significant reduction regarding the ones obtained with Windows. The Linux tests reached shorter

times than Windows tests to all the image and mask sizes to execution with more threads.

Considering all tests performed in this work, Linux Mandrake 9.0 showed the greatest

advantages regarding concurrency, while Windows NT 4.0 showed the worse concurrency

results. The reasons for those results are not discussed here. Usually, these reasons are related

with process and memory management.

For all parallel implementations, both with pthread and winthread, the combination of

parallelism and concurrency showed satisfactory results. Unexpectedly, in all cases the response

time decreased significantly with the increase of the number of threads, in spite of the

consequent concurrency increase between the threads. To explain this behavior, further analysis

must be done. One probable cause for that is the existence of operating system processes running

concurrently with user processes and thus, consuming processing time. Considering that the

processor schedules threads instead of processes, the larger the number of application’s threads,

more processing time is occupied. In this case, the time wasted in context switching (between the

threads) can be insignificant.

The decrease of response times caused by the increase of concurrency was greater in Linux than

in Windows. The use of concurrency provided even better speedup regarding the sequential and

purely parallel implementations (one thread per processor). Considering all analysis and

comparisons realized in this work, it was concluded that the use of parallelism and concurrency

is better than the use of pure parallelism (in the case of these tests).

7. Future Works

As future works we intend to investigate the performance improvements due to the use of

many threads; to compare and analyze the results obtained in this work (implementations with

shared-variable model, using multi-thread libraries) with message-passing based

implementations; and to analyze the cooperative use of multi-threading and message-passing in

the same implementation.

8. Acknowledgment

We would like to acknowledge: ProPPG (Pró-reitoria de Pesquisa e de Pós-Graduação

da PUC-Minas), PPGEE (Programa de Pós-Graduação em Engenharia Elétrica), LSDC

(Laboratório de Sistemas Digitais e Computacionais), CAPES (Coordenação de

Aperfeiçoamento de Pessoal de Nível Superior) and PUC-Minas for supporting our research and

for providing us the infra-structure for the experiments.

9. References

[1] D.O. Penha, J.B.T. Corrêa, C.A.P.S. Martins, “ Análise Comparativa do Uso de Multi-Thread

e OpenMp Aplicados a Operações de Convolução de Imagem”, III Workshop de Sistemas

Computacionais de Alto Desempenho (WSCAD), 2002.

[2] L.F.W. Góes, L.E.S. Ramos, C.A.P.S. Martins, “Performance Analysis of Parallel Programs

using Prober as a Single Aid Tool”. 14th Symposium on Computer Architecture and High

Performance Computing (SBAC-PAD), 2002.

[3] G.S. Almasi and A. Gottlieb, “Highly Parallel Computing” 2.ed, Benjamin/Cummings, 1994.

[4] K. Hwang and Z. Xu, “Scalable Parallel Computing: Technology, Architecture,

Programming”, McGraw-Hill, 1998.

[5] C. A. P. S. Martins, "Subsistema de exibição de imagens digitais com desacoplamento de

resolução - SEID-DR", in Doctor’s thesis, Universidade de São Paulo, SP, 1998. (in portuguese)

[6] R.C. Gonzalez and R.E. Woods, "Processamento de Imagens Digitais" 3.ed, Nova Iorque,

Ed. Edgard Blucher, 2000.

[7] “Introduction to OpenMP”, Advanced Computational Research Laboratory, Faculty of

Computer Science, UNB Fredericton, New Brunswick.

[8] J. B. T. Corrêa, C. A. P. S. Martins, “Performance Optimization on Digital Image Filtering”,

International Conference on Computer Science, Software Engineering, Information Technology,

e-Business, and Applications (CSITeA), 2002.

[9] N.K. Ratha, A.K. Janin, and D.T. Rover, "Convolution on Splash 2 ", Michigan State

University, IEEE, 1995.

Documents

Comparative analysis of multi-threading on different operating