Performance Analysis of Engineering Applications for Complex Processing Architectures Kevin Sgroi Masters Project SUNYIT

Performance Analysis of Performance Analysis of Engineering Applications for Engineering Applications for

Complex Processing Complex Processing ArchitecturesArchitectures

Kevin SgroiKevin Sgroi

Masters ProjectMasters Project

SUNYITSUNYIT

IntroductionIntroduction

In the real world, few people use their multi-core workstations to run In the real world, few people use their multi-core workstations to run parallelized applications, exclusively all day long. parallelized applications, exclusively all day long.

Typically, a mix of legacy apps (built for a single-core CPU), scripts, and Typically, a mix of legacy apps (built for a single-core CPU), scripts, and background services are run concurrently with parallel code on the same background services are run concurrently with parallel code on the same workstation.workstation.

It is possible for single-threaded applications to perform worse on a multi-It is possible for single-threaded applications to perform worse on a multi-core processor than on a single-core processor. This is usually due to the core processor than on a single-core processor. This is usually due to the lower per-core clock speed on multi-core processors compared to single-lower per-core clock speed on multi-core processors compared to single-core (single threaded apps normally only use one of the cores)core (single threaded apps normally only use one of the cores)

The above bullet items represent a “real-world” use case, which is what the The above bullet items represent a “real-world” use case, which is what the tests presented here are targeting. tests presented here are targeting.

Introduction (continued)Introduction (continued)

A matrix-multiply application has been customized to generate a multi-A matrix-multiply application has been customized to generate a multi-threaded load for testing.threaded load for testing.

This matrix-multiply application was executed concurrently with the This matrix-multiply application was executed concurrently with the UNIXBENCH benchmark test suite in this study to model a mixed UNIXBENCH benchmark test suite in this study to model a mixed (threaded/non-threaded) environment.(threaded/non-threaded) environment.

This work was published and presented at the ASME International Design This work was published and presented at the ASME International Design Engineering Technical Conference (IDETC) in Montreal, Canada, Aug. 15-Engineering Technical Conference (IDETC) in Montreal, Canada, Aug. 15-15, 2010.15, 2010.

Introduction (continued)Introduction (continued)

Goal of the study: To conduct experiments designed to reveal problem Goal of the study: To conduct experiments designed to reveal problem areas that should be considered when areas that should be considered when 1. Implementing applications that combine parallel and non-parallel 1. Implementing applications that combine parallel and non-parallel operations on modern parallel computing architectures. operations on modern parallel computing architectures.

2. Running traditional non-parallel applications or operations (designed for 2. Running traditional non-parallel applications or operations (designed for single-CPU architecture) along-side parallel, multi-threaded applications on single-CPU architecture) along-side parallel, multi-threaded applications on the a multi-core workstation.the a multi-core workstation.

Direction of this talkDirection of this talk

Describe the test environment hardware, software, benchmark coverage. Describe the test environment hardware, software, benchmark coverage.

Describe the installation process and tools used for load generation and Describe the installation process and tools used for load generation and benchmarking. benchmarking.

Present results with a detailed explanation for each. Present results with a detailed explanation for each.

Summary and ConclusionSummary and Conclusion

Test environment: BenchmarkTest environment: Benchmark

The benchmark suite used in this analysis is called UNIXBENCH The benchmark suite used in this analysis is called UNIXBENCH

Originally started at Monash University in 1983 as a simple, synthetic Originally started at Monash University in 1983 as a simple, synthetic benchmarking application, UnixBench was enhanced and expanded by Byte benchmarking application, UnixBench was enhanced and expanded by Byte Magazine® and the original authors [2] Magazine® and the original authors [2]

UNIXBENCH covers five types of operationsUNIXBENCH covers five types of operations::

1.1. CPU intensiveCPU intensive

2.2. Inter-process communication with pipesInter-process communication with pipes

3.3. shell script executionshell script execution

4.4. file I/O file I/O

5.5. System call overhead System call overhead

Test environment: ComponentsTest environment: Components

The test environment consists of:The test environment consists of:

One multi-core workstation, One multi-core workstation,

Matrix multiply load applicationMatrix multiply load application

UNIXBENCH benchmark test suiteUNIXBENCH benchmark test suite

OProfile CPU & Kernel profiling toolOProfile CPU & Kernel profiling tool

Various support scriptsVarious support scripts

Test environment: HardwareTest environment: Hardware

System rumbutan: GNU/Linux

OS GNU/Linux -- 2.6.31.2

Machine i686: i386

Language en_US.utf8 (charmap="UTF-8", collate="UTF-8")

Memory 2 Gigabytes, 8MB L2 Cache

CPUs *

0Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (4794.9 bogomips)Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, Intel virtualization

1Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (4795.2 bogomips)Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, Intel virtualization

*Both CPU cores were enabled and utilized for the tests performed for this paper

OS and Installed ApplicationsOS and Installed Applications

Fedora 9 with Kernel upgrade to 2.6.31.2 Fedora 9 with Kernel upgrade to 2.6.31.2

Compiled for the statically linked kernel image (vmlinux) which, when made Compiled for the statically linked kernel image (vmlinux) which, when made bootable allows access to the kernel symbol table bootable allows access to the kernel symbol table

PThreads (POSIX Threads libraries)PThreads (POSIX Threads libraries)

OProfile - A System-wide profiler for Linux OProfile - A System-wide profiler for Linux

No other applications, other than the OS and the gnome Desktop were No other applications, other than the OS and the gnome Desktop were active during testing. active during testing.

A reboot was performed between each test run to ensure a consistent A reboot was performed between each test run to ensure a consistent environment.environment.

Test Scripts and Benchmark ToolsTest Scripts and Benchmark Tools

Load generator Load generator

Matrixmult.c:Matrixmult.c: Multiplies two randomized matrixes, continuously in a loop. Multiplies two randomized matrixes, continuously in a loop.

Size of the matrix is configurable via a command line argumentSize of the matrix is configurable via a command line argument .. Uses Pthreads for row/column multiplication.Uses Pthreads for row/column multiplication.

Benchmark tests Benchmark tests

UnixBench benchmark test packageUnixBench benchmark test package

Measures system throughput & performance. Measures system throughput & performance.

Made up of several individual tests which target specific aeas, such as Made up of several individual tests which target specific aeas, such as

CPU and I/O throughput, process creation and communication, system CPU and I/O throughput, process creation and communication, system callcall

overhead and context switching overhead and context switching

Test harnessTest harness

Loopit.sh:Loopit.sh:

Launches Launches nn instances of matrixmult as background tasks. instances of matrixmult as background tasks.

Continuously monitors currently running instances of matrixmult and, if Continuously monitors currently running instances of matrixmult and, if

< n (i.e. if one dies for some reason), starts a new instance. < n (i.e. if one dies for some reason), starts a new instance.

startbm.sh:startbm.sh:

1. Resets Oprofile logs, runs opcontrol command1. Resets Oprofile logs, runs opcontrol command

2. Launches loopit.sh (described above)2. Launches loopit.sh (described above)

Runs the UNIXBENCH test suite concurrently with matrixmult.Runs the UNIXBENCH test suite concurrently with matrixmult.

After UNIXBENCH completes, runs endtest.sh, After UNIXBENCH completes, runs endtest.sh, stops Oprofile and dumps the data collected to a log file.stops Oprofile and dumps the data collected to a log file.endtest.sh:endtest.sh:

A shell script that ends any running test components and scriptsA shell script that ends any running test components and scripts

Test NameTest Name PurposePurpose OperationsOperations

DhrystoneMeasure, compare computer performance

Focuses on string handling. No floating point operations

WhetstoneMeasure efficiency, speed of floating point operations

Various C functions, such as; sin, cos, sqrt, exp, and log. Integer and floating point ops. array accesses, procedure calls and conditional branches are also done

Execl ThroughputMeasures number of Execl calls per second

“Replaces the current process image with a new process image” [4]

File CopyMeasures file to file data Transfer rate at various buffer sizes

File read, write and copy

Pipe Throughput

Measures speed at which (times per second) a process can write 512 bytes to a pipe and read them back

Communication between processes using pipes.

Pipe-based Context Switching

Measures rate at which an increasing integer can be exchanged between two Processes through a pipe

Carries out a bi-directional pipe conversation between a parent andthe child process that it spawns

Process Creation

Measures the rate at which a process can “fork and reapa child that immediately exits”[4]

Process creation is applicable toMemory bandwidth since it involves allocation of memory and control blocks

Shell Scripts

Rate (per minute) at which a set of 1, 2, 4 and 8 concurrent copies of a shell script can be started and reaped by aprocess

“Applies a series of transformation to a data file” [4]

System Call OverheadEstimates system call overhead which consists of entering and leaving the operating system kernel.

Repeated calls to getpid

Graphical Tests 2D and 3D graphics performanceConsists of “ubgears” program. Dependant on hardware and drivers [4]

[1] Part of the Execl family of functions

Test DescriptionTest DescriptionSeven sets of tests were performed. The first test is a baseline with Seven sets of tests were performed. The first test is a baseline with UNIXBENCH running alone. In each of the remaining six tests, the given UNIXBENCH running alone. In each of the remaining six tests, the given number of matrix multiply operations were being performed while number of matrix multiply operations were being performed while simultaneously executing the UNIXBENCH test suite:simultaneously executing the UNIXBENCH test suite:

UNIXBENCH alone (Baseline)UNIXBENCH alone (Baseline)

UNIXBENCH with 250 instances of matrixmult with a matrix size of 36x36UNIXBENCH with 250 instances of matrixmult with a matrix size of 36x36






Test ResultsTest Results

Results were gathered in two parts. First, Opreport was executed to Results were gathered in two parts. First, Opreport was executed to generate a report of kernel activity collected during the Oprofile session. generate a report of kernel activity collected during the Oprofile session.

Second, UNIXBENCH generates its own set of results, consisting of Second, UNIXBENCH generates its own set of results, consisting of runtimes for each of the tests described in table 2. runtimes for each of the tests described in table 2.

Kernel samples are presented in upcoming slides, but are not used in the Kernel samples are presented in upcoming slides, but are not used in the final test analysis. Unixbench runtimes are presented in graph format.final test analysis. Unixbench runtimes are presented in graph format.

Oprofile ResultsOprofile ResultsOprofile was chosen because of its low overhead and ability to profile all Oprofile was chosen because of its low overhead and ability to profile all code, including applications, shared libraries, kernel modules, hardware and code, including applications, shared libraries, kernel modules, hardware and software interrupt handlers as well as the kernel itself[1].software interrupt handlers as well as the kernel itself[1].

Oprofile queries hardware registers to provide useful machine statistics, like Oprofile queries hardware registers to provide useful machine statistics, like cache misses, page faults, number of hardware interrupts received, etc.cache misses, page faults, number of hardware interrupts received, etc.

More detailed information on Oprofile can be found at:More detailed information on Oprofile can be found at:http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/

An Oprofile setup command must be issued to prior to testing to enable An Oprofile setup command must be issued to prior to testing to enable

Oprofile and configure the sample rateOprofile and configure the sample rate. . TheThe Oprofile commands used in Oprofile commands used in

this project are described in detail in the writeup.this project are described in detail in the writeup.

Oprofile supports the collection of samples on many of the Core 2 CPU's Oprofile supports the collection of samples on many of the Core 2 CPU's performance counter event types [5]. Oprofile doesn’t provide low-level performance counter event types [5]. Oprofile doesn’t provide low-level access to the Core 2 hardware so it uses synthesized events instead [5]. A access to the Core 2 hardware so it uses synthesized events instead [5]. A complete list can be found at:complete list can be found at:

http://oprofile.sourceforge.net/docs/intel-core2-events.phphttp://oprofile.sourceforge.net/docs/intel-core2-events.php

Oprofile AnalysisOprofile Analysis

SampleSampless

% (during % (during cache cache miss miss samples)samples)

samplesampless

% % (during (during cache cache request request samples)samples)

Image Image namename

symbol namesymbol name

7966 85.1523 25478 48.9406vmlinux-2.6.31.2

find_vma

560 5.9861 12684 24.3647vmlinux-2.6.31.2

arch_get_unmapped_area

63 0.6734 1149 2.2071vmlinux-2.6.31.2

arch_get_unmapped_area_topdown

Interpreting the Oprofile ReportInterpreting the Oprofile Report

The Oprofile results are presented here as a sample, and were not used in The Oprofile results are presented here as a sample, and were not used in the final analysis – which only considers the UNIXBENCH test results and the final analysis – which only considers the UNIXBENCH test results and matrix multiply load.matrix multiply load.

As an example of how to interpret the results, the symbol with the highest As an example of how to interpret the results, the symbol with the highest sample percentage, find_vma was called in about 85% of the total L2 cache sample percentage, find_vma was called in about 85% of the total L2 cache missmiss samples taken and in 48% of the L2 cache samples taken and in 48% of the L2 cache requestrequest samples taken. samples taken. These sample distributions did not vary much across all test runs.These sample distributions did not vary much across all test runs.

Brief description of top three symbolsBrief description of top three symbols:: find_vma: finds the closest region to a given addressfind_vma: finds the closest region to a given address arch_get_unmapped_area: searches the process address space to find an arch_get_unmapped_area: searches the process address space to find an

available linear address space [3].available linear address space [3]. arch_get_unmapped_area_topdown: implements arch_get_unmapped_areaarch_get_unmapped_area_topdown: implements arch_get_unmapped_area

UNIXBENCH ResultsUNIXBENCH Results One graph was generated for each test type covered in the suite, includingOne graph was generated for each test type covered in the suite, including the baseline. Only selected graphs were used in the project presentation andthe baseline. Only selected graphs were used in the project presentation and write-up. Each graph compares each of seven separate test runs write-up. Each graph compares each of seven separate test runs side-by-side.side-by-side.

They are labeled in the key as follows:They are labeled in the key as follows:

250 -36 (read as “250 instances of matrixmult using a 36x36 matrix”)250 -36 (read as “250 instances of matrixmult using a 36x36 matrix”) 250 – 72250 – 72 500 – 36500 – 36 500 – 72500 – 72 1000 – 361000 – 36 1000 – 721000 – 72 Baseline Baseline

The baseline is UNIXBENCH executed by itself, with no matrix multiplyThe baseline is UNIXBENCH executed by itself, with no matrix multiply load present. The baseline is important for comparison purposes asload present. The baseline is important for comparison purposes as it represents the performance of the tests under normal conditionsit represents the performance of the tests under normal conditions ..

UNIXBENCH Results (Cont.)UNIXBENCH Results (Cont.)The following key applies to the graphsThe following key applies to the graphs

contained in upcoming slides:contained in upcoming slides:

X-axis: Name of UNIXBENCH test componentX-axis: Name of UNIXBENCH test component

y-axis units are defined as follows:y-axis units are defined as follows:

MWIPS – Millions of whetstone instructions per secondMWIPS – Millions of whetstone instructions per second

KBps – kilobytes per secondKBps – kilobytes per second

lps – loops per secondlps – loops per second

lpm – loops per minutelpm – loops per minute

UNIXBENCH GraphsUNIXBENCH GraphsDouble-Precision Whetstone

0

500

1000

1500

2000

2500

Double-Precision Whetstone

Tests

MW

IPS

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

10%

Array processing operations in whetstone compete directly withMatrix multiply for processor and cache access.

UNIXBENCH GraphsUNIXBENCH GraphsPipe Throughput

0

100000

200000

300000

400000

500000

600000

Pipe Throughput

lps

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

~17%

An anomaly occurred here. Likely cause: a communication bottleneck at 1000 matrix multiplies caused many threads to become idle while waiting on a response from other threads. This freed up resources which were snatched up by the Pipe Throghput test.

UNIXBENCH GraphsUNIXBENCH GraphsPipe-based Context Switching

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Pipe-based Context Switching

lps

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

~17%

Little degradation, as expected since the matrix multiply loadruns mainly in user space & therefore performs little context switching (and therefore little competition with pipe-based context switching).A similar anomaly occurred here as in the pipe throughput test results.

UNIXBENCH GraphsUNIXBENCH GraphsFile Copy 1024 bufsize 2000 maxblocks

0

50000

100000

150000

200000

250000

300000

350000

File Copy 1024 bufsize 2000 maxblocks

KB

ps

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

~40%

UNIXBENCH GraphsUNIXBENCH GraphsFile Copy 256 bufsize 500 maxblocks

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

File Copy 256 bufsize 500 maxblocks

KB

ps

250-36

250 -72

500 - 36

500 - 72

1000 - 36

1000 - 72

baseline

~40%

Both the 1024 and 256 buffer size tests are affected by the throughputof the I/O subsystem and the disk itself, so contention for resourcesmost likely explains the drop in performance.

UNIXBENCH GraphsUNIXBENCH GraphsSystem Call Overhead

0

200000

400000

600000

800000

1000000

1200000

System Call Overhead

lps

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

~40%

This test uses considerable CPU time in single-threaded Kernel code. It makes repeated calls to getpid (to return a process ID) which takesLonger as the list of active matrixmult processes grows.

UNIXBENCH GraphsUNIXBENCH GraphsShell Scripts (1 concurrent)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Shell Scripts (1 concurrent)

lpm

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

Baseline

~56%

UNIXBENCH GraphsUNIXBENCH GraphsShell Scripts (8 concurrent)

020

40

6080

100120140160

180

200

220

240

260280

300

320

340

360380400420

440

460480500

520540

560

580

600

620640

Shell Scripts (8 concurrent)

lpm

250 - 36

250 - 72

500 - 36

500 - 72

1000 - 36

1000 - 72

baseline

~45%

The 8 (concurrent) Shell script test performed better thanthe 1 shell script test, why? The 8 (concurrent) shell script test runs in parallel & therefore uses both cores.

SummarySummaryCombining the UNIXBENCH set of tests with a purely multi-threaded loadprovided insight across a broad array of operations.System Call Overhead, file copy and Shell Script tests were most influenced by the matrix multiply loadPipe-based communication (directly between processes) tended to scale well as the load increasedInterrupt handling is expensive and doesn’t respond well to the increased load because it can delay, and be delayed by a CPU that is busy handling other processes & threads.

Summary (Cont.)Summary (Cont.)

CPU intensive mathematical and floating point operations scale well (little context switching since they run mostly in user space)

Most communication (with whetstone-like operations) stays on the CPU(s)/cores or occurs between threads and therefore incurs lower kernel overhead

Engineers, developers and others should be mindful of these results when purchasing, building or even just running applications on modern parallel computing architectures.

The manor in which multi-threaded operations are combined with traditional non-threaded operations (such as the UNIXBENCH examples used here) can impact performance and throughput.

Future WorkFuture Work

Running the tests on a quad-core or greater processor. Running the tests on a quad-core or greater processor.

Modifying the matrix multiply code to control which core(s) on which Modifying the matrix multiply code to control which core(s) on which threads will execute – a practice sometimes referred to as processor affinity.threads will execute – a practice sometimes referred to as processor affinity.

Verifying that anomalies in the test results are not influenced by the state or Verifying that anomalies in the test results are not influenced by the state or size of the cache. One way to do this is to modify the work load of size of the cache. One way to do this is to modify the work load of matrixmult.c so that the work of each thread fits within the L2 cache, or just matrixmult.c so that the work of each thread fits within the L2 cache, or just clear the cache upon each use.clear the cache upon each use.

AcknowledgementsAcknowledgements

Special thanks to the UNIX lab administrators at SUNYIT for their help

in acquiring equipment and providing remote lab access.

ReferencesReferences

[1] "About (Oprofile)." [1] "About (Oprofile)." OProfileOProfile. 2009. Sourceforge.net. 15 Jan. 2010 . 2009. Sourceforge.net. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/

[2] "byte-unixbench, A Unix benchmark suite." [2] "byte-unixbench, A Unix benchmark suite." Project Hosting on Google Project Hosting on Google CodeCode. 2010. Google. 15 Jan. 2010 . 2010. Google. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/

[3] "Re: [PATCH v5] Unified trace buffer." [3] "Re: [PATCH v5] Unified trace buffer." Linux Kernel ArchiveLinux Kernel Archive. 2008. . 2008. Indiana University. 9 Mar. 2010 Indiana University. 9 Mar. 2010 http://lkml.indiana.edu/hypermail/linux/kernel/0809.3/1326.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0809.3/1326.html

[4] "byte-unixbench, A Unix benchmark suite." [4] "byte-unixbench, A Unix benchmark suite." Project Hosting on Google Project Hosting on Google CodeCode. 2010. Google. 15 Jan. 2010 . 2010. Google. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/

[5] “Intel Core 2 events.” Oprofile. 2009. Sourceforge.net 24 Nov. 2009[5] “Intel Core 2 events.” Oprofile. 2009. Sourceforge.net 24 Nov. 2009<http://oprofile.sourceforge.net/docs/intel-core2-events.php><http://oprofile.sourceforge.net/docs/intel-core2-events.php>

Documents

Performance Analysis of Engineering Applications for Complex Processing Architectures Kevin Sgroi Masters Project SUNYIT