View
216
Download
1
Embed Size (px)
Citation preview
Performance Analysis of Performance Analysis of Engineering Applications for Engineering Applications for
Complex Processing Complex Processing ArchitecturesArchitectures
Kevin SgroiKevin Sgroi
Masters ProjectMasters Project
SUNYITSUNYIT
IntroductionIntroduction
In the real world, few people use their multi-core workstations to run In the real world, few people use their multi-core workstations to run parallelized applications, exclusively all day long. parallelized applications, exclusively all day long.
Typically, a mix of legacy apps (built for a single-core CPU), scripts, and Typically, a mix of legacy apps (built for a single-core CPU), scripts, and background services are run concurrently with parallel code on the same background services are run concurrently with parallel code on the same workstation.workstation.
It is possible for single-threaded applications to perform worse on a multi-It is possible for single-threaded applications to perform worse on a multi-core processor than on a single-core processor. This is usually due to the core processor than on a single-core processor. This is usually due to the lower per-core clock speed on multi-core processors compared to single-lower per-core clock speed on multi-core processors compared to single-core (single threaded apps normally only use one of the cores)core (single threaded apps normally only use one of the cores)
The above bullet items represent a “real-world” use case, which is what the The above bullet items represent a “real-world” use case, which is what the tests presented here are targeting. tests presented here are targeting.
Introduction (continued)Introduction (continued)
A matrix-multiply application has been customized to generate a multi-A matrix-multiply application has been customized to generate a multi-threaded load for testing.threaded load for testing.
This matrix-multiply application was executed concurrently with the This matrix-multiply application was executed concurrently with the UNIXBENCH benchmark test suite in this study to model a mixed UNIXBENCH benchmark test suite in this study to model a mixed (threaded/non-threaded) environment.(threaded/non-threaded) environment.
This work was published and presented at the ASME International Design This work was published and presented at the ASME International Design Engineering Technical Conference (IDETC) in Montreal, Canada, Aug. 15-Engineering Technical Conference (IDETC) in Montreal, Canada, Aug. 15-15, 2010.15, 2010.
Introduction (continued)Introduction (continued)
Goal of the study: To conduct experiments designed to reveal problem Goal of the study: To conduct experiments designed to reveal problem areas that should be considered when areas that should be considered when 1. Implementing applications that combine parallel and non-parallel 1. Implementing applications that combine parallel and non-parallel operations on modern parallel computing architectures. operations on modern parallel computing architectures.
2. Running traditional non-parallel applications or operations (designed for 2. Running traditional non-parallel applications or operations (designed for single-CPU architecture) along-side parallel, multi-threaded applications on single-CPU architecture) along-side parallel, multi-threaded applications on the a multi-core workstation.the a multi-core workstation.
Direction of this talkDirection of this talk
Describe the test environment hardware, software, benchmark coverage. Describe the test environment hardware, software, benchmark coverage.
Describe the installation process and tools used for load generation and Describe the installation process and tools used for load generation and benchmarking. benchmarking.
Present results with a detailed explanation for each. Present results with a detailed explanation for each.
Summary and ConclusionSummary and Conclusion
Test environment: BenchmarkTest environment: Benchmark
The benchmark suite used in this analysis is called UNIXBENCH The benchmark suite used in this analysis is called UNIXBENCH
Originally started at Monash University in 1983 as a simple, synthetic Originally started at Monash University in 1983 as a simple, synthetic benchmarking application, UnixBench was enhanced and expanded by Byte benchmarking application, UnixBench was enhanced and expanded by Byte Magazine® and the original authors [2] Magazine® and the original authors [2]
UNIXBENCH covers five types of operationsUNIXBENCH covers five types of operations::
1.1. CPU intensiveCPU intensive
2.2. Inter-process communication with pipesInter-process communication with pipes
3.3. shell script executionshell script execution
4.4. file I/O file I/O
5.5. System call overhead System call overhead
Test environment: ComponentsTest environment: Components
The test environment consists of:The test environment consists of:
One multi-core workstation, One multi-core workstation,
Matrix multiply load applicationMatrix multiply load application
UNIXBENCH benchmark test suiteUNIXBENCH benchmark test suite
OProfile CPU & Kernel profiling toolOProfile CPU & Kernel profiling tool
Various support scriptsVarious support scripts
Test environment: HardwareTest environment: Hardware
System rumbutan: GNU/Linux
OS GNU/Linux -- 2.6.31.2
Machine i686: i386
Language en_US.utf8 (charmap="UTF-8", collate="UTF-8")
Memory 2 Gigabytes, 8MB L2 Cache
CPUs *
0Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (4794.9 bogomips)Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, Intel virtualization
1Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (4795.2 bogomips)Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, Intel virtualization
*Both CPU cores were enabled and utilized for the tests performed for this paper
OS and Installed ApplicationsOS and Installed Applications
Fedora 9 with Kernel upgrade to 2.6.31.2 Fedora 9 with Kernel upgrade to 2.6.31.2
Compiled for the statically linked kernel image (vmlinux) which, when made Compiled for the statically linked kernel image (vmlinux) which, when made bootable allows access to the kernel symbol table bootable allows access to the kernel symbol table
PThreads (POSIX Threads libraries)PThreads (POSIX Threads libraries)
OProfile - A System-wide profiler for Linux OProfile - A System-wide profiler for Linux
No other applications, other than the OS and the gnome Desktop were No other applications, other than the OS and the gnome Desktop were active during testing. active during testing.
A reboot was performed between each test run to ensure a consistent A reboot was performed between each test run to ensure a consistent environment.environment.
Test Scripts and Benchmark ToolsTest Scripts and Benchmark Tools
Load generator Load generator
Matrixmult.c:Matrixmult.c: Multiplies two randomized matrixes, continuously in a loop. Multiplies two randomized matrixes, continuously in a loop.
Size of the matrix is configurable via a command line argumentSize of the matrix is configurable via a command line argument .. Uses Pthreads for row/column multiplication.Uses Pthreads for row/column multiplication.
Benchmark tests Benchmark tests
UnixBench benchmark test packageUnixBench benchmark test package
Measures system throughput & performance. Measures system throughput & performance.
Made up of several individual tests which target specific aeas, such as Made up of several individual tests which target specific aeas, such as
CPU and I/O throughput, process creation and communication, system CPU and I/O throughput, process creation and communication, system callcall
overhead and context switching overhead and context switching
Test harnessTest harness
Loopit.sh:Loopit.sh:
Launches Launches nn instances of matrixmult as background tasks. instances of matrixmult as background tasks.
Continuously monitors currently running instances of matrixmult and, if Continuously monitors currently running instances of matrixmult and, if
< n (i.e. if one dies for some reason), starts a new instance. < n (i.e. if one dies for some reason), starts a new instance.
startbm.sh:startbm.sh:
1. Resets Oprofile logs, runs opcontrol command1. Resets Oprofile logs, runs opcontrol command
2. Launches loopit.sh (described above)2. Launches loopit.sh (described above)
Runs the UNIXBENCH test suite concurrently with matrixmult.Runs the UNIXBENCH test suite concurrently with matrixmult.
After UNIXBENCH completes, runs endtest.sh, After UNIXBENCH completes, runs endtest.sh, stops Oprofile and dumps the data collected to a log file.stops Oprofile and dumps the data collected to a log file.endtest.sh:endtest.sh:
A shell script that ends any running test components and scriptsA shell script that ends any running test components and scripts
Test NameTest Name PurposePurpose OperationsOperations
DhrystoneMeasure, compare computer performance
Focuses on string handling. No floating point operations
WhetstoneMeasure efficiency, speed of floating point operations
Various C functions, such as; sin, cos, sqrt, exp, and log. Integer and floating point ops. array accesses, procedure calls and conditional branches are also done
Execl ThroughputMeasures number of Execl calls per second
“Replaces the current process image with a new process image” [4]
File CopyMeasures file to file data Transfer rate at various buffer sizes
File read, write and copy
Pipe Throughput
Measures speed at which (times per second) a process can write 512 bytes to a pipe and read them back
Communication between processes using pipes.
Pipe-based Context Switching
Measures rate at which an increasing integer can be exchanged between two Processes through a pipe
Carries out a bi-directional pipe conversation between a parent andthe child process that it spawns
Process Creation
Measures the rate at which a process can “fork and reapa child that immediately exits”[4]
Process creation is applicable toMemory bandwidth since it involves allocation of memory and control blocks
Shell Scripts
Rate (per minute) at which a set of 1, 2, 4 and 8 concurrent copies of a shell script can be started and reaped by aprocess
“Applies a series of transformation to a data file” [4]
System Call OverheadEstimates system call overhead which consists of entering and leaving the operating system kernel.
Repeated calls to getpid
Graphical Tests 2D and 3D graphics performanceConsists of “ubgears” program. Dependant on hardware and drivers [4]
[1] Part of the Execl family of functions
Test DescriptionTest DescriptionSeven sets of tests were performed. The first test is a baseline with Seven sets of tests were performed. The first test is a baseline with UNIXBENCH running alone. In each of the remaining six tests, the given UNIXBENCH running alone. In each of the remaining six tests, the given number of matrix multiply operations were being performed while number of matrix multiply operations were being performed while simultaneously executing the UNIXBENCH test suite:simultaneously executing the UNIXBENCH test suite:
UNIXBENCH alone (Baseline)UNIXBENCH alone (Baseline)
UNIXBENCH with 250 instances of matrixmult with a matrix size of 36x36UNIXBENCH with 250 instances of matrixmult with a matrix size of 36x36
UNIXBENCH with 250 instances of matrixmult with a matrix size of 72x72UNIXBENCH with 250 instances of matrixmult with a matrix size of 72x72
UNIXBENCH with 500 instances of matrixmult with a matrix size of 36x36UNIXBENCH with 500 instances of matrixmult with a matrix size of 36x36
UNIXBENCH with 500 instances of matrixmult with a matrix size of 72x72UNIXBENCH with 500 instances of matrixmult with a matrix size of 72x72
UNIXBENCH with 1000 instances of matrixmult with a matrix size of 36x36UNIXBENCH with 1000 instances of matrixmult with a matrix size of 36x36
UNIXBENCH with 1000 instances of matrixmult with a matrix size of 72x72UNIXBENCH with 1000 instances of matrixmult with a matrix size of 72x72
Test ResultsTest Results
Results were gathered in two parts. First, Opreport was executed to Results were gathered in two parts. First, Opreport was executed to generate a report of kernel activity collected during the Oprofile session. generate a report of kernel activity collected during the Oprofile session.
Second, UNIXBENCH generates its own set of results, consisting of Second, UNIXBENCH generates its own set of results, consisting of runtimes for each of the tests described in table 2. runtimes for each of the tests described in table 2.
Kernel samples are presented in upcoming slides, but are not used in the Kernel samples are presented in upcoming slides, but are not used in the final test analysis. Unixbench runtimes are presented in graph format.final test analysis. Unixbench runtimes are presented in graph format.
Oprofile ResultsOprofile ResultsOprofile was chosen because of its low overhead and ability to profile all Oprofile was chosen because of its low overhead and ability to profile all code, including applications, shared libraries, kernel modules, hardware and code, including applications, shared libraries, kernel modules, hardware and software interrupt handlers as well as the kernel itself[1].software interrupt handlers as well as the kernel itself[1].
Oprofile queries hardware registers to provide useful machine statistics, like Oprofile queries hardware registers to provide useful machine statistics, like cache misses, page faults, number of hardware interrupts received, etc.cache misses, page faults, number of hardware interrupts received, etc.
More detailed information on Oprofile can be found at:More detailed information on Oprofile can be found at:http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/
An Oprofile setup command must be issued to prior to testing to enable An Oprofile setup command must be issued to prior to testing to enable
Oprofile and configure the sample rateOprofile and configure the sample rate. . TheThe Oprofile commands used in Oprofile commands used in
this project are described in detail in the writeup.this project are described in detail in the writeup.
Oprofile supports the collection of samples on many of the Core 2 CPU's Oprofile supports the collection of samples on many of the Core 2 CPU's performance counter event types [5]. Oprofile doesn’t provide low-level performance counter event types [5]. Oprofile doesn’t provide low-level access to the Core 2 hardware so it uses synthesized events instead [5]. A access to the Core 2 hardware so it uses synthesized events instead [5]. A complete list can be found at:complete list can be found at:
http://oprofile.sourceforge.net/docs/intel-core2-events.phphttp://oprofile.sourceforge.net/docs/intel-core2-events.php
Oprofile AnalysisOprofile Analysis
SampleSampless
% (during % (during cache cache miss miss samples)samples)
samplesampless
% % (during (during cache cache request request samples)samples)
Image Image namename
symbol namesymbol name
7966 85.1523 25478 48.9406vmlinux-2.6.31.2
find_vma
560 5.9861 12684 24.3647vmlinux-2.6.31.2
arch_get_unmapped_area
63 0.6734 1149 2.2071vmlinux-2.6.31.2
arch_get_unmapped_area_topdown
Interpreting the Oprofile ReportInterpreting the Oprofile Report
The Oprofile results are presented here as a sample, and were not used in The Oprofile results are presented here as a sample, and were not used in the final analysis – which only considers the UNIXBENCH test results and the final analysis – which only considers the UNIXBENCH test results and matrix multiply load.matrix multiply load.
As an example of how to interpret the results, the symbol with the highest As an example of how to interpret the results, the symbol with the highest sample percentage, find_vma was called in about 85% of the total L2 cache sample percentage, find_vma was called in about 85% of the total L2 cache missmiss samples taken and in 48% of the L2 cache samples taken and in 48% of the L2 cache requestrequest samples taken. samples taken. These sample distributions did not vary much across all test runs.These sample distributions did not vary much across all test runs.
Brief description of top three symbolsBrief description of top three symbols:: find_vma: finds the closest region to a given addressfind_vma: finds the closest region to a given address arch_get_unmapped_area: searches the process address space to find an arch_get_unmapped_area: searches the process address space to find an
available linear address space [3].available linear address space [3]. arch_get_unmapped_area_topdown: implements arch_get_unmapped_areaarch_get_unmapped_area_topdown: implements arch_get_unmapped_area
UNIXBENCH ResultsUNIXBENCH Results One graph was generated for each test type covered in the suite, includingOne graph was generated for each test type covered in the suite, including the baseline. Only selected graphs were used in the project presentation andthe baseline. Only selected graphs were used in the project presentation and write-up. Each graph compares each of seven separate test runs write-up. Each graph compares each of seven separate test runs side-by-side.side-by-side.
They are labeled in the key as follows:They are labeled in the key as follows:
250 -36 (read as “250 instances of matrixmult using a 36x36 matrix”)250 -36 (read as “250 instances of matrixmult using a 36x36 matrix”) 250 – 72250 – 72 500 – 36500 – 36 500 – 72500 – 72 1000 – 361000 – 36 1000 – 721000 – 72 Baseline Baseline
The baseline is UNIXBENCH executed by itself, with no matrix multiplyThe baseline is UNIXBENCH executed by itself, with no matrix multiply load present. The baseline is important for comparison purposes asload present. The baseline is important for comparison purposes as it represents the performance of the tests under normal conditionsit represents the performance of the tests under normal conditions ..
UNIXBENCH Results (Cont.)UNIXBENCH Results (Cont.)The following key applies to the graphsThe following key applies to the graphs
contained in upcoming slides:contained in upcoming slides:
X-axis: Name of UNIXBENCH test componentX-axis: Name of UNIXBENCH test component
y-axis units are defined as follows:y-axis units are defined as follows:
MWIPS – Millions of whetstone instructions per secondMWIPS – Millions of whetstone instructions per second
KBps – kilobytes per secondKBps – kilobytes per second
lps – loops per secondlps – loops per second
lpm – loops per minutelpm – loops per minute
UNIXBENCH GraphsUNIXBENCH GraphsDouble-Precision Whetstone
0
500
1000
1500
2000
2500
Double-Precision Whetstone
Tests
MW
IPS
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
10%
Array processing operations in whetstone compete directly withMatrix multiply for processor and cache access.
UNIXBENCH GraphsUNIXBENCH GraphsPipe Throughput
0
100000
200000
300000
400000
500000
600000
Pipe Throughput
lps
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
~17%
An anomaly occurred here. Likely cause: a communication bottleneck at 1000 matrix multiplies caused many threads to become idle while waiting on a response from other threads. This freed up resources which were snatched up by the Pipe Throghput test.
UNIXBENCH GraphsUNIXBENCH GraphsPipe-based Context Switching
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Pipe-based Context Switching
lps
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
~17%
Little degradation, as expected since the matrix multiply loadruns mainly in user space & therefore performs little context switching (and therefore little competition with pipe-based context switching).A similar anomaly occurred here as in the pipe throughput test results.
UNIXBENCH GraphsUNIXBENCH GraphsFile Copy 1024 bufsize 2000 maxblocks
0
50000
100000
150000
200000
250000
300000
350000
File Copy 1024 bufsize 2000 maxblocks
KB
ps
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
~40%
UNIXBENCH GraphsUNIXBENCH GraphsFile Copy 256 bufsize 500 maxblocks
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
File Copy 256 bufsize 500 maxblocks
KB
ps
250-36
250 -72
500 - 36
500 - 72
1000 - 36
1000 - 72
baseline
~40%
Both the 1024 and 256 buffer size tests are affected by the throughputof the I/O subsystem and the disk itself, so contention for resourcesmost likely explains the drop in performance.
UNIXBENCH GraphsUNIXBENCH GraphsSystem Call Overhead
0
200000
400000
600000
800000
1000000
1200000
System Call Overhead
lps
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
~40%
This test uses considerable CPU time in single-threaded Kernel code. It makes repeated calls to getpid (to return a process ID) which takesLonger as the list of active matrixmult processes grows.
UNIXBENCH GraphsUNIXBENCH GraphsShell Scripts (1 concurrent)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Shell Scripts (1 concurrent)
lpm
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
Baseline
~56%
UNIXBENCH GraphsUNIXBENCH GraphsShell Scripts (8 concurrent)
020
40
6080
100120140160
180
200
220
240
260280
300
320
340
360380400420
440
460480500
520540
560
580
600
620640
Shell Scripts (8 concurrent)
lpm
250 - 36
250 - 72
500 - 36
500 - 72
1000 - 36
1000 - 72
baseline
~45%
The 8 (concurrent) Shell script test performed better thanthe 1 shell script test, why? The 8 (concurrent) shell script test runs in parallel & therefore uses both cores.
SummarySummaryCombining the UNIXBENCH set of tests with a purely multi-threaded loadprovided insight across a broad array of operations.System Call Overhead, file copy and Shell Script tests were most influenced by the matrix multiply loadPipe-based communication (directly between processes) tended to scale well as the load increasedInterrupt handling is expensive and doesn’t respond well to the increased load because it can delay, and be delayed by a CPU that is busy handling other processes & threads.
Summary (Cont.)Summary (Cont.)
CPU intensive mathematical and floating point operations scale well (little context switching since they run mostly in user space)
Most communication (with whetstone-like operations) stays on the CPU(s)/cores or occurs between threads and therefore incurs lower kernel overhead
Engineers, developers and others should be mindful of these results when purchasing, building or even just running applications on modern parallel computing architectures.
The manor in which multi-threaded operations are combined with traditional non-threaded operations (such as the UNIXBENCH examples used here) can impact performance and throughput.
Future WorkFuture Work
Running the tests on a quad-core or greater processor. Running the tests on a quad-core or greater processor.
Modifying the matrix multiply code to control which core(s) on which Modifying the matrix multiply code to control which core(s) on which threads will execute – a practice sometimes referred to as processor affinity.threads will execute – a practice sometimes referred to as processor affinity.
Verifying that anomalies in the test results are not influenced by the state or Verifying that anomalies in the test results are not influenced by the state or size of the cache. One way to do this is to modify the work load of size of the cache. One way to do this is to modify the work load of matrixmult.c so that the work of each thread fits within the L2 cache, or just matrixmult.c so that the work of each thread fits within the L2 cache, or just clear the cache upon each use.clear the cache upon each use.
AcknowledgementsAcknowledgements
Special thanks to the UNIX lab administrators at SUNYIT for their help
in acquiring equipment and providing remote lab access.
ReferencesReferences
[1] "About (Oprofile)." [1] "About (Oprofile)." OProfileOProfile. 2009. Sourceforge.net. 15 Jan. 2010 . 2009. Sourceforge.net. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/
[2] "byte-unixbench, A Unix benchmark suite." [2] "byte-unixbench, A Unix benchmark suite." Project Hosting on Google Project Hosting on Google CodeCode. 2010. Google. 15 Jan. 2010 . 2010. Google. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/
[3] "Re: [PATCH v5] Unified trace buffer." [3] "Re: [PATCH v5] Unified trace buffer." Linux Kernel ArchiveLinux Kernel Archive. 2008. . 2008. Indiana University. 9 Mar. 2010 Indiana University. 9 Mar. 2010 http://lkml.indiana.edu/hypermail/linux/kernel/0809.3/1326.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0809.3/1326.html
[4] "byte-unixbench, A Unix benchmark suite." [4] "byte-unixbench, A Unix benchmark suite." Project Hosting on Google Project Hosting on Google CodeCode. 2010. Google. 15 Jan. 2010 . 2010. Google. 15 Jan. 2010 http://oprofile.sourceforge.net/about/http://oprofile.sourceforge.net/about/
[5] “Intel Core 2 events.” Oprofile. 2009. Sourceforge.net 24 Nov. 2009[5] “Intel Core 2 events.” Oprofile. 2009. Sourceforge.net 24 Nov. 2009<http://oprofile.sourceforge.net/docs/intel-core2-events.php><http://oprofile.sourceforge.net/docs/intel-core2-events.php>