Instrumenting parsecs raytrace

Preview:

DESCRIPTION

(Check my blog @ http://www.marioalmeida.eu/ ) In this presentation I present the performance metrics and results of running the parsec benchmark with the raytrace application on Upc's boada server

Citation preview

Instrumenting a benchmark applicationTools and Measurements TechniquesProject by Mário Almeida (EMDC)

Barcelona, 25 April 2012

Index (1/2)Tools and configuration● Parsec

○ Overview○ Benchmark programs

● Extrae● Paraver● Configuration

1

Index (2/2)Measurements● Raytrace

○ Overview○ Code○ Inputs○ Traces○ Load Balancing○ Cache misses and instructions○ Execution time○ Configuration comparisons○ Extrae overhead

Conclusions 2

Tools and configuration

ParsecOverview● Benchmark with the following characteristics:

○ Multithreaded○ Emerging workloads○ Diverse○ Not HPC-focused○ Research

3

ParsecBenchmark programs● blackscholes● bodytrack● canneal● dedup● facesim● ferret● fluidanimate● freqmine● raytrace● ... 4

Extrae● Instrumentation package to trace programs

and run with shared memory model and message passing programming.

5

Paraver● Detailed quantitative analysis of a program

performance.● Concurrent comparative analysis of several

traces.● Support for mixed message passing and

shared memory.● Building of derived metrics.

6

Configuration (1/4)Boada server:

● Dual CPU Six Core with Hyperthreading.● Kills applications after a few minutes.● 24 GB of RAM.

Boada server:

● Used cpulimit to limit the cpu usage up to four cores.

7

Configuration (2/4)Installed and/or configured:

● Parsec 2.1 with raytrace package only.● Extrae 2.2.1.● Paraver 4.3.0 (in my laptop).● CpuLimit● Minor configurations on .bashrc.● Multiple scripts to clean, build and run.

8

Configuration (3/4)

9

Configuration (4/4)

10

Measurements

RaytraceOverview● Physical simulation for visualization● Computer animation● Input is a complex object of many triangles.

11

RaytraceCodeFor every pixel in the image

calculate trajectory of ray striking pixelfind closest intersection point of ray with scene

geometrycalculate contribution of all lights at intersection pointrecursively trace specularly reflected ray

end for

12

RaytraceInputs● simsmall - 1 million polygons (480x270)● simmedium - 1 million poly (960x540)● simlarge - 1 million poly (1920x1080)● native - 10 million poly (1920x1080)

13

RaytraceTrace (1/2)Only 10% of the execution time is parallel!

14

Not created Running

Render time is proportional to the # of frames!

RaytraceTrace (2/2)

15

RenderInit and adding object Build Context

RaytraceLoad balancing (1/2)

16Not created

Barrier

Create Threads Task

Wait for all threads

Good load balancing between the slave threads.

RaytraceLoad balancing (2/2)

17

RaytraceCache and instructions

18

High number of cache misses Very low number of cache misses

There were no significative diferences of IPC between threads.

RaytraceExecution time (1/3)

These are average times from multiple executions of the parallel code only and without extrae overhead.There was a high average deviation of 0.3 seconds in the experiments.Bigger inputs were more accurate.

19

RaytraceExecution time (2/3)

There was a smaller average deviation of 0.03 seconds. With 64 threads it runs almost three times faster!

20

RaytraceExecution time (3/3)

There was a even smaller average deviation of 0.02 seconds. With 64 threads it runs almost three times faster!

21

RaytraceConfiguration comparison

22

In the case of the limited configuration, although perfomance doesn't seem to degrade, the execution time seems to stabilize for more than 8 threads.

RaytraceExtrae overhead

23

Conclusions

Conclusions (1/3)● The system seemed to perform worse for a

number of threads multiple of the total number of physical cores.

● The program has a good load balancing. ● Fine-granular parallelism.

24

Conclusions (2/3)● Although it wasn't possible to verify,

increasing the input should cause higher cache misses, because of the big working sets that won't fit on the memory.

● Memory bandwidth should be the main issue

for good speedups. ● Boada killed almost all the native input

executions. 25

Conclusions (3/3)● Paraver simplifies the process of analyzing

an application performance. ● Better knowledge of the systems

architecture would be needed in order further analyse the performance of the application.

26

Questions

Recommended