88
Analysis tools for Evaluation and Performance Mourad Bouache PhD, Computer Architecture [email protected] Oracle - Nov, 14-2011

Tools for analysis and evaluation of CPU Performance

Embed Size (px)

DESCRIPTION

Processors are increasingly complex and micro architecture design becomes more difficult to understand the instruction behaviour during its execution in processor, architect uses simulators.

Citation preview

Page 1: Tools for analysis and evaluation of CPU Performance

Analysis tools for Evaluation and

Performance

Mourad BouachePhD, Computer Architecture

[email protected]

Oracle - Nov, 14-2011

Page 2: Tools for analysis and evaluation of CPU Performance

Introduction

Processors are increasingly complex

• More difficult microarchitecturedesign.

Simulator : very important tool

• Understand the instructionbehavior during its execution inprocessor.

Complex Simulator :

• Time for preparation andmodification.

2/1

Page 3: Tools for analysis and evaluation of CPU Performance

Introduction

Simulation

3/1

Page 4: Tools for analysis and evaluation of CPU Performance

Simulator

Tool

• Simulator : very important tool

• test new concepts

4/1

Page 5: Tools for analysis and evaluation of CPU Performance

Simulator

Tool

• Simulator : very important tool

• test new concepts

Three characteristics

4/1

Page 6: Tools for analysis and evaluation of CPU Performance

Complexity of microarchitectures

5/1

Page 7: Tools for analysis and evaluation of CPU Performance

Modular Simulation

6/1

Page 8: Tools for analysis and evaluation of CPU Performance

Speed decreases as complexity increases

7/1

Page 9: Tools for analysis and evaluation of CPU Performance

Contribution : vectorization methodology

8/1

Page 10: Tools for analysis and evaluation of CPU Performance

Monolithic Simulation

Simplescalar, is the most used (in 70% of articles).This simulator and most other simulators have a seriousdrawback : monolithic

Advantage

• simulation speed

9/1

Page 11: Tools for analysis and evaluation of CPU Performance

Monolithic Simulation

Simplescalar, is the most used (in 70% of articles).This simulator and most other simulators have a seriousdrawback : monolithic

Advantage

• simulation speed

Disadvantages

• Difficult to update.

• Difficult to extract and compare the simulator components.

9/1

Page 12: Tools for analysis and evaluation of CPU Performance

Monolithic vs Modular

10/1

Page 13: Tools for analysis and evaluation of CPU Performance

Modular simulation

Advantages

• Reuse/ exchange and compare simulator modules,

11/1

Page 14: Tools for analysis and evaluation of CPU Performance

Modular simulation

Advantages

• Reuse/ exchange and compare simulator modules,

• Better confidence in simulation (closer to HW),

11/1

Page 15: Tools for analysis and evaluation of CPU Performance

Modular simulation

Advantages

• Reuse/ exchange and compare simulator modules,

• Better confidence in simulation (closer to HW),

• Easier to read.

11/1

Page 16: Tools for analysis and evaluation of CPU Performance

Modular simulation

Advantages

• Reuse/ exchange and compare simulator modules,

• Better confidence in simulation (closer to HW),

• Easier to read.

Main drawback :

• Simulation speed slowdown

11/1

Page 17: Tools for analysis and evaluation of CPU Performance

Outline

1 Modular simulation environment

2 Acceleration techniques

3 Vectorization of Simulator Modules

4 Experimental framework

5 Results

6 Scheduling process in SystemC

7 Conclusion & future works.

12/1

Page 18: Tools for analysis and evaluation of CPU Performance

Modular simulation environments

• A modular simulation environment describe hierarchically andstructurally the system to simulate.To simulate the entire system, the environment includes ascheduler controlling the performance of different components.

13/1

Page 19: Tools for analysis and evaluation of CPU Performance

Modular simulation environments

• A modular simulation environment describe hierarchically andstructurally the system to simulate.To simulate the entire system, the environment includes ascheduler controlling the performance of different components.

• Key benefits :

13/1

Page 20: Tools for analysis and evaluation of CPU Performance

Modular simulation environments

Reuse ...

14/1

Page 21: Tools for analysis and evaluation of CPU Performance

Modular simulation environments

Compare ...

15/1

Page 22: Tools for analysis and evaluation of CPU Performance

Modular simulation environments

Share ...

16/1

Page 23: Tools for analysis and evaluation of CPU Performance

Simulation models

17/1

Page 24: Tools for analysis and evaluation of CPU Performance

acceleration techniques

Acceleration techniques

• reduction of inputs and simulation programs : MinneSPEC,

• simulation engine optimization : FastSysC 1(speedX 2),

• distribution of simulation : DisT,

• sampling techniques : representative, periodic and

random sampling,

• transition to modeling TTLM :Timed Transaction LevelModeling.

1. Daniel Gracia Perez et al. FastSysC : a fast SystemC engine18/1

Page 25: Tools for analysis and evaluation of CPU Performance

acceleration techniques

Acceleration techniques

• Compromise between accuracy and simulation speed,

2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)

19/1

Page 26: Tools for analysis and evaluation of CPU Performance

acceleration techniques

Acceleration techniques

• Compromise between accuracy and simulation speed,

• Vectorization 2 is a methodology that can be used withone of these acceleration techniques.

2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)

19/1

Page 27: Tools for analysis and evaluation of CPU Performance

Modular simulation environment

UNISIM 3 : A modular simulation framework

• UNISIM is a modular framework for simulation, each simulatoris divided into several modules, each module corresponding toa hardware block.

3. http ://www.unisim.org/20/1

Page 28: Tools for analysis and evaluation of CPU Performance

Modular simulation environment

UNISIM 3 : A modular simulation framework

• UNISIM is a modular framework for simulation, each simulatoris divided into several modules, each module corresponding toa hardware block.

• A module is composed of two parts : state and processes.

3. http ://www.unisim.org/20/1

Page 29: Tools for analysis and evaluation of CPU Performance

Modular simulation environment

UNISIM : A modular simulation framework

• A process is defined in a .sim file as a C++ class

21/1

Page 30: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication protocol

• Ports : inports and outports

• Signals

22/1

Page 31: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication protocol

• Ports : inports and outports

• Signals

3 signals :

• Processes can be sensitive tothe data, the accept andthe enable signals.

22/1

Page 32: Tools for analysis and evaluation of CPU Performance

Communication protocol

UNISIM : signals

• The simulation engine (SystemC) wakes up the modulesprocess.

23/1

Page 33: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

24/1

Page 34: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

25/1

Page 35: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

26/1

Page 36: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

27/1

Page 37: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

28/1

Page 38: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

29/1

Page 39: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

30/1

Page 40: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

31/1

Page 41: Tools for analysis and evaluation of CPU Performance

UNISIM : Communication protocol

Communication between modules

32/1

Page 42: Tools for analysis and evaluation of CPU Performance

Communication protocol

Communication between modulesScalability is difficult with a modular simulation, for two factors :

• Communication costs between the simulator modules.

• Awakening process for each communicating module.

33/1

Page 43: Tools for analysis and evaluation of CPU Performance

Communication costs

Monolithic Simulator

• Write/read a variable.

34/1

Page 44: Tools for analysis and evaluation of CPU Performance

Communication costs

Monolithic Simulator

• Write/read a variable.

Modular Simulator

34/1

Page 45: Tools for analysis and evaluation of CPU Performance

A New Communication Protocol

Signals Array

• Reduce the number of signals,

• Several values of data, accept, enable temporarily stored insignals array.

35/1

Page 46: Tools for analysis and evaluation of CPU Performance

A New Communication Protocol

Signals Array

• An extension of the communication protocol between modulesis a solution to accelerate a simulation speed.

36/1

Page 47: Tools for analysis and evaluation of CPU Performance

Module Vectorization

A simple and systematic procedure

1 vectorize module state and ports,

2 add a loop around the process,

3 add method calls to send() following the addition of forloops.

37/1

Page 48: Tools for analysis and evaluation of CPU Performance

Example : Functional Unit

1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr > in;5 outport <instr > out;6 FunctionalUnit (const char*name): module(name)7 { sensitive_pos_method (start_of_cycle ) << clock;8 sensitive_neg_method (end_of_cycle ) << clock;9 sensitive_method ( on_data_accept ) << in.data << out.accept;

10 }11 void start_of_cycle ()12 { if (pipeline.is_ready ())13 out.data = pipeline.get ();14 else out.data.nothing ();15 }16 void on_data_accept ()17 { if (in.data.know() && out.accept.know())18 { if (! pipeline.is_full() || out .accept)19 in.accept = true;20 else in.accept = false;21 out.enable = out.accept;22 }23 }24 void end_of_cycle ()25 { if (out.accept) pipeline.pop ();26 if (in.enable) pipeline.push(in.data);27 pipeline.run ();28 }29 private:30 Fifo <instr > pipeline;31 };

38/1

Page 49: Tools for analysis and evaluation of CPU Performance

Module Vectorization

Vectorization Procedure1. vectorize module state and ports.

1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr > in;5 outport <instr > out;6 ...7 private:8 Fifo <instr > pipeline;

1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr , NBCFG > in;5 outport <instr , NBCFG > out;6 ...7 private:8 Fifo <instr > pipeline[NBCFG];

39/1

Page 50: Tools for analysis and evaluation of CPU Performance

Module Vectorization

Vectorization procedure

2. add a loop around the process.

1 ...2 void start_of_cycle ()3 { if (pipeline.is_ready ())4 out.data = pipeline.get ();5 else out.data.nothing ();6 }7 void on_data_accept ()8 { if (in.data.know() && out.accept.know())9 { if (! pipeline.is_full() || out .accept)

10 in.accept = true;11 else in.accept = false;12 out.enable = out.accept;13 }14 }15 ...

1 ...2 void start_of_cycle ()3 { for (int cfg =0; cfg <NBCFG; cfg ++)4 {5 if (pipeline[cfg ]. is_ready ())6 out.data[cfg] = pipeline[cfg ].get ();7 else out .data[cfg ].nothing ();8 ...9 }

10 }11 void on_data_accept ()12 { if (in.data.know() && out.accept.know())13 { for (int cfg =0; cfg <NBCFG; cfg ++)14 { if (! pipeline[cfg ]. is_full()15 || out.accept[cfg ])16 in.accept[cfg] = true;17 else in.accept[cfg] = false;18 out .enable[cfg ] = out.accept[cfg ];19 ...20 }21 }22 }23 ...

40/1

Page 51: Tools for analysis and evaluation of CPU Performance

Module Vectorization

Vectorization procedure

3. add method calls to send() following the addition of for loops.

1 ...2 void start_of_cycle ()3 { if (pipeline.is_ready ())4 out.data = pipeline.get ();5 else out.data.nothing ();6 }7 void on_data_accept ()8 { if (in.data.know() && out.accept.know())9 { if (! pipeline.is_full() || out .accept)

10 in.accept = true;11 else in.accept = false;12 out.enable = out.accept;13 }14 }15 ...

1 ...2 void start_of_cycle ()3 { for (int cfg =0; cfg <NBCFG; cfg ++)4 {5 if (pipeline[cfg ]. is_ready ())6 out.data[cfg] = pipeline[cfg ].get ();7 else out .data[cfg ].nothing ();8 }9 out.data.send();

10 }11 void on_data_accept ()12 { if (in.data.know() && out.accept.know())13 { for (int cfg =0; cfg <NBCFG; cfg ++)14 { if (! pipeline[cfg ]. is_full()15 || out.accept[cfg ])16 in.accept[cfg] = true;17 else in.accept[cfg] = false;18 out .enable[cfg ] = out.accept[cfg ];19 }20 in.accept.send();21 out .enable.send();22 }23 }24 ...

41/1

Page 52: Tools for analysis and evaluation of CPU Performance

Example : Vectorized Functional Unit

1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr , NBCFG > in;5 outport <instr , NBCFG > out;6 FunctionalUnit (const char*name): module(name)7 { // sensitive list8 sensitive_pos_method (start_of_cycle ) << clock;9 sensitive_neg_method (end_of_cycle ) << clock;

10 sensitive_method ( on_data_accept ) << in.data << out.accept;11 }12 void start_of_cycle ()13 { for (int cfg =0; cfg< NBCFG; cfg ++)14 {15 if (pipeline[cfg ]. is_ready ())16 out.data[cfg] = pipeline[cfg ]. get ();17 else out.data[cfg ]. nothing ();18 }19 out .data.send();20 }21 void on_data_accept ()22 { if (in.data.know() && out.accept.know())23 { for (int cfg =0; cfg< NBCFG; cfg ++)24 { if (! pipeline[cfg ]. is_full() || out.accept[cfg ])25 in.accept[cfg ] = true;26 else in.accept[cfg] = false;27 out .enable[cfg] = out.accept[cfg ];28 }29 in.accept. send();30 out .enable.send();31 }32 }33 void end_of_cycle ()34 { for (int cfg =0; cfg< NBCFG; cfg ++)35 { if (out.accept[cfg ]) pipeline[cfg ]. pop ();36 if (in.enable[cfg ]) pipeline[cfg ].push(in.data);37 pipeline[cfg ].run ();38 }39 }40 private:41 Fifo <instr > pipeline[NBCFG];42 };

42/1

Page 53: Tools for analysis and evaluation of CPU Performance

Simulator Vectorization

Multi-cores Simulation

• In our study, we performed simulations of multi-cores : 2, 4, 8,16, 32 and 64.

43/1

Page 54: Tools for analysis and evaluation of CPU Performance

OoOSim : Out of Order Simulator

OoOSim 4 modelises a generic superscalar out-of-order processor.The baseline simulator includes a 4-way superscalar core with an L1instruction cache, an L1 write-back data cache, a bus and a dram.

4. Mourad Bouache, David Parello, Bernard Goossens. Acceleration of Modular simulation. In InternationalSupercomputing Conference (ISC09) Hamburg, Germany, June 2009.

44/1

Page 55: Tools for analysis and evaluation of CPU Performance

OoOSim : Out of Order Simulator

OoOSim : 12 modules

1 Fetcher,

2 AllocatorRenamer,

3 Dispatcher,

4 Scheduler,

5 RegisterFile,

6 Ret-Broadcast and CDBA:Common Data Bus Arbiter,

7 IntegerUnit, FloatingPointUnit and AddressGenerationUnit,

8 LoadStoreQueue,

9 Data caches L1 and L2,

10 Instruction cache L1,

11 Memory DRAM,

12 Reorder Buffer.

45/1

Page 56: Tools for analysis and evaluation of CPU Performance

OoOSim : Out of Order Simulator

more than 15.000 code lines, 12 connected modules through 187 signals.46/1

Page 57: Tools for analysis and evaluation of CPU Performance

Benchmarks

Benchmarks : MiBench

• Simulations were carried out by MiBench, divided into sixsuites targeted areas specific market for embeddedapplications :Automotive, Network, Security, Consumer Devices,

Office Automation, and Telecommunications.

Auto./Industrial Consummer Office Network Security Telecomm.

susan (edges) jpeg stringsearch dijkstra sha FFTsusan (corners) - - - rijndael -susan (smoothing) - - - - -

47/1

Page 58: Tools for analysis and evaluation of CPU Performance

Performance evaluation

Simulation machine

• Performance evaluation has been carried out on a cluster of30 Intel Xeon 5148 dual-core processors clocked at2.33GHz with a 4MBytes L2 cache.

48/1

Page 59: Tools for analysis and evaluation of CPU Performance

Results : simulation speed (without vectorization)

49/1

Page 60: Tools for analysis and evaluation of CPU Performance

simulation speed (with vectorization)

50/1

Page 61: Tools for analysis and evaluation of CPU Performance

Results : speedup

51/1

Page 62: Tools for analysis and evaluation of CPU Performance

Why ... ?

Instrumentation of the FastSysC code(program)

• Cycle Counters (RDTSC:Read Time Stamp Counter) :

1 The scheduler FastSysC transit time.2 The process time.

52/1

Page 63: Tools for analysis and evaluation of CPU Performance

FastSysC transit time(without/with vectorization)

� � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� ���������

�������

�������

�������

�������

�������

� ������������ ��� ����������

��������������������������������������������������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������������������������������������������������������������������

� � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� ���������

�������

�������

�������

�������

�������

�������

�������

�������

�������

�������

�� ����������� ������������

��������������������������������������������������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������������������������������������������������������������������

53/1

Page 64: Tools for analysis and evaluation of CPU Performance

Conclusion

Results

• To address the need to improve the simulation speed, weproposed a developing modules methodology in a modularsimulator.

• This methodology is based on a new communication signalsprotocol .

The vectorial simulation improves scalability.

54/1

Page 65: Tools for analysis and evaluation of CPU Performance

Results Discussion

Vectorization ...

• improves the speedup of the simulation time.

• it allows duplicate resources by limiting the overhead ofscheduler simulation time.

• can be used in conjunction with other techniques toimprove the speed as sampling techniques or reductionof test programs.

55/1

Page 66: Tools for analysis and evaluation of CPU Performance

Results Discussion

Vectorization ...

56/1

Page 67: Tools for analysis and evaluation of CPU Performance

Conclusion

ConclusionOur contribution aims to improve the simulation speed inmodular simulators, offering a simple and systematicdevelopment based on the vectorization of the simulatormodules.

57/1

Page 68: Tools for analysis and evaluation of CPU Performance

Conclusion

Simplescalar is not a multi-core simulator

58/1

Page 69: Tools for analysis and evaluation of CPU Performance

Conclusion

Simplescalar is not a multi-core simulator

59/1

Page 70: Tools for analysis and evaluation of CPU Performance

In focus

Other idea ...

• VectorizationWe wish to compare the results of this methodology usingTTLM modeling (Timed Transaction Level Modeling).

60/1

Page 71: Tools for analysis and evaluation of CPU Performance

Merci, Thank you, Tack

QUESTIONS ?

61/1

Page 72: Tools for analysis and evaluation of CPU Performance

Back-up slides

Post-doc research work

• Instruction Level Parallelism : ILPGoal : understand the general structure of an execution andparallelism it offers.

• PerPi : A Tool to Measure Instruction Level Parallelism• http://kenny.univ-perp.fr/PerPi/• A Pin tool, an Intel free programmable tool,• computes the instructions dependency graph,• computes, for each instruction in the run, its instruction cycle in the ideal

machine,• Analysis of the structure of instruction-level parallelism,• Parallelism on loops,• Local and global parallelism,• Parallelism on function ”CALL”.

62/1

Page 73: Tools for analysis and evaluation of CPU Performance

Back-up slides

Pin Tool

63/1

Page 74: Tools for analysis and evaluation of CPU Performance

Back-up slides

TTLM

64/1

Page 75: Tools for analysis and evaluation of CPU Performance

Back-up slides

SystemC and FastSysC

SystemC, Contains a scheduler which manages signals and directsthe process to start. It contains a sequential processes (sensitive tothe clock) and combinatorial process (sensitive to input ports).FastSysC, a mixture of static and dynamic scheduling to avoidunnecessary awakening processes : thus optimize the simulationengine.

65/1

Page 76: Tools for analysis and evaluation of CPU Performance

Back-up slides

Monolithic

66/1

Page 77: Tools for analysis and evaluation of CPU Performance

Back-up slides

Modular

67/1

Page 78: Tools for analysis and evaluation of CPU Performance

Back-up slides

Parallel Simulation

68/1

Page 79: Tools for analysis and evaluation of CPU Performance

Back-up slides

Sampling I

69/1

Page 80: Tools for analysis and evaluation of CPU Performance

Back-up slides

Sampling II

70/1

Page 81: Tools for analysis and evaluation of CPU Performance

Back-up slides

MiBench

71/1

Page 82: Tools for analysis and evaluation of CPU Performance

Back-up slides

Use of OoOSim

72/1

Page 83: Tools for analysis and evaluation of CPU Performance

Back-up slides

Stringsearch

73/1

Page 84: Tools for analysis and evaluation of CPU Performance

Back-up slides

flight-trace simulation

74/1

Page 85: Tools for analysis and evaluation of CPU Performance

Back-up slides

execution-driven simulation

75/1

Page 86: Tools for analysis and evaluation of CPU Performance

Back-up slides

trace-driven simulation

76/1

Page 87: Tools for analysis and evaluation of CPU Performance

Back-up slides

Unisim Example

77/1

Page 88: Tools for analysis and evaluation of CPU Performance

Back-up slides

UNISIM History

78/1