221
The gem5 Simulator ISCA 2011 Brad Beckmann 1 Nathan Binkert 2 Ali Saidi 3 Joel Hestness 4 Gabe Black 5 Korey Sewell 6 Derek Hower 7 1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin 5 Google, Inc. 6 University of Michigan, Ann Arbor 7 University of Wisconsin, Madison June 5th, 2011 1

The gem5 Simulatordist.gem5.org/dist/tutorials/isca_pres_2011.pdfThe gem5 Simulator ISCA 2011 Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4 Gabe Black5 Korey Sewell6 Derek

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • The gem5 SimulatorISCA 2011

    Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4

    Gabe Black5 Korey Sewell6 Derek Hower7

    1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin5 Google, Inc. 6 University of Michigan, Ann Arbor

    7 University of Wisconsin, Madison

    June 5th, 2011

    1

  • Welcome!

    • We’re glad you’re here!• The gem5 simulator has been multi-year effort• A wide variety of institutions have participated

    • This tutorial is for you• Please ask questions! Don’t save them for the break!• We intend the focus to be audience driven

    2

  • Tutorial Goals and Timeline

    • Tutorial goals• Introduce you to the gem5 simulator• Answer your development questions

    • Two halves• 8:30-noon: Overview of the simulator, features, components,

    and simple examples• after lunch: Birds of a feather sessions and informal

    discussions of simulator internals

    3

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Introduction to gem5

    Introduction to gem5

    Brad Beckmann

    AMD Research

    5

  • Introduction to gem5

    • What is gem5?• The best parts of M5• The best parts of GEMS

    • Overall goals, design principles, and capabilities

    6

  • What is gem5?

    • The combination of M5 and GEMS into a new simulator

    • Google scholar statistics• M5 (IEEE Micro, CAECW): 440 citations• GEMS (CAN): 588 citations

    • Best aspects of both glued together• M5: CPU models, ISAs, I/O devices, infrastructure• GEMS (essentially Ruby): cache coherence protocols,

    interconnect models

    7

  • What else is new?

    • Many other things have changed since previous tutorialsbeyond GEMS+M5

    • Some of the highlights:• The world’s most popular ISAs: ARM and x86• The In-order CPU model• New documentation

    • Overall gem5 has a high degree of capabilities

    8

  • Android on ARM FS

    9

    android.mp4Media File (video/mp4)

  • 64 Processor Linux on x86 FS

    10

  • What gem5 is Not

    • A hardware design language• Higher level for design space exploration, simulation speed

    • A restrictive environment• Just C++ and Python with an event queue and a bunch of

    APIs you can choose to ignore

    • Finished!• Always room for improvement . . .

    11

  • What We Would Like gem5 to Be

    • Something that spares you the pain we’ve been through• A community resource

    • Modular enough to localize changes• Contribute back, and spare others some pain

    • A path to reproducible/comparable results• A common platform for evaluating ideas

    • Let us know how we can help you contribute• Public wiki is up at http://www.gem5.org• Please submit patches and additional features• Ability to add modules with EXTRAS=• The more active the community is, the more successful gem5

    will be!

    12

  • Two Views of gem5

    View #1• A framework for event-driven simulation

    • Events, objects, statistics, configuration

    View #2• A collection of predefined object models

    • CPUs, caches, busses, devices, etc.

    • This tutorial focuses on #2• You may find #1 useful even if #2 is not

    • At least three other “simulators” have been created using #1

    13

  • Main GoalsOverall Goal: Open source community tool focused onarchitectural modeling

    • Flexibility• Multiple CPU models across the speed vs. accuracy spectrum• Two execution modes: System-call Emulation & Full-system• Two memory system models: Classic & Ruby• Once you learn it, you can apply to a wide-range of

    investigations• Availability

    • For both academic and corporate researchers• No dependence on proprietary code• BSD license

    • Collaboration• Combined effort of many with different specialties• Active community leveraging collaborative technologies

    14

  • Key Features

    • Pervasive object-oriented design• Provides modularity, flexibility• Significantly leverages inheritance e.g. SimObject

    • Python integration• Powerful front-end interface• Provides initialization, configuration, & simulation control

    • Domain-Specific Languages• ISA DSL: defines ISA semantics• Cache Coherence DSL (a.k.a.SLICC): defines coherence logic

    • Standard interfaces: Ports and MessageBuffers

    15

  • Capabilities

    • Execution modes: System-call Emulation (SE) &Full-System (FS)

    • ISAs: Alpha, ARM, MIPS, Power, SPARC, x86• CPU models: AtomicSimple, TimingSimple, InOrder, and O3• Cache coherence protocols: broadcast-based, directories,

    etc.• Interconnection networks: Simple & Garnet (Princeton,

    MIT)• Devices: NICs, IDE controller, etc.• Multiple systems: communicate over TCP/IP

    16

  • Cross-Product Matrix

    Processor Memory System

    CPU Model System Mode Classic RubySimple Garnet

    Atomic Simple SEFS

    Timing Simple SEFS

    InOrder SEFS

    O3 SEFS

    17

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    18

  • Basics

    Basics

    Nate Binkert

    HP Labs

    19

  • Basics

    • Compiling gem5• Running gem5• Very brief overview of a few key concepts:

    • Objects• Events• Modes• Ports• Stats

    20

  • Building Executables

    • Platforms• Linux, BSD, MacOS, Solaris, etc.• Little endian machines

    • Some architectures support big endian• 64-bit machines help a lot

    • Tools• GCC/G++ 3.4.6+

    • Most frequently tested with 4.2-4.5• Python 2.4+• SCons 0.98.1+

    • We generally test versions 0.98.5 and 1.2.0• http://www.scons.org

    • SWIG 1.3.31+• http://www.swig.org

    21

  • Compile Targets

    • build//• configs

    • By convention, usually _• ALPHA_SE (Alpha syscall emulation)• ALPHA_FS (Alpha full system)• Other ISAs: ARM, MIPS, POWER, SPARC, X86• Sometimes followed by Ruby protocol:• ALPHA_SE_MOESI_hammer• You can define your own configs

    • binary• gem5.debug – debug build, symbols, tracing, assert• gem5.opt – optimized build, symbols, tracing, assert• gem5.fast – optimized build, no debugging, no symbols, no

    tracing, no assertions• gem5.prof – gem5.fast + profiling support

    22

  • Sample Compile

    blue% scons build/X86_FS/gem5.optscons: Reading SConscript files ...Checking for leading underscore in global variables...noChecking for C header file Python.h... yesChecking for C library pthread... yes

    Reading /n/blue/z/binkert/work/m5/incoming/src/mem/ruby/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/mem/protocol/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/arch/arm/SConsopts

    Building in /n/blue/z/binkert/work/m5/incoming/build/X86_FSVariables file /n/blue/z/binkert/work/m5/incoming/build/variables/X86_FS not found,

    using defaults in /n/blue/z/binkert/work/m5/incoming/build_opts/X86_FSscons: done reading SConscript files.scons: Building targets ...[ CXX] X86_FS/sim/main.cc -> .o[ CXX] X86_FS/sim/async.cc -> .o[ CXX] X86_FS/sim/core.cc -> .o[ TRACING] -> X86_FS/debug/Event.hhDefining FAST_ALLOC_STATS as 0 in build/X86_FS/config/fast_alloc_stats.hh.Defining FORCE_FAST_ALLOC as 0 in build/X86_FS/config/force_fast_alloc.hh.Defining NO_FAST_ALLOC as 0 in build/X86_FS/config/no_fast_alloc.hh.[ CXX] X86_FS/sim/debug.cc -> .o[ TRACING] -> X86_FS/debug/Config.hh[ CXX] X86_FS/sim/eventq.cc -> .o[ CXX] X86_FS/sim/init.cc -> .o[ TRACING] -> X86_FS/debug/TimeSync.hh[SO PARAM] Root -> X86_FS/params/Root.hh[SO PARAM] SimObject -> X86_FS/params/SimObject.hh...

    23

  • Running Simulations

    maize% ./build/ARM_FS/gem5.opt --helpUsage=====

    gem5.opt [gem5 options] script.py [script options]

    gem5 is copyrighted software; use the --copyright option for details.

    Options=======--version show program’s version number and exit--help, -h show this help message and exit--build-info, -B Show build information--copyright, -C Show full copyright information--readme, -R Show the readme--outdir=DIR, -d DIR Set the output directory to DIR [Default: /tmp/m5out]--redirect-stdout, -r Redirect stdout (& stderr, without -e) to file--redirect-stderr, -e Redirect stderr to file--stdout-file=FILE Filename for -r redirection [Default: simout]--stderr-file=FILE Filename for -e redirection [Default: simerr]--interactive, -i Invoke the interactive interpreter after running the script--pdb Invoke the python debugger before running the script--path=PATH[:PATH], -p PATH[:PATH]

    Prepend PATH to the system path when invoking the script--quiet, -q Reduce verbosity--verbose, -v Increase verbosity

    24

  • Running Simulations (cont)

    Statistics Options--------------------stats-file=FILE Sets the output file for statistics [Default:

    stats.txt]

    Configuration Options-----------------------dump-config=FILE Dump configuration output file [Default: config.ini]

    Debugging Options-------------------debug-break=TIME[,TIME]

    Cycle to create a breakpoint--debug-help Print help on trace flags--debug-flags=FLAG[,FLAG]

    Sets the flags for tracing (-FLAG disables a flag)--remote-gdb-port=REMOTE_GDB_PORT

    Remote gdb base port (set to 0 to disable listening)

    Trace Options---------------trace-start=TIME Start tracing at TIME (must be in ticks)--trace-file=FILE Sets the output file for tracing [Default: cout]--trace-ignore=EXPR Ignore EXPR sim objects

    Help Options--------------list-sim-objects List all built-in SimObjects, their params and default

    values

    25

  • Sample Run

    maize% ./build/ARM_SE/gem5.opt configs/example/se.pygem5 Simulator System. http://gem5.orggem5 is copyrighted software; use the --copyright option for details.

    gem5 compiled Jun 2 2011 17:39:30gem5 started Jun 3 2011 14:48:20gem5 executing on maizecommand line: ./build/ARM_SE/gem5.opt configs/example/se.pyGlobal frequency set at 1000000000000 ticks per second0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...Hello world!hack: be nice to actually delete the event hereExiting @ tick 3350000 because target called exit()

    26

  • Modes

    • gem5 has two fundamental modes• Full system (FS)

    • For booting operating systems• Models bare hardware, including devices• Interrupts, exceptions, privileged instructions, fault handlers

    • Syscall emulation (SE)• For running individual applications, or set of applications on

    MP/SMT• Models user-visible ISA plus common system calls• System calls emulated, typ. by calling host OS• Simplified address translation model, no scheduling

    • Selected via compile-time option• Vast majority of code is unchanged, though

    27

  • Objects

    • Everything you care about is an object (C++/Python)• Derived from SimObject base class

    • Common code for creation, configuration parameters, naming,checkpointing, etc.

    • Uniform method-based APIs for object types• CPUs, caches, memory, etc.• Plug-compatibility across implementations

    • Functional vs. detailed CPU• Conventional vs. indirect-index cache

    • Easy replication: cores, multiple systems, . . .

    28

  • Events

    • Standard event queue timing model• Global logical time in “ticks”• No fixed relation to real time

    • Normally picoseconds in our examples• Objects schedule their own events

    • Flexibility for detail vs. performance trade-offs• E.g., a CPU typically schedules event at regular intervals

    • Every cycle or every n picoseconds• Won’t schedule self if stalled/idle

    29

  • Ports src/mem/port.{hh,cc}

    • Method for connecting MemObjects together• Each MemObject subclass has its own Port subclass(es)

    • Specialized to forward packets to appropriate methods ofMemObject subclass

    • Each pair of MemObjects is connected via a pair of Ports(“peers”)

    • Function pairs pass packets across ports• sendTiming() on one port calls recvTiming() on peer

    • Result: class-specific handling with arbitrary connections andonly a single virtual function call

    30

  • Access Modes

    • Three access modes: Functional, Atomic, Timing• Selected by choosing function on initial Port:

    • sendFunctional(), sendAtomic(), sendTiming()• Functional mode:

    • Just “make it happen”• Used for loading binaries, debugging, etc.• Accesses happen instantaneously updating data everywhere

    in the hierarchy• If devices contain queues of packets they must be scanned

    and updated as well

    31

  • Access Modes (cont’d)

    • Atomic mode:• Requests complete before sendAtomic() returns• Models state changes (cache fills, coherence, etc.)• Returns approx. latency w/o contention or queuing delay• Used for fast simulation, fast forwarding, or warming caches

    • Timing mode:• Models all timing/queuing in the memory system• Split transaction

    • sendTiming() just initiates send of request to target• Target later calls sendTiming() to send response packet

    • Atomic and Timing accesses can not coexist in system

    32

  • Statistics

    • Scalar• Average• Vector• Formula• Histogram• Distribution• Vector Distribution

    33

  • Statistics Example – hh file

    class MySimObject : public SimObject{

    private:Stats::Scalar txBytes;Stats::Formula txBandwidth;Stats::Vector syscall;

    public:void regStats();

    };

    34

  • Statistics Example – cc file

    txBytes.name(name() + ".txBytes").desc("Bytes Transmitted").prereq(txBytes);

    txBandwidth.name(name() + ".txBandwidth").desc("Transmit Bandwidth (bits/s)").precision(0);

    txBandwidth = txBytes * Stats::constant(8) / simSeconds;

    syscall.init(SystemCalls ::Number).name(name() + ".syscall").desc("number of syscalls executed").flags(total | pdf | nozero | nonan);

    35

  • Statistics Output

    client.tsunami.etherdev.txBandwidth 4302720client.tsunami.etherdev.txBytes 13446server.tsunami.etherdev.txBandwidth 4684921600server.tsunami.etherdev.txBytes 14640380sim_seconds 0.025000server.cpu.kern.syscall 492server.cpu.kern.syscall_1 189 38.41% 38.41%server.cpu.kern.syscall_2 249 50.61% 89.02%server.cpu.kern.syscall_3 54 10.98% 100.00%

    36

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    37

  • Debugging

    Debugging

    Ali Saidi

    ARM Research & Development

    38

  • Debugging Facilities

    • Tracing• Instruction Tracing• Diffing Traces

    • Using gdb to debug gem5• Debugging C++ and gdb-callable functions• Remote Debugging

    • Python Debugging

    • Pipeline Viewer

    39

  • Tracing/Debugging src/base/trace.*

    • printf() is a nice debugging tool

    • Keep good printfs for tracing

    • Lots of debug output is a very good thing

    • Example flags:• Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache,

    Loader, O3CPUAll, etc.• Print out all flags with --debug-help option

    40

  • Enabling Tracing

    • Selecting flags:• --debug-flags=Cache,Bus• --debug-flags=Exec,-ExecTicks

    • Selecting destination:• --trace-file=my_trace.out• --trace-file=my_trace.out.gz

    • Selecting start:• --trace-start=23000000

    ./build/ARM_FS/gem5.opt --debug-flags=Cache,Bus--trace-start=2400 configs/example/fs.py

    41

  • Adding Debuging

    • Print statement put in source code• Encourage you to add ones to your models or contribute ones

    you find particularly useful• Macros remove them for gem5.fast or gem5.prof binaries

    • So you must be using gem5.debug or gem5.opt to get anyoutput

    • Adding an extra tracing statement:• DPRINTF(Flag, “normal printf %s\n”,“arguments”);

    • Adding a new debug flags (in a SConscript):• DebugFlag(’MyNewFlag’)

    42

  • Instruction Tracing src/sim/insttracer.hh

    • Separate from the general debug/trace facility• But both are enabled the same way

    • Per-instruction records populated as instruction executes• Start with PC and mnemonic• Add argument and result values as they become known

    • Printed to trace when instruction completes• Flags for printing cycle, symbolic addresses, etc.

    4000: sys.cpu : @sym+776 : add r3, r3, #8 : IntAlu : D=0x000083584500: sys.cpu : @sym+780 : sub r3, r3, r7 : IntAlu : D=0x400000005000: sys.cpu : @sym+784 : add r5, r5, r3 : IntAlu : D=0x000173cc5500: sys.cpu : @sym+788 : add r6, r6, r3 : IntAlu : D=0x000174006000: sys.cpu : @sym+792.0 : addi_uop r34, r5, #0 : IntAlu : D=0x000173cc6500: sys.cpu : @sym+792.1 : ldr_uop r3, [r34, #0] : MemRead : D=0x000f0000 A=0x173cc7000: sys.cpu : @sym+792.2 : ldr_uop r4, [r34, #4] : MemRead : D=0x000f0000 A=0x173d07500: sys.cpu : @sym+796 : and r4, r4, r9 : IntAlu : D=0x000f00008000: sys.cpu : @sym+800 : teqs r3, r4 : IntAlu : D=0x00000001

    43

  • Using GDB with gem5

    • Several gem5 functions designed to be called from GDB:• schedBreakCycle() – also with --debug-break• setDebugFlag()/clearDebugFlag()• dumpDebugStatus()• eventqDump()• SimObject::find()• takeCheckpoint()

    44

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

    aliHighlight

  • Diffing Traces util/{rundiff,tracediff}

    • Often useful to compare traces from two simulations• Find where known good and modified simulators diverge

    • Standard diff works only on files (not pipes)• ...but you really don’t want to run to completion

    • util/rundiff• Perl script for diffing two pipes on the fly

    • util/tracediff• Handy wrapper for using rundiff to compare gem5 outputs• tracediff "a/gem5.opt|b/gem5.opt"--debug-flags=Exec compares instruction traces from twobuilds of gem5

    • See comments for details

    47

  • Advanced Trace Diffing

    • Sometimes if you run into a nasty bug it’s hard to compareapples-to-apples traces

    • Different cycle counts, different code paths frominterrupts/timers

    • Some mechanisms that can help:• -ExecTicks don’t print out ticks• -ExecKernel don’t print out kernel code• -ExecUser don’t print out user code• ExecAsid print out ASID of currently running process

    • State trace• PTRACE program that runs binary on real system compares

    cycle-by-cycle to gem5• Supports ARM, x86, SPARC• See wiki for more information

    48

  • Remote Debugging

    ./build/ARM_FS/gem5.opt configs/example/fs.pygem5 Simulator System

    ...command line: ./build/ARM_FS/gem5.opt configs/example/fs.pyGlobal frequency set at 1000000000000 ticks per secondinfo: kernel located at: /chips/pd/randd/dist/binaries/vmlinux.armListening for system connection on port 5900Listening for system connection on port 34560: system.remote_gdb.listener: listening for remote gdb #0 on port 7000info: Entering event queue @ 0. Starting simulation...

    Remote gdb connection listening on port 7000

    49

    aliHighlight

  • Remote Debugging

    GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvsCopyright (C) 2010 Free Software Foundation, Inc....(gdb) symbol-file /dist/binaries/vmlinux.armReading symbols from //dist/binaries/vmlinux.arm...done.(gdb) set remote Z-packet on(gdb) set tdesc filename arm-with-neon.xml(gdb) target remote 127.0.0.1:7000Remote debugging using 127.0.0.1:7000cache_init_objs (cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658(gdb) stepsighand_ctor (data=0xc7ead060) at kernel/fork.c:1467(gdb) info registers

    r0 0xc7ead060-940912544r1 0x5201312r2 0xc002f1e4-1073548828r3 0xc7ead060-940912544r4 0x00r5 0xc7ead020-940912608r6 0x00r7 0xc7ead03c-940912580r8 0xc7c034a0-943704928r9 0x1001001048832r10 0xc7c0cee0-943665440r11 0x2002002097664r12 0xc0000000-1073741824sp 0xc7c29e280xc7c29e28lr 0xc008ed98-1073156712pc 0xc002f1e40xc002f1e4 cpsr 0x1319

    50

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

  • Python Debugging

    • It is possible to drop into the python interpreter (-i flag)• This currently happens after the script file is run• If you want to do this before objects are instantiated, remove

    them from script• It is possible to drop into the python debugger (--pdb flag)

    • Occurs just before your script is invoked• Lets you use the debugger to debug your script code

    • Code that enables this stuff is in src/python/m5/main.py• At the bottom of the main function• Can copy the mechanism directly into your scripts, if in the

    wrong place for you needs• import pdb• pdb.set_trace()

    51

  • O3 Pipeline ViewerUse --debug-flags=O3PipeView and util/o3-pipeview.py

    52

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    53

  • Checkpointing and Fastforwarding

    Checkpointing and Fastforwarding

    Joel Hestness

    University of Texas, Austin

    54

  • Checkpointing and Fastforwarding

    • Idea is simple:• Snapshot of relevant system state• Restore it later and/or in different CPUs, configurations

    • Provides flexibility:• Test numerous different systems configurations• Exact same point in the benchmark• Avoid re-simulating up to that point• Avoid non-determinism inherent with different configurations

    55

  • Checkpointing and Fastforwarding

    • Outline:• Constraints• Checkpointing demo• Checkpointing internals• Instrumenting a benchmark• Fastforwarding internals• Fastforwarding demo

    56

  • Checkpointing and Fastforwarding

    • Constraints:• Original simulation and test simulations must have

    • Same ISA• Same number of cores• Same memory size

    • Usually run original sim with atomic (functional) CPUs

    57

  • Checkpointing DEMO!

    Starting simulation

    58

  • Checkpointing DEMO!

    Starting simulation

    59

  • Checkpointing DEMO!

    Simulated system is running

    60

  • Checkpointing DEMO!

    Another terminal to control simulated system

    61

  • Checkpointing DEMO!

    Attach to simulated system

    62

  • Checkpointing DEMO!

    Simulated system has booted to shell

    63

  • Checkpointing DEMO!

    Run a quick application

    64

  • Checkpointing DEMO!

    Run a quick application

    65

  • Checkpointing DEMO!

    Drop a checkpoint

    66

  • Checkpointing DEMO!

    Exit simulation

    67

  • Checkpointing DEMO!

    Exit simulation

    68

  • Checkpointing DEMO!

    Restore from checkpoint into different simulated system

    69

  • Checkpointing DEMO!

    Simulated system is running

    70

  • Checkpointing DEMO!

    Attach to simulated system

    71

  • Checkpointing DEMO!

    Run a quick application

    72

  • Checkpointing DEMO!

    Slower execution: detailed v. functional simulation

    73

  • Checkpointing DEMO!

    Exit simulation

    74

  • Checkpointing Output

    • cpt.6967183789500/• m5.cpt: State of system components• system.disk?.image.cow: Modified state of disk(s)• system.physmem.physmem: State of memory

    75

  • Specifying State to Checkpoint

    • To checkpoint a piece of state, serialize it• To restore that state, unserialize it

    voidserialize(std::ostream &os){

    SERIALIZE_ARRAY(interrupts, NumInterruptLevels);SERIALIZE_SCALAR(intstatus);

    }

    voidunserialize(Checkpoint *cp, const std::string &section){

    UNSERIALIZE_ARRAY(interrupts, NumInterruptLevels);UNSERIALIZE_SCALAR(intstatus);

    }

    76

  • Checkpointing functionality status

    • Classic memory model:• Does not save state of caches

    • Ruby memory model:• Can save state of caches

    77

  • Instrumenting a Benchmark

    • Copy files from ./util/m5/ into source tree:• m5op.h• m5ops.h• Appropriate assembly file: m5op_.S

    • Include m5op.h in source code that should take a checkpoint

    #include "m5op.h"

    ...

    // Take checkpoint in codem5_checkpoint(0,0);

    • 1st param: no. ticks in future to schedule the checkpoint• 2nd param: no. ticks between checkpoints (periodic)

    • Compile and link against assembly file

    78

  • Checkpointing Functionality in Progress

    Current limitation: cache warm-up1 Take periodic checkpoints throughout execution2 Inspect statistics for interesting sections (think Simpoints)3 Choose interesting sections4 Create memory access traces for cache warm-up5 Restore from checkpoint:

    1 Start simulated system2 Warm up caches from trace3 Restore the rest of state4 Begin execution

    79

  • Fastforwarding

    Setup:• Specify sets of CPUs

    cpu_class = AtomicSimpleCPUswitch_cpu_class = DerivO3CPUtest_sys.cpu = [cpu_class(cpu_id=i) for i in xrange(np)]switch_cpus = [switch_cpu_class(defer_registration=True, cpu_id=(np+i))

    for i in xrange(np)]switch_cpu_list = [(testsys.cpu[i], switch_cpus[i]) for i in xrange(np)]

    80

  • Fastforwarding DEMO!

    Starting simulation

    81

  • Fastforwarding DEMO!

    Starting simulation

    82

  • Fastforwarding DEMO!

    Simulated system is running

    83

  • Fastforwarding DEMO!

    Another terminal to control simulated system

    84

  • Fastforwarding DEMO!

    Attach to simulated system

    85

  • Fastforwarding DEMO!

    Simulated system has booted to shell

    86

  • Fastforwarding DEMO!

    Run a quick application

    87

  • Fastforwarding DEMO!

    Run a quick application

    88

  • Fastforwarding DEMO!

    Switch from functional to detailed CPUs

    89

  • Fastforwarding DEMO!

    Switch from functional to detailed CPUs

    90

  • Fastforwarding DEMO!

    Run a quick application

    91

  • Fastforwarding DEMO!

    Slower execution: detailed v. functional simulation

    92

  • Fastforwarding DEMO!

    Exit simulation

    93

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    94

  • Break

    Break

    95

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    96

  • Multiple Architecture Support

    Multiple Architecture Support

    Gabe Black

    Google, Inc.

    97

  • Overview

    • Tour of the ISAs• Parts of an ISA• Decoding and instructions

    98

  • ISA Support

    • Full-System & Syscall Emulation• Alpha• ARM• SPARC• x86

    • Syscall Emulation• MIPS• POWER

    99

  • Alpha

    • Alpha 21264 including the BWX, MVI, FIX, and CIXA• 21164 PAL code.• Syscall Emulation

    • Linux or Tru64 binaries• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models• Full system

    • Linux or FreeBSD• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models• Four-cores in a normal Tsunami system• Also gem5 big Tsunami support 64 cores

    • Custom PAL code and kernel patches required

    100

  • ARM• ARMv7-A, Thumb, Thumb2, MP, VFPv3, NEON

    • Doesn’t (yet) include TrustZone, ThumbEE, Virtualization,LPAE

    • Syscall Emulation• EABI Linux binaries - no OABI• Simple Atomic, Simple Timing, Out-of-Order CPU models

    • Full system• Linux or Android• Simple Atomic, Simple Timing, Out-of-Order CPU models• Four-cores in a normal ARM RealView system• No kernel patches required• Also supports frame buffer, and control via VNC

    • Can run X11, Android, Web browsers, etc

    101

  • ARM

    101

  • MIPS

    • 32 bit little endian• Syscall Emulation

    • Linux binaries• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models• Full system

    • Significant progress, but not actively developed

    102

  • POWER

    • POWER ISA v2.06 B Book, 32-bit, little endian• Most instructions available, but some FP missing; no vector

    support• Syscall Emulation

    • Linux binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models

    • Full system• No current plans

    103

  • SPARC

    • UltraSPARC Architecture 2005• Syscall Emulation

    • Linux or Solaris binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models

    • Full system• Solaris• Single core of a UltraSPARC T1 (Niagara) processor• Simple Atomic CPU model only• Significant progress on MP, but not actively developed

    104

  • x86

    • Generic x86 CPU w/ 64 bit, 3DNow, & SSE extensions• Effort focused on modern features• No x87 floating point. Compile 32 bit with -msse2.• No Windows support any time soon.• Syscall Emulation

    • Linux binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models

    • Full system• Linux• Simple Atomic, Simple Timing CPU models• MP support

    105

  • Parts of an ISA

    • Parameterization• Number of registers• Endianness• Page size

    • Specialized objects• TLBs• Faults• Control state• Interrupt controller

    • Instructions• Instructions themselves• Decoding mechanism

    106

  • Instruction decode process

    Memory

    Byte Byte Byte Byte ByteByte

    Predecoder Context

    ExtMachInst

    Decoder

    StaticInst Macroop

    Microop Microop

    Or

    107

  • ISA Description Languagesrc/arch/isa_parser.py, src/arch/*/isa/*

    • Custom domain-specific language• Defines decoding & behavior of ISA• Generates C++ code

    • Scads of StaticInst subclasses• decodeInst () function

    • Maps machine instruction to StaticInst instance• Multiple scads of execute() methods

    • Cross-product of CPU models and StaticInst subclasses

    108

  • Definitions etc.

    def bitfield OPCODE ;def bitfield RA ;def bitfield RB ;def bitfield INTFUNC ; // function codedef bitfield RC < 4: 0>; // dest reg

    def operands {{’Ra’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RA] : RA’,

    ’IsInteger’, 1),’Rb’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RB] : RB’,

    ’IsInteger’, 2),’Rc’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RC] : RC’,

    ’IsInteger’, 3),’Fa’: (’FloatReg’, ’df’, ’FA’, ’IsFloating’, 1),’Fb’: (’FloatReg’, ’df’, ’FB’, ’IsFloating’, 2),’Fc’: (’FloatReg’, ’df’, ’FC’, ’IsFloating’, 3),

    }}

    def format LoadAddress(code) {{// Python code here...

    }}

    def format IntegerOperate(code) {{// Python code here...

    }}

    109

  • Instruction Decode and Semantics

    decode OPCODE {format LoadAddress {

    0x08: lda ({{ Ra = Rb + disp; }});0x09: ldah ({{ Ra = Rb + (disp

  • Microcode

    def macroop MOVS_E_M_M {and t0, rcx, rcx, flags=(EZF,), dataSize=aszbr label("end"), flags=(CEZF,)# Find the constant we need to either add or subtract from rdiruflag t0, 10movi t3, t3, dsz, flags=(CEZF,), dataSize=aszsubi t4, t0, dsz, dataSize=aszmov t3, t3, t4, flags=(nCEZF,), dataSize=asz

    topOfLoop:ld t1, seg, [1, t0, rsi]st t1, es, [1, t0, rdi]

    subi rcx, rcx, 1, flags=(EZF,), dataSize=aszadd rdi, rdi, t3, dataSize=aszadd rsi, rsi, t3, dataSize=aszbr label("topOfLoop"), flags=(nCEZF,)

    end:fault "NoFault"

    };

    111

  • Key Features

    • Very compact representation• Most instructions take 1 line of C code• Alpha: 3437 lines of isa description→ 39K lines of C++

    • ∼15K generic decode, ∼12K for each of 2 CPU models• Characteristics auto-extracted from C

    • source, dest regs; func unit class; etc.• execute() code customized for CPU models

    • Thoroughly documented (for us, anyway)• See wiki pages

    112

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    113

  • CPU Modeling

    CPU Modeling

    Korey Sewell

    University of Michigan, Ann Arbor

    114

  • Overview

    • High Level View• Supported CPU Models

    • AtomicSimpleCPU• TimingSimpleCPU• InOrderCPU• O3CPU

    • CPU Model Internals• Parameters• Time Buffers• Key Interfaces

    115

  • CPU Models - System Level View

    CPU Models are designed to be “hot pluggable” with arbitraryISAs and Memory Systems

    116

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine

    • Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:

    • Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark

    • Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU

    • Common Uses:• Fast, Functional Simulation: 2.9 million and 1.2 million

    instructions per second on the “twolf” benchmark• Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:

    • Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark

    • Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:

    • Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark

    • Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support

    • Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:

    • Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark

    • Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU

    • “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc• Simple CPUs

    • Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:

    • Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark

    • Warming Up Caches• Studies that do not require detailed CPU modeling

    • Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on

    the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • AtomicSimpleCPU src/cpu/simple/atomic/*.hh,cc

    • On every CPU “tick()”,perform all necessaryoperations for an instruction

    • Memory accesses areatomic

    • Fastest functional simulation

    118

  • TimingSimpleCPU src/cpu/simple/timing/*.hh,cc

    • Memory accesses usetiming path

    • CPU waits until memoryaccess returns

    • Fast, provides some level oftiming

    119

  • InOrder CPU Model src/cpu/inorder/*.hh,cc• Detailed in-order CPU• InOrder is a new feature to the gem5 Simulator

    • Default 5-stage pipeline• Fetch, Decode, Execute, Memory, Writeback

    120

  • InOrder CPU Model src/cpu/inorder/*.hh,cc• Detailed in-order CPU• InOrder is a new feature to the gem5 Simulator

    • Default 5-stage pipeline• Fetch, Decode, Execute, Memory, Writeback

    120

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    • Detailed in-order CPU• Default 5-stage pipeline

    • Fetch, Decode, Execute, Memory, Writeback

    • Key Resources• CacheUnit, ExecutionUnit, BranchPredictor, etc.

    • Key Parameters• Pipeline Stages, Hardware Threads

    • Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules

    • Each instruction type defines what resources they need in aparticular stage

    • If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    • Detailed in-order CPU• Default 5-stage pipeline

    • Fetch, Decode, Execute, Memory, Writeback• Key Resources

    • CacheUnit, ExecutionUnit, BranchPredictor, etc.

    • Key Parameters• Pipeline Stages, Hardware Threads

    • Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules

    • Each instruction type defines what resources they need in aparticular stage

    • If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    • Detailed in-order CPU• Default 5-stage pipeline

    • Fetch, Decode, Execute, Memory, Writeback• Key Resources

    • CacheUnit, ExecutionUnit, BranchPredictor, etc.• Key Parameters

    • Pipeline Stages, Hardware Threads

    • Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules

    • Each instruction type defines what resources they need in aparticular stage

    • If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    • Detailed in-order CPU• Default 5-stage pipeline

    • Fetch, Decode, Execute, Memory, Writeback• Key Resources

    • CacheUnit, ExecutionUnit, BranchPredictor, etc.• Key Parameters

    • Pipeline Stages, Hardware Threads• Implementation: Customizable Set of Pipeline Components

    • Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules

    • Each instruction type defines what resources they need in aparticular stage

    • If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline

    121

  • O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU

    • Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback

    • Model varying amount of pipeline stages by changing delaysbetween pipeline stages (e.g. fetchToDecodeDelay)

    • Key Resources• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit

    (FU) Pool• Key Parameters

    • Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays

    • Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU

    • Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay)

    • Key Resources• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit

    (FU) Pool• Key Parameters

    • Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays

    • Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU

    • Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources

    • Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    • Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays• Other Key Features

    • Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU

    • Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources

    • Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    • Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays

    • Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU

    • Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources

    • Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    • Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays• Other Key Features

    • Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction

    122

  • CPU Model Internals src/cpu/*

    • A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator

    • Parameter Definition• Shared Components

    • Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication• External Interfaces

    • System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    • A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition

    • Shared Components• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers

    • TimeBuffer-Based Communication• External Interfaces

    • System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    • A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components

    • Branch Predictors, TLBs, ISA decoding, Interrupt Handlers

    • TimeBuffer-Based Communication• External Interfaces

    • System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    • A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components

    • Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication

    • External Interfaces• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    • A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components

    • Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication• External Interfaces

    • System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing

    123

  • CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py

    • Parameters are defined in a *.py in each CPU’s directory• e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown

    below:

    class InOrderCPU(BaseCPU):type = ’InOrderCPU’...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (’local’, ’tournament’)")

    • Use in your configuration scripts

    ...cpu = InOrderCPU()cpu.stageWidth = 2...

    124

  • CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py

    • Parameters are defined in a *.py in each CPU’s directory• e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown

    below:

    class InOrderCPU(BaseCPU):type = ’InOrderCPU’...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (’local’, ’tournament’)")

    • Use in your configuration scripts

    ...cpu = InOrderCPU()cpu.stageWidth = 2...

    124

  • CPU Internals - Time Buffers src/base/timebuf.hh

    • Similar to queues• Are advance()’d each CPU cycle

    • Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate

    cycle• Used for both forwards and backwards communication

    • Avoids unrealistic interaction between pipeline stages• Time buffer class is templated

    • Its template parameter is the communication struct betweenstages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    • Similar to queues• Are advance()’d each CPU cycle

    • Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate

    cycle

    • Used for both forwards and backwards communication• Avoids unrealistic interaction between pipeline stages

    • Time buffer class is templated• Its template parameter is the communication struct between

    stages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    • Similar to queues• Are advance()’d each CPU cycle

    • Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate

    cycle• Used for both forwards and backwards communication

    • Avoids unrealistic interaction between pipeline stages

    • Time buffer class is templated• Its template parameter is the communication struct between

    stages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    • Similar to queues• Are advance()’d each CPU cycle

    • Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate

    cycle• Used for both forwards and backwards communication

    • Avoids unrealistic interaction between pipeline stages• Time buffer class is templated

    • Its template parameter is the communication struct betweenstages

    125

  • Time Buffer Communication

    • Demonstrated on out-of-order pipeline ...• Red is a time buffer

    Fetch Decode RenameIssue

    ExecuteWriteback

    Commit

    Backwards Communication

    126

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    • Interface for accessing total architectural state of a singlethread• PC, register values, etc.

    • Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.

    • Abstract base class• Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    • Interface for accessing total architectural state of a singlethread• PC, register values, etc.

    • Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.

    • Abstract base class• Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    • Interface for accessing total architectural state of a singlethread• PC, register values, etc.

    • Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.

    • Abstract base class• Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - StaticInst Classsrc/cpu/static_inst.{hh,cc}

    • Represents a decoded instruction• Has classifications of the inst• Corresponds to the binary machine inst• Only has static information

    • Has all the methods needed to execute an instruction• Tells which regs are source and dest• Contains the execute() function• ISA parser generates execute() for all insts

    128

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    • Dynamic version of StaticInst• Used to hold extra information detailed CPU models

    • BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations

    • InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule

    • O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    • Dynamic version of StaticInst• Used to hold extra information detailed CPU models

    • BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations

    • InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule

    • O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    • Dynamic version of StaticInst• Used to hold extra information detailed CPU models

    • BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations

    • InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule

    • O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    • Dynamic version of StaticInst• Used to hold extra information detailed CPU models

    • BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations

    • InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule

    • O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB

    129

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    130

  • Ruby Memory System

    Ruby Memory System

    Derek Hower

    University of Wisconsin, Madison

    131

  • Outline

    • Feature Overview• Rich Configuration• Rapid Prototyping

    • SLICC• Modular & Detailed Components

    • Lifetime of a Ruby memory request

    132

  • Feature Overview

    • Flexible Memory System• Rich configuration - Just run it

    • Simulate combinations of caches, coherence, interconnect,etc...

    • Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components

    • Detailed statistics• e.g., Request size/type distribution, state transition

    frequencies, etc...• Detailed component simulation

    • Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)

    133

  • Feature Overview

    • Flexible Memory System• Rich configuration - Just run it

    • Simulate combinations of caches, coherence, interconnect,etc...

    • Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components

    • Detailed statistics• e.g., Request size/type distribution, state transition

    frequencies, etc...• Detailed component simulation

    • Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)

    133

  • Feature Overview

    • Flexible Memory System• Rich configuration - Just run it

    • Simulate combinations of caches, coherence, interconnect,etc...

    • Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components

    • Detailed statistics• e.g., Request size/type distribution, state transition

    frequencies, etc...

    • Detailed component simulation• Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)

    133

  • Feature Overview

    • Flexible Memory System• Rich configuration - Just run it

    • Simulate combinations of caches, coherence, interconnect,etc...

    • Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components

    • Detailed statistics• e.g., Request size/type distribution, state transition

    frequencies, etc...• Detailed component simulation

    • Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)

    133

  • Rich Configuration - Just run it

    • Can build many different memory systems• CMPs, SMPs, SCMPs• 1/2/3 level caches• Pt2Pt/Torus/Mesh Topologies• MESI/MOESI coherence

    • Each components is individually configurable• Build heterogeneous cache architectures (new)• Adjust cache sizes, bandwidth, link latencies, etc...

    • Get research started without modifying code!

    134

  • Configuration Examples

    1 8 core CMP, 2-Level, MESI protocol, 32K L1s, 8MB 8-bankedL2s, crossbar interconnect• scons build/ALPHA_FS/gem5.opt PROTOCOL=MESI_CMP_directory RUBY=True• ./build/ALPHA_FS/gem5.opt configs/example/ruby_fs.py -n 8 --l1i_size=32kB

    --l1d_size=32kB --l2_size=8MB --num-l2caches=8 --topology=Crossbar --timing

    2 64 socket SMP, 2-Level on-chip Caches, MOESI protocol,32K L1s, 8MB L2 per chip, mesh interconnect• scons build/ALPHA_FS/gem5.opt PROTOCOL=MOESI_CMP_directory RUBY=True• ./build/ALPHA_FS/m5.opt configs/example/ruby_fs.py -n 64 --l1i_size=32kB

    --l1d_size=32kB --l2_size=512MB --num-l2caches=64 --topology=Mesh --timing

    • Many other configuration options• Protocols only work with specific architectures (see wiki)

    135

  • Rapid Prototyping - Just create it

    • Modular construction• Coherence controller (SLICC)• Cache (C++)

    • Replacement Policy (C++)• DRAM (C++)• Topology (Python)• Network implementation (C++)

    • Debugging support

    136

  • SLICC: Specification Language forImplementing Cache Coherence

    • Domain-Specific Language• Syntatically similar to C/C++• Like HDLs, constrains operations to be hardware-like (e.g., no

    loops)• Two generation targets

    • C++ for simulation• Coherence controller object

    • HTML for documentation• Table-driven specification (State x Event -> Actions & next

    state)

    137

  • SLICC Protocol Structure

    • Collection of Machines, e.g.• L1 Controller• L2 Controller• DRAM Controller

    • Machines are connectedthrough network ports(different than MemPorts)

    • Network can be an arbitrarytopology

    138

  • Machine Structure

    • Machines are (logically) per-block• Consist of:

    • Ports - Interface to the world• States - Both stable and transient• Events - Triggered by incoming messages• Transitions - Old state x Event -> New state• Actions - Occur atomically during transition, e.g.,

    Send/receive messages from network139

  • MI Example...Directory Directory

    Pt-to-Pt Interconnect

    ...L1 Cache L1 Cache

    CPU CPU

    • Single-level coherence protocol• 2 controller types – Cache + Directory

    • Cache Controller• 2 stable states: Modified (a.k.a. Valid), Invalid

    • Directory Controller [Not Shown]• 2 stable states: Modified (Present in cache), Valid

    • 3 virtual networks (request, response, forward)• See src/mem/ruby/protocols/MI_example.*

    140

  • MI Example - L1 Cache ControllerMachine structure

    Machine Pseduo-Codemachine(L1Cache, "MI Example L1 Cache’’)

    : Sequencer * sequencer, // parameters to the machine object (set at initialization)CacheMemory * cacheMemory,int cache_response_latency = 12,int issue_latency = 2

    { // 3 virtual channels to/from network + connection to CPU

    // M,I, & Load,Store, etc.

    // e.g., RequestMessage::GETX -> Fwd_GETX

    // e.g., issueRequest

    // e.g., I x Store -> M}

    141

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDeclaring States

    State Declaration

    // STATESstate_declaration(State, desc="Cache states") {

    // Stable StatesI, AccessPermission:Invalid, desc="Not Present/Invalid";M, AccessPermission:Read_Write, desc="Modified";

    // Transient StatesII, AccessPermission:Busy, desc="Not Present/Invalid, issued PUT";MI, AccessPermission:Busy, desc="Modified, issued PUT";MII, AccessPermission:Busy, desc="Modified, issued PUTX, received nack";IS, AccessPermission:Busy, desc="Issued request for LOAD/IFETCH";IM, AccessPermission:Busy, desc="Issued request for STORE/ATOMIC";

    }

    143

  • MI Example - L1 Cache ControllerDeclaring Events

    Event Declaration

    // EVENTSenumeration(Event, desc="Cache events") {

    // from processorLoad, desc="Load request from processor";Ifetch, desc="Ifetch request from processor";Store, desc="Store request from processor";

    // From network (directory)Data, desc="Data from network";Fwd_GETX, desc="Forward from network";Inv, desc="Invalidate request from dir";Writeback_Ack, desc="Ack from the directory for a writeback";Writeback_Nack, desc="Nack from the directory for a writeback";

    // Internally generatedReplacement, desc="Replace a block";

    }

    144

  • MI Example - L1 Cache ControllerMapping messages to events

    • Mapping occurs in in_port declaration.• peek(in_port, message_type)

    • Sets variable in_msg to head of in_port queue.• trigger(Event, address)

    Event mapping

    in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {if (forwardRequestNetwork_in.isReady()) {

    peek(forwardRequestNetwork_in, RequestMsg) {if (in_msg.Type == CoherenceRequestType:GETX) {

    trigger(Event:Fwd_GETX, in_msg.Address);}...

    }}

    }

    145

  • MI Example - L1 Cache ControllerDefining Transitions

    • transition(Starting State(s), Event, [EndingState]) [ { Actions } ]

    Transition sequence for new Store request

    transition(I, Store, IM) {v_allocateTBE; // allocate TBE (a.k.a. MSHR) on transition to transient statei_allocateL1CacheBlock;a_issueRequest;m_popMandatoryQueue;

    }

    transition(IM, Data, M) {u_writeDataToCache;s_store_hit;w_deallocateTBE; // deallocate TBE on transition back to stable staten_popResponseQueue;

    }...

    146

  • MI Example - L1 Cache ControllerDefining Actions

    • action(name, abbrev, [desc]) { implementation }• Two special functions available in action

    • peek(in_port, message_type) { use in_msg }• assigns in_msg to message at head of port

    • enqueue(out_port, message_type, [options]) {set out_msg }• enqueues out_msg on out_port

    • Special variable address is available inside an action block• Set to the address associated with the event that caused the

    calling transition

    Example Action Definition

    action(e_sendData, "e", desc="Send data from cache to requestor") {peek(forwardRequestNetwork_in, RequestMsg) {

    enqueue(responseNetwork_out, ResponseMsg, latency=cache_response_latency) {out_msg.Address := address;out_msg.Type := CoherenceResponseType:DATA;out_msg.Sender := machineID;out_msg.Destination.add(in_msg.Requestor); // uses in_msg set by peekout_msg.DataBlk := cacheMemory[address].DataBlk;out_msg.MessageSize := MessageSizeType:Response_Data;

    }}

    }

    147

  • MI Example - L1 Cache ControllerTransition Table

    148

  • MI ExampleConnecting SLICC Machines with a Topology

    Creating the Topology – Not In SLICCsrc/mem/ruby/network/topologies/Pt2Pt.py

    # returns a SimObject for for a Pt2Pt Topologydef makeTopology(nodes, options, IntLink, ExtLink, Router):

    # Create an individual router for each controller (node),# and connect them (ext_links)routers = [Router(router_id=i) for i in range(len(nodes))]ext_links = [ExtLink(link_id=i, ext_node=n, int_node=routers[i])

    for (i, n) in enumerate(nodes)]link_count = len(nodes)

    # Connect routers all-to-all (int_links)int_links = []for i in xrange(len(nodes)):

    for j in xrange(len(nodes)):if (i != j):

    link_count += 1int_links.append(IntLink(link_id=link_count,

    node_a=routers[i],node_b=routers[j]))

    # Return Pt2Pt Topology SimObjectreturn Pt2Pt(ext_links=ext_links,

    int_links=int_links,routers=routers)

    149

  • Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects

    • e.g., Interface with a new message filter• Steps:

    • Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=”yes”) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=”param”;

    action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects

    • e.g., Interface with a new message filter• Steps:

    • Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=”yes”) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=”param”;

    action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects

    • e.g., Interface with a new message filter• Steps:

    • Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=”yes”) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=”param”;

    action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects

    • e.g., Interface with a new message filter• Steps:

    • Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=”yes”) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=”param”;

    action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Detailed Component Simulation: Caches

    • Set-Associative Caches• Each CacheMemory object represents one bank of cache• Configurable bit select for indexing• Modular replacement policy

    • Tree-based pseudo-LRU• LRU

    • See src/mem/ruby/system/CacheMemory.hh

    151

  • Detailed Component Simulation: Memory

    • Memory controller models a single channel DDR2 controller• Implements closed-page policy• Can configure ranks, tCAS, refresh, etc..• See src/mem/ruby/system/MemoryController.hh

    152

  • Detailed Component Simulation: Network• Simple Network

    • Idealized routers - fixed latency, no internal resources• Does model link bandwidth

    • Garnet Network• Detailed routers - both fixed and flexible pipeline model• From Princeton, MIT

    •• See src/mem/ruby/network/*

    153

  • Ruby Debugging Support

    • Random testing support• Stresses protocol by inserting random timing delays

    • Support for coherence transition tracing• Frequent assertions• Deadlock detection

    154

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.

    2 Actions