62
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000. SST Overview Genie Hsieh Arun Rodrigues Sandia National Labs

SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

  • Upload
    doandat

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration

under contract DE-AC04-94AL85000.

SST Overview

Genie Hsieh Arun Rodrigues

Sandia National Labs

Page 2: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Why SST?

Page 3: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

View of the Simulation Problem

Application writerspurchasersdesigners

system procurementalgorithm co-design

architecture researchlanguage research

Multiple Audiences.....Network

ProcessorSystem

present systemsfuture systemsX X X

Scale..... ManyCores

+Memory

ManyManyNodes

ManyManyMany

ThreadsX X

Multi-Physics AppsInformatics Apps

Complexity.....Communication Libraries

Run-TimesOS Effects

Existing LanguagesNew LanguagesX X

Constraints.....Performance

CostPower

ReliabilityCoolingUsability

RiskSize

Page 4: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Worldwide Impact"Total power used by servers [in 2005] represented ... an amount comparable to that for color televisions. "-ESTIMATING TOTAL POWER CONSUMPTION BY SERVERS IN THE U.S. AND THE WORLD, Jonathan G. Koomey

3741e9 KW-Hrs Total US power consumption

* 3-4% used by computers (>2% servers, >1% household computer use)

= 112 - 150e9 KW-Hrs US Computer power consumption* $0.1 $/KW-Hr Retail cost, US Average 2009= $11 - $15 Billion US$ in compute power

* 3-5 in 2005 US was roughly 1/3 of servers, by power.

= $33 ( ) - $75 ( ) Billion US$ in worldwide computer power

* 15-35% DRAM memory power

= $5 ( ) - $25 ( ) BIllion in US$ in DRAM power

Page 5: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Major Simulation Challenges•Multiple Objectives

–Performance used to be only criteria

–Now, Energy, cost, power, reliability, etc...

•Scale & Detail–Many system

characteristics require detail to measure

–Detailed simulation takes too long (10^4-10^5 slower than realtime)

•Accuracy–Systems more complex–Vendors don’t reveal

necessary details

Page 6: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Major Simulation Challenges•Multiple Objectives

–Performance used to be only criteria

–Now, Energy, cost, power, reliability, etc...

•Scale & Detail–Many system

characteristics require detail to measure

–Detailed simulation takes too long (10^4-10^5 slower than realtime)

•Accuracy–Systems more complex–Vendors don’t reveal

necessary details

Page 7: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

What Is the SST?

Page 8: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

SST Simulation Project Overview

Technical Approach

Goals•Become the standard architectural simulation framework for HPC•Be able to evaluate future systems on DOE workloads•Use supercomputers to design supercomputers

•Parallel•Parallel Discrete Event core with conservative optimization over MPI•Holistic

•Integrated Tech. Models for power•McPAT, Sim-Panalyzer•Multiscale

•Detailed and simple models for processor, network, and memory

•Open•Open Core, non viral, modular

Consortium•“Best of Breed” simulation suite•Combine Lab, academic, & industry

Status•Current Release (2.1) at code.google.com/p/sst-simulator/•Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model

Page 9: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Parallel Implementation•Implemented over MPI•Configuration, partitioning, initialization handled by core•Conservative, distance-based optimization

Page 10: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Parallel Implementation•Implemented over MPI•Configuration, partitioning, initialization handled by core•Conservative, distance-based optimization

Page 11: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Message Handling•SST core transparently handles message delivery

•Detects if destination is local or remote

•Local messages delivered to local queues

•Remote messages stored for later serialization and remove delivery–Boost Serialization Library

used for message serialization–MPI used for transfer

•Ranks synchronize based on partitioning

Page 12: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Multi-Scale• Goal: Enable tradeoffs

between accuracy, flexibility, and simulation speed– No single “right” way to

simulate– Support multiple

audiences• High- & Low-level interfaces –Allows multiple input types–Allows multiple input

sources• Traces, stochastic, state-

machines, execution...

Multiscale Parameters

High-Level Low-Level

Detail Message Instruction

Fundamental Objects

Message, Compute block, Process

Instruction, Thread

Static Generation

MPI Traces, MA Traces

Instruction Trace

Dynamic Generation State Machine Execution

Page 13: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Simulator Structure

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

Page 14: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

SST/GeM5 Integration•Goals: Run GeM5 in parallel, connect with other SST components

•High parallel efficiency•Changes

–Replaced Python-based configuration w/ XML or C++-based system

–Encapsulated GeM5 as an SST Component, GeM5 event Q driven by SST clock

–Created translator SimObjects to connect to SST links

–Changes made to GeM5 loader to avoid use of async (untimed out-of-band) messages

gem5 SST Component

CPU

L1 L1

BUS

L2

IO

Bridge

PhysMem

IO Bus

Syscall

Handler

Translator

Portals

NICSS Router

MemBus+

Memory

BUS

DRAMSim

Page 15: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

GeM5•M5: Modular platform for computer system architecture research, encompassing system-level architecture as well as processor microarchitecture.

•Provides detailed, full-system CPU models for x86, ARM, SPARC, Alpha

•Integrated at SST Component, allows interaction with SST models, and parallel execution

•Currently tested up to 256 nodes.

Gem5

SimObject

SimObject

SimObject

Port

Port

Gem5

Queue

SST::Component

SST::Component SST::Component

SST::Link SST::Link

SST Queue

SST::Component

Page 16: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Diversity!

genericProc

Macro

M5 O3

RS RouterDRAMSim

Stochastic

simpleRouter

ResiliencySimscheduleSim

Com

plex

ity/D

etai

l

Execution Time

Zesto

Commonalities• Discrete Event or time stepped

• Amenable to event counting for power modeling

MacSim

Page 17: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Simulation Process

Step Action Example

1 Input Problem Description Skeleton App, Mini App, Compact App

2 Input Hardware Description (Vendor) roadmap, novel architecture

3 Run Model Detailed Model, Abstract Model, Vendor Model

4 Read Results Performance, Energy, Cost, Reliability, etc...

5 Refine problem/hardware description Rewrite Proxy App, use more detailed model

6 Goto 1 Increase level of detail & repeat

Page 18: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Simulation Process

Step Action Example

1 Input Problem Description Skeleton App, Mini App, Compact App

2 Input Hardware Description (Vendor) roadmap, novel architecture

3 Run Model Detailed Model, Abstract Model, Vendor Model

4 Read Results Performance, Energy, Cost, Reliability, etc...

5 Refine problem/hardware description Rewrite Proxy App, use more detailed model

6 Goto 1 Increase level of detail & repeat

Page 19: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Component Library•Parallel Core v2

–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling

•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and

custom power/energy models–HotSpot Thermal model–Working on reliability models

•Components–Processor: Macro Applications, Macro

Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU

–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,

SSD, DiskSim

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

SST Simulator Core

Page 20: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Component Library•Parallel Core v2

–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling

•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and

custom power/energy models–HotSpot Thermal model–Working on reliability models

•Components–Processor: Macro Applications, Macro

Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU

–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,

SSD, DiskSim

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

SST Simulator Core

Page 21: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Holistic Simulation

•Design space includes much more than simple performance

•Create common interface to multiple technology libraries–Power/Energy–Area/Timing estimation

•Make it easier for components to model technology parameters

Com

mon

Tec

h A

PI McPAT

Cacti

Sim-Panalyzer

Others

Component

Page 22: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Open Simulator Framework

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

•Simulator Core will provide...–Power, Area, Cost modeling–Checkpointing–Configuration–Parallel Component-Based

Discrete Event Simulation•Components

–Ships with basic set of open components

–Industry can plug in their own models

•Under no obligation to share•Open Source (BSD-like) license •SVN hosted on Google Code

Page 23: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

How To Use the SST

Page 24: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Where to Get the SST•Google Code•http://code.google.com/p/sst-simulator/

–Anonymous checkout available•BSD-like License•Directory Structure

/sst-simulator : Top Level/deps : Scripts to help build dependencies/doc : Doxygen/sst : Source code/core : Simulator Core (DES)/elements : Simulation models/DRAMSimC : DRAM Simulator/M5 : GeM5 Simulator/SS_Router : Cray SeaStar Simulator/zesto : Detailed x86/iris : NoC Simulator/Many Others...

Page 25: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Dependencies•System

–64-bit Linux (Debian or RedHat)

–MacOS >10.4•Libraries

–MPI Distribution (Open MPI or MPICH2)

–Boost 1.43.x–ParMETIS 3.1.1 (Optional)–Zoltan 3.2 (Optional)

•Optional Components–DRAMSim2–Disksim (64-bit Linux Only)–McPAT–IntSim

–ORION–HotSpot–GeM5 (64-bit Linux Only)

•Detailed directions at: http://code.google.com/p/sst-simulator/w/list

Page 26: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

SST Deps Tarball•Fast-Path dependencies•Download sstComponents_2012-FEB-23.tar.gz from:

•http://code.google.com/p/sst-simulator/downloads/list•Untar in $HOME•cd sst-simulator/deps/bin•./sstDependencies.sh cleanBuild•Uncompresses, patches, and builds

–Boost–DiskSim–DRAMSim 2–HotSpot–IntSim–McPAT–Orion–GeM5

Page 27: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Building the SST•From top level•./autogen.sh•./configure <options>

–./configure --help for explanations–--prefix=DIR–--with-gem5=DIR–--with-disksim=DIR

•make•make install

Page 28: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Building the SST With MacSim•Unpack the MacSim source into the SST Elements dir–cd $SST_TOP/sst/elements –svn co https://svn.research.cc.gatech.edu/macsim/trunk macsim –cd macsim

•Patch MacSim–patch -p0 -i macsim-sst.patch

•Build the SST–cd $SST_TOP–./autogen.sh –./configure

Page 29: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Running the SST•Serial–sst.x <SDL File>

•Parallel–mpirun -n <ranks> sst.x <SDL File>

Page 30: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Sample XML Format•Simple component-based format

•Programmatic construction (C++ or scripted interface) in development

<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link id="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>

<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link id="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>

Link Specification(s)

Component Parameters

Component

Page 31: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Sample XML Format•Named Links

–Endpoints specified–Minimum Latency specified

•Used in Partitioning & Optimization

•Variables & Parameter Blocks

Link Name

Link Local Name

<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link name="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>

<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link name="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>

<variables> <nic_link_lat> 200ns </nic_link_lat> <rtr_link_lat> 10ns </rtr_link_lat></variables>

<link name=0.cpu2nic port=cpu latency=$nic_link_lat/>

Page 32: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

GeM5 Example•Two Level Configuration

–SST Configuration•One M5 SST::Component per rank

•Simulation variables (stop time, etc...)

–M5 Configuration•M5 sub-components (processors, caches, busses)

•Uses same format as SST XML

•Subbomponent Config parameters same as GeM5

<sst> <component name=system type=m5C.M5 rank=0 > <params> <configFile>exampleM5.xml</configFile> <debug> 0 </debug> <M5debug> none </M5debug> <info>yes</info> <registerExit>yes</registerExit> </params> </component></sst>

<component name=nid0.cpu0 type=O3Cpu > <params include=cpuParams> <base.process.cmd>${M5_EXE}</base.process.cmd> <base.process.env.0>RT_RANK=0</base.process.env.0> </params> <link name=nid0.cpu-dcache port=dcache_port/> <link name=nid0.cpu-icache port=icache_port/> </component>

<component name=nid0.cpu0.dcache type=BaseCache > <params include=cacheParams> <size>65536</size> </params> <link name=nid0.cpu-dcache port=cpu_side/> <link name=nid0.dcache-bus port=mem_side/></component>

Page 33: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Portals NIC Example•Tests large scale HPC network–Processor: State-machine –NIC: Simplified packet-

based–Router: Detailed flit-based

router (Cray SeaStar)•Tested up to 3M components

•Includes SDL generator

<component name=0.cpu type=portals4_sm.trig_cpu rank=0 > <params include=cpu_params> <id> 0 </id> </params> <link name=0.cpu2nic port=nic/></component>

<component name=0.nic type=portals4_sm.trig_nic rank=0 > <params include=nic_params1,nic_params2> <id> 0 </id> </params> <link name=0.cpu2nic port=cpu/> <link name=0.nic2rtr port=rtr/></component>

<component name=0.rtr type=SS_router.SS_router rank=0 > <params include=rtr_params> <id> 0 </id> </params> <link name=0.nic2rtr port=nic/> <link name=xr2r.0.0.1 port=xPos/> <link name=xr2r.0.0.0 port=xNeg/></component>

<component name=0.cpu type=portals4_sm.trig_cpu rank=0 >

Library

ComponentManual Partition

Page 34: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

MacSim Example•Each MacSim instance is an SST::Component

•Clock speed, Paths to traces, configuration specified in SDL

•Two Modes–Standalone–w/ DRAMSim2

<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>

Standalone MacSim in SST

Page 35: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

MacSim Example•Each MacSim instance is an SST::Component

•Clock speed, Paths to traces, configuration specified in SDL

•Two Modes–Standalone–w/ DRAMSim2

<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>

Standalone MacSim in SST

<component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath></paramPath> <tracePath></tracePath> <outputPath></outputPath> <clock>1.4Ghz</clock> </params> <link name=membus port=bus latency=1ns /></component>

<component name=mem0 type=DRAMSimC > <params> <clock> 1066 Mhz </clock> <systemini> system-2.ini </systemini> <deviceini>GDDR5_hynix_1Gb.ini</deviceini> </params> <link name=membus port=bus latency=1ns /></component>

MacSim+DRAMSim in SST

Page 36: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

SST Internals

Page 37: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Key Objects & Interfaces•Goal: Simplicity•Objects

–SST::Component: A model of a hardware component

–SST::Link: A connection between two components–SST::Event: A discrete Event–SST::EventHandler: Function to handle an incoming

event or clock tick•Events

–SST::Component::ConfigureLink(): Registers a link and (optionally) handler

–SST::Link::Recv(): Pull an event from a link–SST::Link::Send(): Send an event down a link–SST::Component::registerClock(): Register a clock

and handler

ComponentEventHandler

Event

ComponentEventHandler

Link

Page 38: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Sample XML Format•Named Links•Endpoints specified•Minimum Latency specified–Used in Partitioning

Link Name

Link Local Name

<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link id="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>

<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link id="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>

Page 39: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Routermodel Constructor

Gather Parameters

Create Event Handler

Add Links & Register Handler to Links

Routermodel(ComponentId_t id, Clock* clock, Params_t& params) : Component(id, clock), params(params){ Params_t::iterator it= params.begin(); while (it != params.end()) { if (!it->first.compare("hop_delay")) { int delay; sscanf(it->second.c_str(), "%d", &delay); hop_delay= delay * 0.000000000001; } ...; ++it; }

// One handler handles all ports Handler= new EventHandler<Routermodel, bool, Time_t, Event *> (this, &Routermodel::handle_events);

local_port= LinkAdd("local", Handler); west_port= LinkAdd("west", Handler); east_port= LinkAdd("east", Handler); ...;} Link Local

Name

Page 40: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Simple Event Handlerbool Routermodel::handle_events(Time_t time, Event *event){ NicEvent *e= static_cast<NicEvent *>(event);

e->router_delay += hop_delay; if (e->routeX > 0) { // Keep going East e->routeX--; east_port->Send(e); } else if (e->routeX == 0) { if (e->routeY > 0) { // Go South e->routeY--; south_port->Send(e); } else if (e->routeY == 0) { // We have arrived! local_port->Send(e); } else /* e->routeY < 0 */ { // Go North e->routeY++; north_port->Send(e); } } else /* e->routeX < 0 */ { // Keep going West e->routeX++; west_port->Send(e); }

return false;}

Page 41: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Multiple Ways of Handling Events•“Poll” a Link

–Link::Recv()–Usually called from

a clock–Allows multiple

events to be pulled in same function

•Register event handler with link–handler called

whenever event arrives

•Register a Clock–Called at regular

interval

Constructor {

cpu= LinkAdd( "port0" );

eventHandler = new EventHandler< Xbar, bool, Time_t, Event* > ( this, &Xbar::processEvent ); nic= LinkAdd( "port1", eventHandler );

clockHandler = new EventHandler< Xbar, bool, Cycle_t, Time_t > ( this, &Xbar::clock );

ClockRegister( frequency, clockHandler );}

....

CompEvent *e = cpu->Recv();

Page 42: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Time & Checkpointing•Time

–Represented internally as 64-bit int (default femtoseconds)

–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)

•Example: registerClock( "1 GHz", clockHandler );

–TimeConverter objects provided to convert between different bases

•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first

class” function–Uses Boost Serialization Library–Components dump state to binary file at

specified interval

template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }

Example Checkpoint Function

Page 43: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Time & Checkpointing•Time

–Represented internally as 64-bit int (default femtoseconds)

–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)

•Example: registerClock( "1 GHz", clockHandler );

–TimeConverter objects provided to convert between different bases

•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first

class” function–Uses Boost Serialization Library–Components dump state to binary file at

specified interval

template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }

Example Checkpoint Function

Page 44: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Case Studies

Page 45: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Sample Results & Uses

5%

16%

51%

28%

MiniMD Memory Power Breakdown

NoCDRAML2MC

1%15%

59%

26%

GUPS Memory Power Breakdown

NoCDRAML2MC

0

15

30

45

60

GUPS PageRank MiniMD HPCCG

35.99

9.59

30.44

50.03

LSQ Occupancy

Entries

0

125

250

375

500

GUPS PageRank MiniMD HPCCG

414.01

60.7

200.32

2206.8

Avg. Memory Latency

nanoseconds

5%

16%

51%

28%

MiniMD Memory Power Breakdown

NoCDRAML2MC

1%15%

59%

26%

GUPS Memory Power Breakdown

NoCDRAML2MC

0

15

30

45

60

GUPS PageRank MiniMD HPCCG

35.99

9.59

30.44

50.03

LSQ Occupancy

Entries

0

125

250

375

500

GUPS PageRank MiniMD HPCCG

414.01

60.7

200.32

2206.8

Avg. Memory Latency

nanoseconds

Power analysis help prioritize technology investments

SST Simulation of MD code shows diminishing returns for threading on small data sets

Detailed component simulation highlights bottlenecks

Page 46: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Prototyping•Algorithm/Application presented as a skeleton code•Express communication or memory access pattern•Easy to change / Ignores many details•Allows exploration of a single system features

–E.g. Collective performance in presence of noise (ignore processor & memory, focus on router & NIC)

0

20000

40000

60000

80000

100000

120000

140000

64 128 256 512 1024 2048 4096 8192 16384 32768

Allre

duce

Tim

e (n

s)

Nodes

Host Tree: 1000 ns latency, 10 Mmsgs/s, Radix-8Host Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-8

Triggered Tree: 1000 ns latency, 10 Mmsgs/s, Radix-16Triggered Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-16

Recursive Doubling: 1000 ns latency, 10 Mmsgs/sRecursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s

Triggered Recursive Doubling: 1000 ns latency, 10 Mmsgs/sTriggered Recursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s

Page 47: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Design Space Exploration

Page 48: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Design Space Exploration Example•Design Space Exploration

–Inputs•Memory technology (DDR2, DDR3, GDDR5)•Core width (1,2,4,8 wide issue)•Cache size (32/32/1M or 64/64/2M)

–Outputs: Power, Performance, Cost•Methodology

–Performance models: GeM5/x86, DRAMSim2*

–Energy Models: DRAMSim2, McPAT–Cost Models: IC Knowledge–Key Questions

•What is good cache size? Core Width?•Which DRAM technology?

•Example of sorts of questions simulation can answer

Cores MC

DIMM DIMM

DIMM DIMM

DDR2, DDR3, GDDR5

Small or Large coresSmall or Large caches

Execution Based Processor Model

Detailed DRAM Model

Target Apps

HPCCG

Lulesh

Page 49: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Cache Size

•Larger caches increase processor size, power

•Avg. Power increase: 6.75%•Avg. Cost increase: 3.76%•Avg. Performance improvement–Lulesh: 1.40%–HPCCG: 6.73%

•Conclusion: Lulesh probably wouldn’t benefit, HPCCG marginal benefit

!"#

$"#

%"#

&"#

'"#

("#

)"#

*"#

+"#

!"#$%& '()*$+&,")*& -./$)0&

!$1"%+234$&

5!,,6&

!$%1"%+234$&

-2%7$&,240$&834%$2)$&

9:$4*&"1&-2%7$%&,240$)&

Page 50: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Which Memory System

•Options–DDR2: Cheap, low

power, antiquated–DDR3: Higher

performance, reasonable power

–GDDR5: Expensive, high power, very fast

•Pure performance:–GDDR 26-47% faster

than DDR3 (Lulesh)–GDDR 32-41% faster

thand DDR3 (HPCCG)–GDDR Wins?

!"

!#$"

%"

%#$"

&"

&#$"

%" &" '" (" )*+,-.+"

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+

889:+

889;+

<889=+

!"

!#$"

%"

%#$"

&"

&#$"

%" &" '" (" )*+,-.+"

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+

99:;+

99:<+

899:=+

Better

Better

Page 51: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Energy & Cost•GDDR better performance•DDR3 generally does better on perf/Watt –Lulesh: -3% to 107%–HPCCG: 0 to 100%–GDDR does well at higher

processor widths–DDR2 sometimes slightly

better than DDR3 on HPCCG

•perf/$–DDR3 better for narrow

cores, GDDR better for wide

–DDR slightly better over all

!"#$

!"%$

!"&$

'$

'"($

'"#$

'"%$

'$ ($ #$ &$ )*+,-.+$

!"#$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

56&)14+,)#-"#$%70)+8)#+2%9+

::;<+

::;=+

>::;?+

!"#$

!"%$

!"&$

'$

'"($

'"#$

'"%$

'$ ($ #$ &$ )*+,-.+$

!"#$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

5,667+,)#-"#$%80)+9)#+2%:+

;;<=+

;;<>+

7;;<?+

!"#$

!"%$

!"&$

!"'$

!"($

!")$

*$

*"*$

*"+$

*",$

*"#$

*$ +$ #$ ($ -./012/$

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+8)#+9"&&%#+

99:;+

99:<+

=99:>+

!"#$

!"%$

!"&$

!"'$

!"($

!")$

*$

*"*$

*"+$

*$ +$ #$ ($ ,-./01.$

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+9)#+:"&&%#+

::;<+

::;=+

8::;>+

Better

Better

Page 52: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Processor Width

•Wider processor can issue more instructions/cycle

•Consumes more area, power–Cost is super-linear wrt area

increase•Power often increases faster than performance–E.g. 8-wide processor 78%

faster on Lulesh, uses 123% more power

•1-2 wide cores most power efficient

•2-4 most cost efficient

!"#$

!"%$

!"&$

!"'$

($

("($

(")$

("*$

!" #" $" %"

&'()

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:;+.59<"=>.38"'1"0('3.55'("7,/89"

0'?.("

@'58"

!"#$

!"%$

!"&$

!"'$

($

("($

(")$

("*$

!" #" $" %"

&'()

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:0;;<=">?.38"'1"0('3.55'("7,/89"

0'@.("

;'58"

Better

Better

Page 53: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Design Space Exploration Results•Fastest memory technology not always best (DDR beats GDDR) due to power, cost

•No “best” processor - depends on tradeoff between cost, performance, power

•Can provide better understanding of which configurations are best for a given application

•Can be used as basis for application optimization, vendor guidance

Lulesh Pareto Optimal Designs

Width Memory Cache Power Performance Cost1 DDR3 Small 1.00 1.00 1.02 DDR3 Small 1.43 1.65 1.32 GDDR5 Small 3.00 2.28 2.04 GDDR5 Small 3.57 2.92 2.38 GDDR5 Small 5.29 3.62 3.4

Page 54: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Future Directions

Page 55: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

New Components & Increased Integration

•New Components in development–NVRAM–New HPC NIC–Interface to Palacios

hypervisor•Increased Integration & Improvements–Parallel GeM5–GeM5/IRIS/DRAMSim

integration–SST/Meso scale–Improved Power/Area/Cost

models

Runnemede

Simulator

NMSU

Stochastic

Simulator

M5

MacSim

IRIS

Network

Models

DRAMSim

Skeleton

Apps

TracesC

om

mon

Interfa

ce

Memory

Refs

Traces

Stochastic

Blocks

Comm.

Events

Mem

oiza

tion

Exec.

Blocks

Meso-Scale Simulation

Page 56: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Power/Energy Modeling

•Parameterized Models turn architectural parameters into energy/event and static dissipation (e.g. leakage energy)

•During Simulation, count Key Events

•Event Counts * Energy/Event = Dynamic Energy

•Dynamic + Static Energy = Total Energy

•Energy / time = power•ThermalModel(power) = temperature

•Temperature adjusts leakage power

J/Read

Static

J/WriteCache Model

Size

Assoc.

Temp.

ReadsWritesTime

Simulate

J/ReadJ/WriteStatic

X = Energy

ThermalModel

Page 57: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Case Study: Reliability vs. PowerHidden cost of DVFS

•Dynamic voltage/frequency Scaling reduces power

•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability

•Need–Algorithms to balance

temperature, lower power, & maintain performance

–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

(Coskun 2011)

Page 58: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Case Study: Reliability vs. PowerHidden cost of DVFS

•Dynamic voltage/frequency Scaling reduces power

•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability

•Need–Algorithms to balance

temperature, lower power, & maintain performance

–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

(Coskun 2011)

Page 59: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Bonus Slides

Page 60: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Component Validation•Strategy: component validation in parallel with system-level validation

•Current components validated at different levels, with different methodologies

•Validation in isolation

•What is needed–Uniform validation

methodology (apps)–System (multi-component)

level validation

Component Method Error

DRAMSim RTL Level validation against Micron Cycle

Generic Proc

Simplescalar SPEC92 Validation ~5%

NMSU Comparison vs. existing processors on SPEC <7%

RS Network

Latency/BW against SeaStar 1.2, 2.1 <5%

MacSim Comparison vs. Existing GPUs

Ongoing<10%

expected

Zesto Comparison vs several processors, benchmarks 4-5%

McPAT Comparisons against existing processors

10- 23%

Page 61: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

Power is the Problem•2018 Exascale Machine–1 Exaop/sec–100s petabyte/sec memory

bandwidth–100s petabyte/sec

interconnect bandwidth–No major architecture

changes

•Consider power –1 pJ * 1 Exa = 1 MW–1 MW/year = $1 M–$200-400M / year power bill

Page 14

management of data, combined with better overlap of communication and computation could reduce bandwidth

requirements by 50% or more.

Table 2: Effects of Power Reduction Techniques on an Exascale System

2018 Estimate Reduction Techniques Reduced Power Reduction

Processing 224 MW Simpler processor, Reduce

FP 11.2 MW 95%

Memory 125 MW Closer Proximity 37.5 MW 70%

Interconnect 24 MW Message Overlap 12 MW 50%

Total 373 MW 60.7 MW 84%

Applying these techniques to the hypothetical system from the introduction, we see a power reduction of 84%. This turns a

wholly impractical machine consuming hundreds of megawatts and costing significant fractions of a billion dollars to

power to a more attainable machine consuming tens of megawatts. More aggressive application of the techniques presented

here, and improvements in technology could reduce this further.

Figure 13 – Lessons from Embedded Systems

REFERENCES

[1] G. M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability,” Proceedings of AFIPS Conference, Reston,

VA. 1967. pp. 483-485

[2] G. Arnout, “C for system level design”, Proceedings of Design Automation and Test Europe (DATE) pp.384- 386, 2003, Munich, Germany

[3] M. Barr . "Embedded Systems Glossary". Netrino Technical Library. http://www.netrino.com/Embedded-Systems/Glossary. Retrieved 2007-04-21.

[4] M. Barr; A. J. Massa (2006). "Introduction". Programming embedded systems: with C and GNU development tools. O'Reilly. pp. 1-2. http://books.google.com/books?id=nPZaPJrw_L0C&pg=PA1.

[5] J. L. Bentley, “Programming pearls, second edition,” Addison-Wesley, Inc., 2000, ISBN 0-201-65788-0.

[6] B. W. Boehm, “Improving software productivity,” IEEE Computer 20, 9, 1987, pp. 43-57.

[7] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, J. Rattner, “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,” Technology & Research, Technology@Intel, Magazine Platform 2015

[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communication of the ACM, January 2008, pp. 107-113.

[9] D. Burger and J. R, Goodman, “Billion-transistor architectures: there and back again,” IEEE Computer 37, 3, Mar. 2004, pp. 22-28

From Jensen “Embedded systems and exascale computing.” CiSE 2010

Energy Conventional

Processor 62.5 pJ/op 62.5 MW

Memory 31.25 pJ/bit 125 MW

Interconnect 6 pJ/bit 24 MW

Total 211.5 MW

Page 62: SST Overview - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca2012/HPCA_SST...SST/GeM5 Integration •Goals: Run GeM5 in parallel, connect with other

X to Solution

•Wider processors provide shorter time to solution

•Require much more energy to solution

!"#$

!"%$

!"&$

'"'$

'"($

'"#$

'"%$

'"&$

'$ )$ *$ +$

!"#$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

67&)158+9:)/4+"-+,#"/)11"#+3'*45+

9.)#;<+="+>"&7?".+

='$)+="+>"&7?".+

!"#$

!"%$

!"&$

'"'$

'"($

'"#$

'"%$

'"&$

'$ )$ *$ +$

!"#$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

6,7789+:;)/4+"-+,#"/)11"#+3'*45+

:.)#<=+>"+?"&@A".+

>'$)+>"+?"&@A".+

Better

Better