Upload
ilya
View
21
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ABACUS: A Hardware-Based Software Profiler for Modern Processors. Sergey Blagodurov • Sergey Zhuravlev • Alexandra Fedorova School of Computing Science. Eric Matthews • Lesley Shannon School of Engineering Science. Simon Fraser University, Vancouver, BC, Canada. Overview. - PowerPoint PPT Presentation
Citation preview
ABACUS: A Hardware-Based Software Profiler for Modern Processors
Eric Matthews • Lesley ShannonSchool of Engineering Science
Sergey Blagodurov • Sergey Zhuravlev • Alexandra FedorovaSchool of Computing Science
Simon Fraser University, Vancouver, BC, Canada
OverviewLegendary Introduction to ABACUSDelicious Profiling UnitsEpic Conclusion
2
Introduction to ABACUS
3
Introduction to ABACUS
4
Introduction to ABACUS
5
Introduction to ABACUS
6
ABACUS
7
ABACUS
8
ASPLOSrocks!
ABACUS
9
Performance comparison
10
Memory Reuse ProfileABACUS avg runtime: 48.5secondsSimics avg runtime: 1 hour 6minutes
ABACUS
Simics
missReuse 0
Reuse 101234
namd
Coun
ts (i
n M
il-lio
ns)
missReuse 0
Reuse 10
2
4
hmmer
Coun
ts (i
n M
il-lio
ns)
ConclusionABACUS is a generic profiler that can be
easily integrated into modern processorsIt can be used by the O/S to obtain runtime
information about a thread’s behaviour to make better thread assignments
11
Thank you! Questions?
MotivationFuture systems will be multi-core and
heterogeneousHow does the OS place threads on this
architecture?Characterize thread behaviourInstruction MixMemory Reuse ProfileEffectiveness of pre-fetchingMemory bandwidth utilization
13
Motivation (cont'd)How are these metrics collected?
Offline analysisCode InstrumentationSimulation (e.g., Simics)Software-based instruction set simulatorModels systems with full OS support
14
Motivation (cont'd)Why not use current hardware counters?Architecture-specificNot all desired metrics providedHelp detect symptoms, not causesLimited in number and in concurrent use
15
GoalCreate a hardware profiler to collect thread
characteristics at runtimeImposed constraintsExternal to processorMinimally invasiveCycle accurate OS controllable
16
ABACUShArdware-Based Analyzer for the
Characterization of User Software A collection of runtime configurable profiling
unitsCollects metrics useful for thread placementControllable through the O/S
17
Hardware Platform
18
Proof-of-concept SystemLEON3 Sparc v8 Instruction Set ArchitectureSingle core, single threaded
Test SystemOpenSparc Niagara T1 soft processor1 to 4 hardware threadsMulti-core Multi-board support
Hardware Platform (cont'd)
19
ABACUS
20
External InterfaceBus slave and master modulesProcessing required on processor signalsDesigned such that only external interface
changes with different processor/system
21
Portability
22
Previously integrated with a LEON3 (Sparc v8 ISA) based system
Differences:AMBA Advanced High-performance Bus
(AHB) vs Processor Local Bus (PLB) Processor internals
ControllerStarts or stops profilingCan limit profiling to a specific address
rangeDMA interface for retrieving collected
dataLinux device driver support
23
Profiling UnitsOperate on one or more processor signals:InstructionPCCache Reuse Distanceetc.
Store data in a collection of counters
24
Profiling Units (cont'd)Focus on two dimensional metrics– Gives bigger picture / greater insight
Aim to be as architecture independent as possible
25
Profile UnitBehaves like a traditional software profilerOperates on Program Counter
26
Range Overlap
TraceRangeNon-Overlap
Code Space
Memory Reuse UnitCollects a measure of code or data reuseUtilizes Least Recently Used (LRU) stackReuse distance is movement in the LRU
stack or a missUses in cache contention management
27
Memory Reuse UnitCreates histogram of cache reuse patternRange: [0, set associativity – 1] or cache
miss
28
Reuse Distance
4-way set-associative reuse profile
Instruction Mix
29
Identify current instruction subset in useDivide instructions into logical categoriesLoad/StoreFloating PointControl Flow
Opcode-based table lookup
Latency Unit
30
Break down miss latency into constituent sourcesBus contentionDRAM latencyetc.
For each category create a histogram of latency in cycles
Stall Unit
31
Break down Cycles Per InstructionAttribute cycles to their sourcesCache missTranslation Lookaside Buffer (TLB) missFloating Point busy stallsetc.
Verification
32
Run a subset of the SPECCPU2006 benchmarksThose with memory usage within board
specsCollect metrics with ABACUS and SimicsProfile for a few billion instructionsLimited by Simics performace
Test PlatformProof-of-concept SystemSingle core, single threaded
XUP V2Pro: 90% slice utilization
33
Processor LEON3 (SPARC v8 ISA) (50MHz)Memory 256MB DDR RAMOS Debian Etch (4.0)
Simulation PlatformSimics System:
Differences:SPARC v9 ISA (64-bit processor)Local filesystem vs NFS
34
Processor UltraSparc II (SPARC v9 ISA)Memory 256MB DDR RAMOS Debian Etch (4.0)
LEON3 Comparison
35
missReuse 0
Reuse 10
10
20
namd
Coun
ts (i
n M
il-lio
ns)
missReuse 0
Reuse 105
10152025
hmmer
Coun
ts (i
n M
il-lio
ns)
ABACUS
Simics
LEON3 Comparison (cont'd)
36
missReuse 0
Reuse 101234
namd
Coun
ts (i
n M
il-lio
ns)
missReuse 0
Reuse 10
2
4
hmmer
Coun
ts (i
n M
il-lio
ns)
DC Memory Reuse ProfileABACUS
Simics
Resource Usage
3737
Default:
0200400600800
1000120014001600
LUT (V2p)LUT (V5)FF
32bit counters 40bit counters 32bit countersProfile Unit added
2–way LRU Instruction Cache2–way LRU Data Cache5 Instruction Types
ConclusionABACUS is a generic profiler that can be
easily integrated into modern processorsIt can be used by the O/S to obtain runtime
information about a thread’s behaviour to make better thread assignments
38
Future PlansMove to multi-core/multi-threaded systemMemory reuse distance independent of
existing cache implementationProcess trackingIntegrate results into OS scheduler
39
Questions
?