SmartCore system for Dependable Many-core Processor with Multifunction Routers
Shinya Takamaeda†, Shimpei Sato†, Takefumi Miyoshi‡, Kenji Kise†
†Tokyo Institute of Technology, Japan ‡The University of Electro-communications, Japan
10-11-18 ICNC’10 @Hiroshima Regular Paper Hardware Design and Implementation 14:50-15:20
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 2
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 3
Many-core Processors appear!
10-11-18
Intel Single Chip Cloud Computer 48 cores (x86)
TILERA TILE-Gx100 100 cores (MIPS)
ICNC'10 4
Inter-connection for Many-core processors NoC (Network on Chip)
Data transmission via on-chip-routers
10-11-18 ICNC'10 5
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
Low Dependability on Many-core Process technology scaling for more transistors
But it increases … Soft errors (e.g. bit inversion)
• since cosmic radiations
Timing errors • since variations in transistor characteristic or wire delay
10-11-18 ICNC'10
How to create a reliable Many-core processor?
6
Circuit
Micro-architecture
Architecture
Software
Assurance of the reliability on each layer
10-11-18 ICNC'10 7
Razor-FF
Lock-step
Check-pointing / Re-execution
Inter-connection SmartCore system
Canary-FF
ECC in DRAM Memory Architectural Core Salvaging
Slip Stream Processor
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 8
We propose the SmartCore system SmartCore system
= Smart many-core system with redundant cores and multifunction routers
Key: NoC-based DMR To detect a error,
compare the output packets from the pair
On-chip router has 3 special functions
• Copy a packet • Change the destination • Wait and Compare
2 packets
10-11-18 ICNC'10
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
Handling the same packets
by packet coping
Running the same thread (DMR)
Running the single thread
(DMR)
sharing a packet /
comparing 2 packets
9
Base many-core architecture: M-Core [1] 2D mesh network
connects Nodes Each Node memory
is independent Inter-Node
communication DMA via packets
using ID A packet is a series of
flits (Flow Control Unit) • Only the head flit of
a packet contains the destination
10-11-18 ICNC'10
Node (2, 1)
INCC Node memory
Core
Comp. Node (1,1)
Comp. Node (1,2)
Comp. Node (1,8)
Comp. Node (2,1)
Comp. Node (2,2)
Comp. Node (2,8)
Comp. Node (3,1)
Comp. Node (3,2)
Comp. Node (3,8)
Comp. Node (8,1)
Comp. Node (8,1)
Comp. Node (8,8)
Operation Node (0,0)
Memory Node (1,0)
Off chip memory modules and switch Conventional I/O
Many-core processor chip
Memory Node (2,0)
Memory Node (3,0)
Memory Node (8,0)
Node (1, 1)
INCC Node memory
Core
Router Router
10
DMR on two nodes by using SmartCore Executing a same program binary on the pair
Master Node and Mirror Node If generated packets are different, they are faulty
Packet coping on the Router of the Master for the Mirror to use the same data as Master
Packet comparison on the Router of Master If these two differ, then the Router detects a error
10-11-18 ICNC'10 11
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
Master Node Mirror Node
Node (1,1) Node (2,1) Node (3,1) Node (4,1)
Node (1,2) Node (2,2) Node (3,2) Node (4,2)
Logically Node (1,1)
1. Coping a packet to the Mirror Node Router on Master Node copies a coming packet
to the Mirror Node The destination is changed to the Mirror Node’s ID Original program has a several DMA communications To certainly continue executing the same program in
the two Node
10-11-18 ICNC'10
INCC INCC
R R
Master Mirror
P P
12
2. Wait for a packet from the Mirror Node 3. Compare the contents of two packets Router on Master Node waits a packet from
Master Node and a packet from Mirror Node When Router on Master receives the head flits
from both Nodes, then it starts to compare the 2 flits in order If the contents of flits differ, a error exists in either
Master Node or Mirror Node
10-11-18 ICNC'10
INCC INCC
R R
Master Mirror P P
13
Base router 5 inputs with input buffers / 5 outputs X-Y Dimension-order routing Wormhole switching, Xon/Xoff flow control 1hop/1cycle, single cycle, no virtual channels
10-11-18 ICNC'10
Router
XBAR Switch
Output port X+
Output port X-
Output port Y+
Output port Y-
Output port DMAC
Input port X+
Input port X-
Input port Y+
Input port Y-
Input port DMAC
Arbiter
14
Additional buffer for coping for Mirror Node (a) ID translator to change the destination (b) Flit comparator to verify (c) Node type, Master/Mirror Node ID
Configured by system software
Multifunction router for SmartCore system
10-11-18 ICNC'10
Output port INCC Input port INCC
Router
XBAR Switch
Output port X+
Output port X-
Output port Y+
Output port Y-
Input port X+
Input port X-
Input port Y+
Input port Y-
Arbiter
node type master / mirror ID
V Verify
ID translation
(a) (b)
(c)
15
Advantages of SmartCore system Adaptable to any kind of hardware modules
generating a packet ex) Cache, DSP, Processor core
Because of … Error detection mechanism is independent to
Node structure • Core-granularity redundant execution /
Packet level error detection
10-11-18 ICNC'10 16
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 17
Preliminary Evaluation of SmartCore system 2 evaluations
Performance overhead on DMR Packet rendezvous time
Environment: SimMc 1.0 64 (8×8) threads on 128 (16×8) Nodes Core
• MIPS32 single issue / single cycle processor Router
• 1 hop / 1 cycle, no virtual channels, flit size: 4 bytes INCC (Network Interface)
• up to 1 flit / cycle receive/send from/to router Benchmark: 4 apps from NAS Parallel Benchmarks
• cg, ft, is, lu, Size: S
10-11-18 ICNC'10 18
Node (X, Y)
INCC Node memory
Core
Router
3 configurations of thread mapping
10-11-18 ICNC'10 19
1,1
1,2
1,8
2,1
2,2
2,8
8,1
8,2
8,8
8 Nodes 8 Nodes
8 N
odes
1,1
1,2
1,8
2,1
2,2
2,8
8,1
8,2
8,8
8 N
odes
16 Nodes
1,1
1,2
1,8
1,1
1,2
1,8
2,1
2,2
2,8
2,1
2,2
2,8
8,1
8,2
8,8
8,1
8,2
8,8
8 N
odes
16 Nodes
(a) Base Allocation
(b) Redundant space allocation (Area 2x) (c) Redundant execution with SmartCore system
x,y Proper thread (Master Node)
Redundant thread (Mirror Node) x,y
Not working
to see the effect on #hops to see the effect on SmartCore
Evaluation: Performance overhead on DMR A little slow down
Redundant space (Area 2x): up to 1% slow down Redundant execution (SmartCore): up to 4% slow
down (in cg of NPB)
10-11-18 ICNC'10 20
Evaluation: Packet rendezvous time Cumulative distribution of # cycles that the router
on Master Node waits for a packet from Mirror Node
Almost communications with a little rendezvous
10-11-18 ICNC'10
cg ft
is lu
21
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 22
Hardware Implementation on FPGAs Dependable Many-core processor on FPGA-
based prototyping system by using ScalableCore system [8]
• Connected FPGA boards
• Variable # FPGA boards
2 execution mode • Normal Mode
– Standard M-Core • SmartCore Mode
– The pair executes same thread
10-11-18 ICNC'10 23
SD
Loader (0,1)
PhysicalID (1,1)
Path (0,2)
PhysicalID (1,2)
PhysicalID (2,1)
PhysicalID (2,2)
PhysicalID (3,1)
PhysicalID (3,2)
PhysicalID (4,1)
PhysicalID (4,2)
Path (0,3)
PhysicalID (1,3)
PhysicalID (2,3)
PhysicalID (3,3)
PhysicalID (4,3)
LogicalID (1,1)
LogicalID (1,2)
LogicalID (1,3)
LogicalID (2,1)
LogicalID (2,2)
LogicalID (2,3)
Power Master Mirror Master Mirror
Overview of 15 Nodes ScalableCore system with SmartCore system
10-11-18 ICNC'10 24
Logical ID (1,1)
Master Mirror
Logical ID (1,2)
Master Mirror
Logical ID (1,3)
Master Mirror
Logical ID (2,1)
Master Mirror
Logical ID (2,2)
Master Mirror
Logical ID (2,3)
Master Mirror
Program Loader ID (0,1)
SmartCore system detects a artificial fault
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 25
Related work Slipstream Processor [9, Karthik, ASPLOS2000]
Improving ILP and dependability by using tightly coupled two cores
2 threads • Proper sequence and shorter sequence
Loose Lock-stepped system [10, Nidhi, ISCA2007] Dividing cores, cache, main memory into two groups I/O level error detection
Lockstep [11, IBM] Redundant execution on synchronized processors I/O level error detection
10-11-18 ICNC'10 26
Contents Motivation Proposal: SmartCore system Preliminary Evaluation Hardware Implementation on FPGAs Related Work Conclusion
10-11-18 ICNC'10 27
Conclusion We propose the SmartCore system
NoC-based DMR by using multifunction routers Multifunction router has 3 special functions
• Coping a packet • Changing the destination of a packet • Waiting and comparing the contents of two packets
Low performance overhead Hardware implementation on FPGA-based
prototyping system Future works
Recovery after error detections TMR by SmartCore system
10-11-18 ICNC'10 28