Upload
truongnga
View
217
Download
2
Embed Size (px)
Citation preview
AMS
Institute of Computing Technology Chinese Academy of Sciences
High-Efficient Architecture of Godson-T Many-Core Processor
Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye,
Fenglong Song, Junchao Zhang, Lingjun Fan
Advanced Micro-System GroupNational Key Laboratory of Computer Architecture,
Institute of Computing Technology, Chinese Academy of Sciences
AMS Godson-T | HOT CHIPS 20112
Overview
Microprocessor Architecture Challenges
Godson-T Features for Parallel Program
Godson-T Design and Implementation
AMS Godson-T | HOT CHIPS 20113
Memory wall + Power wall + ILP wall = Brick wall !What parallel architecture will be successful?
Parallel architectures come to rescue, superscalar processor is being replaced by on-chip multi-core / many-core processor
Microprocessor Architecture Revolution
Power, Complexity
Performance
Simpler
Processor Core
Superscalar
Processor Core
Multi-Core
Many-Core
Power Wall
ILP Wall
AMS Godson-T | HOT CHIPS 20114
Exploiting Polymorphic Parallelism
Instruction-Level
Parallelism
Heterogenous
Many-Core
Godson-3
Godson-T
Thread-Level Parallelism
With Powerful
Streaming Unit
With moderate
DLP and strong
TLP exploitation
AMS Godson-T | HOT CHIPS 20115
Many-Core Challenges: The 5 P’s
Power efficiency challenge Performance per watt is the new metric – Dark Silicon will appear in 2020
when facing 11nm process technology
Performance challenge How to scale from 1 to 1000 cores – the number of cores is the new
Megahertz
Programming challenge How to provide a converged many core solution in a standard programming
environment
Parallelism challenge How to exploit parallelism through software and hardware co-design
Platform challenge How to maximize the usability, such as, runtime system (OS), compiler &
library, many-core debug…
AMS Godson-T | HOT CHIPS 20116
What we could help……
Productivity Side:
Most sequential programmers are not ready to switch into parallel programming conservative programming model and make incremental improvement?
Unique programming model cannot solve all problems efficiently; support multiple programming features efficiently?
Locks are messy new way to eliminate deadlock?
……
Performance Side: On-chip synchronization is fast;
handle all synchronizations on-chip? Flops are cheap, memory communications are expensive;
trade flops for communication latency? Fine-grained parallelism should be taken into consideration;
enable data-driven thread execution on chip?……
Take this carefully, because it may affect productivity !
AMS Godson-T | HOT CHIPS 20117
Data Communication concept
Fine GrainedThread division
Thread Synchronization
Motivation
Target atHPC
Godson-TMany cores to
accelerate one program
Cost CostMethod
AMS Godson-T | HOT CHIPS 20118
Overview Microprocessor Architecture Challenges
› Memory wall + Power wall + ILP wall
› The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program› Godson-T architecture overview
› Architectural supports for multithreading
› Software runtime system
Godson-T Design and Implementation› Journey of design and implementation
› Software and hardware co-simulation
› Prototype
AMS Godson-T | HOT CHIPS 20119
Architecture Overview of Godson-T
I/O Controller
Synchronization Manager
L2 Cache Bank
Memory Controller
Mem
ory ControllerM
emor
y C
ontr
olle
r
Memory Controller
Private I-$
Router
Fetch & Decode
Register Files
INT ALU
Sync Unit
FP/Vector ALU
Processing Core
SPM
Local Memory
D$
DTAMACALU
Private Memroy
L1$
CORE
R
GodsonT
Processor
Runtime
System
Program
AMS Godson-T | HOT CHIPS 201110
Processing Core
FETCH
DECODE
Vector
Register File
Instruction
Buffer
Load/Store/
Synchronization
Unit
Vector/
Floating-Point
Unit
32 Bytex 2x1
1 instruction
Vector/
Fixed-Point
Unit
32Byte 2r/1w
L0_Icache
4Byte
Router
16Byte
16Byte in/out msg
L0_Dcache
16Byte
SPM/
L1 Cache
8Byte
1 instruction
ISA: MIPS (user), SIMD-ext., sync-ext.
8-stage pipeline
Dual-issue per thread
Expected SIMD Load, Store, Data Move,
Arithmetic ……
Fast level-1 memory
16KB private memory Automatically mapped into stack
address space
Full/empty bit tagged on each 64-bit slot enables efficient producer-consumer style synchronization
Communication with external modules through message packets
AMS Godson-T | HOT CHIPS 201111
Interconnection
Separated routers with each processing core Static XY wormhole routing
Round-Robin arbitration
Two independent physical networks
Duplex 128bit link for each network
Guaranteed in-order point-to-point communication Deadlock-free & livelock-free
Scalable and power-efficient for MESH topology
Low latency core-to-core communication
Separated networks allowing traffic segregation and tolerating burst DMA transfer
AMS Godson-T | HOT CHIPS 201112
Memory Hierarchy
Reg
LocalMemory
L2 Cache
Off-Chip Memory
Data Transfer Bandwidth
512GB/s
128GB/s
51.2GB/s
Each processing core with 32 fixed-point and 32 floating-point registers
32KB local memory, including L1$D, SPM
16 address-interleaved L2 cache banks, 128KB each
4 DDR3-1600 memory controllers
AMS Godson-T | HOT CHIPS 201113
Power Management
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
• Monitor status of each core at program level
• Shut down or turn on cores separately
• Reassign tasks
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
RT RT RT RT
AMS Godson-T | HOT CHIPS 201114
Overview Microprocessor Architecture Challenges
› Memory wall + Power wall + ILP wall
› The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program› Godson-T architecture overview
› Architectural supports for multithreading
› Software runtime system
Godson-T Design and Implementation› Journey of design and implementation
› Software and hardware co-simulation
› Prototype
AMS Godson-T | HOT CHIPS 201115
Thread Communication & SynchronizationLock-Based Cache Coherent Protocol
Lazy cache coherent protocol Pure mutual-exclusion synchronization instruction without memory
accesses; Eliminate busy-waiting for locks; More scalable than bus-snoopy and directory-based cache protocol; Enable hardware deadlock detecting.
Mini Core Mini Core
Inte rc o nne c t
int mutual_func_a()
{
……
acquire_lock(LOCK_VAL);
X = 3;
release_lock(LOCK_VAL);
……
}
int mutual_func_b()
{
……
acquire_lock(LOCK_VAL);
Y = X;
Z = X;
release_lock(LOCK_VAL);
……
}
Synchronization
ManagerShared L2 Cache
Private Cache Private Cache
Inte rc o nne c t
X=0 X=0X=0
X=3
ACQ. ACQ.REL. REL.
ACK.ACK.X=0
X=3
CORE 0 UNLOCKCORE 1 UNLOCK CORE 0 LOCKEDCORE 1 UNLOCK CORE 0 LOCKEDCORE 1 WAITINGCORE 0 UNLOCKCORE 1 LOCKEDCORE 0 UNLOCKCORE 1 UNLOCK
AMS Godson-T | HOT CHIPS 201116
Thread Communication & Synchronization
Data Transfer Agent (DTA)
invovled data block address
…… ……
chunk stride chunk stride
block stride block strideblock
(b) 2D strided DTA operations
DTA DTA
SPM SPM
……
on-chip network
……
……
Router Router Router
L2 Cache
……
off-chip memory
DTA horizontal operation DTA vertical operation DTA vertical operation
(a) vertical and horizontal DTA operations
Programmable asynchronous data transfer agent Support vertical and horizontal
DTA operations (such as prefetch)
Data transfers between multi-dimension addresses (such as matrix inversion )
Network load perception, automatically flow control(improve bandwidth-efficiency)
Support fine-grain synchronous operations
AMS 17
Synchronized DTA Operations
1 0*load.future (s)/store.future (s)/store.sync (f)
load.sync (s)
store.sync (s)
load.future (f)/store.future (f)/
load.sync (f)
(s): successfully perform on the state of full/empty bit (f): failed to perform on the state of full/empty bit
Target Node
SourceNode
Traffic Awareness
Node
AMS Godson-T | HOT CHIPS 201118
Evaluating On-chip DTA
SGEMM 1-D FFT0
10
20
30
40
50
60
70
80
90
100
110
120
130
72.864.8
122.8127.5
24.719.7
63.7
Per
form
ance
(GFL
OP
S)
Kernels
Cache SPM, without DTA SPM, with DTA Theoretical
72.9
Pe rfo rm an ce co m pariso n s o f SGEMM an d 1-D FFT
Benchmark Processor Godson-T Cell Cyclops-64 GTX8800
SGEMMEfficiency 95.9% 1 99.9% 2 43.4% 60.0%
Performance 122.81 204.7 13.9 206.0
1-D FFTEfficiency 33.2% 20.4% 25.8% 29.9%
Performance 63.72 41.8 20.7 155.0
1 The SGEMM kernel contains only multiply-and-add operation, so that the ideal peak performance is measured by the multiply-and-add function unit, which is 128GFLOPS.
2 Efficiency of SGEMM on Cell is slightly better than that on Godson-T, because 256KB SPM for each SPE on Cell makes the better utilization of data locality.
AMS 19
Evaluating Synchronization without Memory
0 8 16 24 32 40 48 56 64
0.1
1
10
0 8 16 24 32 40 48 56 641E-3
0.01
0.1
1
10
100
1000
0 8 16 24 32 40 48 56 640.01
0.1
1
(c) average time for each load(b) lock transferring overhead
Tim
e (u
s)
# of threads
FAA-based SM-based Pthread
(a) lock overhead without lock contention# of threads # of threads
0 8 16 24 32 40 48 56 640.01
0.1
1
10
100
1000
0 8 16 24 32 40 48 56 640.01
0.1
1
10
100
1000
10000
0 8 16 24 32 40 48 56 640.01
0.1
1
10
100
1000
10000
(c) average time of each load(b) barrier overhead with load imbalancing
Tim
e (u
s)
# of threads
SM-Based Pthread
(a) barrier overhead without workload# of threads # of threads
AMS 20
Evaluating Full-empty Bit Synchronization
Livermore loop 6for ( i=1 ; i<n ; i++)
for ( k=0 ; k<i ; k++)W[i] += B[k][i] * W[(i-k)-1];
0 8 16 24 32 40 48 56 64
0
20
40
60
80
100
120
Nor
mal
ized
Spe
edup
# of threads
coarse-grain sync. fine-grain sync. synchronized DTA
Speedup of Livermore Loop 6
0 8 16 24 32 40 48 56 640
8
16
24
32
40
48
56
64
Spee
dup
nom
aliz
ed to
ser
ail p
erfo
rman
ce
# of threads
Speedup of 2-D Wavefront
AMS Godson-T | HOT CHIPS 201121
Overview Microprocessor Architecture Challenges
› Memory wall + Power wall + ILP wall
› The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program› Godson-T architecture overview
› Architectural supports for multithreading
› Software runtime system
Godson-T Design and Implementation› Journey of design and implementation
› Software and hardware co-simulation
› Prototype
AMS Godson-T | HOT CHIPS 201122
Pthread-like Thread Execution using Private
Memory Support master-slave style thread execution; Programming easily: C programming model + Pthread API; Compatible with large amount of conventional codes on shared
memory system.
Inte rc o nne c t
int mater_func()
{
……
int thread_id =
godsont_ create_thread(slave_func, 2, param_array);
……
}
int slave_func(int argc, int argv[])
{
// do computation;
……
godsont_exit(0);
}
SPM
Stack
Pointer
ab
1. Master transmits parameters to slave;
2. Master setups PC and register context for slave;
3. Slave function begins execution, private memory acts as a local stack.
PC
Stack v.s. L1-D$ accesses ≈ 30 (8x8 matrix multiply, no opt.)
Tens of cycles to create or terminate a thread.
AMS Godson-T | HOT CHIPS 201123
GodRunner Software Stack
Applications(ANSI C Programs)
Pthreads-like Library
GNU C Compiler and Binutils
C Libraries
Godson-T Architecture Simulator (GAS)
“GodRunner” Runtime System
Application Binary
Static
Dynamic
Software
Hardware
The programmers majorly focus on expressing parallelism, While the processor and runtime system concentrate on efficient parallel executions.The library provides a rich set of APIs for task management, including batch thread management.Programmer is responsible for creating,terminating and synchronizing tasks by inserting appropriate Pthreads-like APIs.
AMS Godson-T | HOT CHIPS 201124
GodRunnerRuntime System
GodRunner permits programmers to create more tasks than hardware thread units, and transparently maps them to hardware thread units at runtime.
AMS 25
Evaluating the Scalability of
Performance
0 8 16 24 32 40 48 56 640
8
16
24
32
40
48
56
64
Spee
dup
# of Threads
pFind (32K pep.) iBlastP (600 seq.) FFT (m=16) RADIX (n=256k) LU (n=512) CHOLESKY (tk29.O)
Speedup of the bioinformatics and SPLASH-2 benchmarks on Godson-T
AMS Godson-T | HOT CHIPS 201126
Overview Microprocessor Architecture Challenges
› Memory wall + Power wall + ILP wall
› The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program› Godson-T architecture overview
› Architectural supports for multithreading
› Software runtime system
Godson-T Design and Implementation› Journey of design and implementation
› Software and hardware co-simulation
› Prototype
AMS Godson-T | HOT CHIPS 201127
Journey of Godson-T
Data-driven parallelism
Fast on-chip synchronization
Management
Efficient implicit/explicit
memory management
Efficient task management
Fast and accurate
simulation
Maximize parallelism exploitation
High-performance synchronization
High-performance communication
High-performance and ease-to-use task management
Research platform
Tape Out
AMS Godson-T | HOT CHIPS 201128
Monitor/Analysis/Evaluation/Platform
AMS Godson-T | HOT CHIPS 201129
Godson-T Many-Core Prototype
64 light-weight processing mini cores currently
16-core sample:130nm, 230mm2 , SMIC
Target to domain-specific parallel acceleration
Core Coer Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core Sync.
Manager
L2
Cache
L2
Cache
L2
Cache
L2
Cache
IO
Ctrl.
Chip On Mind ConceptRTL
Chip on EDAPHY Design
Chip on SiliconWafer
Chip in PackagePackage
Chip on BoardSystem
AMS Godson-T | HOT CHIPS 201130
Conclusions
Exploiting parallelism on multiple level Without disturbing DLP, Can merge with DLP processors
Exploiting fine-grained data-driven TLP
A scalable many-core architecture research platform Lock-based cache coherence protocol
Asynchronous DTA and hardware-supported synchronization mechanisms
Pthreads-like programming model, and versatile parallel libraries
Flexible to extend, easy to use, and simple to implement
Keep balance between software and hardware to gain acceptable programmability and performance
AMS Godson-T | HOT CHIPS 201131
Thank you!