31
AMS Institute of Computing Technology Chinese Academy of Sciences High-Efficient Architecture of Godson-T Many-Core Processor Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye, Fenglong Song, Junchao Zhang, Lingjun Fan Advanced Micro-System Group National Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences

High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

Embed Size (px)

Citation preview

Page 1: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS

Institute of Computing Technology Chinese Academy of Sciences

High-Efficient Architecture of Godson-T Many-Core Processor

Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye,

Fenglong Song, Junchao Zhang, Lingjun Fan

Advanced Micro-System GroupNational Key Laboratory of Computer Architecture,

Institute of Computing Technology, Chinese Academy of Sciences

Page 2: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20112

Overview

Microprocessor Architecture Challenges

Godson-T Features for Parallel Program

Godson-T Design and Implementation

Page 3: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20113

Memory wall + Power wall + ILP wall = Brick wall !What parallel architecture will be successful?

Parallel architectures come to rescue, superscalar processor is being replaced by on-chip multi-core / many-core processor

Microprocessor Architecture Revolution

Power, Complexity

Performance

Simpler

Processor Core

Superscalar

Processor Core

Multi-Core

Many-Core

Power Wall

ILP Wall

Page 4: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20114

Exploiting Polymorphic Parallelism

Instruction-Level

Parallelism

Heterogenous

Many-Core

Godson-3

Godson-T

Thread-Level Parallelism

With Powerful

Streaming Unit

With moderate

DLP and strong

TLP exploitation

Page 5: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20115

Many-Core Challenges: The 5 P’s

Power efficiency challenge Performance per watt is the new metric – Dark Silicon will appear in 2020

when facing 11nm process technology

Performance challenge How to scale from 1 to 1000 cores – the number of cores is the new

Megahertz

Programming challenge How to provide a converged many core solution in a standard programming

environment

Parallelism challenge How to exploit parallelism through software and hardware co-design

Platform challenge How to maximize the usability, such as, runtime system (OS), compiler &

library, many-core debug…

Page 6: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20116

What we could help……

Productivity Side:

Most sequential programmers are not ready to switch into parallel programming conservative programming model and make incremental improvement?

Unique programming model cannot solve all problems efficiently; support multiple programming features efficiently?

Locks are messy new way to eliminate deadlock?

……

Performance Side: On-chip synchronization is fast;

handle all synchronizations on-chip? Flops are cheap, memory communications are expensive;

trade flops for communication latency? Fine-grained parallelism should be taken into consideration;

enable data-driven thread execution on chip?……

Take this carefully, because it may affect productivity !

Page 7: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20117

Data Communication concept

Fine GrainedThread division

Thread Synchronization

Motivation

Target atHPC

Godson-TMany cores to

accelerate one program

Cost CostMethod

Page 8: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20118

Overview Microprocessor Architecture Challenges

› Memory wall + Power wall + ILP wall

› The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program› Godson-T architecture overview

› Architectural supports for multithreading

› Software runtime system

Godson-T Design and Implementation› Journey of design and implementation

› Software and hardware co-simulation

› Prototype

Page 9: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 20119

Architecture Overview of Godson-T

I/O Controller

Synchronization Manager

L2 Cache Bank

Memory Controller

Mem

ory ControllerM

emor

y C

ontr

olle

r

Memory Controller

Private I-$

Router

Fetch & Decode

Register Files

INT ALU

Sync Unit

FP/Vector ALU

Processing Core

SPM

Local Memory

D$

DTAMACALU

Private Memroy

L1$

CORE

R

GodsonT

Processor

Runtime

System

Program

Page 10: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201110

Processing Core

FETCH

DECODE

Vector

Register File

Instruction

Buffer

Load/Store/

Synchronization

Unit

Vector/

Floating-Point

Unit

32 Bytex 2x1

1 instruction

Vector/

Fixed-Point

Unit

32Byte 2r/1w

L0_Icache

4Byte

Router

16Byte

16Byte in/out msg

L0_Dcache

16Byte

SPM/

L1 Cache

8Byte

1 instruction

ISA: MIPS (user), SIMD-ext., sync-ext.

8-stage pipeline

Dual-issue per thread

Expected SIMD Load, Store, Data Move,

Arithmetic ……

Fast level-1 memory

16KB private memory Automatically mapped into stack

address space

Full/empty bit tagged on each 64-bit slot enables efficient producer-consumer style synchronization

Communication with external modules through message packets

Page 11: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201111

Interconnection

Separated routers with each processing core Static XY wormhole routing

Round-Robin arbitration

Two independent physical networks

Duplex 128bit link for each network

Guaranteed in-order point-to-point communication Deadlock-free & livelock-free

Scalable and power-efficient for MESH topology

Low latency core-to-core communication

Separated networks allowing traffic segregation and tolerating burst DMA transfer

Page 12: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201112

Memory Hierarchy

Reg

LocalMemory

L2 Cache

Off-Chip Memory

Data Transfer Bandwidth

512GB/s

128GB/s

51.2GB/s

Each processing core with 32 fixed-point and 32 floating-point registers

32KB local memory, including L1$D, SPM

16 address-interleaved L2 cache banks, 128KB each

4 DDR3-1600 memory controllers

Page 13: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201113

Power Management

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

• Monitor status of each core at program level

• Shut down or turn on cores separately

• Reassign tasks

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

RT RT RT RT

Page 14: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201114

Overview Microprocessor Architecture Challenges

› Memory wall + Power wall + ILP wall

› The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program› Godson-T architecture overview

› Architectural supports for multithreading

› Software runtime system

Godson-T Design and Implementation› Journey of design and implementation

› Software and hardware co-simulation

› Prototype

Page 15: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201115

Thread Communication & SynchronizationLock-Based Cache Coherent Protocol

Lazy cache coherent protocol Pure mutual-exclusion synchronization instruction without memory

accesses; Eliminate busy-waiting for locks; More scalable than bus-snoopy and directory-based cache protocol; Enable hardware deadlock detecting.

Mini Core Mini Core

Inte rc o nne c t

int mutual_func_a()

{

……

acquire_lock(LOCK_VAL);

X = 3;

release_lock(LOCK_VAL);

……

}

int mutual_func_b()

{

……

acquire_lock(LOCK_VAL);

Y = X;

Z = X;

release_lock(LOCK_VAL);

……

}

Synchronization

ManagerShared L2 Cache

Private Cache Private Cache

Inte rc o nne c t

X=0 X=0X=0

X=3

ACQ. ACQ.REL. REL.

ACK.ACK.X=0

X=3

CORE 0 UNLOCKCORE 1 UNLOCK CORE 0 LOCKEDCORE 1 UNLOCK CORE 0 LOCKEDCORE 1 WAITINGCORE 0 UNLOCKCORE 1 LOCKEDCORE 0 UNLOCKCORE 1 UNLOCK

Page 16: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201116

Thread Communication & Synchronization

Data Transfer Agent (DTA)

invovled data block address

…… ……

chunk stride chunk stride

block stride block strideblock

(b) 2D strided DTA operations

DTA DTA

SPM SPM

……

on-chip network

……

……

Router Router Router

L2 Cache

……

off-chip memory

DTA horizontal operation DTA vertical operation DTA vertical operation

(a) vertical and horizontal DTA operations

Programmable asynchronous data transfer agent Support vertical and horizontal

DTA operations (such as prefetch)

Data transfers between multi-dimension addresses (such as matrix inversion )

Network load perception, automatically flow control(improve bandwidth-efficiency)

Support fine-grain synchronous operations

Page 17: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS 17

Synchronized DTA Operations

1 0*load.future (s)/store.future (s)/store.sync (f)

load.sync (s)

store.sync (s)

load.future (f)/store.future (f)/

load.sync (f)

(s): successfully perform on the state of full/empty bit (f): failed to perform on the state of full/empty bit

Target Node

SourceNode

Traffic Awareness

Node

Page 18: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201118

Evaluating On-chip DTA

SGEMM 1-D FFT0

10

20

30

40

50

60

70

80

90

100

110

120

130

72.864.8

122.8127.5

24.719.7

63.7

Per

form

ance

(GFL

OP

S)

Kernels

Cache SPM, without DTA SPM, with DTA Theoretical

72.9

Pe rfo rm an ce co m pariso n s o f SGEMM an d 1-D FFT

Benchmark Processor Godson-T Cell Cyclops-64 GTX8800

SGEMMEfficiency 95.9% 1 99.9% 2 43.4% 60.0%

Performance 122.81 204.7 13.9 206.0

1-D FFTEfficiency 33.2% 20.4% 25.8% 29.9%

Performance 63.72 41.8 20.7 155.0

1 The SGEMM kernel contains only multiply-and-add operation, so that the ideal peak performance is measured by the multiply-and-add function unit, which is 128GFLOPS.

2 Efficiency of SGEMM on Cell is slightly better than that on Godson-T, because 256KB SPM for each SPE on Cell makes the better utilization of data locality.

Page 19: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS 19

Evaluating Synchronization without Memory

0 8 16 24 32 40 48 56 64

0.1

1

10

0 8 16 24 32 40 48 56 641E-3

0.01

0.1

1

10

100

1000

0 8 16 24 32 40 48 56 640.01

0.1

1

(c) average time for each load(b) lock transferring overhead

Tim

e (u

s)

# of threads

FAA-based SM-based Pthread

(a) lock overhead without lock contention# of threads # of threads

0 8 16 24 32 40 48 56 640.01

0.1

1

10

100

1000

0 8 16 24 32 40 48 56 640.01

0.1

1

10

100

1000

10000

0 8 16 24 32 40 48 56 640.01

0.1

1

10

100

1000

10000

(c) average time of each load(b) barrier overhead with load imbalancing

Tim

e (u

s)

# of threads

SM-Based Pthread

(a) barrier overhead without workload# of threads # of threads

Page 20: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS 20

Evaluating Full-empty Bit Synchronization

Livermore loop 6for ( i=1 ; i<n ; i++)

for ( k=0 ; k<i ; k++)W[i] += B[k][i] * W[(i-k)-1];

0 8 16 24 32 40 48 56 64

0

20

40

60

80

100

120

Nor

mal

ized

Spe

edup

# of threads

coarse-grain sync. fine-grain sync. synchronized DTA

Speedup of Livermore Loop 6

0 8 16 24 32 40 48 56 640

8

16

24

32

40

48

56

64

Spee

dup

nom

aliz

ed to

ser

ail p

erfo

rman

ce

# of threads

Speedup of 2-D Wavefront

Page 21: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201121

Overview Microprocessor Architecture Challenges

› Memory wall + Power wall + ILP wall

› The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program› Godson-T architecture overview

› Architectural supports for multithreading

› Software runtime system

Godson-T Design and Implementation› Journey of design and implementation

› Software and hardware co-simulation

› Prototype

Page 22: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201122

Pthread-like Thread Execution using Private

Memory Support master-slave style thread execution; Programming easily: C programming model + Pthread API; Compatible with large amount of conventional codes on shared

memory system.

Inte rc o nne c t

int mater_func()

{

……

int thread_id =

godsont_ create_thread(slave_func, 2, param_array);

……

}

int slave_func(int argc, int argv[])

{

// do computation;

……

godsont_exit(0);

}

SPM

Stack

Pointer

ab

1. Master transmits parameters to slave;

2. Master setups PC and register context for slave;

3. Slave function begins execution, private memory acts as a local stack.

PC

Stack v.s. L1-D$ accesses ≈ 30 (8x8 matrix multiply, no opt.)

Tens of cycles to create or terminate a thread.

Page 23: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201123

GodRunner Software Stack

Applications(ANSI C Programs)

Pthreads-like Library

GNU C Compiler and Binutils

C Libraries

Godson-T Architecture Simulator (GAS)

“GodRunner” Runtime System

Application Binary

Static

Dynamic

Software

Hardware

The programmers majorly focus on expressing parallelism, While the processor and runtime system concentrate on efficient parallel executions.The library provides a rich set of APIs for task management, including batch thread management.Programmer is responsible for creating,terminating and synchronizing tasks by inserting appropriate Pthreads-like APIs.

Page 24: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201124

GodRunnerRuntime System

GodRunner permits programmers to create more tasks than hardware thread units, and transparently maps them to hardware thread units at runtime.

Page 25: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS 25

Evaluating the Scalability of

Performance

0 8 16 24 32 40 48 56 640

8

16

24

32

40

48

56

64

Spee

dup

# of Threads

pFind (32K pep.) iBlastP (600 seq.) FFT (m=16) RADIX (n=256k) LU (n=512) CHOLESKY (tk29.O)

Speedup of the bioinformatics and SPLASH-2 benchmarks on Godson-T

Page 26: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201126

Overview Microprocessor Architecture Challenges

› Memory wall + Power wall + ILP wall

› The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program› Godson-T architecture overview

› Architectural supports for multithreading

› Software runtime system

Godson-T Design and Implementation› Journey of design and implementation

› Software and hardware co-simulation

› Prototype

Page 27: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201127

Journey of Godson-T

Data-driven parallelism

Fast on-chip synchronization

Management

Efficient implicit/explicit

memory management

Efficient task management

Fast and accurate

simulation

Maximize parallelism exploitation

High-performance synchronization

High-performance communication

High-performance and ease-to-use task management

Research platform

Tape Out

Page 28: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201128

Monitor/Analysis/Evaluation/Platform

Page 29: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201129

Godson-T Many-Core Prototype

64 light-weight processing mini cores currently

16-core sample:130nm, 230mm2 , SMIC

Target to domain-specific parallel acceleration

Core Coer Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core Sync.

Manager

L2

Cache

L2

Cache

L2

Cache

L2

Cache

IO

Ctrl.

Chip On Mind ConceptRTL

Chip on EDAPHY Design

Chip on SiliconWafer

Chip in PackagePackage

Chip on BoardSystem

Page 30: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201130

Conclusions

Exploiting parallelism on multiple level Without disturbing DLP, Can merge with DLP processors

Exploiting fine-grained data-driven TLP

A scalable many-core architecture research platform Lock-based cache coherence protocol

Asynchronous DTA and hardware-supported synchronization mechanisms

Pthreads-like programming model, and versatile parallel libraries

Flexible to extend, easy to use, and simple to implement

Keep balance between software and hardware to gain acceptable programmability and performance

Page 31: High-Efficient Architecture of Godson-T Many-Core … Architecture of Godson-T Many-Core Processor ... › The 5P’s: ... DTA horizontal operation DTA vertical operation DTA vertical

AMS Godson-T | HOT CHIPS 201131

Thank you!