31
ARM® 64-bit Coherent Scale Out over RapidIO® Rick O’Connor [email protected] 13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 1

ARM® 64-bit Coherent Scale Out over RapidIO® · ACE-Lite ASB APB AHB APB2 AXI3 ATB AHB-Lite APB3 AXI4 AXI4-Lite AXI4-Stream APB4 Time AMBA 1 AMBA 2 AMBA 3 AMBA 4 ‘13 AMBA 5 CHI

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

ARM® 64-bit Coherent Scale Out over RapidIO®

Rick O’Connor [email protected]

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 1

Outline •  Open Industry Collaboration

– ARM 64-bit Coherent Scale Out over RapidIO Task Group

•  System Challenges & Coherent Scale Out Terminology

•  What is RapidIO? •  Summary

– Call for Industry Participation •  Background Material

– ARM AMBA® & ACE – RapidIO Globally Shared Memory

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 2

Task Group Charter •  The ARM 64-bit Coherent Scale Out over RapidIO Task Group shall

be responsible for developing a specification for multi SoC / core coherent scale out of ARM 64-bit cores with the following functionality: –  coherent scale out of a few 10s to 100s cores & 10s of sockets –  ARM AMBA® protocol mapping to RapidIO protocols

•  AMBA 4 AXI4/ACE mapping to RapidIO protocols •  AMBA 5 CHI mapping to RapidIO protocols

–  Migration path from AXI4/ACE to CHI and future ARM protocols –  support for GPU/DSP floating point heterogeneous systems –  HW hooks and definition to support RDMA, MPI, secure boot,

authentication, SDN, Open Flow, Open Data Plane, etc –  Other functionality as necessary to for performance critical

computing - support Data Center, HPC and Networking Infrastructure system development and deployment

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 3

Task Group Contributors

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 4

System Challenges Networking and Computing Infrastructure

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 5

Networking & Computing Infrastructure View

Fiber Termination 1010101010111

1010101010111 Cable

DSLAM

Residential Access

Wifi

Aggrega&on  

Access  

Microwave

1010101010111

1010101010111 1010101010111

Macro Basestation

Basestation Aggregation

Radio Access

Small Cell Basestation

Enterprise Data Center

Enterprise

Core  

Data Center

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 6

Hybrid Cloud - Networking and Computing functions Colliding

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 7

Application specific boxes transitioning to converged, scaled heterogeneous nodes

Mobile Base

Station

Access Fibre/

uWave Aggregation Gateways Core

Data center power budget •  ATCA blade: ~200W

Many power constrained form factors •  Small cell: ~5W SoC •  Macro cell: ~25W SoC

Move to general purpose processors Virtualisation

Separation of operational plane (hardware) and control plane Consolidation (different boxes/subsystems within a box)

Flexibility for evolving workloads

Access   “Cloud”  

Data Center

Distribution of Flexible Intelligence End-to-End

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 8

Pool of scalable heterogeneous nodes with increased visibility and programmability

Access to distributed intelligence for flexibility of workloads, traffic,

services Heterogeneous platforms to meet diversity of physical requirements,

workload requirements

Access   “Cloud”  

Processing

Storage

Acceleration

Networking

Processing

Storage

Acceleration

Networking

Processing

Storage

Acceleration

Networking

P

N

A

P

N

S A P

N

S A

System Level Coherency

•  What is it? – Hold the same memory location in multiple

caches – All caches kept up to date with most recent data

•  Why do we need it? – Sharing data structures between chips,

without the need for software management •  How does it work?

– Multiple copies of data are allowed – Enforce sequential modification of cache lines

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 9

Software View of Coherency

•  It should just work! •  Without hardware coherency the options

are: – Software cache maintenance

•  Requires fine details of data location •  Additional processor overhead •  Prone to error •  Difficult to verify/validate/debug

•  Improves performance, without long debug cycle

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 10

Scale Out Model - Terminology

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 11

Inter-Node

Inter-Blade Multi-Node Processing

Intra-Node

Inter-SoC Multi-core Processing

Inter-Chassis

Inter-Box Grid/Cluster Computing

Domain-Specific Middlewares / Frameworks

Distributed Applications

Interconnect Technologies SoC Backplane / Local Network Wide Area Network

Shared Memory MPI (RDMA)

Socket (TCP/UDP)

up to ~100mm

100 cores / node

50 - 100 ns

Up to 100 Gbytes/s

100mm –10m

16 – 200 blades

100ns – 20,000ns

1-10 Gbytes/s

100m+

50,000 boxes

1ms – 5s

0.1-1 Gbytes/s

Distance

Scalability

Latency

Throughput

This  is  our

 target  scal

e  

out  applica

&on  space  

ARM AMBA & RapidIO Cache Coherency

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 12

Features   ARM    AMBA  Cache  Coherency   RapidIO  Cache  Coherency  

Architecture   CC-­‐NUMA   CC-­‐NUMA  

Fabric   Cache  Coherency  for  mul;-­‐core  SoC   Cache  Coherency  for  mul;-­‐SoC  System  

Coherency  State  Model   5-­‐State  or  smaller  Coherency  Model   3-­‐State  or  larger  Coherency  Model  

Granularity   Cache-­‐Line  and  Large  Buffer  64-­‐byte  (baseline)  

Cache-­‐Line  Granularity  64-­‐byte  (baseline)  

Protocol  Implementa;on  

Snoopy  Cache  Coherency  Snoop  Filter  Cache  Coherency  

Directory  based  Cache  Coherency  Other  extension  possible  

Complimentary Cache Coherency Protocols for multi-core and multi-SoC Unified Fabric Scale-out System

What is RapidIO? Low Latency 10Gbps per lane Unified Fabric

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 13

RapidIO Overview

•  Proven technology > 10 years of market deployment

•  6.25Gbps/lane (6xN) RapidIO on CPUs, DSPs, FPGAs and ASICs

•  10Gbps/lane Specification (10xN) Released Q4 2013

•  25Gbps/lane (25xN) on roadmap •  Hardware termination at PHY layer •  Lowest Latency Interconnect ~ 100 ns •  Inherently scales to 10,000’s of nodes

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

RapidIO Switch

FPGA DSP CPU

RapidIO Switch

•  Over 100 million 10-20 Gbps ports shipped worldwide •  100% 4G/LTE interconnect market share

•  60% Global 3G & 100% China 3G interconnect market share

14

Hardware Terminated Protocol Stack

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

Impl

emen

ted

in H

ardw

are

15

RapidIO Global Shared Memory

•  CC-NUMA model – Cache Coherent – Non-Uniform Memory Access – Directory based

•  Coherency for –  Instruction cache – Data cache

•  Processor to processor •  Processor to shared memory

– Non-cached •  IO data fetch •  Memory management unit page table invalidation/

synchronization

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 16

Coherency  Reference  System  

•  Illustrative example – not limited to 4 sockets

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 17

ARM  SOC  

ARM  SOC  

ARM  SOC  

ARM  SOC  

4  Core  ARM  Cluster  

4  Core  ARM  Cluster  

RapidIO  Coherency  Interface  

Memory  

AXI/ACE/AMBA 5

Memory Directory “Coherency Granule” States

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

ARM  SOC  

ARM  SOC  

ARM  SOC  

ARM  SOC  

1  

2  3  

0  

“Coherency Granule” States 18

ARM AMBA & ACE

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 19

Evolution of AMBA specifications

‘95 ‘99 ‘03 ‘10 ‘11

ACE

ACE-Lite

ASB

APB

AHB

APB2

AXI3

ATB

AHB-Lite

APB3

AXI4

AXI4-Lite

AXI4-Stream

APB4

Time

AMBA 1 AMBA 2 AMBA 3 AMBA 4

‘13

AMBA 5 CHI

‘14 13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 20

ARM ACE Cache Line States

•  Unique – It is only in this cache •  Shared – It may be in another cache •  Clean – Does not have to update main memory •  Dirty – Has responsibility for updating main memory

Unique Dirty

Unique Clean

Shared Dirty

Shared Clean

Invalid

Invalid Valid

Unique Shared D

irty

Cle

an

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 21

Summary

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 22

Join Us!

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 23

Join Us! •  Open Call for Industry Participation •  The ARM 64-bit Coherent Scale Out over RapidIO

Task Group shall be responsible for developing a specification for multi node / core coherent scale out of ARM 64-bit cores with the following functionality:

•  coherent scale out of a few 10s to 100s cores & 10s of sockets –  ARM AMBA® protocol mapping to RapidIO protocols

•  AMBA 4 AXI4/ACE mapping to RapidIO protocols •  AMBA 5 CHI mapping to RapidIO protocols

–  Migration path from AXI4/ACE to CHI and future ARM protocols

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 24

Background Material

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 25

ARM ACE / MOESI Mapping

ARM  ACE   MOESI   ACE  Meaning  

UniqueDirty   M  –  Modified   Not  shared,  dirty,  must  be  wriTen  back  

SharedDirty   O  –  Owned     Shared,  dirty,  must  be  wriTen  back  to  memory  

UniqueClean   E  –  Exclusive   Not  shared,  clean  

SharedClean   S  –  Shared   Shared,  no  need  to  write  back,  may  be  clean  or  dirty  

Invalid   I  –  Invalid   Invalid  

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 26

RapidIO GSM Transactions

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 27

RapidIO GSM Transactions (2)

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 28

RapidIO GSM Example 1

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

0   0   0  0  

0   0   0  1  1   0  

SoC 1 Reads SoC 0

0   1   1  0  1  0  2  

SoC 2 Reads With Intent to Modify

29

RapidIO GSM Example 2

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

1   0   0  0  3  

0  SoC 3 Reads SoC 0

0   0   0  0  

3   0  SoC 3 Releases Ownership

0   1   1  0  

2  

30

RapidIO GSM Example 3

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

0   0   0  0  3   0  

SoC 3 Requests Data Flush to Memory

0   1   1  0  

2  

0   0   0  0  

non-coherent IO Peripheral Reads Memory

1   0  

31