21
Michael J. Miller VP, Technology Innovation & System Applications Second Generation Bandwidth Engine® IC Breaks 4.5 Billion Accesses/sec Hot Chips 25 Stanford Memorial Auditorium August 25-27, 2013

Second Generation Bandwidth Engine IC Breaks 4.5 … J. Miller VP, Technology Innovation & System Applications Second Generation Bandwidth Engine® IC Breaks 4.5 Billion Accesses/sec

Embed Size (px)

Citation preview

Michael J. Miller

VP, Technology Innovation & System Applications

Second Generation Bandwidth Engine® IC

Breaks 4.5 Billion Accesses/sec

Hot Chips 25 Stanford Memorial Auditorium August 25-27, 2013

The Vision: Fast, Intelligent Access

Architecture

Packet Processor

0

1

n-1

n

Serial Link

Serial Link

Serial Link

Serial Link

Bandwidth Engine – Second Generation

12 GOp/s

4.5 GA

384 Gb/s

Bit Safe™

Technology

ingress egress

Multi-bank

Multi-partitions

allow for

high access

availability

Multi-threaded

Multi-Cores

allow for

high processing

throughput

Multi-linked

allow for

concurrent

transport

operations

Fixed Macro

ALUs for

functional

Acceleration

minimizes

intra-chip

traffic

Atomic

Memory

Operations

Allows

Extended

Carrier Class

& In package

Repair

ALU ALU

2 Copyright ©MoSys, Inc. 2013. All rights reserved.

Bandwidth Engine 2 Architecture & Family

Sampling Now

Parallel Array Architecture … Performance up to: 16 outstanding transactions

4.5G Accesses per second

192 Gbps full duplex throughput

~12ns deterministic read latency

2.7ns Random cycle time (tRC)

GigaChip Interface … 90% Efficient Transport Protocol Up to sixteen low latency SerDes lanes (8G to 15G)

High Reliability …70X better SER than 6T-SRAM Full ECC support; 72bit array and macro datapath operations

CRC protected and self recovering GigaChip Interface

SEU resistant 1T-SRAM Memory core: < 10 FIT/Mb

Bit Safe™ Self Test and Self Repair Option

`

Array Manager

GCI Port B

GCI Port A

TX RX

8 8 8 8

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

GCI Port GCI Port

Host

TX RX

RX TX RX TX

Applications BE1 - MSR576 MSR620 MSR820 MSR720

Lookup, LPM, Hash Highest single component Access Rate

Statistics Onboard ALU Onboard ALU+

Buffer – up to 80% efficient Per cycle Burst, Write Broadcast

Metering, Dual Ops Fixed Macro

Semaphore, Link List Atomic Ops

State, Queuing, Link List Dual Port w/ Data Coherency,

36 bit word access

Bandwidth Engine – 2nd Generation

3 Copyright ©MoSys, Inc. 2013. All rights reserved.

Functional Design

Bandwidth Engine MSR720 Architecture

GCI-B

GCI-B

15G Rx SerDes

15G Tx SerDes

GCI-A

15G Rx SerDes

GCI-A

15G Tx SerDes

Memory Controller BCR BCR BCR BCR

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 59

Bank 60

Bank 61

Bank 62

Bank 63

Memory Ops:

Rd/Wr 72b

Wr 36b

375 MHz Clock x

8 Reads +

8 Writes

6GA Internally

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 59

Bank 60

Bank 61

Bank 62

Bank 63

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 59

Bank 60

Bank 61

Bank 62

Bank 63

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 59

Bank 60

Bank 61

Bank 62

Bank 63

Results

Bank Conflict Resolution:

32K x 72b SRAM

Delayed Write Cache

4 x Partitions:

64 Single Port Banks

32K x 72b each

5 Copyright ©MoSys, Inc. 2013. All rights reserved.

A r r a y M a n a g e r

GCI Port B

GCI Port A

RX TX RX TX

TX RX TX RX

8 8 8 8

2M X 72

(1 bank)

2M X 72

(1 bank)

2M X 72

(1 bank)

2M X 72

(1 bank)

GCI Port GCI Port

Host

MSR720: Bandwidth Engine 2 – Access

Simultaneous Read & Write @ same

address

… treat each partition as a single bank

… >>2x the access performance of QDR SRAM • 4.5GA : 3 billion reads/sec, 1.5B 36b write/sec

72b & 36b words each access

12 ns read latency pin to pin

2.7ns tRC cycle time

8.5W @ 12.5G system power

… 90% efficient transport protocol Up to Sixteen 15G serial lanes (2 links of 8 lanes)

High Density

… 576Mbit 1T-SRAM ® memory core 19mm x 19mm package – 1mm pitch

High Reliability

… 70X better SER than 6T-based SRAM Memory core: < 10 FIT/Mb native

Interface: < 1 FIT

Bit Safe™ Self Test and Self Repair Option

Bandwidth Engine - Access

6 Copyright ©MoSys, Inc. 2013. All rights reserved.

-

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0% 20% 40% 60% 80% 100%

Gig

a-A

cce

ss

es

/se

c (

GA

/s)

%Reads

MSR720: Basic ‘Dual Port’ Operation

Flat Partition Mode:

Double the effective bandwidth in worst case: • Bank Conflict Resolution (BCR) function allows for

simultaneous Read and Write of the same bank

• Implements dual port behavior with single port banks

Guarantees data coherency

3 billion accesses / sec @ 15G

Memory

Partition

BCR logic RDFP Data

WRFP Data

banks hidden =

‘flat’ addressing

BCR guarantees

full data

coherency

Dynamically

Scheduled

Writes

Performance @ 15G

Scheduling Balanced R:W

Balanced R:W Controller

Fra

me

Input (RX)

Fra

me

Output (TX)

Partition CMDARX CMDBRX QATX QATX

1 0/1

RDFP WRFP RDFP WRFP 16 RDFP RDFP

2 WD WD 17

3 2/3

RDFP WRFP RDFP WRFP 18 RDFP RDFP

4 WD WD 19

15G

1.50 GA Write

1.50 GA Read

3.00 GA Total

Re

ad

Wri

te

Wri

te

Rea

d

7 Copyright ©MoSys, Inc. 2013. All rights reserved.

“BCR” Implemented w/Delayed Write Cache

Conceptual Block Diagram

Bank 0 Bank 1 Bank 2

Bank n-1 Bank n

Delayed Write Cache Dual Port Memory

Write

Data

Read

Data

Data Data

Write Address

Bank Offset

Read Address

Bank Offset Addr Addr

Write Address

Bank #

Read Address

Bank # Addr Tag

Cached Read Data

Read Data

Hit

RAB # WAB # WAB Offset RAB Offset

Addr Tag

Cached Write Data

8 Copyright ©MoSys, Inc. 2013. All rights reserved.

Up to 4 Accesses Per Partition In One Cycle

3 Accesses Sustained Throughput

The MSR720 supports Bank Aware commands. This can be used to improve the data read performance on the interface,

however bypasses the BCR logic

Useful for unified memory applications combining dual port SRAM performance of “buffers” or “state” tables and read-only “lookup” tables

The two address ranges cannot overlap.

MSR720 supports 36b write operations: WRFP and WRBA Useful for small word size tables (pointers)

Increases write performance

Memory

Partition

BCR logic

WRBA Data

RDFP Data

RDBA Data

WRFP Data

banked array

banks hidden

‘flat’ addressing

Dynamically

Scheduled

Writes

9 Copyright ©MoSys, Inc. 2013. All rights reserved.

Write Flat Partition

Read Flat Partition

Read Bank Aware

Write Bank Aware

1

2

3 4

Mapping Ports and Banks

A r r a y M a n a g e r

GCI Port B

GCI Port A

RX TX RX TX

8 8 8 8

1M X 72

(1 bank)

1M X 72

(1 bank)

1M X 72

(1 bank)

1M X 72

(1 bank)

100G

100G

100G

100G

PPE

PPE

PPE

PPE

1M X 72

1M X 72

1M X 72

1M X 72

For State Memory

(WrFP, RdFP)

For other

functions like

Look Up Table

(RdBA) 10 Copyright ©MoSys, Inc. 2013. All rights reserved.

MSR720 Access Scheduling

Flat Partition guarantees no bank conflict between read and write to any address, even in the same cycle

Balanced R:W Controller (Basic 72b R:W Operation) Native BE Controller (72b WRITES)

Fram

e Input (RX)

Fram

e Output (TX)

Fram

e Input (RX)

Fram

e Output (TX) Partition CMDARX CMDBRX QATX QATX Partition CMDARX CMDBRX QATX QATX

1 0/1

RDFP WRFP RDFP WRFP 16 RDFP Data RDFP Data 1 0/1

RDFP RDBA RDFP RDBA 16 RDFP RDFP 2 WD WD 17 2 WRFP WDL WRFP WDL 17 RDBA RDBA 3

2/3 RDFP WRFP RDFP WRFP 18 RDFP Data RDFP Data 3

2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP

4 WD WD 19 4 WRFP WDL WRFP WDL 19 RDBA RDBA 15G 1

0/1 RDFP RDBA RDFP RDBA 16 RDFP RDFP

1.50 GA Write 2 WRFP WDU WRFP WDU 17 RDBA RDBA 1.50 GA Read 3

2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP

3.00 GA Total 4 WRFP WDU WRFP WDU 19 RDBA RDBA 15G

1.50 GA Read Native BE Controller (36b WRITES) 0.75 GA Write WDL = Lower 36b data

Fram

e Input (RX)

Fram

e Output (TX) 1.50 GA Read WDU = Upper 36b data

Partition CMDARX CMDBRX QATX QATX 3.75 GA Total 1

0/1 RDFP RDBA RDFP RDBA 16 RDFP RDFP

2 WRFP WD WRFP WD 17 RDBA RDBA Partition Access Restrictions 3

2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP

Fram

e RX 4 WRFP WD WRFP WD 19 RDBA RDBA Partition GCI-A GCI-B

15G 1 0/1 P0 P1

1.50 GA Read 2 1.50 GA Write (36b) 3

2/3 P2 P3

1.50 GA Read 4

4.50 GA Total

BA

FP FP

BA

FP FP

FP FP

11 Copyright ©MoSys, Inc. 2013. All rights reserved.

8 x SerDes

Rx Lanes 8 x SerDes

Rx Lanes

8 x SerDes

Tx Lanes 8 x SerDes

Tx Lanes

2.7

ns

WD = 72b data

MSR 720 : Breaking 4.5GA

-

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0% 20% 40% 60% 80% 100%

Gig

a-A

cc

es

se

s/s

ec

(G

A/s

)

Read to Write Ratio (%Reads)

Second Gen BE:

MSR 720

Peak

w/15Gbps

SerDes

72b/36b FP Write

72b FP Read

72b BA Read

Sigma-Quad

or QDR SRAM

BL2 @ 600MHz

36b BA Write

First Gen BE

Note: Datasheet comparison - 1 access = 72 bits

- FP Access is arbitrated by BCR.

- BA Access is to/from array only. Must be bank aware.

RLDRAM3

12 Copyright ©MoSys, Inc. 2013. All rights reserved.

Get More Done With The

Same Amount of I/O

Bandwidth Engine 2 Architecture & Family

Parallel Array Architecture … Performance up to: 16 outstanding transactions

4.5G Accesses per second

192 Gbps full duplex throughput

~12ns deterministic read latency

2.7ns Random cycle time (tRC)

GigaChip Interface … 90% Efficient Transport Protocol Up to sixteen low latency SerDes lanes (8G to 15G)

High Reliability …70X better SER than 6T-SRAM Full ECC support; 72bit array and macro datapath operations

CRC protected and self recovering GigaChip Interface

SEU resistant 1T-SRAM Memory core: < 10 FIT/Mb

Bit Safe™ Self Test and Self Repair Option

`

Array Manager

GCI Port B

GCI Port A

TX RX

8 8 8 8

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

2M X 72

Burst,

Flat,

Macro

GCI Port GCI Port

Host

TX RX

RX TX RX TX

Applications BE1 - MSR576 MSR620 MSR820 MSR720

Lookup, LPM, Hash Highest single component Access Rate

Statistics Onboard ALU Onboard ALU+

Buffer – up to 80% efficient Per cycle Burst, Write Broadcast

Metering, Dual Ops Fixed Macro

Semaphore, Link List Atomic Ops

State, Queuing, Link List Dual Port w/ Data Coherency,

36 bit word access

Bandwidth Engine – 2nd Generation

14 Copyright ©MoSys, Inc. 2013. All rights reserved.

2M x 72 Partition

ECC

ECC

RA

RD Byte Count

WA WD WD

RD Packet Count

Addr 16b Imm

16

16

16b

64b

72

GCI Port

RX TX

16ns

ECC

ECC

1b 64b

+1

72

Dual Counter: 5 Stage Pipeline

Includes Index Compare Logic (not illustrated) for Data Forwarding:

In case of an index match, data is forwarded in the pipeline

Prevents stale data in the pipeline

Similar pipeline for Split Counter

End-to-End ECC Protection

16 ns to completion

15 Copyright ©MoSys, Inc. 2013. All rights reserved.

Individual Flow Programmability Meter Type

Flow Rates

Thresholds

8M Two Color Flows

4M Three Color Flows

Line Rate 4x100G

MSR820 – Metering Capabilities

CIR

EIR

CBS

EBS

> EBS

+

≤ EBS ≤ CBS

+

CIR

CBS

EBS

> EBS

+

≤ EBS ≤ CBS

+

Overflow

Single Rate Three Color Meter Two Rate Three Color Meter

Basic Meter (Two Color)

IR

BS

> BS ≤ BS

+

16 Copyright ©MoSys, Inc. 2013. All rights reserved.

Intelligent Offload, Fire Forward Architecture

MSR820 – Bandwidth Engine 2 –

Intelligent Memory Macros Leverages I/O

Second Generation Bandwidth Engine Architecture Up to 4.5 billion external memory accesses per second w/16 SerDes Lanes

Macros support up to 6 billion internal accesses per second w/8 SerDes Lane

Macros execute Atomically: Stats, Metering, Read & Set, Test & Set

Partition

Bandwidth Engine

f(χ)

Partition f(χ)

Partition f(χ)

Partition f(χ)

Packet Processor

Instruction

CIR

EIR

CBS

EBS

> EBS

+ ≤ EBS ≤ CBS

+

Stats Table +1

Byte Packet

Supports:

4 x 100GE ports

w/ Stats + Metering

over 8 lanes

17 Copyright ©MoSys, Inc. 2013. All rights reserved.

Physical Design

Conceptual Timing & Data Access Control

Analog D Q

Clock Data Recovery

Slip B

uf

DeS

cram

ble

r

Memory Macro

Serializer

Analog

Ou

tpu

t Reg

Frame Detect & Clock Div

Bit Clk

Lane Offset

Access Phase DLL Timing

Data Path

Timing Path

Rx & Bit Aligner

Lane/FrameAligner Serializer&Tx

Mem Access & RMW

BE 1: Read Latency of 15.9ns vs. BE2: Read Latency of ~12.5ns

19 Copyright ©MoSys, Inc. 2013. All rights reserved.

DeS

eria

lizer

Ou

tpu

t Reg

Instru

ction

P

ipelin

e

Core Clk

ALU

BCR

MSR720 Layout

BCR Cach

e

Ban

k 0

8x SerDes

8x SerDes

32K x 72b 32K x (72b + 8b + 6b +1b) Data + ECC + Tag + Valid

Partition 0

~6x

Area

SerDes I/O

5% of Die Area

@ 240 Gbps

20 Copyright ©MoSys, Inc. 2013. All rights reserved.

QDR like

Dual Port

Performance

for 1.10x

vs

3x die area

cost

Michael J. Miller

[email protected]

Thank you

Copyright ©MoSys, Inc. 2013. All rights reserved. 21

Hot Chips 25 Stanford Memorial Auditorium August 25-27, 2013