33
Frontiers in Nanophotonics and Plasmonics Guaruja, SP Brazil November, 2007 Silicon Photonic On-Chip Optical Interconnection Networks Keren Bergman, Columbia University

NoC Bergman

Embed Size (px)

Citation preview

Page 1: NoC Bergman

Frontiers in Nanophotonics and PlasmonicsGuaruja, SP Brazil

November, 2007

Silicon Photonic On-Chip Optical Interconnection Networks

Keren Bergman, Columbia University

Page 2: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20072

Acknowledgements

Columbia

– Prof. Luca Carloni

– Dr. Assaf Shacham, Michele Petracca, Ben Lee, Caroline Lai, Howard Wang, Sasha Biberman

IBM

– Jeff Kash

– Yurii Vlasov

Cornell

– Michal Lipson

Page 3: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20073

Emerging Trend of Chip MultiProcessors (CMP)Emerging Trend of Chip MultiProcessors (CMP)

NiagaraSun 2004

CELL BEIBM 2005

MontecitoIntel 2004

TerascaleIntel 2007

BarcelonaAMD 2007

Page 4: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20074

Networks on Chip (NoC)Networks on Chip (NoC)

Shared, packet-switched, optimized for communicationsShared, packet-switched, optimized for communications

– Resource efficiencyResource efficiency

– Design simplicity Design simplicity

– IP reusabilityIP reusability

– High performanceHigh performance

But… no true relief in But… no true relief in power dissipationpower dissipation

Kolo

dn

y, 2

005

Page 5: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20075

Chip Multiprocessors: the IBM Cell

Parameter Value Technology process 90nm SOI with low- dielectrics and 8 metal

layers of copper interconnect Chip area 235mm^2 Number of transistors ~234M Operating clock frequency 4Ghz Power dissipation ~100W Percentage of power dissipation due to global interconnect

30-50%

Intra-chip, inter-core communication bandwidth

1.024 Tbps, 2Gb/sec/lane (four shared buses, 128 bits data + 64 bits address each)

I/O communication bandwidth 0.819 Tbps (includes external memory)

IBM Cell:

Page 6: NoC Bergman

Columbia University

The Interconnection The Interconnection Challenge: Challenge:

Off-Chip BandwidthOff-Chip Bandwidth Off-chip bandwidth is risingOff-chip bandwidth is rising

Pin countPin count Signaling rateSignaling rate

Some examplesSome examples

Page 7: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20077

Why Photonics for CMP NoC?

OPTICS: Modulate/receive ultra-high

bandwidth data stream once per communication event

Transparency: broadband switch routes entire multi-wavelength high BW stream

Low power switch fabric, scalable Off-chip and on-chip can use

essentially the same technology Off-chip BW = On-chip BW

for nearly same power

TX RX TXRX

TX

RX

TX

RX

TX

RX

TXRX

ELECTRONICS: Buffer, receive and re-transmit

at every switch Off chip is pin-limited and

really power hungry

Photonics changes the rulesfor Bandwidth-per-WattOn-chip AND Off-chip

Page 8: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20078

Recent advances in photonic integrationRecent advances in photonic integration

Infinera, 2005 IBM, 2007 Lipson, Cornell, 2005

Luxtera, 2005Bowers, UCSB, 2006

Page 9: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20079

3DI CMP System Concept Future CMP system in 22nm

Chip size ~625mm2

3D layer stacking used to combine:

– Multi-core processing plane

– Several memory planes

– Photonic NoC

Processor System Stack

Opt

ical

I/O

Opt

ical

I/O

For 22nm scaling will enable 36 multithreaded cores similar to today’s Cell

Estimated on-chip local memory per complex core ~0.5GB

Page 10: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200710

Optical NoC: Design Considerations Design to exploit optical advantages:

– Bit rate transparency: transmission/switching power independent of bandwidth

– Low loss: power independent of distance

– Bandwidth: exploit WDM for maximum effective bandwidths across network• (Over) provision maximized bandwidth per port• Maximize effective communications bandwidth

– Seamless optical I/O to external memory with same BW

Design must address optical challenges:– No optical buffering– No optical signal processing– Network routing and flow control managed in electronics

• Distributed vs. Central• Electronic control path provisioning latency

Packaging constraints: CMP chip layout, avoid long electronic interfaces, network gateways must be in close proximity on photonic plane

Design for photonic building blocks: low switch radix

Page 11: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200711

Goal: Design a NoC for a chip multiprocessor (CMP)Goal: Design a NoC for a chip multiprocessor (CMP)

ElectronicsElectronics

Integration density Integration density abundant buffering and processing abundant buffering and processing

Power dissipation grows with data ratePower dissipation grows with data rate

PhotonicsPhotonics

Low loss, large bandwidth, bit-rate transparencyLow loss, large bandwidth, bit-rate transparency

Limited processing, no buffersLimited processing, no buffers

Our solution – a hybrid approach:Our solution – a hybrid approach:

A dual-network designA dual-network design

– Data transmission in a photonic networkData transmission in a photonic network

– Control in an electronic networkControl in an electronic network

– Paths reserved before transmission Paths reserved before transmission No optical buffering No optical buffering

Photonic On-Chip NetworkPhotonic On-Chip Network

PPP

PPP

PPP

GGG

GGG

GGG

Page 12: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200712

PG

PG

PG

PG

PG

PG

PG

PG

PG

On-Chip Optical Network ArchitectureBufferless, Deflection-switch based

Cell Core (on processor plane)

Gateway to Photonic NoC(on processor and photonic planes)

Thin Electrical Control Network(~1% BW, small messages)

Photonic NoC

Deflection Switch

Page 13: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200713

Key Building Blocks

5cm SOI nanowire 1.28Tb/s (32 x 40Gb/s)

LOW LOSS BROADBAND NANO-WIRES

HIGH-SPEED MODULATOR

CornellOFF ON

BROADBAND ROUTER SWITCH

Si

SiO2

GeSi

Ti/Al

n+ p+ n+ p+

SiO2

SWi

Wm

tGe

Si

SiO2

GeSi

Ti/Al

n+ p+ n+ p+

SiO2

SWi

Wm

tGe

HIGH-SPEED RECEIVER

IBM/Columbia

IBM;Cornell/Columbia

IBM

Page 14: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200714

4x4 Photonic Switch Element4x4 Photonic Switch Element

4 deflection switches grouped with 4 deflection switches grouped with electronic controlelectronic control

4 waveguide pairs I/O links4 waveguide pairs I/O links

Electronic routerElectronic router

– High speed simple logicHigh speed simple logic

– Links optimized for high speedLinks optimized for high speed

Nearly no power consumption in Nearly no power consumption in OFFOFF state state

CMOSDriver

CMOSDriver

CMOSDriver

CMOSDriver

CMOSDriver

CMOSDriver

CMOSDriver

CMOSDriver

ER

PSE

PSE PSE

PSE

North

South

EastWestER

PSEPSE

PSEPSE PSEPSE

PSEPSE

North

South

EastWest

Page 15: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200715

Non-Blocking 4x4 Switch DesignNon-Blocking 4x4 Switch Design

Original switch is internally blockingOriginal switch is internally blocking

Addressed by routing algorithm in original Addressed by routing algorithm in original designdesign

Limited topology choicesLimited topology choices

New designNew design

– Strictly non-blocking*Strictly non-blocking*

– Same number of ringsSame number of rings

– Negligible additional lossNegligible additional loss

– Larger areaLarger area

* U-turns not allowed* U-turns not allowed

W E

N

S

W E

N

S

Page 16: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200716

Design of Nonblocking Network for CMP NoC

Begin with crossbar -- strictly non-blocking architecture

– Any unoccupied input can transmit to any unoccupied output without altering paths taken by other traffic in network

Connections from every input to every output

– Each node transmits and receives on independent paths ineach dimension

Unidirectional links

1 x 2 Switches

Simple routing algorithm

1 2 3 4 5 6 7 8

8

7

6

5

4

3

2

1

Page 17: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200717

1 2 3 4

4

3

2

1

5 6 7 8

Bidirectionality provides for independent reception by two nodes from output (Y) dimension

Design of photonic nonblocking mesh

Utilizing nonblocking switch design with increased functionality and bidirectionality enables novel network architecture

Page 18: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200718

3

4

1

2

Mapping onto a direct network Internalizing nodes in a crossbar (indirect network) produces

mesh/torus (direct network)

Page 19: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200719

Nonblocking Torus Network Internalizing nodes maintains two nodes per dimension

– There is always an independent path available for a node to transmit/receive on/from in each dimension

4

3

2

1

Page 20: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200720

Nonblocking Torus Network Internalizing nodes maintains two nodes per dimension

– There is always an independent path available for a node to transmit/receive on/from in each dimension

4

32

1

Page 21: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200721

Nonblocking Torus Network Each node injects into the network on the X dimension

2

11

2

3

4

1GW

Page 22: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200722

Nonblocking Torus Network Each node ejects from the network on the Y dimension

2

11

2

3

4

1GW

Page 23: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200723

Nonblocking Torus Network Folding the torus to maintain equal path lengths

4 × 4 non-blocking photonic switch

1

2

4

3

Non-Blocking 4x4 Design

Page 24: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200724

Power Analysisstrawman

Modulator power 0.4 pJ/bit

Receiver power 0.4 pJ/bit

On-chip Broadband Router Power 10 mW

Power per laser 10 mW

Number of wavelength channels 250

modulator

modulator

modulator

Off-chip lasers

Rx

am

p

Rx

Rx

?de

mux .

.

.

am

pa

mp

.

.

.

10 mW

. . .

0.4 pJ/bit1 Tb/s

Broadband router

Broadband router

10 mW 10 mW

0.4 pJ/bit

Super-core

Super-core

1 Tb/s

modulator

modulator

modulator

Off-chip lasers

Rx

am

p

Rx

Rx

?de

mux .

.

.

am

pa

mp

.

.

.

10 mW

. . .

0.4 pJ/bit1 Tb/s

Broadband router

Broadband router

10 mW 10 mW

0.4 pJ/bit

Super-core

Super-core

1 Tb/s

Cores per supercore 21

Supercores per chip 36

broadband switch routers 36

Number of lasers 250

Bandwidth per supercore 1 Tb/s

Optical Tx + Rx Power (36 Tb/s) 28.8 W

On-chip Broadband routers (36) 360 mW

External lasers (250) 2.5 W

Total power photonic interconnect 31.7 W

Page 25: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200725

Performance Analysis

Goal to evaluate performance-per-Watt advantage of CMP system with photonic NoC

Developed network simulator using OMNeT++: modular, open-source, event-driven simulation environment

– Modules for photonic building blocks, assembled in network

– Multithreaded model for complex cores

Evaluate NoC performance under uniform random distribution

Performance-per-Watt gains of photonic NoC on FFT application

Opt

ical

I/O

Opt

ical

I/O

Page 26: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200726

Multithreaded complex core model Model complex core as multithreaded processor with many

computational threads executed in parallel Each thread independently make a communications request to any core

Three main blocks:– Traffic generator – simulates core

threads data transfer requests, requests stored in back-pressure FIFO queue

– Scheduler – extracts requests from FIFO, generates path setup, electronic interface, blocked requests re-queued, avoids HoL blocking

– Gateway – photonic interface, send/receive, read/write data to local memory

Page 27: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200727

Throughput per core

Throughput-per-core = ratio of time core transmits photonic message over total simulation time

– Metric of average path setup time

– Function of message length and network topology

Offered load considered when core is ready to transmit

For uncongested network: throughput-per-core = offered load

Simulation system parameters:

– 36 multithreaded cores

– DMA transfers of fixed size messages, 16kB

– Line rate = 960Gbps; Photonic message = 134ns

Page 28: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200728

Throughput per core for 36-node photonic NoC

Multithreading enables better exploitation of photonic NoC high BWGain of 26% over single-threadNon-blocking mesh, shorter average path, improved by 13% over crossbar

Page 29: NoC Bergman

HPEC 2007, Lexington, MA 18-20 September, 200729

FFT Computation Performance We consider the execution of Cooley-Tukey FFT algorithm using 32 of 36

available cores First phase: each core processes: k=m/M sample elements

– m = array size of input samples– M = number of cores

After first phase, log M iterations of computation-step followed by communication-step when cores exchange data in butterfly

Time to perform FFT computation depends on core architecture, time for data movement is function of NoC line rate and topology

Reported results for FFT on Cell processor, 224 samples FFT executes in ~43ms based on Bailey’s algorithm.

We assume Cell core with (2X) 256MB local-store memory, DP Use Bailey’s algorithm to complete first phase of Cooley-Tukey in 43ms Cooley-Tukey requires 5kLogk floating point operations, each iteration after

first phase is ~1.8ms for k= 224 Assuming 960Gbps, CMP non-blocking mesh NoC can execute 229 in 66ms

Page 30: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200730

FFT Computation Power Analysis

For photonic NoC:

– Hop between two switches is 2.78mm, with average path of 11 hops and 4 switch element turns

– 32 blocks of 256MB and line rate of 960Gbps, each connection is 105.6mW at interfaces and 2mW in switch turns

– total power dissipation is 3.44W

Electronic NoC:

– Assume equivalent electronic circuit switched network

– Power dissipated only for length of optimally repeated wire at 22nm, 0.26pJ/bit/mm

Summary: Computation time is a function of the line rate, independent of medium

Page 31: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200731

FFT Computation Performance Comparison

FFT computation: time ratio and power ratio as function of line rate

Page 32: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200732

Performance-per-Watt

To achieve same execution time (time ratio = 1), electronic NoC must operate at the same line rate of 960Gbps, dissipating 7.6W/connection or ~70X over photonic

Total dissipated power is ~244W

To achieve same power (power ratio = 1), electronic NoC must operate at line rate of 13.5Gbps, a reduction of 98.6%.

Execution time will take ~1sec or 15X longer than photonic

Page 33: NoC Bergman

Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200733

Summary

Opt

ical

I/O

Electronic NoCs dissipate prohibitively high power

a technology shift is required

Remarkable advances in Silicon Nanophotonics

Photonic NoCs provide enormous capacity at dramatically low power consumption required for future CMPs, both on- and off-chip

Performance-per-Watt gains on communications intensive applications

CMPs are clearly emerging for power efficient high performance computing capability

Future on-chip interconnects must provide large bandwidth to many cores