Extreme Scale Computing with Optical Data Movementsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2016... · 2019-11-18 · 6 The Photonic Opportunity for Data Movement Energy efficient,

Extreme Scale Computing with Optical

Data Movement

Keren Bergman

Department of Electrical Engineering

Columbia University

Loss of US Dominance in Supercomputing 2

Average computing performance of the top 3 Supercomputers over past decade

US had 10X

Advantage!

Tianhe-2

China

• Vast increase in parallelism requires ever more communications

…but bandwidth is stagnated

• Over past 5 years: while system compute power grows by 13X

• Node I/O bandwidth increases by only < 2X

• Data-movement is too expensive! ($ and Energy)

The Major Lag in Data Communications…

Top 10 Supercomputers computation capabilities over past 5 years:

4

Silicon Photonics: all the parts for on-chip optical communications

-Silicon as core material • High refractive index and high contrast –

sub micron cross-section dimensions, smallest bend radius.

-Small footprint devices • 10 μm – 1 mm scale compared to

cm-level scale for telecom components

-Low power consumption • Can reach <1 pJ/bit per full point to

point link

-Aggressive WDM platform • Bandwidth densities 1-2Tb/s per pin

-Silicon wafer-level CMOS processing • Integration

• Mass production, price

• Compatibility with CMOS fabs, CMOS electronics

Switching

WDM Modulation &

Demultiplexing

Silicon Microring/Microdisk Based Devices

Modulation

Detection

5

Silicon Photonics Technology – toward Commercialization

Fundamental DiscoveriesIntroduction of

Innovative DevicesIntegration and

Commercialization

1990s 2015+

low-loss, single-mode waveguiding

optical coupling

optical modulation via carrier injection

high-speed microringmodulators and switches

high-speed MZM modulators and switches

arrayed waveguide gratings

germanium photodetectors

ultra low-loss waveguides and crossings

hybrid silicon lasers

2000s

Hybrid platforms

Foundry and Design Services

Transceivers for Datacom

2010

6

The Photonic Opportunity for Data Movement

Energy efficient, low-latency, high-bandwidth data interconnectivity is the core challenge to continued scalability across computing platforms

Energy consumption completely dominated by costs of data movement

Bandwidth taper from chip to system forces extreme locality

Reduce Energy Consumption Eliminate Bandwidth Taper

10

100

1,000

10,000

100,000

1,000,000

Ban

dw

idth

: Gb

/se

c/m

m

Electronic

Photonic

0.01

0.1

1

10

100

1000

Chip Edge PCB Rack System

Dat

a M

ove

me

nt

Ene

rgy

(pJ/

bit

)

Electronic

Photonic

Node level bandwidth requirements

Near memory bandwidth: 10 TF x 8bit x 0.5B/F = 40Tb/s

(split over ~6-10 individual ~5 Tb/s interfaces)

Bulk memory bandwidth:

0.1 B/F 8Tb/s

0.2 B/F 16 Tb/s

(split over ~1-6 links)

Interconnect bandwidth:

0.01 B/F 0.8 Tb/s

0.05 B/F 4Tb/s

Assume: 10 Teraflop (TF) node (Exascale with 100K)

Power requirements

• Today’s largest envelope: Tianhe-2 = 17MW; RIKEN = 12MW

• Exascale at 100MW is maximal consideration: 10 GigaFLOP/Joule

• 20MW total system power envelope preferred: 50 GigaFLOP/Joule

0.001 0.01 0.1 1

1

10

Verbosity (byte/flop)

En

erg

y b

ud

ge

t p

er

bit (

pJ)

10 Gigaflop/J, 10% of the envelope




Network energy requirements

End-to-end data movement energy budget:

0.25 pJ/bit

100s of pJ to 10s pJ

10s of pJ to

single pJs

pJs to fJs!

Interconnect costs

• Network is ~15% of total system cost

– $200M considered typical

Exascale price

$30M max for network

• Total interconnect bandwidth

– ~300 PB/s (0.1 B/F)

• $30M / 300 PB/s 1$/10GB/s

1.25¢/Gb/s

• Cost reduction required:

– >100X for 0.1 B/F

– >10X for 0.01 B/F

Photonic Computing Architectures: Beyond Wires

11

• Leverage dense WDM bandwidth density

• Photonic switching

• Distance-independent, cut-through, bufferless

• Bandwidth-energy optimized interconnects

On chip

Short distance PCB

Long distance PCB Optical link

Conventional hop-by-hop

data movement Fully flattened end-to-end

data movement

12 conversions!

No conversion!

• Novel design environment enabling HFI across three layers:

• Application IO primitives

• Copy memory array to remote location

• Send, multicast, broadcast messages

• Thread synchronization (e.g. barrier)

• Network architecture and protocols

• Link locking mechanisms (frame detection)

• Network topology (routing)

• Arbitration of shared buses, switches

• Si Photonic Hardware implementations

• Silicon photonics modulators, switches

• Complete “toolbox” of models at each layer

• Ensure interoperability among models

• Avoids “manual” adaptations of data

between distinct software

Columbia PhoenixSim: Integrated Multi-Level Modeling and Design Environment

Environment

Appl. model

Network arch.

Hardware SiP devices

software

soft

war

e

soft

war

e

S. Rumley , M. Glick, S. D. Hammond, A. Rodrigues, K. Bergman “Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems”, ISC-HPC 2015.

Si Photonic physical hardware layer

• Silicon Photonic WDM links:

• Silicon Photonic Switches: Electro-optical

(switch time: 1-2 ns)

Thermo-optical (1-2 μs) Electro-mechanical (~1ms)

Other chips

External

laser

Photodetectors

Chip 1Chip 2Optical switch

Parameterization of silicon photonic links

clk

da

ta

TIA

clk

da

ta

TIA

clk clk clk

data

R

C

clk

da

ta

TIA

clk gen

Silicon waveguide Silicon waveguide

• Co-existence of Electronics and Photonics

• Energy-Bandwidth optimization

Clock Distribution

Serialization of Data

Driver

Comb Laser

Vertical Grating Coupler

Coupled Waveguides

OOK Modulation

Fiber

Thermal Tuning

Phot

odiode

“0”

“1”

λ

Deserialization

Demultiplexing Filter

Amplifier

Optimization of the Optical Link

• Link bandwidth: data rate, wavelength channels

• Multitude of different circuits and devices

– Energy per bit (pJ/bit)

– Loss, Crosstalk, Power penalties

• Component energy consumption: constant,

linear, quadratic or logarithmic with bit rate

• Power penalty of Demux depends on bit rate

– Low Q leads to high level of crosstalk

– High Q leads to narrowing down the OOK spectrum

– Q of the ring can be optimized for minimum penalty

< 1pJ/bit

Single Channel

Laser

Tuning

TIA

Deserializer

Serializer

Serializer Driver Receiver

E/b

it

Bit Rate Bit Rate Bit Rate

Bandwidth-Energy Design Exploration

• Goal is to maximize throughput

while minimizing energy/bit

• Lessons learned:

WDM link can provide up to 2 Tbps

– 8 Gb/s per channel

Min energy 1.46 pJ/bit

1.6 Tb/s of optical bandwidth

200 channels possible

– 13 Gb/s per channel

Max bandwidth 1.9 Tb/s

Energy cost 1.54 pJ/bit

146 channels possible

Implementing Photonic Computing Systems

• Silicon photonic chips subsystem hardware

– Electronic/photonic packaging

– FPGA control and programmability

• 2 Examples of Photonic Architectures:

– Photonic switching in extreme scale computing

– Optically interconnected memory

• Photonic multi-level memory

Prototypical Test Vehicles for Silicon Photonics

Generalized characterization test vehicle for electrical-optical packaged devices: (switches, modulators, filters, receivers)

4-Channel WDM

Mach-Zehnder

Modulator

8-Channel WDM

Microring Drop

Filter

Fully E/O-

packaged sub-

assembly

Special purpose high-speed functional verification test vehicle for 4-channel WDM modulator (MZI-based)

Optical Network Interface Card (ONIC)

FF2 Motherboard

ONIC Daughterboard

• 8 wavelengths

modulated at 10Gbps

• TX wavelength

multiplexers

• RX wavelength

demultiplexer

Optical I/O to

Network

Samtec Interface Conn.

Placement of E/O packaged devices directly on printed circuit

board (PCB) – abstract SiP functionality via FPGA

Supply circuitry w/

adjustment

TX driver

amplifiers

≥ 10GHz signal photodiodes

~ 200 kHz

feedback

photodiodes

Small

power

split

*Not to scale

TX

Path

RX

Path

8x 10Gbps

GPIO

control

Xilinx

Kintex

RX

transimpedance

and limiting

amplifiers

IME

003

-C3

IME

003

-C3

GC

GC

IME

002-

W9U

Gra

ting

Couple

r

Discrete bias supplies

Feedback

Control Multi-

GPIO

to

ADC/

DAC

To Xilinx Kintex

ONIC control

FPGA from FF2

WDM

Optical TX

Path

WDM

Optical

RX Path

8-ch ADC and DAC

Interface to

Fast

Forward 2

HMC-

enabled

motherboard

Advanced Packaging Subassembly

Side camera view Auto-aligner side

camera

Focus length, 30-40mm

Top view

DC connectors

RF connectors

Interposer 1

Interposer 2

Interposer 2

Device Image

Iso view

64 channel fiber array

Key concepts: • Fan-out RF interconnects from Tx/Rx using interposers • Au wire bonding from Tx/Rx to interposers and test board. • Arrays of SMA headers for RF connections • Rotate assembly 45 degrees against test board to provide field of

view for side camera • A maximum of 32 differential pairs can be used.

Preliminary design

Columbia-Sandia Collaboration: Incorporating Silicon Photonics at System Level Rev PA1 Rev PA1 21

Use demultiplexer controlled by ONIC at node to perform wavelength routing

RX

RX

RX

RX Node 1

RX

RX Node 2

RX Node 3

RX Node 4

RX

RX

RX

RX Node 5

RX

RX Node 6

RX Node 7

RX Node 8

Spatially Connected or Switched Network

λ-selective periphery

Controller FPGA to Actuate and

Arbitrate Central Spatial Switch

Mach-Zehnder Interferometer-Based 4x4 Benes Switching Topology

4x4 spatial switch

SiP Devices

Electrical PCB

Optical I/O Optical Interposer

SiP Devices

Optical I/O

Multi-layer

waveguides

Copper pillar and

redistribution

layer Solder

bumps

AIM 2.5D Integration Platform

Photonic switching for dynamic bandwidth

Photonic Switch Subsystem

PIN/TIA

LA PIN/TIA EDFA

λ1 λ2

Multi-node FPGA-enabled Terabit/s WDM SiP network testbed

23

Optical Fiber

Electrical Cable

Node A

Arbiter FPGA

LA PIN/TIA EDFA

RX Chain

EDFA

FPGA FPGA

12.5 GHz

10 MHz

PIN/TIA

Low-Speed Feedback

Off-Chip Laser Sources

Off-chip On-chip

Network

Outputs

Network

Inputs

Node D

TX SiP RX SiP

Node C

TX SiP RX SiP

Node B

TX SiP RX SiP

control/data

control

data

Software

communication

control

Full System Validation with Nodes and Switch

Growing Challenges of Computing/Memory

• Architectures such as High-Bandwidth Memory (HBM) stress the IO for processor connectivity, limited storage capacity.

• Low cost NVM slow – requires large pin counts; DRAM

• Highly heterogeneous, multi-level memory architectures

FPGA SiP Interconnected HMC

Eight-wavelength bi-directional WDM

{λ1 … λ8} at 10Gbps

HMC (4GB, latest gen)

SiP WDM Tx/Rx

SiP WDM Tx/Rx

SiP WDM Tx/Rx

SiP WDM Tx/Rx

FPGA FPGA FPGA FPGA

2.56 Tbps bisectional bandwidth - 32 bidirectional lanes to each FPGA - 10 Gbps signaling rate

HMC (4GB, latest gen)

FPGA (Stratix 5)

0.625 Tbps bisectional bandwidth - 8 bidirectional lanes to each FPGA - 10 Gbps signaling rate

Address a subset of HMC I/O bandwidth

0.625 Tbps =

640 Gbps (bisectional)

0.15625 Tbps =

160 Gbps (bisectional)

HMC-FPGA

WD

M S

iP T

x

WD

M S

iP R

x

D/A

A

/D

Dri

ver

Elec

.

http://en.wikipedia.org/wiki/File:USC_Shield.svg

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&ved=0CAQQjRw&url=http://thesingularityeffect.wordpress.com/darpa/pulse-darpa/&ei=M6McU9awC4TwoATSrYKYCg&usg=AFQjCNHKk5uCPRMCIk0GHTpirA3kJ1k-Aw&sig2=xqU8u5C8y7WumPLetQZRHg&bvm=bv.62578216,d.cGU

Shared Low-Speed Bus Induced Latency

• A single data bus is shared among multiple NVM packages

• When the bus is busy, even if the package is ready, data is stalled

• Problem becomes worse as the number of layers per package grows

• Toshiba/SanDisk 48-layer packages

NVM package

NVM package

NVM package

NVM package

Interface Control

Logic

Stall

8/16 pins, each of 400 Mb/s

Toshiba 48-layer BiCS 3D-NAND

A Photonic Multi-Package Fabric

Multi-wavelength photonic bus provides ample IO bandwidth for each package

NVM package

NVM package

NVM package

NVM package

λ Router

Processor APM

APM

APM

APM

APM

APM

A

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

APM

B

APM

APM

APM

APM

APM

APM

Core

λ1-λ4

λ1-λ4

Rev PA1 Rev PA1 29

Photonic switch enabled memory affinity

• Optimizing core-memory affinity is

key in case of multi stacks

• Can use a reconfigurable photonic

switch to map optical connections

to core-memory relations

4 memory stacks

Apex Mem Ctrl Apex Mem Ctrl Apex

Apex Apex Apex Apex Apex

Apex Mem Ctrl Apex Mem Ctrl Apex



optical links

NoC

[N-1 , … , 1, 0]

0 1 2

N-1

Photonic Switch

work assignment direction

data layout direction

3900

Att

ain

ab

le M

Flo

ps/s

Operational Intensity (Flops/byte)

peak single-core performance

peak multi-core performance

1408

3424

*ApexMap intensity = 0.25 Flops/byte

Rev PA1 Rev PA1 30

Modulator E-

MU

X

WDM Link

Modulator

Ph

oto

nic

Mu

x

E-

MU

X

E-

MU

X

E-

MU

X

Modulator

Modulator

Processor

A

B

Ph

oto

nic

Dem

ux

Core

Core A ’s data

Core B ’s data A’s intf

B’s intf

1

2

3

4

HBM stacks

Example: Reconfigure photonic switch for memory affinity

• By reconfiguring the photonic switch, the number of hops traversed by memory

traffic on processor NoC can significantly decrease

Reconfigurable

Silicon photonic chip

Demonstrating 4-node Layer-1(λ) Switching

1548.5 nm

1546.9 nm

FPGA-Emulated

Memory Node

1545.3 nm

1543.7 nm

PD/TIA PD/TIA

PD/TIA PD/TIA

FPGA-Emulated

Processor Node

Four DRAM modules

use four demux rings

TX: 4 x 10 Gbps/channel

• Emulated quad-core processor reading data from four memory modules through a four-channel reconfigurable optical link

• Optical demux to route wavelengths in a way that matches NIOS-memory affinity

Voltage control

λ-mux single fiber

Rev PA1 Rev PA1 32

FPGA-programmable optically connected memory

MZI based

SiP switch

fibers

FPGA emulating

processor and performing

switch control

FPGA emulating

DRAM

FPGA

emulating

NVM

Documents

Extreme Scale Computing with Optical Data Movementsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2016... · 2019-11-18 · 6 The Photonic Opportunity for Data Movement Energy efficient,