Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Extreme Scale Computing with Optical
Data Movement
Keren Bergman
Department of Electrical Engineering
Columbia University
Loss of US Dominance in Supercomputing 2
Average computing performance of the top 3 Supercomputers over past decade
US had 10X
Advantage!
Tianhe-2
China
• Vast increase in parallelism requires ever more communications
…but bandwidth is stagnated
• Over past 5 years: while system compute power grows by 13X
• Node I/O bandwidth increases by only < 2X
• Data-movement is too expensive! ($ and Energy)
The Major Lag in Data Communications…
Top 10 Supercomputers computation capabilities over past 5 years:
4
Silicon Photonics: all the parts for on-chip optical communications
-Silicon as core material • High refractive index and high contrast –
sub micron cross-section dimensions, smallest bend radius.
-Small footprint devices • 10 μm – 1 mm scale compared to
cm-level scale for telecom components
-Low power consumption • Can reach <1 pJ/bit per full point to
point link
-Aggressive WDM platform • Bandwidth densities 1-2Tb/s per pin
-Silicon wafer-level CMOS processing • Integration
• Mass production, price
• Compatibility with CMOS fabs, CMOS electronics
Switching
WDM Modulation &
Demultiplexing
Silicon Microring/Microdisk Based Devices
Modulation
Detection
5
Silicon Photonics Technology – toward Commercialization
Fundamental DiscoveriesIntroduction of
Innovative DevicesIntegration and
Commercialization
1990s 2015+
low-loss, single-mode waveguiding
optical coupling
optical modulation via carrier injection
high-speed microringmodulators and switches
high-speed MZM modulators and switches
arrayed waveguide gratings
germanium photodetectors
ultra low-loss waveguides and crossings
hybrid silicon lasers
2000s
Hybrid platforms
Foundry and Design Services
Transceivers for Datacom
2010
6
The Photonic Opportunity for Data Movement
Energy efficient, low-latency, high-bandwidth data interconnectivity is the core challenge to continued scalability across computing platforms
Energy consumption completely dominated by costs of data movement
Bandwidth taper from chip to system forces extreme locality
Reduce Energy Consumption Eliminate Bandwidth Taper
10
100
1,000
10,000
100,000
1,000,000
Ban
dw
idth
: Gb
/se
c/m
m
Electronic
Photonic
0.01
0.1
1
10
100
1000
Chip Edge PCB Rack System
Dat
a M
ove
me
nt
Ene
rgy
(pJ/
bit
)
Electronic
Photonic
Node level bandwidth requirements
Near memory bandwidth: 10 TF x 8bit x 0.5B/F = 40Tb/s
(split over ~6-10 individual ~5 Tb/s interfaces)
Bulk memory bandwidth:
0.1 B/F 8Tb/s
0.2 B/F 16 Tb/s
(split over ~1-6 links)
Interconnect bandwidth:
0.01 B/F 0.8 Tb/s
0.05 B/F 4Tb/s
Assume: 10 Teraflop (TF) node (Exascale with 100K)
Power requirements
• Today’s largest envelope: Tianhe-2 = 17MW; RIKEN = 12MW
• Exascale at 100MW is maximal consideration: 10 GigaFLOP/Joule
• 20MW total system power envelope preferred: 50 GigaFLOP/Joule
0.001 0.01 0.1 1
1
10
Verbosity (byte/flop)
En
erg
y b
ud
ge
t p
er
bit (
pJ)
10 Gigaflop/J, 10% of the envelope
10 Gigaflop/J, 15% of the envelope
50 Gigaflop/J, 10% of the envelope
50 Gigaflop/J, 15% of the envelope
Network energy requirements
End-to-end data movement energy budget:
0.25 pJ/bit
100s of pJ to 10s pJ
10s of pJ to
single pJs
pJs to fJs!
Interconnect costs
• Network is ~15% of total system cost
– $200M considered typical
Exascale price
$30M max for network
• Total interconnect bandwidth
– ~300 PB/s (0.1 B/F)
• $30M / 300 PB/s 1$/10GB/s
1.25¢/Gb/s
• Cost reduction required:
– >100X for 0.1 B/F
– >10X for 0.01 B/F
Photonic Computing Architectures: Beyond Wires
11
• Leverage dense WDM bandwidth density
• Photonic switching
• Distance-independent, cut-through, bufferless
• Bandwidth-energy optimized interconnects
On chip
Short distance PCB
Long distance PCB Optical link
Conventional hop-by-hop
data movement Fully flattened end-to-end
data movement
12 conversions!
No conversion!
• Novel design environment enabling HFI across three layers:
• Application IO primitives
• Copy memory array to remote location
• Send, multicast, broadcast messages
• Thread synchronization (e.g. barrier)
• Network architecture and protocols
• Link locking mechanisms (frame detection)
• Network topology (routing)
• Arbitration of shared buses, switches
• Si Photonic Hardware implementations
• Silicon photonics modulators, switches
• Complete “toolbox” of models at each layer
• Ensure interoperability among models
• Avoids “manual” adaptations of data
between distinct software
Columbia PhoenixSim: Integrated Multi-Level Modeling and Design Environment
Environment
Appl. model
Network arch.
Hardware SiP devices
software
soft
war
e
soft
war
e
S. Rumley , M. Glick, S. D. Hammond, A. Rodrigues, K. Bergman “Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems”, ISC-HPC 2015.
Si Photonic physical hardware layer
• Silicon Photonic WDM links:
• Silicon Photonic Switches: Electro-optical
(switch time: 1-2 ns)
Thermo-optical (1-2 μs) Electro-mechanical (~1ms)
Other chips
External
laser
Photodetectors
Chip 1Chip 2Optical switch
Parameterization of silicon photonic links
clk
da
ta
TIA
clk
da
ta
TIA
clk clk clk
data
R
C
clk
da
ta
TIA
clk gen
Silicon waveguide Silicon waveguide
• Co-existence of Electronics and Photonics
• Energy-Bandwidth optimization
Clock Distribution
Serialization of Data
Driver
Comb Laser
Vertical Grating Coupler
Coupled Waveguides
OOK Modulation
Fiber
Thermal Tuning
Phot
odiode
“0”
“1”
λ
Deserialization
Demultiplexing Filter
Amplifier
Optimization of the Optical Link
• Link bandwidth: data rate, wavelength channels
• Multitude of different circuits and devices
– Energy per bit (pJ/bit)
– Loss, Crosstalk, Power penalties
• Component energy consumption: constant,
linear, quadratic or logarithmic with bit rate
• Power penalty of Demux depends on bit rate
– Low Q leads to high level of crosstalk
– High Q leads to narrowing down the OOK spectrum
– Q of the ring can be optimized for minimum penalty
< 1pJ/bit
Single Channel
Laser
Tuning
TIA
Deserializer
Serializer
Serializer Driver Receiver
E/b
it
Bit Rate Bit Rate Bit Rate
Bandwidth-Energy Design Exploration
• Goal is to maximize throughput
while minimizing energy/bit
• Lessons learned:
WDM link can provide up to 2 Tbps
– 8 Gb/s per channel
Min energy 1.46 pJ/bit
1.6 Tb/s of optical bandwidth
200 channels possible
– 13 Gb/s per channel
Max bandwidth 1.9 Tb/s
Energy cost 1.54 pJ/bit
146 channels possible
Implementing Photonic Computing Systems
• Silicon photonic chips subsystem hardware
– Electronic/photonic packaging
– FPGA control and programmability
• 2 Examples of Photonic Architectures:
– Photonic switching in extreme scale computing
– Optically interconnected memory
• Photonic multi-level memory
Prototypical Test Vehicles for Silicon Photonics
Generalized characterization test vehicle for electrical-optical packaged devices: (switches, modulators, filters, receivers)
4-Channel WDM
Mach-Zehnder
Modulator
8-Channel WDM
Microring Drop
Filter
Fully E/O-
packaged sub-
assembly
Special purpose high-speed functional verification test vehicle for 4-channel WDM modulator (MZI-based)
Optical Network Interface Card (ONIC)
FF2 Motherboard
ONIC Daughterboard
• 8 wavelengths
modulated at 10Gbps
• TX wavelength
multiplexers
• RX wavelength
demultiplexer
Optical I/O to
Network
Samtec Interface Conn.
Placement of E/O packaged devices directly on printed circuit
board (PCB) – abstract SiP functionality via FPGA
Supply circuitry w/
adjustment
TX driver
amplifiers
≥ 10GHz signal photodiodes
~ 200 kHz
feedback
photodiodes
Small
power
split
*Not to scale
TX
Path
RX
Path
8x 10Gbps
GPIO
control
Xilinx
Kintex
RX
transimpedance
and limiting
amplifiers
IME
003
-C3
IME
003
-C3
GC
GC
IME
002-
W9U
Gra
ting
Couple
r
Discrete bias supplies
Feedback
Control Multi-
GPIO
to
ADC/
DAC
To Xilinx Kintex
ONIC control
FPGA from FF2
WDM
Optical TX
Path
WDM
Optical
RX Path
8-ch ADC and DAC
Interface to
Fast
Forward 2
HMC-
enabled
motherboard
Advanced Packaging Subassembly
Side camera view Auto-aligner side
camera
Focus length, 30-40mm
Top view
DC connectors
RF connectors
Interposer 1
Interposer 2
Interposer 2
Device Image
Iso view
64 channel fiber array
Key concepts: • Fan-out RF interconnects from Tx/Rx using interposers • Au wire bonding from Tx/Rx to interposers and test board. • Arrays of SMA headers for RF connections • Rotate assembly 45 degrees against test board to provide field of
view for side camera • A maximum of 32 differential pairs can be used.
Preliminary design
Columbia-Sandia Collaboration: Incorporating Silicon Photonics at System Level Rev PA1 Rev PA1 21
Use demultiplexer controlled by ONIC at node to perform wavelength routing
RX
RX
RX
RX Node 1
RX
RX Node 2
RX Node 3
RX Node 4
RX
RX
RX
RX Node 5
RX
RX Node 6
RX Node 7
RX Node 8
Spatially Connected or Switched Network
λ-selective periphery
Controller FPGA to Actuate and
Arbitrate Central Spatial Switch
Mach-Zehnder Interferometer-Based 4x4 Benes Switching Topology
4x4 spatial switch
SiP Devices
Electrical PCB
Optical I/O Optical Interposer
SiP Devices
Optical I/O
Multi-layer
waveguides
Copper pillar and
redistribution
layer Solder
bumps
AIM 2.5D Integration Platform
Photonic switching for dynamic bandwidth
Photonic Switch Subsystem
PIN/TIA
LA PIN/TIA EDFA
λ1 λ2
Multi-node FPGA-enabled Terabit/s WDM SiP network testbed
23
Optical Fiber
Electrical Cable
Node A
Arbiter FPGA
LA PIN/TIA EDFA
RX Chain
EDFA
FPGA FPGA
12.5 GHz
10 MHz
PIN/TIA
Low-Speed Feedback
Off-Chip Laser Sources
Off-chip On-chip
Network
Outputs
Network
Inputs
Node D
TX SiP RX SiP
Node C
TX SiP RX SiP
Node B
TX SiP RX SiP
control/data
control
data
Software
communication
control
Full System Validation with Nodes and Switch
Growing Challenges of Computing/Memory
• Architectures such as High-Bandwidth Memory (HBM) stress the IO for processor connectivity, limited storage capacity.
• Low cost NVM slow – requires large pin counts; DRAM
• Highly heterogeneous, multi-level memory architectures
FPGA SiP Interconnected HMC
Eight-wavelength bi-directional WDM
{λ1 … λ8} at 10Gbps
HMC (4GB, latest gen)
SiP WDM Tx/Rx
SiP WDM Tx/Rx
SiP WDM Tx/Rx
SiP WDM Tx/Rx
FPGA FPGA FPGA FPGA
2.56 Tbps bisectional bandwidth - 32 bidirectional lanes to each FPGA - 10 Gbps signaling rate
HMC (4GB, latest gen)
FPGA (Stratix 5)
0.625 Tbps bisectional bandwidth - 8 bidirectional lanes to each FPGA - 10 Gbps signaling rate
Address a subset of HMC I/O bandwidth
0.625 Tbps =
640 Gbps (bisectional)
0.15625 Tbps =
160 Gbps (bisectional)
HMC-FPGA
WD
M S
iP T
x
WD
M S
iP R
x
D/A
A
/D
Dri
ver
Elec
.
Shared Low-Speed Bus Induced Latency
• A single data bus is shared among multiple NVM packages
• When the bus is busy, even if the package is ready, data is stalled
• Problem becomes worse as the number of layers per package grows
• Toshiba/SanDisk 48-layer packages
NVM package
NVM package
NVM package
NVM package
Interface Control
Logic
Stall
8/16 pins, each of 400 Mb/s
Toshiba 48-layer BiCS 3D-NAND
A Photonic Multi-Package Fabric
Multi-wavelength photonic bus provides ample IO bandwidth for each package
NVM package
NVM package
NVM package
NVM package
λ Router
Processor APM
APM
APM
APM
APM
APM
A
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
APM
B
APM
APM
APM
APM
APM
APM
Core
λ1-λ4
λ1-λ4
Rev PA1 Rev PA1 29
Photonic switch enabled memory affinity
• Optimizing core-memory affinity is
key in case of multi stacks
• Can use a reconfigurable photonic
switch to map optical connections
to core-memory relations
4 memory stacks
Apex Mem Ctrl Apex Mem Ctrl Apex
Apex Apex Apex Apex Apex
Apex Mem Ctrl Apex Mem Ctrl Apex
Apex Apex Apex Apex Apex
Apex Apex Apex Apex Apex
optical links
NoC
[N-1 , … , 1, 0]
0 1 2
N-1
Photonic Switch
work assignment direction
data layout direction
3900
Att
ain
ab
le M
Flo
ps/s
Operational Intensity (Flops/byte)
peak single-core performance
peak multi-core performance
1408
3424
*ApexMap intensity = 0.25 Flops/byte
Rev PA1 Rev PA1 30
Modulator E-
MU
X
WDM Link
Modulator
Ph
oto
nic
Mu
x
E-
MU
X
E-
MU
X
E-
MU
X
Modulator
Modulator
Processor
A
B
Ph
oto
nic
Dem
ux
Core
Core A ’s data
Core B ’s data A’s intf
B’s intf
1
2
3
4
HBM stacks
Example: Reconfigure photonic switch for memory affinity
• By reconfiguring the photonic switch, the number of hops traversed by memory
traffic on processor NoC can significantly decrease
Reconfigurable
Silicon photonic chip
Demonstrating 4-node Layer-1(λ) Switching
1548.5 nm
1546.9 nm
FPGA-Emulated
Memory Node
1545.3 nm
1543.7 nm
PD/TIA PD/TIA
PD/TIA PD/TIA
FPGA-Emulated
Processor Node
Four DRAM modules
use four demux rings
TX: 4 x 10 Gbps/channel
• Emulated quad-core processor reading data from four memory modules through a four-channel reconfigurable optical link
• Optical demux to route wavelengths in a way that matches NIOS-memory affinity
Voltage control
λ-mux single fiber
Rev PA1 Rev PA1 32
FPGA-programmable optically connected memory
MZI based
SiP switch
fibers
FPGA emulating
processor and performing
switch control
FPGA emulating
DRAM
FPGA
emulating
NVM