Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR
USING SINGLE-FLUX-QUANTUM CIRCUITS
Naofumi TakagiGraduate School of Informatics
Kyoto UniversityKyoto 606-8501, [email protected]
Our Team
Takagi Group: Kyoto UniversityProf. N. Takagi, Prof. K. Takagi
Murakami Group: Kyushu UniversityProf. K. Murakami, Prof. K. Inoue, Prof. H. Honda
Yoshikawa Group: Yokohama National UniversityProf. N. Yoshikawa, Prof. Y. Yamanashi
Akaike Group: Nagoya UniversityProf. H. Akaike, Prof. A. Fujimaki, Prof. M. Tanaka
Nagasawa Group: ISTEC-SRL(Superconductivity Research Laboratory, International Superconductivity Technology Center)
Mr. S. Nagasawa, Dr. M. Hidaka2
Aim of the research
Developing basic technologies of energy-efficient, high-performance computers, e.g., a 10 Tflops desk-side computer, using superconducting sigle-flux-quantum (SFQ) circuits.
By adopting the processor architecture called ‘large-scale reconfigurable data-paths (RDPs).’
3
4
SFQ (0.5~0.35um)LSRDP
Developing basic technologies
A 10TFlops Computer
By technologies at 200690nm CMOSParallel computer
Our approach
Reduction of power consumptionof conventional circuit technology
5
Development of technologiesfor realizing a high-performance computerusing a new low-power circuit technology
SFQ
Backgrounds
Superconducting Single-Flux-Quantum CircuitsUltra Low-Power, Ultra High-Speed
6
Φ0 = h/2e = 2.07 mV. ps
2~3 ps
SFQ pulse
SFQ in a superconductive loop
~1 mV
Josephson Junction
SFQ technologies at 2006• Conventional Nb 4-layer 2µm fabrication process
Cell-based design, Logic cell libraryAutomatic routing by Josephson transmission lineSFQ-LSIs with more than 10,000JJs
• Development of a new 1µm fabrication processNb 6 lyersNo design environment
• Development of passive transmission line (PTL) technology
High-speed inner-chip data transfer
7Superconducting micro-strip line
SFQ pulse
Reconfigurable Data-Path (RDP) processor
• Reconfigurable data-path–A lot of floating point Units
(FPUs)–Reconfigurable operand routing
networks :(ORNs)–Dynamic reconfiguration
• Features–Reconfiguring the data-path by
routing ORNs to fit the processing of a loop in large-scale numerical computation
–Parallel and pipelined processing– Burst input /output data is
transferred from/to memory
PE PEPE
ORN
PE PE PEPE
PE PE PEPE
ORN
オペランドルーティングネットワーク (ORN )
ORN
演算器(PE )
メモリアクセスコントローラー(MAC )
MAC
I/O Port
汎用プロセッサ(GPP )
...
...
...
...
...
...
...
主記憶
FPU FPU
ORN
FPU FPU FPUFPU
FPU FPU FPUFPU
ORN
Operand Routing Network(ORN)
ORN
Streaming Memory Access Controller (SMAC)
SMAC
I/O Port
GeneralPurpose
Processor
...
...
...
...
...
...
...
Main Mem.
演算器PEPEFPU FPU
Research subjects
1. SFQ fabrication process and circuit design environments(1) Nb multi-layer 1µm fabrication process (Nagasawa G.)(2) Logic cell library for the 1µm process (Yoshikawa G. and Akaike G.)(3) CAD for SFQ digital circuit design (Takagi G.)
2. SFQ-FPUs and SFQ-RDP prototypes(1) SFQ-FPUs (Yoshikawa G. and Takagi G.)
Half-precision FPA and FPM operating at 25GHz (2µm process)FPA and FPM operating at 50GHz (1µm process)
(2) SFQ-RDP prototypes (ALU+ORN) (Akaike G.)2x2 SFQ-RDP operating at 25GHz (2µm process) 4x4 SFQ-RDP operating at 50GHz (1µm process)
3. RDP architecture (Murakami G.)RDP architecture, RDP compiler, RDP-oriented algorithms
9
Results of the research
1. Fabrication process and design environment
Development of a Nb 9-layer 1µm fabrication process
11
300 nm400 nm
300 nm
300 nm
150 nm150 nm
BC
GCRC RC
AlOx
C6C6
BCSiO2
GC
JJ
M7 (GP)
C6
M3 (PTL1)
M5 (PTL2)C5
M6 (GND3)C5 C5C5 C5
400 nm
400 nm
GC
M8 (BAS)RES1
JC
C2C2C2 C2
C3
C4
C3
C4 C4
C3
M1 (DCP)
M2 (GND1)
M4 (GND2)C4
C3 C3
C4 150 nm150 nm150 nm150 nm150 nm150 nm
150 nm150 nm150 nm150 nm200 nm
200 nm
M9 (COU)
GC
C6
M2 (GND1)
M8 (BAS)
C1C1 200 nm
Si Substrate
M9 (COU)
Nb layer thickness SiO2 layer thickness
Complemented planarization layer
SiO2
SiO2
SiO2
Cross-sectional SEM photographExcellent flatness was obtained even though the step edges of several underlying patterns are overlapped.
Active layers including junctions and resistors
Main Ground plane
1st PTL layer
DC power layer
2nd PTL layer
Nb layers for M1-M7 are planarized.
Shift registers for evaluation of the Nb 9-layer 1µm process
Measurement results (Best chip)Only three defects, Correct operation of 13 of 16 circuits Correct operation of all 2560bit shift registers with 10,281 JJs
2560
-bit
SR
1280
-bit
SR
640-
bit S
R
64-b
it S
R
2560
-bit
SR
1280
-bit
SR
16-b
it S
R
2560
-bit
SR
2560
-bit
SR
1280
-bit
SR12
80-b
it SR
640-
bit S
R
160-
bit S
R
64-b
it S
R16
-bit
SR
160-
bit S
R
Chip size: 8.5mm x 7.0mm
Design 16 circuits• Two 16-bit shift registers• Two 64-bit shift registers• Two 160-bit shift registers• Two 640-bit shift registers• Four 1280-bit shift
registers• Four 2560-bit shift
registers 68,990 JJs in total
Development of a logic cell library for the 9-layer process
A microphotograph of a dffc2 cell
Basic structure of a logic cell
30 µm
PTL2
PTL1
30 µ
mグランドコンタクト
バイアスピラーBias pillar
Ground contact
30 µm
A 4x4 switch by the 9-layer 1µm process and the cell library
Operation up to 112 GHz (World’s highest) Total Power Consumption : 660 µW Total number of JJs: 3362 The number of vias: 434
Upper PTLLower PTLVia hole
Circuit area ratio 1 : 0.19 (81% reduction)
Conventional Nb 4-layer 2µm proces New Nb 9-ayer 1µm process
(Cell size:40mm x 40mm → 30µm x 30µm)
Functional block
Functional block
Area Reduction by 81% compared to the conventional Nb 4-layer 2µm fabrication process
Device density and operating frequency in LSIs
GaAs HEMT
0.01 0.1 1 10 100 10001
103
106
109
1012
Si Bip
Si MOSFET
GaAs HBTInP HEMT
GaAsMESFET
SiGe HBT
InP HBT
Frequency (GHz)
Dev
ice
Den
sity
(Trs
/cm
2 )Limit from Long Interconnect Delay
Limit from Power Density for CMOS
Limit from Power Density for Compound
Demonstrated in CREST
Present SFQ in USA/ Previous SFQ in Japan
4x4 SW
2x2 RDP
SFQ LSIs developed in this project have reached the region that semiconductors can not reach.
Energy consumption for a device used in LSIs
10-23
10-21
10-19
10-17
10-15
10-13
1 101 102 103 104
Thermal Energy @4K
Present CMOS
Thermal Energy @350K
Clock Period (ps)
Ene
rgy
Con
sum
ptio
n (J
)
SFQ in Japan before 2005
Demonstrated in CREST
2x2 RDPFPM/FPA4x4 SW
Primitives will be demonstrated in CREST
Present SFQ in USA
Design flow of SFQ LSIs
18
Design Entry
LogicSynthesis
Logic Netlist
Placement
Placed Cells& Connections
Routing
Mask Layout
Layout Viewer
TechnologyLibrary
P&RLibrary
SpecificSynthesis Subsystemsfor SFQLogic Circuits
Constraints&Violations
Cell &WireGeometry
TimingVerification
LogicSimulator
StaticTimingAnalyzer
Specification& Constraints
• Sequential circuit synthesis• Clock scheduling and
distribution• Asynchronous logic
synthesis
Layout-driven design
Precise timing analysis
Unique process to SFQ circuits
Verification ofpipeline operations
Development of design tools
• Designed a sample circuit: 8-bit carry lookahead adder
Verified correct operations
clock tree synthesis
semi-automatic placement
automatic routing
8-bit CLA158 gates, 9 levelsconcurrent-flow clocking7092JJs, 598PTLs
2.SFQ-FPUs and SFQ-RDP prototypes
20
Operating frequency: 20GHzPerformance: 1.67 GFLOPsThe number of junctions: 10244 JJsPower consumption: 3.5 mWCircuit area: 5.86 x 5.72 mm2
Shifter of A
Shifter of B
Controller
Normalizer
Adder & Subtractor
Normalizer
1mm Shifter Register for Confirmation
Shi
fter R
egis
ter f
or C
onfir
mat
ion
FPA
Clock Generator
Shifter Register of Significands
Shifter Register of Exponent and Sign
Operating frequency: 32GHzPerformance: 2.6 GFLOPsThe number of junctions: 11044 JJsPower consumption: 3.5 mWCircuit area: 6.22 ×3.78 mm2
Multiplier
Normalizer
Shifter Register
Normalizer
Clock Generator
1mm
Shifter Register
FPM
Half-precision FPA and FPM using the 2µm process
FPA and FPM using the 1µm process
21
Operation circuit for significant part
Operation circuit for exponent part
Systolic array multiplier
Clock Generator
Significand Processing Circuit
ExponentProcessing Circuit
Normalizer
Shift Registerfor Input Shift Register
for Output
3.510 mm
2.16
mm
Circuit area: 7.58 mm2
Junction count: 6157 JJs
Micrograph of 10-bit bit-serial FPM
Block diagram of bit-serial FPM
Measurement
Simulation
9%
50-GHz test results of 4b multiplier
22
2x3 SFQ-RDP prototype using the 2µm process
6 ALUsClock frequency: 23 GHzJunction counts : 14040 (World’s largest integration scale)Circuit area: 6.84 ×6.72 mm2
CONNECTcooperated with SRL, NiCT, NU & YNU
*SRL Nb 2.5 kA/cm2 standard process
2x2 SFQ-RDP prototype using the 1µm process
23
Clock frequency: 45GHzPower dissipation: 3.4mWJunction count: 11458 Circuit area:
5.61 ×2.82mm2
AUL4AND
AUL3ADD
TU
TU
IN4
IN3
IN2
IN1 OUT1
OUT2
OUT3
OUT4AUL2XOR
AUL1SUB
TU
TU
TU: Transfer Unit
1mm
ORN1 ORN2
ALU1a
ALU1b
ALU2a
ALU2b
ORN architecture and a crossbar switch for a 2-bit wide data streams
Nu
mbe
r of
row
s =
1.5
×M
Number of columns = 4×MCL+1
Nu
mbe
r of
row
s =
1.5
×M
Number of columns = 4×MCL+1
ORN with MCL= 2Junction count: 547 Clock frequency: 65GHzPower Consumption: 0.14mW
24
M MCL JJ count Power dissipation (mW)
RDP-M(Middle-scale) 24 5 307692 7.7RDP-L(Large scale) 48 5 615384 15
Operating region
M: Number of FPUs in an array, MCL: Maximum Connection Length
Micro-architecture:Two types of FPUs: FPA and FPMFPU:Three Inputs (A,B,C)→ Three Outputs (A(*B),B,C)
Arrangement of FPUs: alternateThree types (scales) of RDP
(Small, Medium and Large-Scales)
25
FU TUTU TUFP TUTU TU
PE (i, j)
(i+2,j+1)
(i+L,j+1)
(i+1,j+1)
(i,j+1)
MCL = L
・・・
ORN
RDP parameters (optimized by total number of JJs)
# Input # Output Width Height MCLTotal JJs
(∝RDP size)
RDP-S 19 12 22 14 4 19387K
RDP-M 19 12 24 17 5 27027K
RDP-L 38 24 41 34 6 96374K
3. Development of RDP Architecture
TU:Data Through
Development of RDP Complier
ApplicationC code
1 Modified code
2
Modifyingapplication codeManual: Inserting LSRDP
instructions in the code
1
ISAcc or COINScompiler
2
DFG ExtractionSemi-manual
1
.asm codefor MIPS-based GPP
2
Data flow graphsPlacement and Routing Tool
2
Configuration file +various text and schematic
reports
1
RDP library fileFunctions definition
& declaration
1RDP architecture description
2
2: flow of generation ofconfiguration bit-streamfor RDP Simulator
Performance evaluation
1: flow of generation of assembly codes for GPP
27
Development of RDP-Oriented Algorithms One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum
chemistry Runge-Kutta calculation for ordinary differential equation
Performance Estimation Two-dimensional heat equation (1024x1024 mesh)
SFQ-RDP: 50.6GFlops vs. GPU1): 63.0GFlopsEstimation method:
RDP - Execution time model, - DFG has 21 inputs, 9 outputs, and 63 operations- BW: 159.0GB/s
GPP - Cycle-accurate processor simulator1) T. Aoki, and A. Nukada,“CUDA programming: beginners,“ Kougakusya,
ISBN-10:4777514773, 2009 (in Japanese).
Summary
1. SFQ fabrication process and circuit design environments(1) We have established a Nb 9-layer 1µm fabrication process.(2) We have developed a logic cell library for the 1µm process.(3) We have developed several CAD tools for the 1µm process.
It is possible to design and fabricate large scale SFQ circuits.2. SFQ-FPUs and SFQ-RDP prototypes
(1) We have fabricated a half-precision FPA and FPM by the 2µm process, which have operated at 20GHz and 35GHz, respectively.We are designing an FPA and FPM by the 1µm process.
(2) We have fabricated a 2x3 SFQ-RDP prototype by the 2µm process, which has operated at 23GHz.We have fabricated a 2x2 SFQ-RDP prototype by the 1µm process, which has operated at 45GHz.We are developing a 4x4 SFQ-RDP prototype by the 1µm process.
It is possible to realize SFQ-RDPs.28
29
3. RDP architecture(1) We have determined architectural specifications of SFQ-RDP.(2) We have developed a compiler for RDP.(3) We have developed RDP-oriented algorithms for several applications
and have estimated the performance of SFQ-RDP.
SFQ-RDP is effective.
We will be able to develop energy-efficient, high-performance computers using SFQ circuits in the near future.