29
LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi Takagi Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan [email protected]

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR

USING SINGLE-FLUX-QUANTUM CIRCUITS

Naofumi TakagiGraduate School of Informatics

Kyoto UniversityKyoto 606-8501, [email protected]

Page 2: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Our Team

Takagi Group: Kyoto UniversityProf. N. Takagi, Prof. K. Takagi

Murakami Group: Kyushu UniversityProf. K. Murakami, Prof. K. Inoue, Prof. H. Honda

Yoshikawa Group: Yokohama National UniversityProf. N. Yoshikawa, Prof. Y. Yamanashi

Akaike Group: Nagoya UniversityProf. H. Akaike, Prof. A. Fujimaki, Prof. M. Tanaka

Nagasawa Group: ISTEC-SRL(Superconductivity Research Laboratory, International Superconductivity Technology Center)

Mr. S. Nagasawa, Dr. M. Hidaka2

Page 3: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Aim of the research

Developing basic technologies of energy-efficient, high-performance computers, e.g., a 10 Tflops desk-side computer, using superconducting sigle-flux-quantum (SFQ) circuits.

By adopting the processor architecture called ‘large-scale reconfigurable data-paths (RDPs).’

3

Page 4: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

4

SFQ (0.5~0.35um)LSRDP

Developing basic technologies

A 10TFlops Computer

By technologies at 200690nm CMOSParallel computer

Page 5: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Our approach

Reduction of power consumptionof conventional circuit technology

5

Development of technologiesfor realizing a high-performance computerusing a new low-power circuit technology

SFQ

Page 6: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Backgrounds

Superconducting Single-Flux-Quantum CircuitsUltra Low-Power, Ultra High-Speed

6

Φ0 = h/2e = 2.07 mV. ps

2~3 ps

SFQ pulse

SFQ in a superconductive loop

~1 mV

Josephson Junction

Page 7: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

SFQ technologies at 2006• Conventional Nb 4-layer 2µm fabrication process

Cell-based design, Logic cell libraryAutomatic routing by Josephson transmission lineSFQ-LSIs with more than 10,000JJs

• Development of a new 1µm fabrication processNb 6 lyersNo design environment

• Development of passive transmission line (PTL) technology

High-speed inner-chip data transfer

7Superconducting micro-strip line

SFQ pulse

Page 8: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Reconfigurable Data-Path (RDP) processor

• Reconfigurable data-path–A lot of floating point Units

(FPUs)–Reconfigurable operand routing

networks :(ORNs)–Dynamic reconfiguration

• Features–Reconfiguring the data-path by

routing ORNs to fit the processing of a loop in large-scale numerical computation

–Parallel and pipelined processing– Burst input /output data is

transferred from/to memory

PE PEPE

ORN

PE PE PEPE

PE PE PEPE

ORN

オペランドルーティングネットワーク (ORN )

ORN

演算器(PE )

メモリアクセスコントローラー(MAC )

MAC

I/O Port

汎用プロセッサ(GPP )

...

...

...

...

...

...

...

主記憶

FPU FPU

ORN

FPU FPU FPUFPU

FPU FPU FPUFPU

ORN

Operand Routing Network(ORN)

ORN

Streaming Memory Access Controller (SMAC)

SMAC

I/O Port

GeneralPurpose

Processor

...

...

...

...

...

...

...

Main Mem.

演算器PEPEFPU FPU

Page 9: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Research subjects

1. SFQ fabrication process and circuit design environments(1) Nb multi-layer 1µm fabrication process (Nagasawa G.)(2) Logic cell library for the 1µm process (Yoshikawa G. and Akaike G.)(3) CAD for SFQ digital circuit design (Takagi G.)

2. SFQ-FPUs and SFQ-RDP prototypes(1) SFQ-FPUs (Yoshikawa G. and Takagi G.)

Half-precision FPA and FPM operating at 25GHz (2µm process)FPA and FPM operating at 50GHz (1µm process)

(2) SFQ-RDP prototypes (ALU+ORN) (Akaike G.)2x2 SFQ-RDP operating at 25GHz (2µm process) 4x4 SFQ-RDP operating at 50GHz (1µm process)

3. RDP architecture (Murakami G.)RDP architecture, RDP compiler, RDP-oriented algorithms

9

Page 10: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Results of the research

Page 11: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

1. Fabrication process and design environment

Development of a Nb 9-layer 1µm fabrication process

11

300 nm400 nm

300 nm

300 nm

150 nm150 nm

BC

GCRC RC

AlOx

C6C6

BCSiO2

GC

JJ

M7 (GP)

C6

M3 (PTL1)

M5 (PTL2)C5

M6 (GND3)C5 C5C5 C5

400 nm

400 nm

GC

M8 (BAS)RES1

JC

C2C2C2 C2

C3

C4

C3

C4 C4

C3

M1 (DCP)

M2 (GND1)

M4 (GND2)C4

C3 C3

C4 150 nm150 nm150 nm150 nm150 nm150 nm

150 nm150 nm150 nm150 nm200 nm

200 nm

M9 (COU)

GC

C6

M2 (GND1)

M8 (BAS)

C1C1 200 nm

Si Substrate

M9 (COU)

Nb layer thickness SiO2 layer thickness

Complemented planarization layer

SiO2

SiO2

SiO2

Cross-sectional SEM photographExcellent flatness was obtained even though the step edges of several underlying patterns are overlapped.

Active layers including junctions and resistors

Main Ground plane

1st PTL layer

DC power layer

2nd PTL layer

Nb layers for M1-M7 are planarized.

Page 12: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Shift registers for evaluation of the Nb 9-layer 1µm process

Measurement results (Best chip)Only three defects, Correct operation of 13 of 16 circuits Correct operation of all 2560bit shift registers with 10,281 JJs

2560

-bit

SR

1280

-bit

SR

640-

bit S

R

64-b

it S

R

2560

-bit

SR

1280

-bit

SR

16-b

it S

R

2560

-bit

SR

2560

-bit

SR

1280

-bit

SR12

80-b

it SR

640-

bit S

R

160-

bit S

R

64-b

it S

R16

-bit

SR

160-

bit S

R

Chip size: 8.5mm x 7.0mm

Design 16 circuits• Two 16-bit shift registers• Two 64-bit shift registers• Two 160-bit shift registers• Two 640-bit shift registers• Four 1280-bit shift

registers• Four 2560-bit shift

registers 68,990 JJs in total

Page 13: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Development of a logic cell library for the 9-layer process

A microphotograph of a dffc2 cell

Basic structure of a logic cell

30 µm

PTL2

PTL1

30 µ

mグランドコンタクト

バイアスピラーBias pillar

Ground contact

30 µm

Page 14: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

A 4x4 switch by the 9-layer 1µm process and the cell library

Operation up to 112 GHz (World’s highest) Total Power Consumption : 660 µW Total number of JJs: 3362 The number of vias: 434

Upper PTLLower PTLVia hole

Page 15: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Circuit area ratio 1 : 0.19 (81% reduction)

Conventional Nb 4-layer 2µm proces New Nb 9-ayer 1µm process

(Cell size:40mm x 40mm → 30µm x 30µm)

Functional block

Functional block

Area Reduction by 81% compared to the conventional Nb 4-layer 2µm fabrication process

Page 16: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Device density and operating frequency in LSIs

GaAs HEMT

0.01 0.1 1 10 100 10001

103

106

109

1012

Si Bip

Si MOSFET

GaAs HBTInP HEMT

GaAsMESFET

SiGe HBT

InP HBT

Frequency (GHz)

Dev

ice

Den

sity

(Trs

/cm

2 )Limit from Long Interconnect Delay

Limit from Power Density for CMOS

Limit from Power Density for Compound

Demonstrated in CREST

Present SFQ in USA/ Previous SFQ in Japan

4x4 SW

2x2 RDP

SFQ LSIs developed in this project have reached the region that semiconductors can not reach.

Page 17: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Energy consumption for a device used in LSIs

10-23

10-21

10-19

10-17

10-15

10-13

1 101 102 103 104

Thermal Energy @4K

Present CMOS

Thermal Energy @350K

Clock Period (ps)

Ene

rgy

Con

sum

ptio

n (J

)

SFQ in Japan before 2005

Demonstrated in CREST

2x2 RDPFPM/FPA4x4 SW

Primitives will be demonstrated in CREST

Present SFQ in USA

Page 18: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Design flow of SFQ LSIs

18

Design Entry

LogicSynthesis

Logic Netlist

Placement

Placed Cells& Connections

Routing

Mask Layout

Layout Viewer

TechnologyLibrary

P&RLibrary

SpecificSynthesis Subsystemsfor SFQLogic Circuits

Constraints&Violations

Cell &WireGeometry

TimingVerification

LogicSimulator

StaticTimingAnalyzer

Specification& Constraints

• Sequential circuit synthesis• Clock scheduling and

distribution• Asynchronous logic

synthesis

Layout-driven design

Precise timing analysis

Unique process to SFQ circuits

Verification ofpipeline operations

Page 19: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Development of design tools

• Designed a sample circuit: 8-bit carry lookahead adder

Verified correct operations

clock tree synthesis

semi-automatic placement

automatic routing

8-bit CLA158 gates, 9 levelsconcurrent-flow clocking7092JJs, 598PTLs

Page 20: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

2.SFQ-FPUs and SFQ-RDP prototypes

20

Operating frequency: 20GHzPerformance: 1.67 GFLOPsThe number of junctions: 10244 JJsPower consumption: 3.5 mWCircuit area: 5.86 x 5.72 mm2

Shifter of A

Shifter of B

Controller

Normalizer

Adder & Subtractor

Normalizer

1mm Shifter Register for Confirmation

Shi

fter R

egis

ter f

or C

onfir

mat

ion

FPA

Clock Generator

Shifter Register of Significands

Shifter Register of Exponent and Sign

Operating frequency: 32GHzPerformance: 2.6 GFLOPsThe number of junctions: 11044 JJsPower consumption: 3.5 mWCircuit area: 6.22 ×3.78 mm2

Multiplier

Normalizer

Shifter Register

Normalizer

Clock Generator

1mm

Shifter Register

FPM

Half-precision FPA and FPM using the 2µm process

Page 21: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

FPA and FPM using the 1µm process

21

Operation circuit for significant part

Operation circuit for exponent part

Systolic array multiplier

Clock Generator

Significand Processing Circuit

ExponentProcessing Circuit

Normalizer

Shift Registerfor Input Shift Register

for Output

3.510 mm

2.16

mm

Circuit area: 7.58 mm2

Junction count: 6157 JJs

Micrograph of 10-bit bit-serial FPM

Block diagram of bit-serial FPM

Measurement

Simulation

9%

50-GHz test results of 4b multiplier

Page 22: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

22

2x3 SFQ-RDP prototype using the 2µm process

6 ALUsClock frequency: 23 GHzJunction counts : 14040 (World’s largest integration scale)Circuit area: 6.84 ×6.72 mm2

CONNECTcooperated with SRL, NiCT, NU & YNU

*SRL Nb 2.5 kA/cm2 standard process

Page 23: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

2x2 SFQ-RDP prototype using the 1µm process

23

Clock frequency: 45GHzPower dissipation: 3.4mWJunction count: 11458 Circuit area:

5.61 ×2.82mm2

AUL4AND

AUL3ADD

TU

TU

IN4

IN3

IN2

IN1 OUT1

OUT2

OUT3

OUT4AUL2XOR

AUL1SUB

TU

TU

TU: Transfer Unit

1mm

ORN1 ORN2

ALU1a

ALU1b

ALU2a

ALU2b

Page 24: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

ORN architecture and a crossbar switch for a 2-bit wide data streams

Nu

mbe

r of

row

s =

1.5

×M

Number of columns = 4×MCL+1

Nu

mbe

r of

row

s =

1.5

×M

Number of columns = 4×MCL+1

ORN with MCL= 2Junction count: 547 Clock frequency: 65GHzPower Consumption: 0.14mW

24

M MCL JJ count Power dissipation (mW)

RDP-M(Middle-scale) 24 5 307692 7.7RDP-L(Large scale) 48 5 615384 15

Operating region

M: Number of FPUs in an array, MCL: Maximum Connection Length

Page 25: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Micro-architecture:Two types of FPUs: FPA and FPMFPU:Three Inputs (A,B,C)→ Three Outputs (A(*B),B,C)

Arrangement of FPUs: alternateThree types (scales) of RDP

(Small, Medium and Large-Scales)

25

FU TUTU TUFP TUTU TU

PE (i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

ORN

RDP parameters (optimized by total number of JJs)

# Input # Output Width Height MCLTotal JJs

(∝RDP size)

RDP-S 19 12 22 14 4 19387K

RDP-M 19 12 24 17 5 27027K

RDP-L 38 24 41 34 6 96374K

3. Development of RDP Architecture

TU:Data Through

Page 26: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Development of RDP Complier

ApplicationC code

1 Modified code

2

Modifyingapplication codeManual: Inserting LSRDP

instructions in the code

1

ISAcc or COINScompiler

2

DFG ExtractionSemi-manual

1

.asm codefor MIPS-based GPP

2

Data flow graphsPlacement and Routing Tool

2

Configuration file +various text and schematic

reports

1

RDP library fileFunctions definition

& declaration

1RDP architecture description

2

2: flow of generation ofconfiguration bit-streamfor RDP Simulator

Performance evaluation

1: flow of generation of assembly codes for GPP

Page 27: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

27

Development of RDP-Oriented Algorithms One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum

chemistry Runge-Kutta calculation for ordinary differential equation

Performance Estimation Two-dimensional heat equation (1024x1024 mesh)

SFQ-RDP: 50.6GFlops vs. GPU1): 63.0GFlopsEstimation method:

RDP - Execution time model, - DFG has 21 inputs, 9 outputs, and 63 operations- BW: 159.0GB/s

GPP - Cycle-accurate processor simulator1) T. Aoki, and A. Nukada,“CUDA programming: beginners,“ Kougakusya,

ISBN-10:4777514773, 2009 (in Japanese).

Page 28: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

Summary

1. SFQ fabrication process and circuit design environments(1) We have established a Nb 9-layer 1µm fabrication process.(2) We have developed a logic cell library for the 1µm process.(3) We have developed several CAD tools for the 1µm process.

It is possible to design and fabricate large scale SFQ circuits.2. SFQ-FPUs and SFQ-RDP prototypes

(1) We have fabricated a half-precision FPA and FPM by the 2µm process, which have operated at 20GHz and 35GHz, respectively.We are designing an FPA and FPM by the 1µm process.

(2) We have fabricated a 2x3 SFQ-RDP prototype by the 2µm process, which has operated at 23GHz.We have fabricated a 2x2 SFQ-RDP prototype by the 1µm process, which has operated at 45GHz.We are developing a 4x4 SFQ-RDP prototype by the 1µm process.

It is possible to realize SFQ-RDPs.28

Page 29: LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE … · Ultra Low-Power, Ultra High-Speed. 6. ... • Asynchronous logic synthesis. Layout-driven design. Precise timing analysis. Unique

29

3. RDP architecture(1) We have determined architectural specifications of SFQ-RDP.(2) We have developed a compiler for RDP.(3) We have developed RDP-oriented algorithms for several applications

and have estimated the performance of SFQ-RDP.

SFQ-RDP is effective.

We will be able to develop energy-efficient, high-performance computers using SFQ circuits in the near future.