Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Computer Architecture for theNext Millenium
November 1, 1999
William J. DallyComputer Systems Laboratory
Stanford [email protected]
Computer Architecture for the Next MilleniumWJD November 1, 1999 2
Outline
• The Stanford Concurrent VLSI Architecture Group• Forces acting on computer architecture
– applications (media)– technology (wire-limited)– techniques (explicit parallelism)
• Example: register organization– distributed register files
• Imagine a stream processor– 20GFLOPS on a 0.5cm2 chip
• Tremendous opportunities and challenges forcomputer architecture in the next millenium– its not a mature field yet
Computer Architecture for the Next MilleniumWJD November 1, 1999 3
The Concurrent VLSI Architecture Group
• Architecture and design technology for VLSI• Routing chips
– Torus Routing Chip, Network Design Frame, Reliable Router– Basis for Intel, Cray/SGI, Mercury, Avici network chips
Computer Architecture for the Next MilleniumWJD November 1, 1999 4
Parallel computer systems
• J-Machine (MDP) led to Cray T3D/T3E• M-Machine (MAP)
– Fast messaging, scalable processing nodes, scalablememory architecture
MDP Chip J-Machine Cray T3D MAP Chip
Computer Architecture for the Next MilleniumWJD November 1, 1999 5
Design technology
• Off-chip I/O– Simultaneous bidirectional signaling, 1989
• now used by Intel and Hitachi– High-speed signalling
• 4Gb/s in 0.6µm CMOS, Equalization, 1995
• On-Chip Signalling– Low-voltage on-chip signalling– Low-skew clock distribution
• Synchronization– Mesochronous, Plesiochronous– Self-Timed Design
4Gb/s CMOS I/O
250ps/division
Computer Architecture for the Next MilleniumWJD November 1, 1999 6
What is Computer Architecture?
Technology
Applications ComputerArchitect
Interfaces
Machine Organization
Measurement &Evaluation
ISA
AP
I
Link
I/O C
han
Regs
IR
Computer Architecture for the Next MilleniumWJD November 1, 1999 7
Forces Acting on Architecture
• Applications - shifting towards media applications dealingwith streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations
• Technology - becoming wire-limited– power and delay dominated by communication, not arithmetic– global structures: register files and instruction issue don’t scale
• Technique - Micro-architecture - ILP has been mined out– to the point of diminishing returns on squeezing performance
from sequential code– explicit parallelism (data parallelism and thread-level parallelism)
required to continue scaling performance
Computer Architecture for the Next MilleniumWJD November 1, 1999 8
Applications
• Little locality of reference– read each pixel once– often non-unit stride– but there is producer-consumer
locality• Very high arithmetic intensity
– 100s of arithmetic operationsper memory reference
• Dominated by low-precision(16-bit) integer operations
Computer Architecture for the Next MilleniumWJD November 1, 1999 9
Wires Are Becoming Like Wet Noodles
0.0mm
2.5mm
5.0mm
7.5mm
10.0mm
Minimum widthwire in an 0.35µmprocess
Computer Architecture for the Next MilleniumWJD November 1, 1999 10
Technology scaling makes communication thescarce resource
0.18µm256Mb DRAM16 64b FP Proc
500MHz
0.07µm4Gb DRAM
256 64b FP Proc2.5GHz
P
1999 2008
18mm30,000 tracks
1 clockrepeaters every 3mm
25mm120,000 tracks
16 clocksrepeaters every 0.5mm
P
Computer Architecture for the Next MilleniumWJD November 1, 1999 11
Care and Feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP‘Feeding’ Structure Dwarfs ALU
Computer Architecture for the Next MilleniumWJD November 1, 1999 12
What Does This Say About Architecture?
• Tremendous opportunities– Media problems have lots of parallelism and locality– VLSI technology enables 100s of ALUs per chip (1000s soon)
• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)
• Challenging problems– Locality - global structures won’t work– Explicit parallelism - ILP won’t keep 100 ALUs busy– Memory - streaming applications don’t cache well
• Its time to try some new approaches
Computer Architecture for the Next MilleniumWJD November 1, 1999 13
ExampleRegister File Organization
• Register files serve two functions:– Short term storage for intermediate results– Communication between multiple function units
• Global register files don’t scale well as N, number ofALUs increases– Need more registers to hold more results (grows with N)– Need more ports to connect all of the units (grows with N2)
Computer Architecture for the Next MilleniumWJD November 1, 1999 14
Register Cells are Mostly Switch
Bit Lines
Wor
d Li
nes
Vdd
Gnd
Vdd
p w
p
h
Bit Lines
Wor
d Li
nes
...
1 wiregrid
...
p
p w
h
Computer Architecture for the Next MilleniumWJD November 1, 1999 15
Register Architecture for ‘wide’ Processors
N Arithmetic Units N/CArithmetic
Units
C SIMD Clusters
N/CArithmetic
Units
C SIMD Clusters
(A) (B)
(C) (D)
N Arithmetic UnitsN/C Arithmetic
UnitsN/C Arithmetic
Units
Computer Architecture for the Next MilleniumWJD November 1, 1999 16
Area of Register Organizations
Central
SIMD
DRF
SIMD/DRF
0.1
1
10
100
1000
1 10 100 1000
Number of Arithmetic Units
Computer Architecture for the Next MilleniumWJD November 1, 1999 17
Delay of Register Organizations
Central
DRF
SIMD/DRF
SIMD
0.1
1
10
100
1000
1 10 100 1000
Number of Arithmetic Units
Computer Architecture for the Next MilleniumWJD November 1, 1999 18
Performance of Register Organizations
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Central SIMD SIMD/DRF HIERARCHICAL STREAM
(B) Performance with Latency
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Central SIMD SIMD/DRF HIERARCHICAL STREAM
(A) Raw Performance
Computer Architecture for the Next MilleniumWJD November 1, 1999 19
Stubs Abstract the Communication BetweenOperations
FU(Op 1)
RF
FU(Op 2)
RF
Op 1
Op 2W
rite
stu
bR
ead
stubD
ata
tran
sfer
Computer Architecture for the Next MilleniumWJD November 1, 1999 20
A Communication Example
+
t1 t1
t1 * t2 L/S
+ k * x L/S
t1 t1
+ * L/S
+ t1 * t2 L/S
+ k * x L/S
t1
Pass t1 * L/S
+
t1
t1 * t2 L/S
Inst
ruct
ion
5In
stru
ctio
n 6
(a) (b) (c)
Computer Architecture for the Next MilleniumWJD November 1, 1999 21
The Imagine Stream Processor
Stream Register FileNetworkInterface
HostInterface
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
cont
rolle
r
Computer Architecture for the Next MilleniumWJD November 1, 1999 22
Data Bandwidth Hierarchy
Imagine Stream Processor
3.2GB/s 64GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Computer Architecture for the Next MilleniumWJD November 1, 1999 23
Cluster Architecture
• VLIW organization with shared control• Local register files provide high data bandwidth
CU
Inte
rclu
ster
Net
wor
k
+
From Stream Buffers
To Stream Buffers
+ + * * /
Cross Point
Local Register File
Computer Architecture for the Next MilleniumWJD November 1, 1999 24
Imagine is a Stream Processor
• Instructions are Load, Store, and Operate– operands are streams– also Send and Receive for multiple-imagine systems
• Operate performs a compound stream operation– read elements from input streams– perform a local computation– append elements to output streams– repeat until input stream is consumed– (e.g., triangle transform)
• Order of magnitude less global register bandwidththan a vector processor
Pixe
l Dep
th &
Col
or
Mem
ory
Stre
am R
egis
ter F
ile
Inpu
tD
ata
wor
d
reco
rdTr
iang
le R
ecor
ds
Shad
ed T
riang
le R
ecor
ds
Proj
ecte
d Tr
iang
le R
ecor
ds
Span
Rec
ords
Frag
men
t Rec
ords
Pixe
l Dep
th &
Col
or
Pixe
l Dep
th &
Col
orZ-
Com
posi
te
Frag
men
t Rec
ords
Imag
eD
epth
&C
olor
Imag
e B
uffe
r Ind
ices
Mem
ory
Ban
dwid
thR
egis
ter
Ban
dwid
th
Com
pact
Sort
Proc
ess
Span
Span
Set
up
Proj
ect/
Cul
l
Tran
sfor
m
Arit
hmet
icC
lust
ers
Tria
ngle
Rec
ords
Shad
e
Tria
ngle
Ren
derin
g
Computer Architecture for the Next MilleniumWJD November 1, 1999 26
Transform Kernel
References (per ∆) Stream Scalar Vector
Memory 5.5 117 (21.3) 48 (8.7)
Global RF 48 624 (13.0) 261 (5.4)
Local RF 372 N/A N/A
Bandwidth Demands
Computer Architecture for the Next MilleniumWJD November 1, 1999 27
Data Parallelism is easier than ILP
Kernel 1 to 8 Cluster Speedup
FFT (1024) 6.4
DCT (8x8) 7.8
Blockwarp (8x8) 7.2
Transform (∆) 8.0
Harmonic Mean 7.3
Computer Architecture for the Next MilleniumWJD November 1, 1999 28
Conventional Approaches toData-Dependent Conditional Execution
A
x>0
B
C
J
K
Data-DependentBranch
Y N
A
x>0
YB
C
Whoops
J
K
SpeculativeLoss D x W~100s
A
B
J
y=(x>0)
if y
if ~y
C
K
if y
if ~y
ExponentiallyDecreasing Duty Factor
Computer Architecture for the Next MilleniumWJD November 1, 1999 29
Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches– Predication (SIMD select, masked vectors) - large fraction of
execution ‘opportunities’ go idle.• Conditional Streams
– append an element to an output stream depending on a casevariable.
Value Stream
Case Stream {0,1}
Output Stream
Computer Architecture for the Next MilleniumWJD November 1, 1999 30
Sustainable Performance
12.88.0 8.9 8.1 7.0
4.0
24.4
14.63.2 1.8
3.62.8
0.8
2.6
0
5
10
15
20
25
30Tr
ansf
orm
FFT
FIR
Blo
ckw
arp
Sort
Mer
ge
DC
T
FIR
GO
PS
Communication16-bit Arithmetic32-bit ArithmeticFP Arithmetic
Computer Architecture for the Next MilleniumWJD November 1, 1999 31
Power Comparison
3.7 1.5
0.28
0.22
1.2
0 1 2 3 4 5 6GOPs/W
Imagine
AD 21160
TI C67x
StrongARM SA-1100
1024-point Floating-Point FFTCommunicationsDhrystone
-Source: Web Pages of Intel, TI, and Analog Devices
Computer Architecture for the Next MilleniumWJD November 1, 1999 32
Power and Performance
34%
45%
9%
9% 3%
00.5
11.5
22.5
33.5
W
FIR
FFT
BWAR
P
XFOR
M
MER
GE DCT
OtherMem SysSRFClustersClock
11.5 5.2 5.2 3.8 7.1 14.1GOPs/W:
Computer Architecture for the Next MilleniumWJD November 1, 1999 33
A Look Inside an ApplicationStereo Depth Extraction
• 320x240 8-bit grayscaleimages
• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W
Clusters Mem_0 Mem_1
21400
21500
21600
21700
21800
21900
22000
22100
22200
22300
22400
22500
22600
22700
22800
22900
23000
23100
23200
23300
23400
23500
23600
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
Clust Mem0 Mem1501400
501500
501600
501700
501800
501900
502000
502100
502200
502300
502400
502500
502600
502700
502800
502900
503000
503100
503200
503300
BlockSADLoad Load
BlockSAD
StoreBlockSAD
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
Load originalpacked row
Unpack (8bit -> 16 bit)
7x7 Convolve
3x3 Convolve
Store convolved row
Load ConvolvedRows
CalculateBlockSADs atdifferent disparities
Store bestdisparity values
Stereo Depth ExtractorConvolutions Disparity Search
Computer Architecture for the Next MilleniumWJD November 1, 1999 35
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0
G E N _ C I S T A T E
C O N D _ I N _ D
G E N _ C C E N D
S P C R E A D _ W T S P C W R I T E
C O M M U C D A T A
C H K _ A N Y
S E L E C T
S H I F T A 1 6
C O M M U C P E R M
C O M M U C P E R M
C O M M U C P E R M
S E L E C T
S E L E C T C O M M U C P E R M
I M U L R N D 1 6 S E L E C T
I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T
I M U L R N D 1 6I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S
P A S SS H U F F L E
S H U F F L ES H U F F L E S H U F F L E
I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S
I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S
I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I A D D S 1 6I A D D S 1 6 I A D D S 1 6
S H U F F L E S H U F F L ES H U F F L E D A T A _ I N
I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N
I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N
S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N
I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T
I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T
L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T
D A T A _ O U T D A T A _ O U T
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0
IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS
IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE
IMULRND16 IMULRND16SHUFFLE PASS
IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE
IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS
PASSSHUFFLE SELECT
SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM
IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT
IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16
IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT
IADDS16IADDS16 IADDS16 IMULRND16IMULRND16
IMULRND16IMULRND16 PASS
SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS
IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE
IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D
SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND
IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT
IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT
IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT
IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT
LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT
IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT
7x7 Convolve Kernel
Computer Architecture for the Next MilleniumWJD November 1, 1999 36
Imagine Summary
• Imagine operates on streams of records– simplifies programming– exposes locality and concurrency
• Compound stream operations– perform a subroutine on each stream element– reduces global register bandwidth
• Bandwidth hierarchy– use bandwidth where its inexpensive– distributed and hierarchical register organization
• Conditional stream operations– sort elements into homogeneous streams– avoid predication or speculation
Computer Architecture for the Next MilleniumWJD November 1, 1999 37
Computer Architecture for the Next Millenium
• Applications and technology are changing– media applications process streams of low-precision samples– wires dominate gates
• ILP is at the point of diminishing returns• Tremendous opportunities for new architectures
– new applications have lots of parallelism and locality– modern technology can build chips with 100s of ALUs (32b
FP) 1000s in the near future• The challenge is to develop architectures
– that can harness this potential performance– in a way that can be easily programmed
• Stream processing is one approach, there are manyothers. We need to start exploring them