View
218
Download
0
Embed Size (px)
Citation preview
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.1
CS 152Computer Architecture and Engineering
Lecture 25
The Final ChapterA whirlwind retrospective on the term
May 10, 1999
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.2
° Recap: What was covered in lectures 45 minutes)
° Questions and Administrative Matters (2 minutes)
° Future of Computer Architecture and Engineering (15 minutes)
° Lessons from CS 152 (10 minutes)
° HKN evaluation of teaching staff (15 minutes)
Outline of Today’s Lecture
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.3
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
Where have we been?
34-b it A LU
LO register(16x2 bits)
Load
HI
Cle
arH
I
Load
LO
M ultiplicandRegister
S h iftA ll
LoadM p
Extra
2 bits
3 232
LO [1 :0 ]
Result[H I] Result[LO]
32 32
Prev
LO[1]
Booth
Encoder E N C [0 ]
E N C [2 ]
"LO
[0]"
Con trolLog ic
InputM ultiplier
32
S ub /A dd
2
34
34
32
InputM ultiplicand
32=>34sig nEx
34
34x2 M U X
32=>34sig nEx
<<13 4
E N C [1 ]
M ulti x2 /x1
2
2HI register(16x2 bits)
2
01
3 4
CS152Spring ‘99
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
1000
19
80 1
98
1 19
83 1
98
4 19
85 1
98
6 19
87 1
98
8 19
89 1
99
0 19
91 1
99
2 19
93 1
99
4 19
95 1
99
6 19
97 1
99
8 19
99 2
00
0
DRAM
CPU
19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
ArithmeticSingle/multicycleDatapaths
Pipelining
Memory Systems
I/O
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.4
The Big Picture
Control
Datapath
Memory
Processor
Input
Output
° Since 1946 all computers have had 5 components
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.5
What is “Computer Architecture”?
I/O systemInstr. Set Proc.
Compiler
OperatingSystem
Application
Digital DesignCircuit Design
Instruction Set Architecture
Firmware
• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, Measurement, and Evaluation
Datapath & Control
Layout
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.6
Year
Perf
orm
an
ce
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
Microprocessors
Minicomputers
Mainframes
Supercomputers
• Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.– Density: improves 1.2x / yr.– Die Area: 1.2x / yr.
• One lesson of RISC is to keep the ISA as simple as possible:– Shorter design cycle => fully exploit the advancing technology (~3yr)– Advanced branch prediction and pipeline techniques– Bigger and more sophisticated on-chip caches
Performance and Technology Trends
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.7
Instruction Set Architecture (subset of Computer Arch.)
... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964
SOFTWARESOFTWARE-- Organization of Programmable Storage
-- Data Types & Data Structures: Encodings & Representations
-- Instruction Set
-- Instruction Formats
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.8
Instruction Set Design
instruction set
software
hardware
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.9
Hierarchical Design to manage complexity
Top Down vs. Bottom Up vs. Successive Refinement
Importance of Design Representations:
Block Diagrams
Decomposition into Bit Slices
Truth Tables, K-Maps
Circuit Diagrams
Other Descriptions: state diagrams, timing diagrams, reg xfer, . . .
Optimization Criteria:
Gate Count
[Package Count]
Logic Levels
Fan-in/Fan-outPower
topdown
bottom up
AreaDelay
mux designmeets at TT
Cost Design timePin Out
Summary of the Design Process
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.10
Measurement and Evaluation
Architecture is an iterative process -- searching the space of possible designs -- at all levels of computer systems
Good IdeasGood Ideas
Mediocre IdeasBad Ideas
Cost /PerformanceAnalysis
Design
Analysis
CreativityYou must be willing to throw out Bad Ideas!!
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.11
One of most important aspects of design: TEST
• Think about testing from beginning of design• Well over 50% of modern teams devoted to testing• VHDL Test benches: monitoring hardware to aid
debugging:– Include assert statements to check for “things that should
never happen”
Test Bench
Device UnderTest
Inline vectorsAssert StatementsFile IO (either for patternsor output diagnostics)
Inline Monitor
Output in readableformat (disassembly)Assert Statements
Complete Top-Level Design
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.12
Why should you keep an design notebook?
• Keep track of the design decisions and the reasons behind them– Otherwise, it will be hard to debug and/or refine the design
– Write it down so that can remember in long project: 2 weeks ->2 yrs
– Others can review notebook to see what happened
• Record insights you have on certain aspect of the design as they come up
• Record of the different design & debug experiments– Memory can fail when very tired
• Industry practice: learn from others mistakes
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.13
Basis of Evaluation
Actual Target Workload
Full Application Benchmarks
Small “Kernel” Benchmarks
Microbenchmarks
Pros Cons
• representative• very specific• non-portable• difficult to run, or measure• hard to identify cause
• portable• widely used• improvements useful in reality
• easy to run, early in design cycle
• identify peak capability and potential bottlenecks
•less representative
• easy to “fool”
• “peak” may be a long way from application performance
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.14
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) = 1 (1-F) + F/S
Amdahl's Law
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.15
° Time is the measure of computer performance!
° Remember Amdahl’s Law: Speedup is limited by unimproved part of program
° Good products created when have:
• Good benchmarks
• Good ways to summarize performance
° If NOT good benchmarks and summary, then choice between 1) improving product for real programs 2) changing product to get more sales (sales almost always wins)
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Performance Evaluation Summary
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.16
Defects_per_unit_area * Die_Area
}
Integrated Circuit Costs
Die Cost is goes roughly with the cube of the area.
{ 1+
Die cost = Wafer cost
Dies per Wafer * Die yield
Dies per wafer = * ( Wafer_diam / 2)2 – * Wafer_diam – Test dies Wafer Area
Die Area 2 * Die Area Die Area
Die Yield = Wafer yield
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.17
Computer Arithmetic
• Bits have no inherent meaning: operations determine whether really ASCII characters, integers, floating point numbers
• Hardware algorithms for arithmetic:–Carry lookahead/carry save addition–Multiplication and divide.–Booth algorithms
• Divide uses same hardware as multiply (Hi & Lo registers in MIPS)
• Floating point follows paper & pencil method of scientific notation
–using integer algorithms for multiply/divide of significands
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.18
Carry Look Ahead (Design trick: peek)
A B C-out0 0 0 “kill”0 1 C-in “propagate”1 0 C-in “propagate”1 1 1 “generate”
P = A and BG = A xor B
A0
B0
A1
B1
A2
B2
A3
B3
S
S
S
S
GP
GP
GP
GP
C0 = Cin
C1 = G0 + C0 P0
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
C4 = . . .
P
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.19
MULTIPLY HARDWARE Version 3
• 32-bit Multiplicand reg, 32-bit ALU, 64-bit Product reg (shift right), (0-bit Multiplier reg)
Product (Multiplier)
Multiplicand
32-bit ALU
WriteControl
32 bits
64 bits
Shift Right“HI” “LO”
Divide can use same hardware!
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.20
Booth’s Algorithm Insight
Current Bit Bit to the Right Explanation Example Op
1 0 Begins run of 1s 0001111000 sub
1 1 Middle of run of 1s 0001111000none
0 1 End of run of 1s 0001111000 add
0 0 Middle of run of 0s 0001111000none
Originally for Speed (when shift was faster than add)
• Replace a string of 1s in multiplier with an initial subtract when we first see a one and then later add for the bit afterthe last one
0 1 1 1 1 0beginning of runend of run
middle of run
–1+ 1000001111
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.21
Double Bit Booth Multiplier
34 -b it A LU
LO register(16x2 bits)
Load
HI
Cle
arH
I
Load
LO
M ultiplicandR egister
S h iftA ll
LoadM p
Extra
2 bits
3 232
LO [1 :0 ]
R esult[H I] R esult[LO]
32 32
Prev
LO[1]
Booth
Encoder E N C [0 ]
E N C [2 ]
C on trolLog ic
InputM ultiplier
32
S ub /A d d
2
34
34
32
InputM ultiplicand
32=>34s ig nE x
34
34x2 M U X
32=>34s ig nE x
<<13 4
E N C [1 ]
M ulti x2 /x1
2
2H I register(16x2 bits)
2
01
3 4
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.22
Pentium Bug
• Pentium: Difference between bugs that board designers must know about and bugs that potentially affect all users
–$200,000 cost in June to repair design–$400,000,000 loss in December in profits to replace bad
parts–How much to repair Intel’s reputation?–Make public complete description of bugs in later
category? • What is technologist’s and company’s responsibility to disclose
bugs?
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.23
Multiple Cycle Datapath
IdealMemoryWrAdrDin
RAdr
32
32
32Dout
MemWr
32
AL
U
3232
ALUOp
ALUControl
32
IRWr
Instru
ction R
eg
32
Reg File
Ra
Rw
busW
Rb5
5
32busA
32busB
RegWr
Rs
Rt
Mu
x
0
1
Rt
Rd
PCWr
ALUSelA
Mux 01
RegDst
Mu
x
0
1
32
PC
MemtoReg
Extend
ExtOp
Mu
x
0
132
0
1
23
4
16Imm 32
<< 2
ALUSelB
Mu
x1
0
32
Zero
ZeroPCWrCond PCSrc
32
IorD
Mem
Data R
eg
AL
U O
ut
B
A
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.24
Control: Hardware vs. Microprogrammed
° Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique.
Initial Representation Finite State Diagram Microprogram
Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs
Logic Representation Logic Equations Truth Tables
Implementation Technique PLA ROM“hardwired control” “microprogrammed control”
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.25
Finite State Machine (FSM) Spec
IR <= MEM[PC]PC <= PC + 4
R-type
ALUout <= A fun B
R[rd] <= ALUout
ALUout <= A or ZX
R[rt] <= ALUout
ORi
ALUout <= A + SX
R[rt] <= M
M <= MEM[ALUout]
LW
ALUout <= A + SX
MEM[ALUout] <= B
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
1011
1100
BEQ
0010
0011
If A = B then PC <= ALUout
ALUout <= PC +SX
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.26
Sequencer-based control unit
Opcode
State Reg
Inputs
Outputs
Control Logic MulticycleDatapath
1
Address Select Logic
Adder
Types of “branching”• Set state to 0• Dispatch (state 1)• Use incremented state number
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.27
“Macroinstruction” Interpretation
MainMemory
executionunit
controlmemory
CPU
ADDSUBAND
DATA
.
.
.
User program plus Data
this can change!
AND microsequence
e.g., Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s)
one of these ismapped into oneof these
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.28
Microprogramming
Label ALU SRC1 SRC2 Dest. Memory Mem. Reg. PC Write SequencingFetch: Add PC 4 Read PC IR ALU Seq
Add PC Extshft Dispatch
Rtype: Func rs rt Seqrd ALU Fetch
Ori: Or rs Extend0 Seqrt ALU Fetch
Lw: Add rs Extend SeqRead ALU Seq
rt MEM Fetch
Sw: Add rs Extend SeqWrite ALU Fetch
Beq: Subt. rs rt ALUoutCond. Fetch
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.29
Precise Interrupts
• Precise state of the machine is preserved as if program executed up to the offending instruction– All previous instructions completed– Offending instruction and all following instructions act as if they have
not even started– Same system code will work on different implementations – Position clearly established by IBM– Difficult in the presence of pipelining, out-ot-order execution, ...– MIPS takes this position
• Imprecise system software has to figure out what is where and put it all back together
• Performance goals often lead designers to forsake precise interrupts– system software developers, user, markets etc. usually wish they had
not done this
• Modern techniques for out-of-order execution and branch prediction help implement precise interrupts
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.30
Administrivia
• Oral reports Tomorrow: – 10am - 12pm, 2pm-4pm, 306 Soda
– 5pm go over to lab to run mystery programs
– Still 3 empty slots. You *MUST* sign up today.
– Reports due at 5pm in Lab (not in BOX downstairs)
– TAs have handed out a list of requirements
– Remember: talk is 15 minutes + 5 minutes questions• Don’t bring more than 8 slides!!!• Practice! Your final project grade will depend partially on
your oral report.
• Grades posted by Friday– Give us a random 8 digit number for final grades!
We will use this to post final grades on the web site.
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.31
Recap: Pipelining Lessons (its intuitive!)
° Pipelining doesn’t help latency of single task, it helps throughput of entire workload
° Multiple tasks operating simultaneously using different resources
° Potential speedup = Number pipe stages
° Pipeline rate limited by slowest pipeline stage
° Unbalanced lengths of pipe stages reduces speedup
° Time to “fill” pipeline and time to “drain” it reduces speedup
° Stall for Dependences
6 PM 7 8 9
Time
B
C
D
A
303030 3030 3030Task
Order
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.32
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
AL
UIm Reg Dm Reg
Why Pipeline? Because the resources are there!
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.33
• Yes: Pipeline Hazards– structural hazards: attempt to use the same resource two
different ways at the same time• E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)– data hazards: attempt to use item before it is ready
• E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
• instruction depends on result of prior instruction still in the pipeline
– control hazards: attempt to make a decision before condition is evaulated• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in• branch instructions
• Can always resolve hazards by waiting– pipeline control must detect the hazard– take action (or delay action) to resolve hazards
Can pipelining get us into trouble?
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.34
Exceptions in a 5 stage pipeline
• Use pipeline to sort this out!– Pass exception status along with instruction.– Keep track of PCs for every instruction in pipeline.– Don’t act on exception until it reache WB stage
• Handle interrupts through “faulting noop” in IF stage• When instruction reaches WB stage:
– Save PC EPC, Interrupt vector addr PC– Turn all instructions in earlier stages into noops!
Pro
gram
Flo
w
Time
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Data TLB
Bad Inst
Inst TLB fault
Overflow
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.35
Data Stationary Control
• The Main Control generates the control signals during Reg/Dec– Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
– Control signals for Mem (MemWr Branch) are used 2 cycles later
– Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
IF/ID
Register
ID/E
x Register
Ex/M
em R
egister
Mem
/Wr R
egister
Reg/Dec Exec Mem
ExtOp
ALUOp
RegDst
ALUSrc
Branch
MemWr
MemtoReg
RegWr
MainControl
ExtOp
ALUOp
RegDst
ALUSrc
MemtoReg
RegWr
MemtoReg
RegWr
MemtoReg
RegWr
Branch
MemWr
Branch
MemWr
Wr
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.36
° Simple 5-stage pipeline: F D E M W° Pipelines pass control information down the pipe just
as data moves down pipe
° Resolve data hazards through forwarding.
° Forwarding/Stalls handled by local control
° Exceptions stop the pipeline
° MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
° More performance from deeper pipelines, parallelism
° You built a complete 5-stage pipeline in the lab!
Pipeline Summary
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.37
Out of order execution: Tomasulo Organization
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation Stations
Common Data Bus (CDB)
To Mem
FP OpQueue
Load Buffers
Store Buffers
Load1Load2Load3Load4Load5Load6
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.38
How can the machine exploit available ILP?Technique
° Pipelining
° Super-pipeline
- Issue 1 instr. / (fast) cycle
- IF takes multiple cycles
° Super-scalar
- Issue multiple scalar
instructions per cycle
° VLIW
- Each instruction specifies
multiple scalar operations
Limitation
Issue rate, FU stalls, FU depth
Clock skew, FU stalls, FU depth
Hazard resolution
Packing,
Compiler
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
Ex M W
Ex M W
Ex M W
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.39
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Gap (latency)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.40
Levels of the Memory Hierarchy
CPU Registers100s Bytes<2s ns
CacheK Bytes SRAM2-100 ns$.01-.001/bit
Main MemoryM Bytes DRAM100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents-3 -4
CapacityAccess TimeCost
Tapeinfinitesec-min10-6
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.41
Memory Hierarchy
° The Principle of Locality:• Program access a relatively small portion of the address
space at any instant of time.- Temporal Locality: Locality in Time- Spatial Locality: Locality in Space
° Three Major Categories of Cache Misses:• Compulsory Misses: sad facts of life. Example: cold start
misses.• Conflict Misses: increase cache size and/or associativity.• Capacity Misses: increase cache size
° Virtual Memory invented as another level of the hierarchy–Today VM allows many processes to share single memory
without having to swap all processes to disk, protection more important
–TLBs are important for fast translation/checking
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.42
Set Associative Cache
• N-way set associative: N entries for each Cache Index– N direct mapped caches operates in parallel
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared to the input in
parallel– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.43
Quicksort vs. Radix as vary number keys: Cache misses
Cache misses
Job size in keys
Radix sort
Quicksort
What is proper approach to fast algorithms?
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.44
Static RAM Cell
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
• Write:1. Drive bit lines (bit=1, bit=0)2.. Select row
• Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!2.. Select row3. Cell pulls one line low4. Sense amp on column detects difference between bit and bit
replaced with pullupto save area
10
0 1
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.45
1-Transistor Memory Cell (DRAM)
• Write:– 1. Drive bit line
– 2.. Select row
• Read:– 1. Precharge bit line to Vdd
– 2.. Select row
– 3. Cell and bit line share charges• Very small voltage changes on the bit line
– 4. Sense (fancy sense amp)• Can detect changes of ~1 million electrons
– 5. Write: restore the value
• Refresh– 1. Just do a dummy read to every cell.
row select
bit
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.46
Classical DRAM Organization (square)
row
decoder
rowaddress
Column Selector & I/O Circuits Column
Address
data
RAM Cell Array
word (row) select
bit (data) lines
• Row and Column Address together: – Select 1 bit a time
Each intersection representsa 1-T DRAM Cell
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.47
Main Memory Performance
• Simple: CPU, Cache, Bus, Memory same width (32 bits)
• Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)
• Interleaved: CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved
Timing model: 1 to send address, 6 access time, 1 to send dataCache Block is 4 wordsSimple M.P. = 4 x (1+6+1) = 32Wide M.P. = 1 + 6 + 1 = 8Interleaved M.P. = 1 + 6 + 4x1 = 11
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.48
I/O System Design Issues
Processor
Cache
Memory - I/O Bus
MainMemory
I/OController
Disk Disk
I/OController
I/OController
Graphics Network
interrupts
• Systems have a hierarchy of busses as well (PC: memory,PCI,ESA)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.49
A Three-Bus System
• A small number of backplane buses tap into the processor-memory bus– Processor-memory bus is only used for processor-memory
traffic
– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
Processor Memory
Processor Memory Bus
BusAdaptor
BusAdaptor
BusAdaptor
I/O BusBackplane Bus
I/O Bus
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.50
Disk Latency = Queueing Time + Controller time + Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm
Disk Device Terminology
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.51
Disk I/O Performance
Response time = Queue + Device Service time
100%
ResponseTime (ms)
Throughput (% total BW)
0
100
200
300
0%
Proc
Queue
IOC Device
Metrics: Response Time Throughput
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.52
• Described “memoryless” or Markovian request arrival (M for C=1 exponentially random), 1 server: M/M/1 queue
• When Service times have C = 1, M/M/1 queue
Tq = Tser x u / (1 – u) Tser average time to service a customer
u server utilization (0..1): u = x Tser
Tq average time/customer in queue
A Little Queuing Theory: M/M/1 queues
Proc IOC Device
Queue server
System
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.53
Computers in the news: Tunneling Magnetic Junction
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.54
Computers in the News: Sony Playstation 2000
• (as reported in Microprocessor Report, Vol 13, No. 5)– Emotion Engine: 6.2 GFLOPS, 75 million polygons per second– Graphics Synthesizer: 2.4 Billion pixels per second– Claim: Toy Story realism brought to games!
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.55
Computers in the News: Electronic Ink
• Electronic Ink: – Little capsules with charged balls that are
half black/half white– Placing an electronic charge of one polarity makes dot black and the
other polarity makes it white.– Flexible, cheap, paper-like displays!
Schematic Diagram
Electron Micrograph
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.57
ComputerArchitecture
Technology ProgrammingLanguages
OperatingSystems
History
Applications
(A = F / M)
Forces on Computer Architecture
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.58
° Fast, cheap, highly integrated “computers-on-a-chip”
• IDT R4640, NEC VR4300, StrongARM, Superchips
° Affordable access to fast networks -> Network is everywhere!
• ISDN, Cable Modems, ATM, . . .
° Platform independent programming languages
• Java, JavaScript, Visual Basic Script
° Lightweight Operating Systems
• GEOS, NCOS, RISCOS
° ???
Key Technologies
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.59
• Performance• High Level Computer Architecture• Multiprocessors• “IRAM”• “Introspective Computing”• AetherStore
Future of Computer Architecture and Engineering
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.60
Year
Perf
orm
an
ce
0
50
100
150
200
250
300
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
RISC
Intel x86
35%/yr
RISCintroduction
Processor Performance
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.61
Alpha 21164 Pentium II HP PA-8000
Year 1995 1996 1996
Clock 600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97)
Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
Issue rate 2int+2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename regs none 40 56
“Braniac”“Speed Demon”
3 Recent Machines
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.62
0
2
4
6
8
10
12
14
16
18
20
go
88ks
im gcc
com
pre
ss li
ijpe
g
perl
vort
ex
SP
EC
int
PA-800021164PPro
SPECint95base Performance (Oct. 1997)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.63
0
5
10
15
20
25
30
35
40
45
50
tom
catv
swim
su2c
or
hydr
o2d
mgr
id
appl
u
turb
3d
apsi
fppp
p
wa
ve5
SP
EC
fp
PA-800021164PPro
SPECint95base Performance (Oct. 1997)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.64
° Theory of Algorithms & Compilers based on number of operations
° Compiler remove operations and “simplify” ops: Integer adds << Integer multiplies << FP adds << FP multiplies
• Advanced pipelines => these operations take similar time(FP multiply faster than integer multiply)
° As Clock rates get higher and pipelines are longer, instructions take less time but DRAMs only slightly faster (although much larger)
° Today time is a function of (ops, cache misses);
• How do you tune performance on Pentium Pro? Random?° Given importance of caches, what does this mean to:
• Compilers?• Data structures?• Algorithms?
Performance Retrospective
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.65
PCWork-stationMini-
computer
Mainframe
Vector Supercomputer
“Big Iron”
1985 Computer Food Chain
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.66
PCWork-station
Mainframe
Vector Supercomputer Massively Parallel Processors
Minicomputer
(hitting wall soon)
(future is bleak)
1995 Computer Food Chain
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.67
° Switched vs. Shared Media: pairs communicate at same time: “point-to-point” connections
Interconnection Networks
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.68
P
M
P
M
P
M
P
M
I/O
NI
Fast, Switched Network
P
MNININININI
Fast Communication
Slow, Scalable Network
…
…
P
M
NI
D
P
M
NI
D
P
M
NI
D
Distributed Comp.MPP
P P P
M
SMP
I/OBus
NI
General Purpose
Incremental Scalability,Timeliness
Fast, Switched Network
…
…
P
M
NI
D
P
M
NI
D
P
M
NI
D
Cluster/Network of Workstations (NOW)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.69
2005 Computer Food Chain?
PortableComputers
Mainframe Vector Supercomputer
Networks of Workstations/PCs
MinicomputerMassively Parallel Processors
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.70
• IRAM motivation ( 2000 to 2005)
– 256 Mbit/1Gbit DRAMs in near future (128 MByte)
– Current CPUs starved for memory BW
– On chip memory BW = SQRT(Size)/RAS or 80 GB/sec
– 1% of Gbit DRAM = 10M transistors for µprocessor
– Even in DRAM process, a 10M trans. CPU is attractive
– Package could be network interface vs. Addr./Data pins
– Embedded computers are increasingly important• Why not re-examine computer design based on separation of memory and
processor?
– Compact code & data?
– Vector instructions?
– Operating systems? Compilers? Data Structures?
Intelligent DRAM (IRAM)
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.71
Microprocessor & DRAM on a single chip:
– on-chip memory latency 5-10X, bandwidth 50-100X
– improve energy efficiency 2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses
– smaller board area/volume
– adjustable memory size/width DRAM
fab
Proc
Bus
D R A M
$ $Proc
L2$
Logic
fabBus
D R A M
I/OI/O
I/OI/O
Bus
IRAM Vision Statement
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.73
Introspective Computing
• Biological Analogs for computer systems:– Continuous adaptation
– Insensitivity to design flaws• Both hardware and software• Necessary if can never be
sure that all componentsare working properly…
• Examples:– ISTORE -- applies introspective
computing to disk storage
– DynaComp -- applies introspectivecomputing at chip level• Compiler always running and part of execution!
Compute
Monitor
Adapt
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.74
° multiprocessors on a chip?
° complete systems on a chip?
• memory + processor + I/O
° computers in your credit card?
° networking in your kitchen? car?
° eye tracking input devices?
and why not
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.76
CS152: So what's in it for me? (from 1st lecture)
° In-depth understanding of the inner-workings of modern computers, their evolution, and trade-offs present at the hardware/software boundary.
• Insight into fast/slow operations that are easy/hard to implementation hardware
° Experience with the design process in the context of a large complex (hardware) design.
• Functional Spec --> Control & Datapath --> Physical implementation
• Modern CAD tools
° Designer's "Intellectual" toolbox.
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.77
Simulate Industrial Environment (from 1st lecture)
° Project teams must have at least 4 members• Managers have value
° Communicate with colleagues (team members)• What have you done?
• What answers you need from others?
• You must document your work!!!
• Everyone must keep an on-line notebook
° Communicate with supervisor (TAs)• How is the team’s plan?
• Short progress reports are required:
- What is the team’s game plan?
- What is each member’s responsibility?
5/10/99 ©UCB Spring 1999 CS152 / Kubiatowicz
Lec25.79
Summary: Things we Hope You Learned from 152
° Keep it simple and make it work:• Fully test everything individually & then together;
break when together• Retest everything whenever you make any changes• Last minute changes are big “no nos”
° Group dynamics. Communication is the key to success:
• Be open with others of your expectations & your problems (e.g., trip)• Everybody should be there on design meetings when key decisions
are made and jobs are assigned
° Planning is very important (“plan your life; live your plan”):
• Promise what you can deliver; deliver more than you promise• Murphy’s Law: things DO break at the last minute
- DON’T make your plan based on the best case scenarios- Freeze your design and don’t make last minute changes
° Never give up! It is not over until you give up (“Bear won’t die”)