14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 12 Buses and I/O system

331 Lec19.1 Fall 2003

14:332:331Computer Architecture and Assembly Language

Fall 2003

Week 12Buses and I/O system

[Adapted from Dave Patterson’s UCB CS152 slides and

Mary Jane Irwin’s PSU CSE331 slides]

331 Lec19.2 Fall 2003

Head’s Up This week’s material

Buses: Connecting I/O devices- Reading assignment – PH 8.4

Memory hierarchies- Reading assignment – PH 7.1 and B.5

Reminders Next week’s material Basics of caches

- Reading assignment – PH 7.2

331 Lec19.3 Fall 2003

Review: Major Components of a Computer

Processor

Control

Datapath

Memory

Devices

Input

Output

Cach

e

Main

M

emo

ry

Seco

nd

ary M

emo

ry(D

isk)

331 Lec19.4 Fall 2003

Input and Output Devices

I/O devices are incredibly diverse wrt Behavior Partner Data rate

Device Behavior Partner Data rate (KB/sec)Keyboard input human 0.01

Mouse input human 0.02

Laser printer output human 200.00

Graphics display output human 60,000.00

Network/LAN input or output

machine 500.00-6000.00

Floppy disk storage machine 100.00

Magnetic disk storage machine 2000.00-10,000.00

331 Lec19.5 Fall 2003

Magnetic Disk

Purpose Long term, nonvolatile storage Lowest level in the memory hierarchy

- slow, large, inexpensive

General structure A rotating platter coated with a magnetic surface

Use a moveable read/write head to access the disk

Advantages of hard disks over floppy disks Platters are more rigid (metal or glass) so they can be larger Higher density because it can be controlled more precisely Higher data rate because it spins faster Can incorporate more than one platter

331 Lec19.6 Fall 2003

Organization of a Magnetic Disk

Typical numbers (depending on the disk size) 1 to 15 (2 surface) platters per disk with 1” to 8” diameter 1,000 to 5,000 tracks per surface 63 to 256 sectors per track

- the smallest unit that can be read/written (typically 512 to 1,024 B)

Traditionally all tracks have the same number of sectors- Newer disks with smart controllers can record more sectors on

the outer tracks (constant bit density)

Platters

Track

Sector

331 Lec19.7 Fall 2003

Magnetic Disk Characteristic Cylinder: all the tracks under the heads

at a given point on all surfaces

Read/write data is a three-stage process: Seek time: position the arm over the

proper track (6 to 14 ms avg.)- due to locality of disk references

the actual average seek time may be only 25% to 33% of the advertised number

Rotational latency: wait for the desired sectorto rotate under the read/write head (½ of 1/RPM)

Transfer time: transfer a block of bits (sector)under the read-write head (2 to 20 MB/sec typical)

Controller time: the overhead the disk controller imposes in performing an disk I/O access (typically < 2 ms)

SectorTrack

Cylinder

HeadPlatter

331 Lec19.8 Fall 2003

Magnetic Disk Examples

Characteristic Sun X6713A Toshiba MK2016Disk diameter (inches) 3.5 2.5

Capacity 73 GB 20 GB

MTTF (k hr’s) 1,200 300

# of platters - heads 2 - 4

# cylinders 16,383

# B/sector - # sectors/track 512 - 63

Rotation speed (RPM) 10,000 4,200

Max. - Avg. seek time (ms) ? - 6.6 24 - 13

Avg. rot. latency (ms) 3 7.14

Transfer rate (PIO) 35 MB/sec 16.6 MB/sec

Power (watts) < 2.5

Volume (in3) 4.01

Weight (oz) 3.49

331 Lec19.9 Fall 2003

I/O System Interconnect Issues

A bus is a shared communication link (a set of wires used to connect multiple subsystems)

Performance Expandability Resilience in the face of failure – fault tolerance

Processor

ReceiverMainMemory

Keyboard

bus

331 Lec19.10 Fall 2003

Performance Measures

Latency (execution time, response time) is the total time from the start to finish of one instruction or action

usually used to measure processor performance

Throughput – total amount of work done in a given amount of time

aka execution bandwidth the number of operations performed per second

Bandwidth – amount of information communicated across an interconnect (e.g., a bus) per unit time

the bit width of the operation * rate of the operation usually used to measure I/O performance

331 Lec19.11 Fall 2003

I/O System Expandability

Cache Memory

Memory - I/O Bus

MainMemory

I/OController

Disk Disk

I/OController

I/OController

Terminal Network

interrupt signals

Usually have more than one I/O device in the system each I/O device is controlled by an I/O Controller

Processor

331 Lec19.12 Fall 2003

Bus Characteristics

Control lines Signal requests and acknowledgments Indicate what type of information is on the data lines

Data lines Data, complex commands, and addresses

Bus transaction consists of Sending the address Receiving (or sending) the data

Data Lines

Control Lines

331 Lec19.13 Fall 2003

Output (Read) Bus Transaction Defined by what they do to memory

read = output: transfers data from memory (read) to I/O device (write)

Processor Main Memory

Control

Data

Step 1: Processor sends read request and read address to memory

Processor

Control

DataMain Memory

Step 2: Memory accesses data

Processor

Control

DataMain Memory

Step 3: Memory transfers data to disk

331 Lec19.14 Fall 2003

Input (Write) Bus Transaction Defined by what they do to memory

write = input: transfers data from I/O device (read) to memory (write)

Processor Main Memory

Control

Data

Step 1: Processor sends write request and write address to memory

Processor

Control

DataMain Memory

Step 2: Disk transfers data to memory

331 Lec19.15 Fall 2003

Advantages and Disadvantages of Buses Advantages

Versatility:- New devices can be added easily

- Peripherals can be moved between computer systems that use the same bus standard

Low Cost:- A single set of wires is shared in multiple ways

Disadvantages It creates a communication bottleneck

- The bus bandwidth limits the maximum I/O throughput

The maximum bus speed is largely limited by- The length of the bus

- The number of devices on the bus

It needs to support a range of devices with widely varying latencies and data transfer rates

331 Lec19.16 Fall 2003

Types of Buses Processor-Memory Bus (proprietary)

Short and high speed Matched to the memory system to maximize the memory-

processor bandwidth Optimized for cache block transfers

I/O Bus (industry standard, e.g., SCSI, USB, ISA, IDE) Usually is lengthy and slower Needs to accommodate a wide range of I/O devices Connects to the processor-memory bus or backplane bus

Backplane Bus (industry standard, e.g., PCI) The backplane is an interconnection structure within the

chassis Used as an intermediary bus connecting I/O busses to the

processor-memory bus

331 Lec19.17 Fall 2003

A Two Bus System

I/O buses tap into the processor-memory bus via Bus Adaptors (that do speed matching between buses)

Processor-memory bus: mainly for processor-memory traffic

I/O busses: provide expansion slots for I/O devices

Processor Memory

Processor-Memory Bus

I/OBus

BusAdaptor

BusAdaptor

BusAdaptor

I/OBus

I/OBus

331 Lec19.18 Fall 2003

A Three Bus System

A small number of Backplane Buses tap into the Processor-Memory Bus

Processor-Memory Bus is used for processor memory traffic I/O buses are connected to the Backplane Bus

Advantage: loading on the Processor-Memory Bus is greatly reduced

Processor Memory


BusAdaptor

Backplane Bus

BusAdaptor

BusAdaptor

I/O Bus

I/O Bus

331 Lec19.19 Fall 2003

I/O System Example (Apple Mac 7200)

Cache Memory

PCI

MainMemory

I/OController

I/OController

Graphic Terminal

Network

Processor

Typical of midrange to high-end desktop system in 1997

PCIInterface/Memory

ControllerI/O

ControllerI/O

Controller

SC

SI

bu

s

Disk

CDRom

Tape


Serial portsAudio I/O

331 Lec19.20 Fall 2003

Example: Pentium System Organization

Processor-MemoryBus

PCI Bus

I/O Busses

Memory controller(“Northbridge”)

http://developer.intel.com/design/chipsets/850/animate.htm?iid=PCG+devside&

331 Lec19.21 Fall 2003

A Bus Transaction A bus transaction includes three parts:

Gaining access to the bus - arbitration Issuing the command (and address) - request Transferring the data - action

Gaining access to the bus How is the bus reserved by a devices that wishes to use it? Chaos is avoided by a master-slave arrangement

- The bus master initiates and controls all bus requests

In the simplest system: The processor is the only bus master Major drawback - the processor must be involved in every

bus transaction

BusMaster

BusSlave

Control: Master initiates requests

Data can go either way

331 Lec19.22 Fall 2003

Single Master Bus Transaction All bus requests are controlled by the processor

it initiates the bus cycle on behalf of the requesting device

Processor

Control

DataMemory

Step 2: Processor responds and generates appropriate control signals

Processor Memory

Control

Data

Step 1: Disk wants to use the bus so it generates a bus request to processor

Processor

Control

DataMemory

Step 3: Processor gives slave (disk) permission to use the bus

331 Lec19.23 Fall 2003

Multiple Potential Bus Masters: Arbitration Bus arbitration scheme:

A bus master wanting to use the bus asserts the bus request A bus master cannot use the bus until its request is granted A bus master must release the bus after its use

Bus arbitration schemes usually try to balance two factors:

Bus priority - the highest priority device should be serviced first Fairness - Even the lowest priority device should never

be completely locked out from using the bus

Bus arbitration schemes can be divided into four broad classes

Daisy chain arbitration: all devices share 1 request line Centralized, parallel arbitration: multiple request and grant lines Distributed arbitration by self-selection: each device wanting the

bus places a code indicating its identity on the bus Distributed arbitration by collision detection: Ethernet uses this

331 Lec19.24 Fall 2003

Centralized Parallel Arbitration

Used in essentially all backplane and high-speed I/O busses

BusArbiter

Device 1

Device NDevice 2

Control

Data

Grant1

Grant2

GrantN

ReqReq

Req

331 Lec19.25 Fall 2003

Synchronous and Asynchronous Buses Synchronous Bus

Includes a clock in the control lines A fixed protocol for communication that is relative to the clock Advantage: involves very little logic and can run very fast Disadvantages:

- Every device on the bus must run at the same clock rate

- To avoid clock skew, they cannot be long if they are fast

Asynchronous Bus It is not clocked, so requires handshaking protocol (req, ack)

- Implemented with additional control lines

Advantages:- Can accommodate a wide range of devices

- Can be lengthened without worrying about clock skew or synchronization problems

Disadvantage: slow(er)

331 Lec19.26 Fall 2003

Asynchronous Handshaking Protocol

1. Memory sees ReadReq, reads addr from data lines, and raises Ack

2. I/O device sees Ack and releases the ReadReq and data lines

3. Memory sees ReadReq go low and drops Ack

4. When memory has data ready, it places it on data lines and raises DataRdy

5. I/O device sees DataRdy, reads the data from data lines, and raises Ack

6. Memory sees Ack, releases the data lines, and drops DataRdy

7. I/O device sees DataRdy go low and drops Ack

Output (read) data from memory to an I/O device.

I/O device signals a request by raising ReadReq and putting the addr on the data lines

12

3

ReadReq

Data

Ack

DataRdy

addr data

4

5

6

7

331 Lec19.27 Fall 2003

Key Characteristics of Two Bus Standards

Characteristic PCI SCSIType backplane I/O

Data bus width 32 or 64 8 to 32

Addr/data muxed? multiplexed multiplexed

# of masters multiple multiple

Arbitration centralized self-selection

Clocking synchronous (33 - 66 MHz)

asynchronous

Peak bandwidth 133 - 512 MB/sec 5 MB/sec

Typical bandwidth 80 MB/sec 1.5 MB/sec

Max. devices 32 per bus segment 7 to 31

Max. length 0.5 meters 25 meters

331 Lec19.28 Fall 2003

Review: Major Components of a Computer

Processor

Control

Datapath

Memory

Devices

Input

Output

331 Lec19.29 Fall 2003

SecondLevelCache

(SRAM)

A Typical Memory Hierarchy

Control

Datapath

SecondaryMemory(Disk)

On-Chip Components

RegF

ile

MainMemory(DRAM)

Data

Cache

InstrC

ache

ITLB

DT

LB

eDRAM

Speed (ns): .1’s 1’s 10’s 100’s 1,000’s

Size (bytes): 100’s K’s 10K’s M’s T’s

Cost: highest lowest

By taking advantage of the principle of locality: Present the user with as much memory as is available in the

cheapest technology. Provide access at the speed offered by the fastest technology.

331 Lec19.30 Fall 2003

Characteristics of the Memory Hierarchy

Increasing distance from the processor in access time

L1$

L2$

Main Memory

Secondary Memory

Processor

(Relative) size of the memory at each level

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM

4-8 bytes (word)

1 block

1,023+ bytes (disk sector = page)

8-32 bytes (block)

331 Lec19.31 Fall 2003

Memory Hierarchy Technologies Random Access

“Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory

- High density (1 transistor cells), low power, cheap, slow

- Dynamic: need to be “refreshed” regularly (~ every 8 ms)

SRAM: Static Random Access Memory- Low density (6 transistor cells), high power, expensive, fast

- Static: content will last “forever” (until power turned off)

Size: DRAM/SRAM 4 to 8 Cost/Cycle time: SRAM/DRAM 8 to 16

“Non-so-random” Access Technology Access time varies from location to location and from time to

time (e.g., Disk, CDROM)

331 Lec19.32 Fall 2003

Classical SRAM Organization (~Square)

row

decoder

rowaddress

data word

RAM Cell Array

word (row) select

bit (data) lines

Each intersection represents a 6-T SRAM cell

Column Selector & I/O Circuits

columnaddress

One memory row holds a block of data, so the column address selects the requested word from that block

331 Lec19.33 Fall 2003

data bitdata bit

Classical DRAM Organization (~Square Planes)

row

decoder

rowaddress

Column Selector & I/O Circuits

columnaddress

data bit

word (row) select

bit (data) lines

Each intersection represents a 1-T DRAM cell

The column addressselects the requested bit from the row in

eachplanedata word

. . .

. . .

RAM Cell Array

331 Lec19.34 Fall 2003

RAM Memory Definitions Caches use SRAM for speed

Main Memory is DRAM for density Addresses divided into 2 halves (row and column)

- RAS or Row Access Strobe triggering row decoder- CAS or Column Access Strobe triggering column selector

Performance of Main Memory DRAMs Latency: Time to access one word

- Access Time: time between request and when word arrives- Cycle Time: time between requests- Usually cycle time > access time

Bandwidth: How much data can be supplied per unit time- width of the data channel * the rate at which it can be used

331 Lec19.35 Fall 2003

Classical DRAM Operation

DRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires

a RAS / CAS cycle

Row Address

CAS

RAS

Col Address Row Address Col Address

1st M-bit Access 2nd M-bit Access

N r

ows

N cols

DRAM

M bits

RowAddress

ColumnAddress

M-bit OutputCycle Time

331 Lec19.36 Fall 2003

Ways to Improve DRAM Performance Memory interleaving

Fast Page Mode DRAMs – FPM DRAMs www.usa.samsungsemi.com/products/newsummary/asyncdram/K4F661612D.

htm

Extended Data Out DRAMs – EDO DRAMs www.chips.ibm.com/products/memory/88H2011/88H2011.pdf

Synchronous DRAMS – SDRAMS www.usa.samsungsemi.com/products/newsummary/sdramcomp/K4S641632D.

htm

Rambus DRAMS www.rambus.com/developer/quickfind_documents.html www.usa.samsungsemi.com/products/newsummary/rambuscomp/K4R271669B

.htm

Double Data Rate DRAMs – DDR DRAMS www.usa.samsungsemi.com/products/newsummary/ddrsyncdram/K4D62323H

A.htm

. . .

331 Lec19.37 Fall 2003

Increasing Bandwidth - InterleavingAccess pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access pattern with 4-way Interleaving:

CPU

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2

Cycle Time

Access Time

Acc

ess

Ban

k 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

D2 available

331 Lec19.38 Fall 2003

Problems with Interleaving

How many banks? Ideally, the number of banks number of clocks we have to

wait to access the next word in the bank Only works for sequential accesses (i.e., first word requested

in first bank, second word requested in second bank, etc.)

Increasing DRAM sizes => fewer chips => harder to have banks

Growth bits/chip DRAM : 50%-60%/yr

Only can use for very large memory systems (e.g., those encountered in supercomputer systems)

331 Lec19.39 Fall 2003

N r

ows

N cols

DRAM

ColumnAddress

M-bit Output

M bitsN x M “SRAM”

RowAddress

Fast Page Mode DRAM Operation Fast Page Mode DRAM

N x M “SRAM” to save a row

Row Address

CAS

RAS

Col Address Col Address

1st M-bit Access

Col Address Col Address

2nd M-bit 3rd M-bit 4th M-bit

After a row is read into the SRAM “register”

Only CAS is needed to access other M-bit blocks on that row

RAS remains asserted while CAS is toggled

331 Lec19.40 Fall 2003

Why Care About the Memory Hierarchy?

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory GapµProc60%/year(2X/1.5yr)

DRAM9%/year(2X/10yrs)

331 Lec19.41 Fall 2003

Memory Hierarchy: Goals Fact: Large memories are slow, fast memories are

small

How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?

by taking advantage of

The Principle of Locality: Programs access a relatively small portion of the address space at any instant of time.

Address Space0 2n - 1

Probabilityof reference

331 Lec19.42 Fall 2003

Memory Hierarchy: Why Does it Work?

Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the

processor

Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper

levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

331 Lec19.43 Fall 2003

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (Block X)

Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block Y)

Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level

+ Time to deliver the block the processor Hit Time << Miss Penalty

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

331 Lec19.44 Fall 2003

How is the Hierarchy Managed? registers <-> memory

by compiler (programmer?)

cache <-> main memory by the hardware

main memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files)

331 Lec19.45 Fall 2003

Summary DRAM is slow but cheap and dense

Good choice for presenting the user with a BIG memory system

SRAM is fast but expensive and not very dense Good choice for providing the user FAST access time

Two different types of locality Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon.

By taking advantage of the principle of locality: Present the user with as much memory as is available in the

cheapest technology. Provide access at the speed offered by the fastest technology.

Documents

14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 12 Buses and I/O system