Historically, The Limiting Factor in a Computer’s

8/14/2019 Historically, The Limiting Factor in a Computers

1/29


2/29

2

EE 4504 Section 3 3

Terminology

Capacity: the amount of information thatcan be contained in a memory unit --usually in terms of words or bytesWord: the natural unit of organization inthe memory, typically the number of bits

used to represent a numberAddressable unit: the fundamental dataelement size that can be addressed in thememory -- typically either the word size orindividual bytesUnit of transfer: The number of dataelements transferred at a time -- usuallybits in main memory and blocks insecondary memory

Transfer rate: Rate at which data istransferred to/from the memory device

EE 4504 Section 3 4

Access time: For RAM, the time to address the unit and

perform the transfer For non-random access memory, the time to

position the R/W head over the desired location

Memory cycle time: Access time plus anyother time required before a second accesscan be startedAccess technique: how are memorycontents accessed Random access:

Each location has a unique physical address Locations can be accessed in any order and

all access times are the same What we term RAM is more aptly called

read/write memory since this accesstechnique also applies to ROMs as well Example: main memory


3/29

3

EE 4504 Section 3 5

Sequential access: Data does not have a unique address Must read all data items in sequence until

the desired item is found Access times are highly variable Example: tape drive units

Direct access: Data items have unique addresses Access is done using a combination of

moving to a general memory areafollowed by a sequential access to reach thedesired data item

Example: disk drives

EE 4504 Section 3 6

Associative access: A variation of random access memory Data items are accessed based on their

contents rather than their actual location Search all data items in parallel for a match

to a given search pattern All memory locations searched in parallel

without regard to the size of the memoryExtremely fast for large memory sizes

Cost per bit is 5-10 times that of a normalRAM cell

Example: some cache memory units


4/29

4

EE 4504 Section 3 7

Memory Hierarchy

Major design objective of any memorysystem To provide adequate storage capacity at An acceptable level of performance At a reasonable cost

Four interrelated ways to meet this goal Use a hierarchy of storage devices Develop automatic space allocation methods for

efficient use of the memory Through the use of virtual memory techniques,

free the user from memory management tasks Design the memory and its related

interconnection structure so that the processorcan operate at or near its maximum rate

EE 4504 Section 3 8

Basis of the memory hierarchy Registers internal to the CPU for temporary

data storage (small in number but very fast) External storage for data and programs

(relatively large and fast) External permanent storage (much larger and

much slower)

Characteristics of the memory hierarchy Consists of distinct levels of memory

components Each level characterized by its size, access

time, and cost per bit Each increasing level in the hierarchy consists

of modules of larger capacity, slower accesstime, and lower cost/bit

Goal of the memory hierarchy Try to match the processor speed with the rate

of information transfer from the lowest elementin the hierarchy


5/29

5

EE 4504 Section 3 9

The memory hierarchy

Registersin the CPU

Cache

Main memory

Magnetic disk

Magnetic tapeOptical disk

Disk cache

EE 4504 Section 3 10

MemoryType

Technology Size AccessTime

Cache SemiconductorRAM

128-512KB

10 ns

MainMemory

SemiconductorRAM

4-128 MB 50 ns

MagneticDisk

Hard Disk Gigabyte 10 ms,10 MB/sec

Optical Disk CD-ROM Gigabyte 300 ms,600 KB/sec

MagneticTape

Tape 100s MB Sec-min.,10MB/min

Typical memory Parameters


6/29

6


The memory hierarchy works because of locality of reference Memory references made by the processor, for

both instructions and data, tend to clustertogether

Instruction loops, subroutines Data arrays, tables

Keep these clusters in high speed memory toreduce the average delay in accessing data

Over time, the clusters being referenced willchange -- memory management must deal withthis


Example: Two-level memory system Level 1 access time of 1 us Level 2 access time of 10us Ave access time = H(1) + (1-H)(10) us

Figure 4.2 2-level memory performance


7/29

7


Main Memory

Core memory Used in generations 2 and 3 Magnetic cores (toroids) used to store logical 0

or 1 state by inducing an E-field in them(hysteresis loop)

1 core = 1 bit of storage Required addressing and sensing wires ran

through each core Destructive readout Obsolete

Replaced in the 1970s by semiconductormemory


Semiconductor memory Typically random access RAM: actually read-write memory

Dynamic RAMStorage cell is essentially a transistoracting as a capacitorCapacitor charge dissipates over timecausing a 1 to flip to a zeroCells must be refreshed periodically toavoid thisVery high packaging density

Static RAM: basically an array of flip-flopstorage cells

Uses 5-10x more transistors thansimilar dynamic cell so packagingdensity is 10x lower

Faster than a dynamic cell


8/29

8


Read Only Memories (ROM) Permanent data storage ROMs

Data is wired in during fabrication ata chip manufacturers plantPurchased in lots of 10k or more

PROMsProgrammable ROMData can be written once by the useremploying a PROM programmerUseful for small production runs

EPROMErasable PROMProgramming is similar to a PROMCan be erased by exposing to UV light


EEPROMSElectrically erasable PROMsCan be written to many times whileremaining in a systemDoes not have to be erased firstProgram individual bytesWrites require several hundred usec perbyteUsed in systems for development,personalization, and other tasksrequiring unique information to bestored

Flash MemorySimilar to EEPROM in using electricaleraseFast erasures, block erasures

Higher density than EEPROM


9/29

9


Organization Each memory chip contains a number of 1-

bit cells1, 4, and 16 million cell chips arecommon

Cells can be arranged as a single bit column

(e.g., 4Mx1) or in multiple bits per addresslocation (e.g., 1Mx4)

To reduce pin count, address lines can bemultiplexed with data and/or as high andlow halves

Trade off is in slower operation Typical control lines

W* (write), OE* (output enable) forwrite and read operationsCS* (chip select) derived from externaladdress decoding logicRAS*, CAS* (row and column addressselects) used when address is applied tothe chip in 2 halves


Figure 4.8 256Kx8 memory from 256Kx1 chips


10/29

10


Figure 4.9 1Mx8 memory from 256Kx1 chips


Improvements to DRAM Basic DRAM design has not changed much

since its development in the 70s Cache was introduced to improve

performanceLimited to no gain in performance after

a certain amount of cache isimplemented

Enhanced DRAMAdd fast 1-line SRAM cache to DRAMchipConsecutive reads to the same line arefrom this cache and thus faster than theDRAM itself Tests indicate these chips can performas well as tradition DRAM-cachecombinations

Cache DRAMUse larger SRAM cache on the chip as

a true multi-line cacheUse it as a serial data stream buffer forblock data transfers


11/29

11


Error Correction

Semiconductor memories are subject toerrors Hard (permanent) errors

Environmental abuse Manufacturing defects

Wear Soft (transient) errors Power supply problems Alpha particles

Problematic as feature sizes shrink Memory systems include logic to detect and/or

correct errors Width of memory word is increased Additional bits are parity bits Number of parity bits required depends on

the level of detection and correction needed


Figure 4.10 Basic error detection and correction circuitry


12/29

12


General error detection and correction A single error is a bit flip -- multiple bit flips

can occur in a word 2M valid data words 2M+K codeword combinations in the memory Distribute the 2 M valid data words among the

2 M+K codeword combinations such that thedistance between valid words is sufficient todistinguish the error

Valid codeword

Valid codeword

1 bit flipbetween eachcodeword

7 bit flips wouldmap the upper valid

codeword into thelower one

Detect up to 6 errors,Correct up to 3 errors


Single error detection and correction For each valid codeword, there will be 2 K-1

invalid codewords 2K-1 must be large enough to identify which of

the M+K bit positions is in error Therefore 2 K-1 > M+K

8-bit data, 4 check bits 32-bit data, 6 check bits

Arrange bits as shown in Figure 4.12


13/29


14/29

14


If there are 2 n words in the main memory, thenthere will be M= 2 n /K blocks in the memory

M will be much greater than the number of lines, C, in the cache

Every line of data in the cache must be taggedin some way to identify what main memory

block it is The line of data and its tag are stored in the

cache Factors in the cache design

Mapping function between main memoryand the cache

Line replacement algorithm Write policy Block size Number and type of caches


Mapping functions -- since M>>C, howare blocks mapped to specific lines incacheDirect mapping Each main memory block is assigned to a

specific line in the cache:i = j modulo C

where i is the cache line number assigned tomain memory block j

If M=64, C=4:

Line 0 can hold blocks 0, 4, 8, 12, ...Line 1 can hold blocks 1, 5, 9, 13, ...Line 2 can hold blocks 2, 6, 10, 14, ...Line 3 can hold blocks 3, 7, 11, 15, ...


15/29

15


Direct mapping cache treats a main memoryaddress as 3 distinct fields

Tag identifier Line number identifier Word identifier (offset)

Word identifier specifies the specific word (oraddressable unit) in a cache line that is to beread

Line identifier specifies the physical line incache that will hold the referenced address

The tag is stored in the cache along with thedata words of the line

For every memory reference that the CPUmakes, the specific line that would hold thereference (if it is has already been copiedinto the cache) is determined

The tag held in that line is checked to see if the correct block is in the cache


Figure 4.18 Direct mapping cache organization


16/29

16


Example:

Memory size of 1 MB (20 address bits)addressable to the individual byte

Cache size of 1 K lines, each holding 8 bytes

Word id = 3 bitsLine id = 10 bitsTag id = 7 bits

Where is the byte stored at main memorylocation $ABCDE stored?

$ABCDE=1010101 1110011011 110

Cache line $39B, word offset $6, tag $55


Advantages of direct mapping Easy to implement Relatively inexpensive to implement Easy to determine where a main memory

reference can be found in cache Disadvantage

Each main memory block is mapped to aspecific cache line

Through locality of reference, it is possibleto repeatedly reference to blocks that mapto the same line number

These blocks will be constantly swapped inand out of cache, causing the hit ratio to below


17/29

17


Associative mapping Let a block be stored in any cache line that is

not in use Overcomes direct mappings main

weakness Must examine each line in the cache to find the

right memory block Examine the line tag id for each line Slow process for large caches!

Line numbers (ids) have no meaning in thecache

Parse the main memory address into 2fields (tag and word offset) rather than 3 asin direct mapping

Implement cache in 2 parts The lines themselves in SRAM

The tag storage in associative memory Perform an associative search over all tags to

find the desired line (if its in the cache)


Figure 4.20 Associate cache organization


18/29

18


Our memory example again ...

Word id = 3 bitsTag id = 17 bits


$ABCDE=10101011110011011 110

Cache line unknown, word offset $6, tag$1579D

Advantages Fast Flexible

Disadvantage Implementation cost

Example above has 8 KB cache andrequires 1024 x 17 = 17,408 bits of associative memory for the tags!


Set associative mapping Compromise between direct and fully

associative mappings that builds on thestrengths of both

Divide cache into a number of sets (v), each setholding a number of lines (k)

A main memory block can be stored in any oneof the k lines in a set such that

set number = j modulo v If a set can hold X lines, the cache is referred to

as an X-way set associative cache Most cache systems today that use set

associative mapping are 2- or 4-way setassociative


19/29

19


Figure 4.22 Set associative cache organization(caption in the text is misleading -- not 2-way!)


Our memory example again

Assume the 1024 lines are 4-way set associative

1024/4 = 256 sets

Word id = 3 bitsSet id = 8 bitsTag id = 9 bits


$ABCDE=101010111 10011011 110

Cache set $9B, word offset $6, tag $157


20/29

20


Line replacement algorithms When an associative cache or a set associative

cache set is full, which line should be replacedby the new line that is to be read from memory?

Not a problem for direct mapping sinceeach block has a predetermined line it must

use Least recently used First in first out Least frequently used Random


Write policy When a line is to be replaced, must update the

original copy of the line in main memory if anyaddressable unit in the line has been changed

Write through Anytime a word in cache is changed, it is

also changed in main memory Both copies always agree Generates lots of memory writes to main

memory Write back

During a write, only change the contents of the cache

Update main memory only when the cacheline is to be replaced

Causes cache coherency problems --

different values for the contents of anaddress are in the cache and the mainmemory

Complex circuitry to avoid this problem


21/29

21


Block / line sizes How much data should be transferred from

main memory to the cache in a single memoryreference

Complex relationship between block size andhit ratio as well as the operation of the system

bus itself As block size increases,

Locality of reference predicts that theadditional information transferred willlikely be used and thus increases the hitratio (good)

Number of blocks in cache goes down,limiting the total number of blocks in thecache (bad)

As the block size gets big, the probability of referencing all the data in it goes down (hitratio goes down) (bad)

Size of 4-8 addressable units seems aboutright for current systems


Number of caches Single vs. 2-level

Modern CPU chips have on-board cache(L1)

80486 -- 8KBPentium -- 16 KBPower PC -- up to 64 KB

L1 provides best performance gains Secondary, off-chip cache (L2) provides

higher speed access to main memory L2 is generally 512KB or less -- more than

this is not cost-effective


22/29

22


Unified vs. split cache Unified cache stores data and instructions in

1 cacheOnly 1 cache to design and operateCache is flexible and can balanceallocation of space to instructions or

data to best fit the execution of theprogram -- higher hit ratio Split cache uses 2 caches -- 1 for

instructions and 1 for dataMust build and manage 2 cachesStatic allocation of cache sizesCan out perform unified cache insystems that support parallel executionand pipelining (reduces cachecontention)


Pentium and PowerPC Cache

In the last 10 years, we have seen theintroduction of cache into microprocessorchipsOn board cache (L1) is supplemented withexternal fast SRAM cache (L2)

While off-chip, it can provide zero wait stateperformance compared to a relatively slowmain memory access

Intel family 386 had no internal cache 486 had 8KB unified cache Pentium has 16KB split cache: 8 KB data and

8 KB instruction Pentium support 256 or 512 KB external L2

cache that is 2-way set associative

PowerPC 601 had 1 32 KB cache 603/604/620 has split cache of size 16/32/64

KB


23/29

23


MESI Cache Coherency Protocol

MESI protocol provides cache coherencyin both the Pentium and the PowerPCStands for Modified Exclusive

Shared Invalid

Implemented with an additional 2-bit fieldfor each cache lineBecomes interesting in the interactions of the L1 and the L2 caches -- each track thelocal MESI status as a line moves frommain memory to L2 and then to L1PowerPC adds an addition state A for

allocated A line is marked as A while its data is beingswapped out


Operation of 2-Level Memory

Recall the goal of the memory system: Provide an average access time to all memory

locations that is approximately the same as thatof the fastest memory component

Provide a system memory with an average costapproximately equal to the cost/bit of the

cheapest memory componentSimplistic approach,

Ts = H1xT1 + H2(T1 + T2 + Tb21)

H2 = 1 - H1

T1, T2 are the access times to level 1and 2

Tb21 is the block transfer time fromlevel 2 to level 1Can be generalized to 3 or more levels


24/29

24


External Memory

Magnetic disks The disk is a metal or plastic platter coated with

magnetizable material Data is recorded onto and later read from the

disk using a conducting coil, the head Data is organized into concentric rings, called

tracks, on the platter Tracks are separated by gaps Disk rotates at a constant speed -- constant

angular velocity Number of data bits per track is a constant Data density is higher on the inner tracks

Logical data transfer unit is the sector Sectors are identified on each track during

the formatting process


Figures 5.1 and 5.2 Disk organization


25/29

25


Disk characteristics Single vs. multiple platters per drive (each

platter has its own read/write head) Fixed vs. movable head

Fixed head has a head per track Movable head uses one head per platter

Removable vs. nonremovable plattersRemovable platter can be removedfrom disk drive for storage of transferto another machine

Data accessing timesSeek time -- position the head over thecorrect track Rotational latency -- wait for thedesired sector to come under the headAccess time -- seek time plus rotational

latencyBlock transfer time -- time to read theblock (sector) off of the disk andtransfer it to main memory


RAID Technology Disk drive performance has not kept pace with

improvements in other parts of the system Limited in many cases by the electro-

mechanical transport means Capacity of a high performance disk drive can

be duplicated by operating many (muchcheaper) disks in parallel with simultaneousaccess

Data is distributed across all disks With many parallel disks operating as if they

were a single unit, redundancy techniques canbe used to guard against data loss in the unit(due to aggregate failure rate being higher)

RAID developed at Berkeley -- RedundantArray of Independent Disks

Six levels: 0 -- 5


26/29

26


RAID 0 No redundancy techniques are used Data is distributed over all disks in the array Data is divided into strips for actual storage

Similar in operation to interleavedmemory data storage

Can be used to support high data transferrates by having block transfer size be inmultiples of the strip

Can support low response time by havingthe block transfer size equal a strip --support multiple strip transfers in parallel

RAID 1 All disks are mirrored -- duplicated

Data is stored on a disk and its mirrorRead from either the disk or its mirrorWrite must be done to both the disk

and mirror Fault recovery is easy -- use the data on the

mirror System is expensive!


RAID 2 All disks are used for every access -- disks

are synchronized together Data strips are small (byte) Error correcting code computed across all

disks and stored on additional disks

Uses fewer disks than RAID 1 but stillexpensive

Number of additional disks isproportional to log of number of datadisks

RAID 3 Like RAID 2 however only a single

redundant disk is used -- the parity drive Parity bit is computed for the set of

individual bits in the same position on alldisks

If a drive fails, parity information on theredundant disks can be used to calculate thedata from the failed disk on the fly


27/29

27


RAID 4 Access is to individual strips rather than to

all disks at once (RAID 3) Bit-by-bit parity is calculated across

corresponding strips on each disk Parity bits stored in the redundant disk

Write penaltyFor every write to a strip, the paritystrip must also be recalculated andwrittenThus 1 logical write equals 2 physicaldisk accessesThe parity drive is always written toand can thus be a bottleneck

Raid 5 Parity information is distributed on data

disks in a round-robin scheme No parity disk needed


Optical disks Advent of CDs in the early 1980s

revolutionized the audio and computerindustries

Basic operation CD is operated using constant linear

velocity Essentially one long track spiraled onto the

disk Track passes under the disks head at a

constant rate -- requires the disk to changerotational speed based on what part of thetrack you are on

To write to the disk, a laser is used to burnpits into the track -- write once!

During reads, a low power laser illuminatesthe track and its pits

In the track, pits reflect light differentlythan no pits thus allowing you to store1s and 0s


28/29

28


Master disk is made using the laserMaster is used to press copies in amass production mechanical styleCheaper than production of information on magnetic disks

Storage capacity roughly 775 NB or 550 3.5

disks Transfer rate standard is 176 MB/second Only economical for production of large

quantities of disks Disks are removable and thus archival Slower than magnetic disks


WORMs -- Write Once, Read Many disks User can produce CD ROMs in limited

quantities Specially prepared disk is written to using a

medium power laser Can be read many times just like a normal

CD ROM Permits archival storage of user

information, distribution of large amountsof information by a user

Erasable optical disk Combines laser and magnetic technology to

permit information storage Laser heats an area that can then have an e-

field orientation changed to alterinformation storage

State of the e-field can be detected usingpolarized light during reads


29/29

29


Magnetic Tape The first kind of secondary memory Still widely used

Very cheap Very slow

Sequential access Data is organized as records with physical

air gaps between records One words is stored across the width of the

tape and read using multiple read/writeheads


Summary

Goal of the memory hierarchy is toproduce a memory system that has anaverage access time of roughly the L1memory and an average cost per bitroughly equal to the lowest level in thehierarchyRange of performance spans 10 orders of magnitude!Components / levels discussed Cache Main memory Secondary memory

Documents

Historically, The Limiting Factor in a Computer’s