Upload
jade-haymaker
View
215
Download
0
Embed Size (px)
Citation preview
1
Caches
Electronic Computers LM
Parallelism
2
Cache
•For sake of simplicity let’s suppose there is only one level of cache•The cache is a memory with an access time some order of magnitudes shorter than that of che main memory BUT with a size much smaller. It contains a small (see later) replicated portion of the main memory.
•The CPU, when accessing a data, tries FIRST to find it in cache (hit) and then, when the data is not found, in the main memory (miss)
• In cache there are no single bytes BUT groups of bytes with contiguous addresses (normally 32 o 64 o 128 and in any case “aligned”): each group is called line
LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)WORKING SET
CPURegisters
Cache I lev.
Cache II lev.
Cache III l3v.
Memory
Disk
Tape
Cache
Parallelism
Cache
3
0
Memory
32-256 bytesper line
0
Cache
2
5
m
m+1
n
n+2
2
5
m
m+1
n
n+2
Memory access time >100 clock cyclesCache access time : 1 to 4 clock cycles
Line number In line offset
Processor generated address
Cache position detection
Cache line
Data
Data line
Number of line: the address of the lower byte of a line divided by the
size of the line (aligned).In other words the line numer is the
complete address of the first line byte minus the LSbits
which are zeros (alignment!)
Data line address
Accessed data range:single byte to the entire line
Parallelism
4
Memorie associative(Content Addressable Memories)
• Associative memories : they include BOTH data lines and the of the lower byte address (line number - TAG)
• A data is found not through the decoding of the CPU address BUT by mean of a parallel comparison between all cache lines numbers (TAGs) and the CPU MSB address. The comparison can be either successfull (hit) or not (miss)
Line number Data
Data
Data
Data
Data
Data
Data
Data
Line number
Line number
Line number
Line number
Line number
Line number
Line number
A line size can be one byte only (never used)!Parallelism
5
Full-associative cache
315
TAG
0
Slot
1
Validity
72251 0
72262 1
57m 1
88n 1
Line
Line
Line
Line
Cache
Line 0
Line 1
Line 2
Line k
Line w
Line w+1
Line z
Memory
• In each slot any memory line can be stored. The TAG is the line number
• For instance: 64GB memory (36 bit address) and 256 byte lines. Offset in line: 8 bit.
Tag=36-8= 28 bit
256 bytes/line
The line number is compared with all cache TAGs . In case of HIT (and if the validity bit is 1) the requested data is present. The address offset is the position of the first byte in the line (requested data can be a byte, a word, a double word and so on. provided it is within the line boundary). This cache organization makes the best use of the cache but it is terribly complex since it requires many comparators (if the cache has 1024 slots - in this case the cache size is 256 Kbytes - 1024 28 bit comparators are required!) and normally caches have 64K slots and more
Cache size is always a power of 2 as the line size
LineNumber
Line number(28 bit)
In line offset (8 bit)
Processor generated address
Parallelism
Parallelism 6
Directly mapped cache
TAG
0
Slot
1
Validity
1 0
2 1
m 1
n 1
Line
Line
Lina
Line
Line
Cache
In each cache slot only a subset of the memory lines can be stored. For instance in slot 0 only those whose initial address is exactly the divison of line number by the slots number, in slot 1 only those whose initial address is exactly the division of the line number by the slots number with remainder 1 and so on. Obviously the initial memory address of data in each slot is the line number mulpiplied by the line sizeFor instance: 1 MB memory, 64 bytes lines, 16K different lines, 16 line numbers. If the cache has 128 slots => 16K divided by 128 is 128. Therefore in slot 0 lines number 0, 128, 256, etc., in slot 1 lines number 1, 129, 257 etc.
Line 0
Line 1
Line 2
Line k
Line w
Line w+1
Line z
Memory
Parallelism 7
Line 0
Line 1
Line 2
Memory
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Line 12
Line 13
Line 14
Line 15
TAG
0
Slot
1
Validity
1 0
2 1
Line
Line
Line
Cache
3 1 Line
Cache directly mappedAn example (line 4 bytes)
8
Directly mapped cache
The LSBs of the line number indicate the only cache slot where the line can be stored. See the previous example of a processor with 36 bit address (64 GB), 256 byte line (8 bit): the line number is 28 bit (how many lines ? 228 -> 210 x 210 x 28). If the cache has 1024 slots (256KB) the 10 LSBs of the line number (index) indicate the slot where a line must be stored
In line offset (8 bit)
TAG (18 bit)Slot
(10bit)
Only one 10 bit decoder (to detect the involved slot) and only one 18 bit comparator are needed
Very little flexible
Index
Line number(28 bit)
In line offset (8 bit)
Processor generated address
Parallelism
Directly mapped cache
9
TAG DATA
Offset in line
TAG Slot
Cache
In each slot only one line for each index can be stored
Index
Processor generated address
Parallelism
Cache
10
A compromisen-way set-associative cache
• N-way set-associative : many lines for each index
• N comparators for n-way. Parallelism of the comparators identical to that of directly mapped cache
• In the directly mapped caches data can be provided before validity and TAG check . In the set-associative caches only after the check
• Sometimes speculative mechanisms (way 0 data is provided then check)
TAG DATA
Offset nelblocco
TAG Slot
Processor generated address
Parallelism
11
Cache set associative
INDIRIZZO BLOCCO DI CACHE
Tag Index Offset Status Tag Dato
Dataword
Hit/missTAG check and data selection according to the data type requested by the CPU (byte, word, DW etc.)
Way 0 Way 1 Way n
Way 0 Way 1 Way n
Way 0 Way 1 Way n
Way 0 Way 1 Way n
Parallelism
Therefore...
12
• In a fully associative cache a line can be stored in any slot
• In a directly mapped cache in only one slot, that corresponding to the INDEX
• In a set-associative cache in any way of the slot corresponding to the INDEX
• http://www.ecs.umass.edu/ece/koren/architecture/Cache/default.htm
• http://www.ecs.umass.edu/ece/koren/architecture/Cache/page3.htm
• http://www.ecs.umass.edu/ece/koren/architecture/Cache/frame2.htm
Parallelism
13
Replacement algorithms
Caches are of limited size and therefore is necessary (i.e. in case of a read miss) to select a line which must be discarded (overwritten if not modified, written back in memory and then overwritten if modified)
Ther are basically three possible policies: RAND (Random), LRU (Least Recently Used), and FIFO (First In First Out) with different efficiency and complexity
RAND: in this case the logical network must first detect whether invalid lines are present (and therefore overwrite one on them): if not according to a random number generator (i.e. a shift register feedbacked by an EX-OR gate) must select a line to be replaced. The algorithms can be refined selecting first the non-modified lines. Although non-optimal this algorithm is very cost-effective
Parallelism
14
Replacement algorithmsSTACK: the same network for each set. When a“hit”occurs the hit way must become the most recent and all others become of a lower rank with no change among them.Let’s suppose there are 4 ways and that all lines of the set are valid. The way (its number) in position Ra is the most recenttly hit. The other lines were hit in the past according to their positions.
Na, Nb, Nc, Nd store the four way number 0,1,2,3 (obviously not in order!! It depends on the set history!)Rx as Ra, Rb,Rc,Rd: 2 bit registersX: hit way number (if any)Rd stores the way number least recently hit. Its line is the candidate for replacement in case of miss for the set: the way where the read dat are stored after the replacement becomes the most recently hit and its number is stored in Ra while all other way number are right shifted one position
Na Nb Nc Nd
AND AND ANDCLK
X
Ex-OR Ex-OR Ex-OR
Ra Rb Rc Rd
Parallelism
15
Replacement algorithms
Let’s now suppose a HIT for way 2 and that the way numbers in Ri registers from left to right are 1, 0, 2, e 3. (Way 3 is the replacement candidate in case of set miss). The shift register right shifts until Rc (whose way number is 2) because Rd clock is blocked. After the clock the Ri registers store (in sequence) 2, 1, 0, 3 (way 3 is still the candidate for replacement while all other way numbers are correctly updated with way 2 as the most recently hit)
When a line is invalidated its way number is stored in Rd and all other ways numbers which were hit less recently than the invalidated line are left shifted one position The mechanism is symmetrical to the hit mechanism. For instance that in presence of the situation depicted in figure line 0 (in register Rb) is invalidated. Line 0 is stored in Rd while line 2 is stored in Rb and line 3 in Rc. In order to deal with the invalidation a symmetrical circuit must be added.
1 0 2 3
AND AND ANDCLK
X
Ex-OR Ex-OR Ex-OR
Ra Rb Rc Rd
2
Parallelism
16
COUNTERS: a counter for each way of each set
The counter walues correspond to the way ranking position for replacement: 0-> most recently hit , 3-> least recently hit
Eventi
01) Hit Way 002) Miss (line fill – Way 2 count 3 replaced 03) Way 1 invalidated04) Hit Way 0 05) Way 3 invalidated06) Miss (line fill – Way 3 count3 replaced)07) Hit Way 208) Miss (line fill – Way 2 count 3 replaced)09) Miss (line fill – Way 0 count 3 replaced)10) Miss (line fill – Way 0 count 3 replaced)
Validità
1 1 1 1 1 1 1 11 1 1 11 0 1 1 1 0 1 1 1 0 1 01 0 1 11 0 1 11 1 1 11 1 1 1
In most implementation the counters can be incremented or reset. In case of hit of a way number the counters with a lower value are incremented and is reset the counter corresponding to the hit way. In case of miss and replacement the way whose counter is three is selected and then the system behaves as if that way was hit. In case of invalidation the invalidated way counter becomes 3 and all other counters with a greater number are decremented
W0 W1 W2 W30 1 3 21 2 0 31 3 0 20 3 1 20 2 1 3 1 3 2 02 3 0 13 0 1 20 1 2 3 1 2 3 0
Finale status
W0 W1 W2 W31 0 3 20 1 3 21 2 0 31 3 0 20 3 1 20 2 1 3 1 3 2 02 3 0 13 0 1 20 1 2 3
Initial status
Replacement algorithms
It must be noticed that the counter algorithm is equivalent to the shift register network. In that the position indicates the age rank, in this the counters
Parallelism
17
Replacement algorithms
PSEUDO-LRU (in this example 4 ways)
The 4 set ways are indicated by I0, I1, I2 e I3When a line is invalid it is replaced in case of miss There are three bits (B0, B1 e B2) each set
If the last set access was for I0 or I1 then B0 =1 otherwise B0=0If the last access for the two ways I0 and I1 was for I0 then B1=1 otherwise B1=0. If the last access for the two ways I2 and I3 was for I0 then B2=1 otherwise B2=0
In case of replacement
According to B0 the cache selects first which couple (I0:I1 or I2:I3) was least recently accessed then selects within the couple the way to be replaced according to B1 or B2
B0=0 ?
Yes (I0:I1) least recently accessed
B1=0 ? B2=0 ?
Yes No Yes No
I2 I3I0 I1Replace
The algorithm is pseudo-optimal because I1 could be the way least recently accessed but could be «blackened» by I0 if this is the most reently accessed .
No (I2:I3) least recently accessed
Parallelism
18
Replacement algorithms
FIFO
In this implementation there is a single counter for each set which starting from 0 is incremented for each read miss (that is for each replacement). The new line id inserted in the way pointed by the counter.
This algorithms has a singularity because it does not consider the invalidations. If the counter has value 3 and line in way 2 is invalidated, way 2 and not 3 should be used in case of read miss. Although suboptimal this algorithm has a very good cost/effectiveness ratio.
http://www.ecs.umass.edu/ece/koren/architecture/Cache/frame1.htm
http://www.ecs.umass.edu/ece/koren/architecture/PReplace/
Parallelism
What is then a TLB?
19
ProcessorCache
RAM
Miss
VirtualAddress
PhysicalAddress
Hit
TLB
Dati
• The TLB is a cache which instead of providing memory data provides memory addresses (physical addresses) since it is addressed by processor virtual addresses
• The TLB access time is similar to that of the 1st level cache. In the modern processors the TLB (like the caches) has two levels
• NB: the processors (theoretically today) could be not paged. In this case the TLB does not exist since the virtual addresses are also the physical addresses
• As for the the caches the TLB can be fully associative, directly mapped or set associative with the same replacement problems. TLBs are normally 8-16 ways set associative 64-1024 slots
• http://www.ecs.umass.edu/ece/koren/architecture/CacheTLB/index.html
Parallelism
Virtually addressed caches
20
ProcessorCache
RAM
Miss
VirtualAddress
HitData
• In this case an indirect address level (TLB or in the worst case the page tables) is spared but:
• Different virtual addresses can be mapped at the same physical addresses (for different processs) and therefore a process tag must be inserted in the cache and a possible duplication of the same data could take place. In case of RAM or cache line data change all slots containing the line must be invalidated which implies a complex hardware. Otherwise all data cache must be flushed upon a context switch.
Parallelism
21
Cache write
Two possible policies:
Yes : write-allocateNo: no-write-allocate
N.B.: Write operation are VERY less frequent than the read operations and with a high probability of sparse addresses.
How lines are dealt with in case of write miss ? Read (with possible replacement) and then write ?
In case of write-allocate the operation is a read/replacement followed by a line write in cache.
In the other case data are written on the following cache level (if any and containing the line, otherwise in memory)
Parallelism
Cache write
22
When a write hit occurs ? Data must be written also in the following cache levels?
Two policies
Yes : write throughNo : write back
In the first case the line is overwritten and data are also written in the following cache levels (down to the memory).
In the second case a line is overwritten without forwarding the data to the next level cache (unless for coherency problems – see later). When a line must be replaced an already overwritten line must be first written back in the following cache level since data in the first level are more recent. The data traffic is much smaller (smaller bandwith use) but hadware is more complex
It must be underlined that a line is a consistent data structure and therefore even in case of a single byte modification the entire line must be written back.
All modern processors use the write back policy with the “write once” system which will be explained later
Write-back policy implies that a bit for each line must be present in order to indicate whether the line has been modified (dirty bit)Parallelism
Posted write
23
Very often in order to reduce the access time impact the posted-write methodology is used
ProcessorCache
FIFO
RAM
• Data to be written in RAM are inserted in a FIFO write buffer which is accessed by the processor (or by the cache in case of write back for replacement) with no delay. The memory controller transfers then data from the buffer to the memory at the memory speed (much lower)
• Normally the FIFO slots are 4-32. When the FIFO is full, processor (or cache) is delayed.
• NB When the write buffer is used the cache read system must first check whether the requested data are in the FIFO
Parallelism
Coherency
24
• Caches have coherency problems
• This means that the system must grant the most recent data to a system «agent» (processor, DMA, graphic processor..) upon a read request.
• The coherency problem arises not only between caches of processors belonging to a multiprocessor system but also between different levels caches of the same processor.
• For sake of simplicity let’s consider that all processors of the same multiprocessor system have two levels caches (L1 and L2) and the common memory. In most cases L2 is bigger than L1 (the cache directly connected to the processor). Let’s suppose the caches are inclusive that is if a line is present in L1 it is present also in L2 (but not viceversa).
• The presented mechanism can be easily extended to the case of n-levels caches
Parallelism
25
Coeherency policiesREAD
How can we grant that an external agent (not the processor) reads from memory the most recent version of data (the data in memory could be stale that is «old») ? Let’ consider the write policies
Write-throughFor each processor data write (both data present or not present in cache) the data is written also in memory: the coherency is therefore granted but the system is slowed by the memory access time.
Posted write-throughSimilar to the previous case. The processor efficiency is improved (the processor is not normally delayed by the memory access time). No access is allowed to the external agent until data are written to the memory (not easy to implement and little efficient) )
Write-back In this case the memory is updated only when necessary (i.e. a replacement). For each external agent access the cache (or the caches) mut be checked in order to verify whether it (they) stores the requested data and if the aswer is positive the agent memory access must be blocked until the requested data are forcedly written back to the memory. Cache snoop mechanism
Parallelism
26
Coeherency policiesWRITE
What happens when another agent wants to write data in memory ?
Write-through The cache controller must monitor the system bus and invalidates in
cache the lines (if any) containing the data overwritten in memory by the agent (until then coherent)
Write-back The cache controller must monitor the system bus, and in case of an
agent attempt to write must perform the following operations:
a) If data are present in cache in a modified state line (or lines) the controller must stop the agent memory access, must trigger a write-back of the modified line and then invalidates the line (lines). It must be noticed that the write-back operation is needed because since a line is made of several bytes there is no way of detecting which byte (or bytes) were modified. The new master could write bytes different from those which were modified
b) If line data are not modified upon a write from another master the line must be only invalidated in cache.
Parallelism
27
Two levels cachescoherency policies
L1 e L2 write-through
For each processor write (both in case data is present in cache or absent) data are written down to the memory. This obviously has a great impact on the bus, the most important bottleneck. The write operation by L2 could be deferred. In case of write of another agent data are invalidated (if present) in both cache levels
L1 write-through e L2 write-back
In this case L2 must monitor the bus and when another agent tries to read a data must first write back the modified data (if any – data are in any case the same in L2 and in L1 – if in L1 are present) in memory. In case of write acces by another agent, modified data must be first written back to memory then invalidated both caches.
N.B. The processor has no way of determining whether a secondary cache is present. Signals exchanged with the system must be the same whether a secondary caches exists or not. The same applies for the secondary cache if a third level caches is present.
How can we grant that another agent reads from memory the most updated data (if the same data were also in cache, the corresponding data in memory could be «stale» that is «older» than those in cache) ?
Parallelism
28
Two levels cachescoherency policies
L1 and L2 both write-back
When the processor reads data (line fill) upon a miss in L1, L2 checks whether it stores the requested data. If yes data are transferred to L1 (with a possible replacement). If the data are present in L2 this means that they are «cacheable». If data are non available in L2, data are requested to the memory controller (MC). If data are «cacheable» a line fill takes place both in L2 and L1. If not, data are simply read by the processor.
In case of a processor write operation with both L1 and L2 write-back there are many cases which depend whether the system is mono- or multi-processor : in any case the system must provide the most update data when they are requested
MESI PROTOCOL
Parallelism
29
M.E.S.I.(monoprocessor - write back)
M – modified (L1 and L2)The requested line is available in cache where it was modified without write-back downstream (which is L2 for L1 and memory for L2). The considered cache stores updated data. Notice that if a line is in modified state in L1 and L2 the line in L1 is more updated than the same line of L2. A write operation triggers a transition from M to M state without downstream write
E – exclusive (L1 and L2)The considered line is present and identical to the same line present in the device downstream (which is L2 for L1 and memory for L2). A write operation triggers a state change from E to M without downstream write . (Careful: the name can be misleading)
S – shared (state possible only for L1 in a monoprocessor system)The line is present il L1 (S), L2 (E) and memory. A write operation triggers a a downstream write upon which L1 state becomes E and L2 state is changed from E to M (no memory write.. see state E). L2 in mono processor systems is never in shared state because there are no agents which need to be informed of the state of the (single) processor internal line (which is not the case of multiprocessor systems)
I – invalid (L1 and L2)The requested line is not available in cache
N.B. Lines of a code cache can be only in S or I stateAt the system start-up alle lines in all caches are invalid
Parallelism
30
Possible Statesof the same line
L2 L1
I I
MEM
Not present
ES
Not present
NB: Not present: line not present because we consider inclusive caches L2 never shared in mono processor systems !!!L1 is always in a
state which is related to the state of L2. A line cannot be in M-state in L1 if not in M –state in L2
Monoprocessor case(with two levels caches)
Parallelism
31
Coherency policies(L1 and L2 both write back)
In case of monoprocessor systems a line-fill when data are not present both in L1 and L2. L1 state becomes S and L2 state E.
A successive write operation to L1 triggers a state change of L1 to E and L2 to M (the L1 written data are also written to L2). Data are not writte to memory
A successive write operation affects only L1 whose state becomes M.
NB: Since the size of L2 is bigger than the size of L1 it is possible (because of replacements) that a line is not present in L1 but in L2 only either in E or M state. A line fill, therefore in L1 stores the line in L1 respectively in S or E state.
In the following slides we assume that all caches are inclusive. The MESI protocol is however applicable also to other cases
Parallelism
32
Coherency policies(monoprocessor)
Read operation. If L2 line containing the requested data is in E-state then the same line is in S-state in L1 (if any). Memory data are therefore the most updated.
Write operation It triggers an “enquiry” of the Memory Controller in L2. If the line (if any) containing the data is present in L2 in E-state then the same line is in S-state in L1 (if any). The line in L1 and L2 is invalidated. If the line (if any) in L2 is in M-state, L2 must check in L1 whether the line is present in L1 and is in M-state. In any case the most recently updated version of the line is written back to the memory and the line is invalidated in both caches. The the external agent can then write its data in memory.
The following cases apply to external agents without private cache (i.e. DMA controller) accessing memory
Parallelism
33
PROCESSOR READ COERENCY
No external cachesMonoprocessor
1) Miss in L1 e not in L2. Line fill from L2 a L1. L1 state depends on L2-state. If L2 state is exclusive, L1 becomes shared; if L2 state is modified L1 becomes exclusive. No chance of a line present in L1 and not in L2
2) Miss in L1 e L2 -> double line fill. L1 > shared and L2 -> exclusive
N.B. Why must L1-> S if L2 is exclusive ? Because in case of write if L1 were in exclusive state no write-back to L2 would take place (l1 E->M) and a memory enquiry would find that the requested data in L2 are identical to those in memory (although stale) and no further enquiry on L1 would take place. A read or write data of an external agent would operate on the memory data without write-back of L1 data (the most recent data)
Parallelism
34
PROCESSOR WRITE COERENCY
No external cachesMonoprocessor
1) Miss in L1 and L2: line fill from memory in L2 (->E) and L1 (->S) then write to both caches (L1-> E and L2->M)
2) Miss in L1 and not in L2. Line fill from L2 in L1 then write. If L2 in E state L1->S otherwise L1-E (L2 can only be in E or M state). Final states as per point 1
3) L1 hit. Tre cases (the line is surely in L2 too)
a) L1 shared (and therefore L2 exclusive). Write to L1 and L2. L2->M and L>-E.
b) L1 exclusive (and therefore necessarily L2 modified). Write to L1 only. L1->M
c) L1 modified (and therefore L2 modified): write to L1 only. L1 remains in M-state
Parallelism
35
External agent READ/WRITE coherencyCacheless external agent
External agent READ
1) Miss in L1 and L2 or HIT in LI or L2 both not modified: NOP2) Hit in L1 modified (and therefore L2 modified): L1 write back
to memory and L2. L1->S and L2->E 3) Hit in L2 modified (e L1 exclusive or line not present in L1):
L2 write back to memory. L2->E and L1 (if any) ->S
External agent WRITE
4) Miss in L1 and L2: NOP5) Hit in L2 and possibly in L1 both not modified: L2->I and L1
((if any) ->I6) Hit both in L2 and L1 (both modified): write back to memory
of L1 then L1->I and L2->I
Parallelism
36
M.E.S.I. (multiprocessor)
M – modifiedThe line is present only in the caches of one processor and in the specified cache it was modified without being written back to the downstream device (is is different form the same line in the downstream device). The line can be read and written without any downstream cycle.
E – exclusiveThe line is present only in the caches of one processor and its content is identical to the downstream device. The line can be read and written without any downstream cycle. A processor write operation provokes a transition to M state.
S – sharedThe line is possibly in the caches of many processors. (Possibly because it could be present, for instance, in two processors and then one of them has replaced the line) A write operation causes a downstream write and invalidates the line in the caches of other processors, if any.
I – invalidThe requested line is not available in cache. A read operation causes a LINE-FILL. A write operation causes a WRITE-THROUGH in case of non write-allocate policy otherwise a line fill followed by the write operation
Parallelism
37
Possibile Statesof the same line
Multiprocessor case(with two levels caches)
L2 L1
I I
MEM
Not present
ES
Not present
S SNot present
In case of multilevel caches a lower level cache stores a reduced set of the lines of the upper level (inclusive caches). But not always (not inclusive caches) !
Parallelism
38
READ COHERENCY
Multiprocessor (only L1 and L2)
1) Miss in L1 but not in L2.
• When L2 shared or exclusive L1 the read line becomes shared • When L2 modified, L1 the read line becomes exclusive.
NB: Similar to monoprocessor case but notice that in this case is it possible that both L1 and L2 are or shared (while in case of a monoprocessor L2 IS NEVER in S state)
Parallelism
39
READ COHERENCY
Multiprocessor
2) Miss in L1 and L2 . Bus snoop
• When the line in neither cache a double line fill occurs. If not present in caches of another processor in L1 the read line is in shared state and in L2 is in exclusive state
• When the line is present in some other caches not modified (that is is in shared or exclusive state) upon the snoop all become shared state, The line is read into L1 and L2: in both caches of the requesting processor (as in alla caches of the orher processors) the state become shared
• If the line is present in the caches of only one processor and is in modified state (a line can be in modified state ONLY in one processor !) back-off on the bus, write back of the line in memory, the hit caches state becomes shared. The line is read into L1 and L2: in both caches the state become shared Notice that if a line is in modified state in a L1 is in modified state in the corresponding L2 too !!
N.B. A bus snoop is a snoop on L2 which is forwarded to L1 if L2 is in modified state
Parallelism
40
WRITE COHERENCY
Multiprocessor 1) Miss in L1 e L2..Three cases
a) The line is not in caches of other processors: as for the monoprocessor
b) The line is present in other caches not modified: all caches containing the line are invalidated. Read in L1 and L2 and then write; final state L1 exclusive and L2 modified
c) The line is present in another processor (only one !) in modified state. Bus back-off, the modified line is written back to the memory and the caches storing the line invalidated (do not forget that both L1 and L2 can be in modified state). The modified line must be first written back because it is not known which data of the line will be rewritten. Then as in the case of the monoprocessor. In any case at the end of the operation L2 modified and L1 exclusive.
2) Miss in L1 and not in L2. The line stored in L2 is forwarded to L1 and then written. Three
cases
a) L2 exclusive. No bus snoop, The line is written in L2 and L1. At the end L2 modified and L1 exclusive.
b) L2 modified. L2 modified and L1 modifiedc) L2 shared. Bus snoop with invalidation, read in L1 and L2 and
then write operation. L1 exclusive and L2 modified
Parallelism
41
WRITE COHERENCY
3) Hit in L1 (and therefore in L2). Three cases:
a) L1 modified. Only L1 is writtenb) L1 exclusive. Only L1 is written. L1 modifiedc) L1 shared. Two cases :
I. L2 is shared. Bus snoop with invalidation then write on L1 and L2. L1 exclusive and L2 modified
II. L2 exclusive. No bus snoop- Write on L1 and L2. L1 exclusive and L2 modified
N.B. There are no cases with L1 shared and L2 modified,
Parallelism
42
Other coherency policies
“Directory based” coherency protocol
• The total memory is the sum of the processors local memories (accessible also from other processors) and the common memory- There is therefore an unique memory addressing system for all memories
• Information about memory lines are stored in a directory associated to the block
• Each directory stores the information about the line and the processors whose caches (if any) store the line
• Each line can be in the following states Shared: one or more processors caches have the line coeherent with the
memory Non cached: no processor caches has the line Modificata: Only one processor cache has the modificed line. In this
case the processor is the owner of the line
(Common memory)
MC
M
P1C1
I/O
D
M
P2C2
I/O
D
M
P3C3
I/O
D
M
P4C4
I/O
D
D
Directory
Caches (possibly multilivel)
Parallelism
43
Directory based protocol• In the line directory there is a bit for each processor which is 1 if the processor
cache stores the line. Two or more 1’s mean that the line is in shared state. A single 1 means that there is a possible owner of the line (the line could be in modified state). If a line is modified in a cache a message is sent to the directory which send a message to invalidate the other caches. In case of read a message is sent to the owner (if any) which must write back ita modified data and the line become shared. In case of write write-back and invalidation of the previous owner (if any)..
• The transitions are simililare to those of MESI but the implementation is different. This system is very useful if there are multiple connections between the processors by reducing the global use of the busses .
MC
M
P
C
I/O
D
M
P
C
I/O
D
M
P
C
I/O
D
M
P
C
I/O
D
D
L
L
L
Home RemoteLocal
requ
est
messagemessage
L=line
Parallelism
Caches
Parallelism 44
• There are two types of caches: unififed and not-unified. Not-unified means that data and instructions are not mixed. Unified menas the contrary
• In general in the modern processors the first level caches are not-unified (Harward architecture). Other levels are unified
45
R2000/R3000 - RISC (DLX reale)
IF/ID ID/EX EX/MEM MEM/WB
IF ID EX MEM WB
ICache
DCache
MemoryHarvard architecture
Pipeline split-cycle (Phases 1 F and 2)F
IF
ID
EX
MEM
WB
1F Virtual address translation TLB2 F I-Cache access - If hit1 F instruction read and parity check 2 F Registers read – if Branch destination address computing1 F Start ALU execution – if Branch check condition2 F End ALU – if Load/Store vitual address translation (TLB)1 F D-cache access if write) (2 F Data from D-Cache if read 1 F Register File write2 --F
Norally the branch condition is tested at the end of EX (two other instructions already started). In this case the test occurs in 1F and the I-cache addressing in 2 F and therefore only one instruction penalty (the instruction in ID stage)
F1 2F F1 2FF1 2FF1 2F F1 2F
Feedback Branch
Dati
Physical addresses
Parallelism
Web site
Parallelism 46
https://www.cs.tcd.ie/Jeremy.Jones/vivio/vivio.htm
At this address some interesting animated views of the caches behaviours
47
Branch Target Buffer
In order to avoid stalls derived from branches, a branch prediction is necessary in the first stage of the pipeline. The prediciton can be either correct or wrong. In any case the branch is tested in the execution stage.
P C
PC addressDestination
addressT/U
PC addressDestinatino
addressT/U
PC addressDestination
addressT/U
PC addressDestination
addressT/U
PC addressDestination
addressT/U
PC addressDestination
addressT/U
The BTB is a cache whose TAGs are the addresses detected as branches. The line in this case is the branch destination address and among the status bits there are those who predict whether the brach is taken or Untaken
In case of miss (detected in the execution stage) a line fille occurs and a replacement procedure is activated. The initial prediction is that occurred in the execution stage
Branch Target BufferTaken
Untaken
Parallelism
Branch prediction
48
How is a prediction managed ? On a statistical basis ?
• Simple case: static prediction. The prediction is always «taken»The error probability with this policy, according to SPEC benchmarks, is 34% (fairly high)
• Static prediction according to the direction of branch (forward or backward)• In this case the prediction is taken for backward branches (see
loops) and the prediction is untaken for forward branches• In SPEC benchmarks, however, the majority of branches id
forward and the prediction is taken, therefore the prediction gives better results
• Dynamic prediction on the basis of the history of the branch • The prediction error varies between 5 to 22%
Parallelism
49
Branch Target Buffer
With only one prediction bit which records the last verified branch.
In this case for loop1 there are two successive prediction errors
Loop1 Loop2
When loop2 ends (predicted as taken but untaken) there is a following error because in the first following loop loop1 will be predicted as untaken
Parallelism
50
Branch Target Buffer
Normally two bits are used. Two possible schemesTAKEN
TAKEN
UNTAKEN
UNTAKENTAKEN TAKEN
UN
TA
KE
N
UNTAKENUNTAKEN TAKEN
UNTAKEN
TA
KE
N
UNTAKEN
TAKEN
TAKENUNTAKENTAKEN TAKEN
UNTAKENUNTAKEN
TAKEN
UNTAKEN
TAKEN
UNTAKEN
In this case after two «mispredictions» the prediction is changed (low pass filter)
In this case after two «mispredictions» the prediction is changed but ready to go back to the previous prediction in case of a further change
With both schemes the accuracy is higher than 80%
Parallelism
Simulator
Parallelism 51
http://www.ecs.umass.edu/ece/koren/architecture/BTBuffer/project.html
Advanced algorithms for BTB
52
Two levels adaptive prediction
Two registers: BHR (Branch History Table) and PHT (Pattern History Table)
First case: globale approach
0 0 1 0 1 1 1 0
Ex:
BHR (Shift Register)
(content = 2Eh)
0 0 1 0 1
0 1 1 0 0
1 1 1 0 0
1 0 1 0 1
(00)
(01)
(2E)
(FF)
PHT
1 -> Branch taken0 -> Branch not taken
History of the most recent n (8 in this example) branches (what really happened , that is whether the the branch was verified as either taken or untaken
What was predicted with the
same global succession (BHR) ?
Decision: taken
Decisione: untaken
In this example the content of the BHR is 2Eh=4710Parallelism
53
Advanced algorithms for BTB
In case of branch the most recent event succession is analysed (whether the branch was really taken or untaken), For each configuration of this succession a pattern is selected which reflects the decisions taken with this succession configuration. After each Branch execution the resulted value is stored in the right-shifted BHR
A function must be defined which according to the contents of the BHR and the PHT predicts the branch
This prediction system (which uses n + (2**n x m) FF - where n is the size of the BHR and m that of each PHT slot) is not particularly significant because there is no difference among all branches. Effective but not very precise.
Parallelism
54
Advanced algorithms for BTB
Second case: mixed preditor
In this case there is a BHR for each branch made of K shift registers each one of n bits (one for each branch) while there only one PHT.
m
K
n
2**n
K branches considered
Branch (address) BHTSame pointed
PHT
Parallelism
55
Advanced algorithms for BTB
N.B.: registers related to different branches can point to the same PHT register
In this case too there is a lack of consistency: while the history of each branch is different the originating pattern is the same
Used FFs: k x n + (2**n x m)
Parallelism
56
Advanced algorithms for BTB
Third case: omogeneous predicto r
A (complex) refinement of the second case
m
Required FFs: k x n + (2**n x m x k)
k
n
2**n
k
Parallelism
Branch (address) BHT