Upload
others
View
33
Download
1
Embed Size (px)
Citation preview
9.1
CS356 Midterm II
Review
Reminder on Page Faults
Reminder on Page Faults
Consequences
To remember:TLB ⇒ MM(not the reverse)
CACHE ⇒ MM(not the reverse)
TLB and CACHE are independent
Means: page in MM (hit in PT)
9.5
Caches
Memory: addresses of m bits ⇒
M = 2m memory locations
Cache:
● S = 2s cache sets
● Each set has K lines
● Each line has: data block of B
= 2b bytes, valid bit,
t = m − (s+b) tag bits
How to check if the word at an
address is in the cache?
9.6
Caches: Data AccessFF1E1111 1111 0001 1110
FF3E1111 1111 0011 1110
FF4E1111 1111 0100 1110
9.7
Caches: Data Access
9.8
Caches: Data Access
9.9
Caches: Data Access
Average Access Time = (Hit Time) + (Miss Ratio) ⨯ (Miss Penalty)
9.10
Caches: Data Access
8.11
Cache Operation ExampleAccess Trace
– R: 0x00a0
– W: 0x00f4
– R: 0x00b0
– W: 0x2a2cPossible Operations: Hit, or fetch block X (possibly with “evict block Y” and “WB of Y” if Y is dirty)
Break down address and decide operations for
K=2-way set-associative, N=4, B=32 bytes/block (S = N/K = 2)
Access Cache Operation
R: 0x00a0 Fetch Block 00a0-00bf
W: 0x00f4 Fetch Block 00e0-00ff
R: 0x00b0 Hit
W: 0x2a2c Evict 00e0-00ff with WBFetch block 2a20-2a3f
Done! Final WB of 2a20-2a3f
Address Tag Set Byte Offset
0x00a0 0000 0000 10 1 0 0000
0x00f4 0000 0000 11 1 1 0100
0x00b0 0000 0000 10 1 1 0000
0x2a2c 0010 1010 00 1 0 1100
Set 1 after each access (LRU)
9.12
Caches: Trace SimulationYou are asked to optimize a cache capable of
storing 8 bytes total for the given references. There
are three direct-mapped cache designs possible by
varying the block size:
● C1 has one-byte blocks
● C2 has two-byte blocks
● C3 has four-byte blocks
In terms of miss rate, which cache design is best?
If the miss stall time is 25 cycles, and C1 has an
access time of 2 cycles, C2 takes 3 cycles, and C3
takes 5 cycles, which is the best cache design?
Trace (LSB)
1 0000 0001
134 1000 0110
212 1101 0100
1 0000 0001
135 1000 0111
213 1101 0101
162 1010 0010
161 1010 0001
2 0000 0010
44 0010 1100
41 0010 1001
221 1101 1101
9.13
Caches: Trace Simulation on C1Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.14
Caches: Trace Simulation on C2Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.15
Caches: Trace Simulation on C3Address breakdown
● C1 has no block offset, 3-bit set address
● C2 has 1-bit block offset, 2-bit set address
● C3 has 2-bit block offset, 1-bit set address
How to run a trace: extract set address (3, 2, 1 bits)
from LSB; on miss, load (1, 2, 4) bytes.
Running C3:
● Get 1: miss. Put bytes 0-3 in bucket 0.
● Get 134: miss. Put 132-135 in bucket 1.
● Get 212: miss. Put 212-215 in bucket 1.
● Get 1: hit.
● Get 135: miss. Put 132-135 in bucket 1.
● Get 213: miss. Put 212-215 in bucket 1.
● Get 162: miss. Put 160-163 in bucket 0.
● Get 161: hit.
Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.16
Loop over a Matrix, by row
Example: each cache line holds 4 array elements
stored in registers(temporal locality)
hopefully in cache(spatial locality)
9.17
Loop over a Matrix, by col
Example: each cache line holds 4 array elements
stored in registers(temporal locality)
hopefully in cache(spatial locality)
9.18
Single-Level Page Tables
points to a different table for each process
9.19
Single-Level Page Tables
9.20
Single-Level Page Tables
9.21
Single-Level Page Tables
8-bit virtual addresses, 10-bit physical addresses, 32-byte pages● Physical address of virtual address 0x2D? 00101101 => 0 0011 1100 1101● Physical address of virtual address 0x7A? 01111010 => 0 0000 1101 1010● Physical address of virtual address 0xEF? 11101111 => ● Physical address of virtual address 0xA8? 10101000 => 0 1000
Index Valid PPN
0 0 0x0E
1 1 0x1E
2 1 0x16
3 1 0x06
4 0 0x0B
5 1 0x1F
6 0 0x15
7 0 0x0A
9.22
Multi-level Page Tables
9.23
Page Table with 3 Levels
9.24
Page Table with 3 Levels
9.25
Translation Lookaside Buffer
A k-level page table requires k memory accesses in the worse case.
Idea: cache address mappings inside the CPU (10 ns hit time).
● VPN is the cache tag, PPN is the entire cache block
● High degree of associativity (4-way or fully-associative: low miss rate)
● Usually smaller than data cache (fast lookup, low hit time)
Average Access Time = (Hit Time) + (Miss Rate) ⨯ (Miss Penalty)
9.26
Example: 2-way set-associative TLB
16-bit virtual and physical addresses, 256-byte pages● Physical address of virtual address 0x7E85 == 0111 1110 1000 0101● Virtual address of physical address 0x3020 == 0011 0000 0010 0000
Set Index Valid Bit Tag PPN
01 0x13 0x30
0 0x34 0x58
10 0x1F 0x80
1 0x2A 0x72
21 0x1F 0x95
0 0x20 0xAA
31 0x3F 0x20
0 0x3E 0xFF
9.27
TLB == Subset of Page Table
9.28
Addressing: Cache, VM, TLB
9.29
Addressing: Cache, VM, TLB
9.30
Structs: Offsets in assembly
Assume 4-byte int / float, 8-byte long / double.
Can you figure out the offsets for %rdi ?
struct record_t { char a[2]; int b; long c; int d[3]; short e;};
void initialize(struct record_t *x) { x->a[1] = 1; x->b = 2; x->c = 3; x->d[1] = 4; x->e = 5;}
initialize: movb $1, 1(%rdi) movl $2, 4(%rdi) movq $3, 8(%rdi) movl $4, 20(%rdi) movw $5, 28(%rdi) reta a b b b b
c c c c c c c c
d0 d0 d0 d0 d1 d1 d1 d1
d2 d2 d2 d2 e e
9.31
struct B { // this struct must start/end at a multiple of 4, because that's required by 'y'
char x; // 1 byte
int y; // 4 bytes (needs 3 bytes of padding before to start at a multiple of 4)
char z; // 1 byte (needs 3 bytes of padding after to end at a multiple of 4)
};
struct A {
char a; // 1 byte
struct B b; // has 4-byte alignment: 3 bytes of padding before 'b'
char c; // also 3 bytes of padding before 'c', so that 'b' ends at a multiple of 4
};
void init(struct A *a) {
a->a = 1;
a->b.x = 2;
a->b.y = 3;
a->b.z = 4;
a->c = 5;
}
$ gcc -fomit-frame-pointer -mno-red-zone -Og -S align.c; cat align.s | grep mov
movb $1, (%rdi)
movb $2, 4(%rdi)
movl $3, 8(%rdi)
movb $4, 12(%rdi)
movb $5, 16(%rdi)
Nested Structs
a x
y y y y z
c
We still want each member of the nested struct to start at a multiple of its size, but where should the nested struct itself start?
Its start/end should have the largest alignment required by its members.
9.32
Struct Alignment
9.33
p
Unions
• Unions allow you to read/write the same memory region as variables with different types– All elements start at offset 0
– The size of the union is simply the size of the biggest member
– Elements must be POD (plain old data) or at least default-constructible
Data1
union Data1 { int x; char y;};
union Data2 { short w; char *p;};
int main() { union Data1 item; item.x = 0x356; item.y = 'a';}
x
offset: 0
Data2(w/o padding)
w
offset: 0 2
y
item 56 03 00 00
offset: 0
item 61 03 00 00
Recall x86 uses little-endian
1 2 3
CS:APP 3.9.2
9.34
Unions: Revealing Endianness
• 4-byte union• x reads/writes an int• bytes reads/writes
4 consecutive char
Note that bytes are stored in reversed order
#include <stdio.h>
union int_bytes { int x; char bytes[4];};
int main() { union int_bytes ib; ib.x = 256; printf("%08X is %02X %02X %02X %02X\n", ib.x, ib.bytes[3], ib.bytes[2], ib.bytes[1], ib.bytes[0]);}
// prints:// 00000100 is 00 00 01 00
9.35
Return-oriented Programming
9.36
Return-oriented Programming
9.37
Return-oriented Programming
A64.38
Arithmetic/Logic OperationsA different style!
• Always operate between registers/immediates (not memory)
• Three operands: OP dst, src1, src2 means dst = src1 OP src2
Examplesx1 0x11111111 11111111 (initial state)
x2 0x22002200 22002200
x3 0x33003300 00330033
• “add x1, x2, x3” assigns x2+x3 to x1, like “x1 = x2 + x3”x1 0x55005500 22332233 after add x1,x2,x3
• “add w1, w2, w3” assigns w2+w3 to w1, sets MS 32 bits to 0x1 0x00000000 22332233 after add w1,w2,w3
A64.39
Arithmetic/Logic OperationsInstruction Example Effect
Add immediate add x0, x1, 1 x0 = x1 + 1
Add register add x0, x1, x2 x0 = x1 + x2
Add shifted register (or imm.) add x0, x1, x2, 10 x0 = x1 + (x2 << 10)
Subtract sub x0, x1, x2 x0 = x1 - x2
Subtract shifted sub x0, x1, x2, 10 x0 = x1 - (x2 << 10)
Negate neg x0, x1 x0 = -x1
Multiply mul x0, x1, x2 x0 = x1 * x2
Multiply-add madd x0, x1, x2, x4 x0 = x1 * x2 + x4
Divide signed sdiv x0, x1, x2 x0 = x1 / x2
Divide unsigned udiv x0, x1, x2 x0 = x1 / x2
Logical shift left lsl x0, x1, x2 x0 = x1 << (x2 % 64)
Logical shift right lsr x0, x1, x2 x0 = x1 >> (x2 % 64)
Arithmetic shift right asr x0, x1, x2 x0 = x1 (signed) >> (x2 % 64)
Rotate bits from LSB to MSB ror x0, x1, x2 x0 = x1 >>> (x2 % 64)
Bitwise AND and x0, x1, x2 x0 = x1 & x2
Bitwise OR orr x0, x1, x2 x0 = x1 | x2
Bitwise XOR eor x0, x1, x2 x0 = x1 ^ x2
Bitwise NOT (“move not”) mvn x0, x2 x0 = ~x1
A64.40
Flexible OperandsShift or Rotate Operand2
• add, sub, bitwise ops (and, orr, eor, mvn) and move (mov) allow optional shift or rotation of the last argument:OP dst, src1, src2, LSL n means dst = src1 OP (src2 << n)OP dst, src1, src2, LSR n means dst = src1 OP (src2 >> n)OP dst, src1, src2, ASR n means dst = src1 OP (src2 s>> n)OP dst, src1, src2, ROR n means dst = src1 OP (src2 >>> n)
Example
• “add x1, x2, x3, lsl 32” assigns x2+(x3<<32) to x1x1 0x55005500 22002200 after add x1,x2,x3,lsl 32
• “add w1, w2, w3, ror 8” assigns w2+(w3>>>8)x1 0x00000000 55005500 after add w1,w2,w3,ror 8
x1 0x11111111 11111111 (initial state)
x2 0x22002200 22002200
x3 0x33003300 00330033
A64.41
Simple Arithmetic: A64 vs x64// a64add:
/* return value in w0 */
/* arguments in w0, w1 */
add w0, w0, w1
ret
multadd:
add w0, w0, 10
add w0, w0, w1, lsl 3
ret
bias:
mov w2, 1
lsl w2, w2, w1
sub w2, w2, #1
and w2, w2, w0, asr 31
add w0, w2, w0
ret
// C
int add(int x, int y) {
return x + y;
}
int multadd(int x, int y) {
return 10 + x + 8*y;
}
int bias(int x, int k) {
int bias = (1 << k) - 1;
int mask = x >> 31;
return x + (mask & bias);
}
// x64add:
leal (%rdi,%rsi), %eax
ret
multadd:
leal 10(%rdi,%rsi,8), %eax
ret
bias:
movl $1, %eax
movl %esi, %ecx
sall %cl, %eax
subl $1, %eax
movl %edi, %edx
sarl $31, %edx
andl %edx, %eax
addl %edi, %eax
ret
A64.42
Memory Load/StoreA different style!
• In x86-64: mov to read/write memory, suffix must match
• A64 has dedicated instructions without size suffix (inferred)– ldr x1, [x2] to load into x1 the 8 bytes at address x2– str x1, [x2] to store 8 bytes of x1 to address x2
• Additional instructions to load/store register pairs– ldr x0, x1, [x2] 8 bytes at x2 => x0, 8 bytes at x2+8 => x1– str x0, x1, [x2] write x0 to address x2, x1 to address x2+8
• Moves are only between registers– mov x0, x1 is equivalent to x0 = x1
A64.43
Working with Pointers: A64 vs x64// a64get:
ldr w0, [x0]
ret
set:
str w1, [x0]
ret
// C
int get(int *ptr) {
return *ptr;
}
void set(int *ptr, int x) {
*ptr = x;
}
// x64get:
movl (%rdi), %eax
ret
set:
movl %esi, (%rdi)
ret
A64.44
Calling Procedures
• Arguments in x0, .., x7 then on the stack
• Return value in x0
• Caller-save registers x0 to x18
• Callee-save registers x19 to x29
• Callee saves link register x30 if it invokes a procedure...
Call and Return Mechanisms
• Branch with link (bl) sets the link register x30 to PC+4
– PC is the address of the current instruction (each is 4 bytes)
• Return (ret) jumps to the address in x30 (lr)
– Can also use “ret x0” or with any other register
Calling Conventions