CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111 1111 0100 1110

9.1

CS356 Midterm II

Review

Reminder on Page Faults

Reminder on Page Faults

Consequences

To remember:TLB ⇒ MM(not the reverse)

CACHE ⇒ MM(not the reverse)

TLB and CACHE are independent

Means: page in MM (hit in PT)

9.5

Caches

Memory: addresses of m bits ⇒

M = 2m memory locations

Cache:

● S = 2s cache sets

● Each set has K lines

● Each line has: data block of B

= 2b bytes, valid bit,

t = m − (s+b) tag bits

How to check if the word at an

address is in the cache?

9.6

Caches: Data AccessFF1E1111 1111 0001 1110

FF3E1111 1111 0011 1110

FF4E1111 1111 0100 1110

9.7

Caches: Data Access

9.8

Caches: Data Access

9.9

Caches: Data Access

Average Access Time = (Hit Time) + (Miss Ratio) ⨯ (Miss Penalty)

9.10

Caches: Data Access

8.11

Cache Operation ExampleAccess Trace

– R: 0x00a0

– W: 0x00f4

– R: 0x00b0

– W: 0x2a2cPossible Operations: Hit, or fetch block X (possibly with “evict block Y” and “WB of Y” if Y is dirty)

Break down address and decide operations for

K=2-way set-associative, N=4, B=32 bytes/block (S = N/K = 2)

Access Cache Operation

R: 0x00a0 Fetch Block 00a0-00bf

W: 0x00f4 Fetch Block 00e0-00ff

R: 0x00b0 Hit

W: 0x2a2c Evict 00e0-00ff with WBFetch block 2a20-2a3f

Done! Final WB of 2a20-2a3f

Address Tag Set Byte Offset

0x00a0 0000 0000 10 1 0 0000

0x00f4 0000 0000 11 1 1 0100

0x00b0 0000 0000 10 1 1 0000

0x2a2c 0010 1010 00 1 0 1100

Set 1 after each access (LRU)

9.12

Caches: Trace SimulationYou are asked to optimize a cache capable of

storing 8 bytes total for the given references. There

are three direct-mapped cache designs possible by

varying the block size:

● C1 has one-byte blocks

● C2 has two-byte blocks

● C3 has four-byte blocks

In terms of miss rate, which cache design is best?

If the miss stall time is 25 cycles, and C1 has an

access time of 2 cycles, C2 takes 3 cycles, and C3

takes 5 cycles, which is the best cache design?

Trace (LSB)

1 0000 0001

134 1000 0110

212 1101 0100

1 0000 0001

135 1000 0111

213 1101 0101

162 1010 0010

161 1010 0001

2 0000 0010

44 0010 1100

41 0010 1001

221 1101 1101

9.13

Caches: Trace Simulation on C1Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.14

Caches: Trace Simulation on C2Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.15

Caches: Trace Simulation on C3Address breakdown

● C1 has no block offset, 3-bit set address

● C2 has 1-bit block offset, 2-bit set address

● C3 has 2-bit block offset, 1-bit set address

How to run a trace: extract set address (3, 2, 1 bits)

from LSB; on miss, load (1, 2, 4) bytes.

Running C3:

● Get 1: miss. Put bytes 0-3 in bucket 0.

● Get 134: miss. Put 132-135 in bucket 1.


● Get 1: hit.




● Get 161: hit.

Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.16

Loop over a Matrix, by row

Example: each cache line holds 4 array elements

stored in registers(temporal locality)

hopefully in cache(spatial locality)

9.17

Loop over a Matrix, by col

Example: each cache line holds 4 array elements

stored in registers(temporal locality)

hopefully in cache(spatial locality)

9.18

Single-Level Page Tables

points to a different table for each process

9.19


9.20


9.21


8-bit virtual addresses, 10-bit physical addresses, 32-byte pages● Physical address of virtual address 0x2D? 00101101 => 0 0011 1100 1101● Physical address of virtual address 0x7A? 01111010 => 0 0000 1101 1010● Physical address of virtual address 0xEF? 11101111 => ● Physical address of virtual address 0xA8? 10101000 => 0 1000

Index Valid PPN

0 0 0x0E

1 1 0x1E

2 1 0x16

3 1 0x06

4 0 0x0B

5 1 0x1F

6 0 0x15

7 0 0x0A

9.22

Multi-level Page Tables

9.23

Page Table with 3 Levels

9.24

Page Table with 3 Levels

9.25

Translation Lookaside Buffer

A k-level page table requires k memory accesses in the worse case.

Idea: cache address mappings inside the CPU (10 ns hit time).

● VPN is the cache tag, PPN is the entire cache block

● High degree of associativity (4-way or fully-associative: low miss rate)

● Usually smaller than data cache (fast lookup, low hit time)

Average Access Time = (Hit Time) + (Miss Rate) ⨯ (Miss Penalty)

9.26

Example: 2-way set-associative TLB

16-bit virtual and physical addresses, 256-byte pages● Physical address of virtual address 0x7E85 == 0111 1110 1000 0101● Virtual address of physical address 0x3020 == 0011 0000 0010 0000

Set Index Valid Bit Tag PPN

01 0x13 0x30

0 0x34 0x58

10 0x1F 0x80

1 0x2A 0x72

21 0x1F 0x95

0 0x20 0xAA

31 0x3F 0x20

0 0x3E 0xFF

9.27

TLB == Subset of Page Table

9.28

Addressing: Cache, VM, TLB

9.29

Addressing: Cache, VM, TLB

9.30

Structs: Offsets in assembly

Assume 4-byte int / float, 8-byte long / double.

Can you figure out the offsets for %rdi ?

struct record_t { char a[2]; int b; long c; int d[3]; short e;};

void initialize(struct record_t *x) { x->a[1] = 1; x->b = 2; x->c = 3; x->d[1] = 4; x->e = 5;}

initialize: movb $1, 1(%rdi) movl $2, 4(%rdi) movq $3, 8(%rdi) movl $4, 20(%rdi) movw $5, 28(%rdi) reta a b b b b

c c c c c c c c

d0 d0 d0 d0 d1 d1 d1 d1

d2 d2 d2 d2 e e

9.31

struct B { // this struct must start/end at a multiple of 4, because that's required by 'y'

char x; // 1 byte

int y; // 4 bytes (needs 3 bytes of padding before to start at a multiple of 4)

char z; // 1 byte (needs 3 bytes of padding after to end at a multiple of 4)

};

struct A {

char a; // 1 byte

struct B b; // has 4-byte alignment: 3 bytes of padding before 'b'

char c; // also 3 bytes of padding before 'c', so that 'b' ends at a multiple of 4

};

void init(struct A *a) {

a->a = 1;

a->b.x = 2;

a->b.y = 3;

a->b.z = 4;

a->c = 5;

}

$ gcc -fomit-frame-pointer -mno-red-zone -Og -S align.c; cat align.s | grep mov

movb $1, (%rdi)

movb $2, 4(%rdi)

movl $3, 8(%rdi)

movb $4, 12(%rdi)

movb $5, 16(%rdi)

Nested Structs

a x

y y y y z

c

We still want each member of the nested struct to start at a multiple of its size, but where should the nested struct itself start?

Its start/end should have the largest alignment required by its members.

9.32

Struct Alignment

9.33

p

Unions

• Unions allow you to read/write the same memory region as variables with different types– All elements start at offset 0

– The size of the union is simply the size of the biggest member

– Elements must be POD (plain old data) or at least default-constructible

Data1

union Data1 { int x; char y;};

union Data2 { short w; char *p;};

int main() { union Data1 item; item.x = 0x356; item.y = 'a';}

x

offset: 0

Data2(w/o padding)

w

offset: 0 2

y

item 56 03 00 00

offset: 0

item 61 03 00 00

Recall x86 uses little-endian

1 2 3

CS:APP 3.9.2

9.34

Unions: Revealing Endianness

• 4-byte union• x reads/writes an int• bytes reads/writes

4 consecutive char

Note that bytes are stored in reversed order

#include <stdio.h>

union int_bytes { int x; char bytes[4];};

int main() { union int_bytes ib; ib.x = 256; printf("%08X is %02X %02X %02X %02X\n", ib.x, ib.bytes[3], ib.bytes[2], ib.bytes[1], ib.bytes[0]);}

// prints:// 00000100 is 00 00 01 00

9.35

Return-oriented Programming

9.36


9.37


A64.38

Arithmetic/Logic OperationsA different style!

• Always operate between registers/immediates (not memory)

• Three operands: OP dst, src1, src2 means dst = src1 OP src2

Examplesx1 0x11111111 11111111 (initial state)

x2 0x22002200 22002200

x3 0x33003300 00330033

• “add x1, x2, x3” assigns x2+x3 to x1, like “x1 = x2 + x3”x1 0x55005500 22332233 after add x1,x2,x3

• “add w1, w2, w3” assigns w2+w3 to w1, sets MS 32 bits to 0x1 0x00000000 22332233 after add w1,w2,w3

A64.39

Arithmetic/Logic OperationsInstruction Example Effect

Add immediate add x0, x1, 1 x0 = x1 + 1

Add register add x0, x1, x2 x0 = x1 + x2

Add shifted register (or imm.) add x0, x1, x2, 10 x0 = x1 + (x2 << 10)

Subtract sub x0, x1, x2 x0 = x1 - x2

Subtract shifted sub x0, x1, x2, 10 x0 = x1 - (x2 << 10)

Negate neg x0, x1 x0 = -x1

Multiply mul x0, x1, x2 x0 = x1 * x2

Multiply-add madd x0, x1, x2, x4 x0 = x1 * x2 + x4

Divide signed sdiv x0, x1, x2 x0 = x1 / x2

Divide unsigned udiv x0, x1, x2 x0 = x1 / x2

Logical shift left lsl x0, x1, x2 x0 = x1 << (x2 % 64)

Logical shift right lsr x0, x1, x2 x0 = x1 >> (x2 % 64)

Arithmetic shift right asr x0, x1, x2 x0 = x1 (signed) >> (x2 % 64)

Rotate bits from LSB to MSB ror x0, x1, x2 x0 = x1 >>> (x2 % 64)

Bitwise AND and x0, x1, x2 x0 = x1 & x2

Bitwise OR orr x0, x1, x2 x0 = x1 | x2

Bitwise XOR eor x0, x1, x2 x0 = x1 ^ x2

Bitwise NOT (“move not”) mvn x0, x2 x0 = ~x1

A64.40

Flexible OperandsShift or Rotate Operand2

• add, sub, bitwise ops (and, orr, eor, mvn) and move (mov) allow optional shift or rotation of the last argument:OP dst, src1, src2, LSL n means dst = src1 OP (src2 << n)OP dst, src1, src2, LSR n means dst = src1 OP (src2 >> n)OP dst, src1, src2, ASR n means dst = src1 OP (src2 s>> n)OP dst, src1, src2, ROR n means dst = src1 OP (src2 >>> n)

Example

• “add x1, x2, x3, lsl 32” assigns x2+(x3<<32) to x1x1 0x55005500 22002200 after add x1,x2,x3,lsl 32

• “add w1, w2, w3, ror 8” assigns w2+(w3>>>8)x1 0x00000000 55005500 after add w1,w2,w3,ror 8

x1 0x11111111 11111111 (initial state)

x2 0x22002200 22002200

x3 0x33003300 00330033

A64.41

Simple Arithmetic: A64 vs x64// a64add:

/* return value in w0 */

/* arguments in w0, w1 */

add w0, w0, w1

ret

multadd:

add w0, w0, 10

add w0, w0, w1, lsl 3

ret

bias:

mov w2, 1

lsl w2, w2, w1

sub w2, w2, #1

and w2, w2, w0, asr 31

add w0, w2, w0

ret

// C

int add(int x, int y) {

return x + y;

}

int multadd(int x, int y) {

return 10 + x + 8*y;

}

int bias(int x, int k) {

int bias = (1 << k) - 1;

int mask = x >> 31;

return x + (mask & bias);

}

// x64add:

leal (%rdi,%rsi), %eax

ret

multadd:

leal 10(%rdi,%rsi,8), %eax

ret

bias:

movl $1, %eax

movl %esi, %ecx

sall %cl, %eax

subl $1, %eax

movl %edi, %edx

sarl $31, %edx

andl %edx, %eax

addl %edi, %eax

ret

A64.42

Memory Load/StoreA different style!

• In x86-64: mov to read/write memory, suffix must match

• A64 has dedicated instructions without size suffix (inferred)– ldr x1, [x2] to load into x1 the 8 bytes at address x2– str x1, [x2] to store 8 bytes of x1 to address x2

• Additional instructions to load/store register pairs– ldr x0, x1, [x2] 8 bytes at x2 => x0, 8 bytes at x2+8 => x1– str x0, x1, [x2] write x0 to address x2, x1 to address x2+8

• Moves are only between registers– mov x0, x1 is equivalent to x0 = x1

A64.43

Working with Pointers: A64 vs x64// a64get:

ldr w0, [x0]

ret

set:

str w1, [x0]

ret

// C

int get(int *ptr) {

return *ptr;

}

void set(int *ptr, int x) {

*ptr = x;

}

// x64get:

movl (%rdi), %eax

ret

set:

movl %esi, (%rdi)

ret

A64.44

Calling Procedures

• Arguments in x0, .., x7 then on the stack

• Return value in x0

• Caller-save registers x0 to x18

• Callee-save registers x19 to x29

• Callee saves link register x30 if it invokes a procedure...

Call and Return Mechanisms

• Branch with link (bl) sets the link register x30 to PC+4

– PC is the address of the current instruction (each is 4 bytes)

• Return (ret) jumps to the address in x30 (lr)

– Can also use “ret x0” or with any other register

Calling Conventions

Documents

CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111 1111 0100 1110