57
Code and memory optimization tricks Evgeny Muralev Software Engineer Sperasoft Inc.

Code and Memory Optimisation Tricks

Embed Size (px)

Citation preview

Page 1: Code and Memory Optimisation Tricks

Code and memory optimization tricksEvgeny Muralev

Software EngineerSperasoft Inc.

Page 2: Code and Memory Optimisation Tricks

About me

• Software engineer at Sperasoft• Worked on code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft

AAA title• Indie game developer in my free time

Page 3: Code and Memory Optimisation Tricks

Our Clients

Electronic ArtsRiot GamesWargaming

BioWareUbisoftDisneySony

Our Projects

Dragon Age: Inquisition FIFA 14SIMS 4

Mass Effect 2League of LegendsGrand Theft Auto V

About us

Our Office Locations

USAPolandRussia

The Facts

Founded in 2004 300+ employees

Sperasoft on-line

sperasoft.comlinkedin.com/company/sperasoft

twitter.com/sperasoftfacebook.com/sperasoft

Page 4: Code and Memory Optimisation Tricks

Agenda

• Brief architecture overview• Optimizing for data cache• Optimizing branches (and I-cache)

Page 5: Code and Memory Optimisation Tricks

Developing AAA title

• Fixed performance requirements• Min 30fps (33.3ms per frame)

• Performance is a king• a LOT of work to do in one frame!

Page 6: Code and Memory Optimisation Tricks

Make code faster?…

• Improved hardware• Wait for another generation• Fixed on consoles

• Improved algorithm• Very important

• Hardware-aware optimization• Optimizing for (limited) range of hardware• Microoptimizations for specific architecture!

Page 7: Code and Memory Optimisation Tricks

Brief overview

CPU REG

L1 I L1 D

L2 I/D

RAM

REG

~2 cycles~20 cycles

~200 cycles

Page 8: Code and Memory Optimisation Tricks

• Last level cache (LLC) miss cost ~ 200 cycles• Intel Skylake instruction latencies• ADDPS/ADDSS 4 cycles• MULPS/MULSS 4 cycles• DIVSS/DIVPS 11 cycles• SQRTPS/SQRTSS 13 cycles

Brief overview

Page 9: Code and Memory Optimisation Tricks

Brief overview

Page 10: Code and Memory Optimisation Tricks

Intel Skylake case study:

Level Capacity/Associativity

Fastest Latency Peak Bandwidth (B/cycle)

L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32 Store)

L1/I 32Kb/8 N/A N/A

L2 256Kb/4 12 cycles 64B/cycle

L3 (shared) Up to 2Mb per core/Up to 16

44 cycles 32B/cycle

http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Brief overview

Page 11: Code and Memory Optimisation Tricks

• Out-of-order execution cannot hide big latencies like access to main memory

• That’s why processor always tries to prefetch ahead• Both instructions and data

Brief overview

Page 12: Code and Memory Optimisation Tricks

• Linear data access is the best you can do to help hardware prefetching• Processor recognizes pattern and preload data for next iterations

beforehand

Vec4D in[SIZE]; // Offset from originfloat ChebyshevDist[SIZE]; // Chebyshev distance from origin

for (auto i = 0; i < SIZE; ++i){

ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w);}

Optimizing for data cache

Page 13: Code and Memory Optimisation Tricks

• Access patterns must be trivial• Triggering prefetching after every cache miss will

pollute a cache• Prefetching cannot happen across page

boundaries• Might trigger invalid page table walk (on TLB miss)

Optimizing for data cache

Page 14: Code and Memory Optimisation Tricks

• What about traversal of pointer-based data structures?• Spoiler: It sucks

Optimizing for data cache

Page 15: Code and Memory Optimisation Tricks

• Prefetching is blocked• next->next is not known• Cache miss every iteration!• Increases chance of TLB misses• * Depending on your memory allocator

current current->next->nextcurrent->next

struct GameActor{

// Data…GameActor* next;

};

while (current != nullptr){

// Do some operations on current// actor…current = current->next;

}

LLC miss! LLC miss! LLC miss!

Optimizing for data cache

Page 16: Code and Memory Optimisation Tricks

Array vs Linked List traversal

Linear Data Random access

Optimizing for data cache

Time

N of elements

Page 17: Code and Memory Optimisation Tricks

• Load from memory:• auto data = *pointerToData;

• Special instructions:• use intrinsics: _mm_prefetch(void *p, enum _mmhint h)

Configurable!

Optimizing for data cache

Page 18: Code and Memory Optimisation Tricks

• Usually retire after virtual to physical address translation is completed• In case of exception such as page fault software prefetch retired without

prefetching any data

e.g. Intel guide on prefetch instructions:

• Load from memory != prefetch instructions• Prefetch instructions may differ depending on H/W vendor

Optimizing for data cache

Page 19: Code and Memory Optimisation Tricks

• Probably won’t help• Computations don’t overlap memory access time enough• Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c

while (current != nullptr){

Prefetch(current->next)// Trivial ALU computations on current

actorcurrent = current->next;

}

Optimizing for data cache

Page 20: Code and Memory Optimisation Tricks

while (current != nullptr){

Prefetch(current->next)//HighLatencyComputation…current = current->next;

}

• May help around high latency• Make sure data is not evicted from cache before use

Optimizing for data cache

Page 21: Code and Memory Optimisation Tricks

• Prefetch far enough to overlap memory access time• Prefetch near enough so it’s not evicted from data

cache• Do NOT overprefetch• Prefetching is not free• Polluting cache

• Always profile when using software prefetching

Optimizing for data cache

Page 22: Code and Memory Optimisation Tricks

RAM:

… … … … … … a … … … … … … … … … … … … … … … … … … … … … … … … …

Cache:

• Cache operates with blocks called “cache lines”

• When accessing “a” whole cache line is loaded

• You can expect 64 bytes wide cache line on x64

a … … … … … … … … … … … … … … …

Optimizing for data cache

Page 23: Code and Memory Optimisation Tricks

struct FooBonus{

float fooBonus;float otherData[15];

};

// For every character…// Assume we have array<FooBonus> structs;float Sum{0.0f};

for (auto i = 0; i < SIZE; ++i){

Actor->Total += FooArray[i].fooBonus;}

Example of poor data layout:

Optimizing for data cache

Page 24: Code and Memory Optimisation Tricks

• 64 byte offset between loads• Each is on separate cache line• 60 from 64 bytes are wasted

addss xmm6,dword ptr [rax-40h] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+40h] addss xmm6,dword ptr [rax+80h] addss xmm6,dword ptr [rax+0C0h] addss xmm6,dword ptr [rax+100h] addss xmm6,dword ptr [rax+140h] addss xmm6,dword ptr [rax+180h] add rax,200h cmp rax,rcx jl main+0A0h

*MSVC loves x8 loop unrolling

Optimizing for data cache

Page 25: Code and Memory Optimisation Tricks

• Look for patterns how your data is accessed • Split the data based on access patterns• Data used together should be located together• Look for most common case

Optimizing for data cache

Page 26: Code and Memory Optimisation Tricks

Cold fields

struct FooBonus{

MiscData* otherData;float fooBonus;

};

struct MiscData{

float otherData[15];};

Optimizing for data cache

+ 4 bytes for memory alignment on 64bit

Page 27: Code and Memory Optimisation Tricks

• 12 byte offset• Much less bandwidth is wasted• Can do better?!

addss xmm6,dword ptr [rax-0Ch] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+18h] addss xmm6,dword ptr [rax+24h] addss xmm6,dword ptr [rax+30h] addss xmm6,dword ptr [rax+3Ch] addss xmm6,dword ptr [rax+48h] add rax,60h cmp rax,rcx jl main+0A0h

Optimizing for data cache

Page 28: Code and Memory Optimisation Tricks

• Maybe no need to make a pointer to the cold fields?• Make use of Structure of Arrays• Store and index different arrays

struct FooBonus{

float fooBonus;};

struct MiscData{

float otherData[15];};

Optimizing for data cache

Page 29: Code and Memory Optimisation Tricks

• 100% bandwidth utilization• If everything is 64byte aligned

addss xmm6,dword ptr [rax-4] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+4] addss xmm6,dword ptr [rax+8] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+10h] addss xmm6,dword ptr [rax+14h] addss xmm6,dword ptr [rax+18h] add rax,20h cmp rax,rcx jl main+0A0h

Optimizing for data cache

Page 30: Code and Memory Optimisation Tricks

B/D Utilization

Attempt 1 Attempt 2 Attempt 3

Optimizing for data cache

Time

N of elements

Page 31: Code and Memory Optimisation Tricks

• Poor data utilization:• Wasted bandwidth• Increasing probability of TLB misses• More cache misses due to crossing page boundary

Optimizing for data cache

Page 32: Code and Memory Optimisation Tricks

• Recognize data access patterns:• Just analyze the data and how it’s used• Include logging to getters/setters• Collect any other useful data (time/counters)

float GameCharacter::GetStamina() const{ // Active only in debug build CollectData(“GameCharacter::Stamina”); return Stamina;}

Optimizing for data cache

Page 33: Code and Memory Optimisation Tricks

• What to consider:• What data is accessed together• How often data is accessed?• From where it’s accessed?

Optimizing for data cache

Page 34: Code and Memory Optimisation Tricks

• Instruction fetch• Decoding• Execution• Memory Access• Retirement

*of course it is more complex on real hardware

Optimizing branchesInstruction lifetime:

Page 35: Code and Memory Optimisation Tricks

IF ID EX MEM WB

I1

Optimizing branches

Page 36: Code and Memory Optimisation Tricks

IF ID EX MEM WB

I1

I2 I1

Optimizing branches

Page 37: Code and Memory Optimisation Tricks

IF ID EX MEM WB

I1

I2 I1

I3 I2 I1

Optimizing branches

Page 38: Code and Memory Optimisation Tricks

IF ID EX MEM WB

I1

I2 I1

I3 I2 I1

I4 I3 I2 I1

Optimizing branches

Page 39: Code and Memory Optimisation Tricks

IF ID EX MEM WB

I1

I2 I1

I3 I2 I1

I4 I3 I2 I1

I5 I4 I3 I2 I1

Optimizing branches

Page 40: Code and Memory Optimisation Tricks

• What instructions to fetch after Inst A?• Condition hasn’t been evaluated yet• Processor speculatively chooses one of the

paths• Wrong guess is called branch misprediction

// Instruction Aif (Condition == true){

// Instruction B// Instruction C

}else{

// Instruction D// Instruction E

}

Optimizing branches

Page 41: Code and Memory Optimisation Tricks

IF ID EX MEM WB

A

B A

C B A

A

D A

• Pipeline Flush• A lot of wasted cycles

Mispredicted branch!

Optimizing branches

Page 42: Code and Memory Optimisation Tricks

• Try to remove branches at all• Especially hard to predict branches• Reduces chance of branch misprediction• Doesn’t take resources of Branch Target Buffer

Optimizing branches

Page 43: Code and Memory Optimisation Tricks

Know bit tricks!Example: Negate number based on flag value

int In;int Out;bool bDontNegate;

r = (bDontNegate ^ (bDontNegate– 1)) * v;

int In;int Out;bool bDontNegate;

Out = In;if (bDontNegate == false){

out *= -1;}

Branchy version: Branchless version:

https://graphics.stanford.edu/~seander/bithacks.html

Optimizing branches

Page 44: Code and Memory Optimisation Tricks

• Compute both branchesExample:X = (A < B) ? CONST1 : CONST2

Optimizing branches

Page 45: Code and Memory Optimisation Tricks

• Conditional instructions (setCC and cmovCC)

cmp a, b;Condition

jbe L30;Conditional branch

mov ebx const1 ;ebx holds X

jmp L31;Unconditional branchL30:

mov ebx const2L31:

X = (A < B) ? CONST1 : CONST2

xor ebx, ebx ;Clear ebx (X in the C code)cmp A, Bsetge bl ;When ebx = 0 or 1

;OR the complement conditionsub ebx, 1;ebx=11..11 or 00..00and ebx, const3 ;const3 = const1-const2add ebx, const2 ;ebx=const1 or const2

Branchy version: Branchless version:

http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Optimizing branches

Page 46: Code and Memory Optimisation Tricks

• SIMD mask + blending exampleX = (A < B) ? CONST1 : CONST2

// create selectormask = __mm_cmplt_ps(a, b);

// blend valuesres = __mm_blendv_ps(const2, const1, mask);

mask = 0xffffffff if (a < b); 0 otherwise

blend values using mask

Optimizing branches

Page 47: Code and Memory Optimisation Tricks

• Do it only for hard to predict branches• Obviously have to compute both results• Introduces data-dependency blocking out-of-order

execution• Profile!

Compute both summary:

Optimizing branches

Page 48: Code and Memory Optimisation Tricks

• Blue nodes - archers• Red nodes - swordsmen

Optimizing branches

Example: Need to updatea squad

Page 49: Code and Memory Optimisation Tricks

struct CombatActor{

// Data…EUnitType Type; //ARCHER or

SWORDSMAN};struct Squad{

CombatActor Units[SIZE][SIZE];};

void UpdateArmy(const Squad& squad){ for (auto i = 0; i < SIZE; ++i) for (auto j = 0; j < SIZE; ++j) { const auto & Unit = squad.Units[i][j];

switch (Unit.Type) {

case EElementType::ARCHER: // Process archer

break; case EElementType::SWORDSMAN:

// Process swordsman break;

default: // Handle default break;

}}

}

• Branching every iteration?• Bad performance for hard-to-predict branches

Optimizing branches

Page 50: Code and Memory Optimisation Tricks

struct CombatActor{

// Data…EUnitType Type; //ARCHER or

SWORDSMAN};

struct Squad{

CombatActor Archers[A_SIZE];CombatActor Swordsmen[S_SIZE];

};

void UpdateArchers(const Squad & squad){ // Just iterate and process, no branching here // Update archers}

• Split! And process separately• No branching in processing methods• + Better utilization of I-cache!

void UpdateSwordsmen(const Squad & squad){ // Just iterate and process, no branching here // Update swordsmen}

Optimizing branches

Page 51: Code and Memory Optimisation Tricks

• For very predictable branches:• Generally prefer predicted not taken conditional

branches• Depending on architecture predicted taken branch

may take little more latency

Optimizing branches

Page 52: Code and Memory Optimisation Tricks

; function prologue cmp dword ptr [data], 0 je END ; set of some ALU instructions… ;…END: ; function epilogue

; function prologue cmp dword ptr [data], 0 jne COMP jmp ENDCOMP: ; set of some ALU instructions… ;…END: ; function epilogue

• Imagine cmp dword ptr [data], 0 – likely to evaluate to “false”

• Prefer predicted not taken

Predicted not taken Predicted taken

Optimizing branches

Page 53: Code and Memory Optimisation Tricks

• Study branch predictor on target architecture• Consider whether you really need a branch• Compute both results• Bit/Math hacks

• Study the data and split it• Based on access patterns• Based on performed computation

Optimizing branches

Page 54: Code and Memory Optimisation Tricks

Conclusion

• Know your hardware• Architecture matters!

• Design code around data, not abstractions• Hardware is a real thing

Page 55: Code and Memory Optimisation Tricks

Resources

• http://www.agner.org/optimize/microarchitecture.pdf• https://people.freebsd.org/~lstewart/articles/cpumemory.pdf• http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf• http://www.intel.ru/content/dam/www/public/us/en/documents/manual

s/64-ia-32-architectures-optimization-manual.pdf• https://graphics.stanford.edu/~seander/bithacks.html

Page 56: Code and Memory Optimisation Tricks

Questions?

• E-mail: [email protected]• Twitter: @EvgenyGD• Web: evgenymuralev.com