77
PERFORMANCE AND PREDICTABILITY Richard Warburton @richardwarburto insightfullogic.com

Performance and predictability

Embed Size (px)

DESCRIPTION

These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.

Citation preview

Page 1: Performance and predictability

PERFORMANCE AND PREDICTABILITYRichard [email protected]

Page 2: Performance and predictability
Page 3: Performance and predictability

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Page 4: Performance and predictability
Page 5: Performance and predictability

Performance Discussion

Product Solutions“Just use our library/tool/framework, and everything is web-scale!”

Architecture Advocacy“Always design your software like this.”

Methodology & Fundamentals“Here are some principles and knowledge, use your brain“

Page 6: Performance and predictability

Do you care?

DON'T look at access patterns first

Many problems not Pattern RelatedNetworkingDatabase or External ServiceMinimising I/OGarbage CollectionInsufficient Parallelism

Harvest the low hanging fruit first

Page 7: Performance and predictability

Low Hanging is a cost-benefit analysis

Page 8: Performance and predictability

So when does it matter?

Page 9: Performance and predictability

Informed Design & Architecture *

* this is not a call for premature optimisation

Page 10: Performance and predictability
Page 11: Performance and predictability

That 10% that actually matters

Page 12: Performance and predictability

Performance of Predictable Accesses

Page 13: Performance and predictability

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Page 14: Performance and predictability

What 4 things do CPUs actually do?

Page 15: Performance and predictability

Fetch, Decode, Execute, Writeback

Page 16: Performance and predictability

Pipelining speeds this up

Page 17: Performance and predictability

Pipelined

Page 18: Performance and predictability

What about branches?

public static int simple(int x, int y, int z) {

int ret;

if (x > 5) {

ret = y + z;

} else {

ret = y;

}

return ret;

}

Page 19: Performance and predictability

Branches cause stalls, stalls kill performance

Page 20: Performance and predictability

Super-pipelined & Superscalar

Page 21: Performance and predictability

Can we eliminate branches?

Page 22: Performance and predictability

Naive Compilation

if (x > 5) {

ret = z;

} else {

ret = y;

}

cmp x, 5 jmp L1 mov z, ret br L2L1: mov y, retL2: ...

Page 23: Performance and predictability

Conditional Branches

if (x > 5) {

ret = z;

} else {

ret = y;

}

cmp x, 5 mov z, ret

cmovle y, ret

Page 24: Performance and predictability

Strategy: predict branches and speculatively execute

Page 25: Performance and predictability

Static Prediction

A forward branch defaults to not taken

A backward branch defaults to taken

Page 26: Performance and predictability
Page 27: Performance and predictability

Static Hints (Pentium 4 or later)

__emit 0x3E defaults to taken

__emit 0x2E defaults to not taken

don’t use them, flip the branch

Page 28: Performance and predictability

Dynamic prediction: record history and predict future

Page 29: Performance and predictability

Branch Target Buffer (BTB)

a log of the history of each branch

also stores the address

its finite!

Page 30: Performance and predictability

Local

record per conditional branch histories

Global

record shared history of conditional jumps

Page 31: Performance and predictability

Loop

specialised predictor when there’s a loop (jumping in a cycle n times)

Function

specialised buffer for predicted nearby function returns

Two level Adaptive Predictor

accounts for up patterns of up to 3 if statements

Page 32: Performance and predictability

Measurement and Monitoring

Use Performance Event Counters (Model Specific Registers)

Can be configured to store branch prediction information

Profilers & Tooling: perf (linux), VTune, AMD Code Analyst, Visual Studio, Oracle Performance Studio

Page 33: Performance and predictability

Approach: Loop Unrolling

for (int x = 0; x < 100; x+=5)

{

foo(x);

foo(x+1);

foo(x+2);

foo(x+3);

foo(x+4);

}

Page 34: Performance and predictability

Approach: minimise branches in loops

Page 35: Performance and predictability

Approach: Separation of Concerns

Little branching on a latency critical path or thread

Heavily branchy administrative code on a different path

Helps adjust for the inflight branch prediction not tracking many targets

Only applicable with low context switching rates

Also easier to understand the code

Page 36: Performance and predictability

One more thing ...

Page 37: Performance and predictability

Division/Modulus

index = (index + 1) % SIZE;

vs

index = index + 1;if (index == SIZE) { index = 0;}

Page 38: Performance and predictability

/ or % are 92 Cycles on Core i*

Page 39: Performance and predictability

Summary

CPUs are Super-pipelined and Superscalar

Branches cause stalls

Branches can be removed, minimised or simplified

Frequently get easier to read code as well

Page 40: Performance and predictability

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Page 41: Performance and predictability

The Problem Very Fast

Relatively Slow

Page 42: Performance and predictability

The Solution: CPU Cache

Core Demands Data, looks at its cache

If present (a "hit") then data returned to registerIf absent (a "miss") then data looked up from memory and stored in the cache

Fast memory is expensive, a small amount is affordable

Page 43: Performance and predictability

Multilevel Cache: Intel Sandybridge

Shared Level 3 Cache

Level 2 Cache

Level 1 Data

Cache

Level 1 Instruction

Cache

Physical Core 0

HT: 2 Logical Cores

....

Level 2 Cache

Level 1 Data

Cache

Level 1 Instruction

Cache

Physical Core N

HT: 2 Logical Cores

Page 44: Performance and predictability

How bad is a miss?

Location Latency in Clockcycles

Register 0

L1 Cache 3

L2 Cache 9

L3 Cache 21

Main Memory 150-400

Page 45: Performance and predictability

Cache Lines

Data transferred in cache lines

Fixed size block of memory

Usually 64 bytes in current x86 CPUsBetween 32 and 256 bytes

Purely hardware consideration

Page 46: Performance and predictability

Prefetch = Eagerly load data

Adjacent Cache Line Prefetch

Data Cache Unit (Streaming) Prefetch

Problem: CPU Prediction isn't perfect

Solution: Arrange Data so accesses are

predictable

Prefetching

Page 47: Performance and predictability

Sequential Locality

Referring to data that is arranged linearly in memory

Spatial Locality

Referring to data that is close together in memory

Temporal Locality

Repeatedly referring to same data in a short time span

Page 48: Performance and predictability

General Principles

Use smaller data types (-XX:+UseCompressedOops)

Avoid 'big holes' in your data

Make accesses as linear as possible

Page 49: Performance and predictability

Primitive Arrays

// Sequential Access = Predictable

for (int i=0; i<someArray.length; i++)

someArray[i]++;

Page 50: Performance and predictability

Primitive Arrays - Skipping Elements

// Holes Hurt

for (int i=0; i<someArray.length; i += SKIP)

someArray[i]++;

Page 51: Performance and predictability

Primitive Arrays - Skipping Elements

Page 52: Performance and predictability

Multidimensional Arrays

Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)

Some people realign their accesses:

for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) {array[ROWS * col + row]++;

}}

Page 53: Performance and predictability

Bad Access Alignment

Strides the wrong way, bad locality.

array[COLS * row + col]++;

Strides the right way, good locality.

array[ROWS * col + row]++;

Page 54: Performance and predictability

Full Random AccessL1D - 5 clocksL2 - 37 clocksMemory - 280 clocks

Sequential AccessL1D - 5 clocksL2 - 14 clocksMemory - 28 clocks

Page 55: Performance and predictability

Primitive Collections (GNU Trove, GS-Coll)Arrays > Linked ListsHashtable > Search TreeAvoid Code bloating (Loop Unrolling)Custom Data Structures

Judy Arraysan associative array/map

kD-Treesgeneralised Binary Space Partitioning

Z-Order Curvemultidimensional data in one dimension

Data Layout Principles

Page 56: Performance and predictability

Data Locality vs Java Heap Layout

0

1

2

class Foo {Integer count;Bar bar;Baz baz;

}

// No alignment guaranteesfor (Foo foo : foos) {

foo.count = 5;foo.bar.visit();

}

3

...

Foo

bar

baz

count

Page 57: Performance and predictability

Data Locality vs Java Heap Layout

Serious Java Weakness

Location of objects in memory hard to guarantee.

GC also interferesCopyingCompaction

Page 58: Performance and predictability

Summary

Cache misses cause stalls, which kill performance

Measurable via Performance Event Counters

Common Techniques for optimizing code

Page 59: Performance and predictability

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Page 60: Performance and predictability

Hard Disks

Commonly used persistent storage

Spinning Rust, with a head to read/write

Constant Angular Velocity - rotations per minute stays constant

Page 61: Performance and predictability

A simple model

Zone Constant Angular Velocity (ZCAV) /Zoned Bit Recording (ZBR)

Operation Time = Time to process the commandTime to seekRotational speed latencySequential Transfer TIme

Page 62: Performance and predictability

ZBR implies faster transfer at limits than centre

Page 63: Performance and predictability

Seeking vs Sequential reads

Seek and Rotation times dominate on small values of data

Random writes of 4kb can be 300 times slower than theoretical max data transfer

Consider the impact of context switching between applications or threads

Page 64: Performance and predictability

Fragmentation causes unnecessary seeks

Page 65: Performance and predictability

Sector alignment is disk-level irrelevant

Page 66: Performance and predictability
Page 67: Performance and predictability
Page 68: Performance and predictability

Summary

Simple, sequential access patterns win

Fragmentation is your enemy

Alignment can be relevant in RAID scenarios

SSDs are a different game

Page 69: Performance and predictability

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Page 70: Performance and predictability

Software runs on Hardware, play nice

Page 71: Performance and predictability

Predictability is a running theme

Page 72: Performance and predictability

Huge theoretical speedups

Page 73: Performance and predictability

YMMV: measured incremental improvements >>> going nuts

Page 74: Performance and predictability

More information

Articleshttp://www.akkadia.org/drepper/cpumemory.pdfhttps://gmplib.org/~tege/x86-timing.pdfhttp://psy-lob-saw.blogspot.co.uk/http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.htmlhttp://mechanical-sympathy.blogspot.co.ukhttp://www.agner.org/optimize/microarchitecture.pdf

Mailing Lists:https://groups.google.com/forum/#!forum/mechanical-sympathyhttps://groups.google.com/a/jclarity.com/forum/#!forum/friendshttp://gee.cs.oswego.edu/dl/concurrency-interest/

Page 75: Performance and predictability
Page 76: Performance and predictability

Q & [email protected]/java8lambdas

Page 77: Performance and predictability