42
gluent.com 1 Low-Level CPU Performance Profiling Examples Tanel Poder a long time computer performance geek @tanelpoder blog.tanelpoder.com

Low Level CPU Performance Profiling Examples

Embed Size (px)

Citation preview

Page 1: Low Level CPU Performance Profiling Examples

gluent.com 1

 Low-Level CPU Performance Profiling Examples

Tanel Podera long time computer performance geek

@tanelpoderblog.tanelpoder.com

Page 2: Low Level CPU Performance Profiling Examples

gluent.com 2

Intro: About me

• Tanel Põder• RDBMS Performance geek 20+ years (Oracle)• Unix/Linux Performance geek• Hadoop Performance geek• Spark Performance geek?

• http://blog.tanelpoder.com• @tanelpoder

Expert Oracle Exadata book

Page 3: Low Level CPU Performance Profiling Examples

gluent.com 3

GluentOracle

TeradataNoSQL

Big Data Sources

MSSQL

App X

App Y

App Z

A data sharing platform for enterprise applications

Gluent as a data virtualization layer

Page 4: Low Level CPU Performance Profiling Examples

gluent.com 4

Some Microscopic level stuff to talk about…

1. Some things worth knowing about modern CPUs2. Measuring internal CPU efficiency (C++)3. A columnar database scanning example (Oracle)4. Low level Analysis of Spark Performance

• RDD vs DataFrame• DataFrame with bad code

This is gonna be a (hopefully fun)

hacking session!

Page 5: Low Level CPU Performance Profiling Examples

gluent.com 5

”100%” busy?

A CPU close to 100% busy?

What if I told you your CPU is not that busy?

Page 6: Low Level CPU Performance Profiling Examples

gluent.com 6

CPU Performance Counters on Linux# perf stat -d -p PID sleep 30

Performance counter stats for process id '34783':  27373.819908 task-clock # 0.912 CPUs utilized 86,428,653,040 cycles # 3.157 GHz 32,115,412,877 instructions # 0.37 insns per cycle # 2.39 stalled cycles per insn 7,386,220,210 branches # 269.828 M/sec 22,056,397 branch-misses # 0.30% of all branches 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle 256,440,384 cache-references # 9.368 M/sec 222,036,981 cache-misses # 86.584 % of all cache refs 234,361,189 LLC-loads # 8.562 M/sec 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits 18,493,582 LLC-stores # 0.676 M/sec 3,233,231 LLC-store-misses # 0.118 M/sec 7,324,946,042 L1-dcache-loads # 267.589 M/sec 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits 36,890,302 L1-dcache-prefetches # 1.348 M/sec   30.000601214 seconds time elapsed   

Measure what’s going on inside a

CPU!

Metrics explained in my blog entry: http://

bit.ly/1PBIlde

Page 7: Low Level CPU Performance Profiling Examples

gluent.com 7

Modern CPUs can run multiple operations concurrently

http://software.intel.com

Multiple ports/execution

units for computation &

memory ops

If waiting for RAM – CPU pipeline 

stall!

Page 8: Low Level CPU Performance Profiling Examples

gluent.com 8

Latency Numbers Every Programmer Should Know

Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 ns 3 usSend 1K bytes over 1 Gbps network 10,000 ns 10 usRead 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSDRead 1 MB sequentially from memory 250,000 ns 250 usRound trip within same datacenter 500,000 ns 500 usRead 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memoryDisk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter

roundtripRead 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms  Source: https://gist.github.com/jboner/2841832

Page 9: Low Level CPU Performance Profiling Examples

gluent.com 9

CPU = fast

CPU L2 / L3 cache in between

RAM = slow 

Page 10: Low Level CPU Performance Profiling Examples

gluent.com 10

Tape is dead, disk is tape, flash is disk, RAM locality is king

Jim Gray, 2006

http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

Page 11: Low Level CPU Performance Profiling Examples

gluent.com 11

Just caching all your data in RAM does not give you a modern “in-memory” system!

* Columnar data structures to the rescue!

Page 12: Low Level CPU Performance Profiling Examples

gluent.com 12

Row-Major Data Structures

SELECT SUM(column) FROM array

Page 13: Low Level CPU Performance Profiling Examples

gluent.com 13

Variable field offsets Memory line (cache line)

size = 64 Bytes

Page 14: Low Level CPU Performance Profiling Examples

gluent.com 14

Columnar Data Structure (conceptual)

Store values of a column next to each other (data locality)

Much less data to scan (or filter)

if accessing a subset of columns

Better compression due

to adjacent repeating (or

slightly differing) values

Page 15: Low Level CPU Performance Profiling Examples

gluent.com 15

Single-Instruction-Multiple-Data (SIMD) processing

• Run an operation (like ADD) on multiple registers/memory locations in a single instruction:

Do the same work with less (but more

complex) instructions

More concurrency inside CPU

If the underlying data structures “feed”

data fast enough …

Page 16: Low Level CPU Performance Profiling Examples

gluent.com 16

A database example (Oracle)

Page 17: Low Level CPU Performance Profiling Examples

gluent.com 17

A simple Data Retrieval test!

• Retrieve 1% rows out of a 8 GB table:

SELECT COUNT(*) , SUM(order_total)FROM orders WHERE warehouse_id BETWEEN 500 AND 510

The Warehouse IDs range between

1 and 999

Test data generated by

SwingBench tool

Page 18: Low Level CPU Performance Profiling Examples

gluent.com 18

Data Retrieval: Test Results• Remember, this is a very simple scanning + filtering query:

TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ------------------------- ---------- -------- -------- --------- ---------test1: index range scan * 16715356 265203 37438 782858 511231test2: full buffered */ C 630573765 132075 48944 1013913 849316test3: full direct path * 630573765 15567 11808 1013873 1013850test4: full smart scan */ 630573765 2102 729 1013873 1013850test5: full inmemory scan 630573765 155 155 14 0test6: full buffer cache 630573765 7850 7831 1014741 0

Test 5 & Test 6 run entirely

from memory

Source: http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action

But why 50x difference in CPU usage?

Page 19: Low Level CPU Performance Profiling Examples

gluent.com 19

CPU & cache friendly data structures are key!

Headers, ITL entries

Row Directory

#0 hdr row

#1 hdr row

#2 hdr row

#3 hdr row

#4 hdr row

#5 hdr row

#6 hdr row

#7 hdr row

#8 hdr row

… row

#1 offset

#2 offset

#3 offset

#0 offset

Hdrbyte Column dataLock

byteCC

byteCol. len Column dataCol.

len Column dataCol. len Column dataCol.

len

• OLTP: Block->Row->Column format• 8kB blocks• Great for writes, changes

• Field-length encoding• Reading column #100 requires walking

through all preceding columns

• Columns (with similar values) not densely packed together

• Not CPU cache friendly for analytics!

Page 20: Low Level CPU Performance Profiling Examples

gluent.com 20

Scanning columnar data structures

Scanning a column in a row-oriented data block

Scanning a column in a column-oriented compression unit

col 1 col 2

col 3

col 4

col 5

col 6

col 2col 2

col 3col 3

col 4col 4

col 5col 5

col5col 6

col 1 col 2

3…

col 3 col 4col 4 col 5

col 6 col 1 col 2col 3

col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6

col 1 col 2

3…

col 3 col 4col 4 col 5

col 6 col 1 col 2col 3

col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6

col 1 col 2

3…

col 3 col 4col 4 col 5

col 6 col 1 col 2col 3

col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6 Read filter

column(s) first. Access only

projected columns if matches found.

Reduced memory traffic. More

sequential RAM access, SIMD on adjacent data.

Page 21: Low Level CPU Performance Profiling Examples

gluent.com 21

Testing data access path differences on Oracle 12c

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Run the same query on same dataset stored in different formats/layouts.

Full details:http://blog.tanelpoder.com/2015/11/30/ram-is-the-new-disk-and-how-to-measure-its-performance-part-3-cpu-instructions-cycles/

Test result data:http://bit.ly/1RitNMr

Page 22: Low Level CPU Performance Profiling Examples

gluent.com 22

CPU instructions used for scanning/counting 69M rows

Page 23: Low Level CPU Performance Profiling Examples

gluent.com 23

Average CPU instructions per row processed

• Knowing that the table has about 69M rows, I can calculate the average number of instructions issued per row processed

Page 24: Low Level CPU Performance Profiling Examples

gluent.com 24

CPU cycles consumed (full scans only)

Page 25: Low Level CPU Performance Profiling Examples

gluent.com 25

CPU efficiency (Instructions-per-Cycle)

Yes, modern superscalar CPUs can execute multiple

instructions per cycle

Page 26: Low Level CPU Performance Profiling Examples

gluent.com 26

Reducing memory writes within SQL execution

• Old approach:1. Read compressed data chunk2. Decompress data (write data to temporary memory location)3. Filter out non-matching rows4. Return data

• New approach:1. Read and filter compressed columns2. Decompress only required columns of matching rows3. Return data

Page 27: Low Level CPU Performance Profiling Examples

gluent.com 27

Memory reads & writes during internal processing

Unit = MB Read only requested columns

Rows counted from chunk headers

Scan compressed data: few memory writes

Page 28: Low Level CPU Performance Profiling Examples

gluent.com 28

Spark Examples

• Will use:• Spark built in tools• Perf• Honest Profiler• FlameGraphs

Page 30: Low Level CPU Performance Profiling Examples

gluent.com 30

Apache Spark Tungsten Data Structures

Much denser data structure

“Good memory locality”

Page 31: Low Level CPU Performance Profiling Examples

gluent.com 31

Spark test setup (RDD)

CSV RDD (partitoned)

RDD(single

partition)

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

log("cache entire RDD in memory")data.cache()

log("run map(length).max to populate cache")println(data.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

I wanted to simplify this test as much as

possible

Page 32: Low Level CPU Performance Profiling Examples

gluent.com 32

“SELECT” sum (Year) from RDD// SUM all values of “year” columnprintln(data.map(d => d(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

Cached RDD ~1M records, ~40 columns

1-column sum: 0.349 seconds!

17/01/19 18:43:36 INFO DAGScheduler: ResultStage 123 (reduce at demo.scala:89) finished in 0.349 s17/01/19 18:43:36 INFO DAGScheduler: Job 61 finished: reduce at demo.scala:89, took 0.353754 s

Page 33: Low Level CPU Performance Profiling Examples

gluent.com 33

Spark test setup (DataFrame)

CSV RDD partitioned

RDDsingle

partition

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

...

val dataFrame = ss.createDataFrame(data.map(d => Row(d: _*)), schema)

log("cache entire data-frame in memory")dataFrame.cache()

log("run map(length).max to populate cache")println(dataFrame.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

DataFrame

Page 34: Low Level CPU Performance Profiling Examples

gluent.com 34

“SELECT” sum (Year) from DataFrame (silly example!)// SUM all values of “year” columnprintln(dataFrame.map(r => r(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

17/01/19 19:39:25 INFO DAGScheduler: ResultStage 29 (reduce at demo.scala:71) finished in 4.664 s17/01/19 19:39:25 INFO DAGScheduler: Job 14 finished: reduce at demo.scala:71, took 4.673204 s

Cached DataFrame: ~1M records, ~40 columns

1-column SUM: 4.67 seconds!  (13x more than RDD?)

This does not make sense!

Page 35: Low Level CPU Performance Profiling Examples

gluent.com 35

“SELECT” sum (Year) from DataFrame (proper)// SUM all values of “year” columnprintln(dataFrame.agg(sum("Year")).first.get(0))

17/01/19 19:32:02 INFO DAGScheduler: ResultStage 118 (first at demo.scala:70) finished in 0.004 s17/01/19 19:32:02 INFO DAGScheduler: Job 40 finished: first at demo.scala:70, took 0.041698 s

Cached DataFrame ~1M records, ~40 columns

1-column sum with aggregation pushdown: 0.041 seconds! 

(Over 100x faster than previous Silly DataFrame and 8.5x faster than 1st RDD example)

Page 36: Low Level CPU Performance Profiling Examples

gluent.com 36

Summary

• New data structures are required for CPU efficiency!• Columnar …

• On efficient data structures, efficient code becomes possible• Bad code still performs badly …

• It is possible to measure the CPU efficiency of your code• That should come after the usual profiling and DAG / execution plan

validation

• All secondary metrics (like efficiency ratios) should be used in context of how much work got done

Page 37: Low Level CPU Performance Profiling Examples

gluent.com 37

Past & Future

Page 38: Low Level CPU Performance Profiling Examples

gluent.com 38

Future-proof Open Data Formats!

• Disk-optimized columnar data structures• Apache Parquet

• https://parquet.apache.org/

• Apache ORC• https://orc.apache.org/

• Memory / CPU-cache optimized data structures• Apache Arrow

• Not only storage format• … also a cross-system/cross-platform IPC communication framework• https://arrow.apache.org/

Page 39: Low Level CPU Performance Profiling Examples

gluent.com 39

Future

1. RAM gets cheaper + bigger, not necessarily faster

2. CPU caches get larger

3. RAM blends with storage and becomes non-volatile

4. IO subsystems (flash) get even closer to CPUs

5. IO latencies shrink

6. The latency difference between non-volatile storage and volatile RAM shrinks - new database layouts!

7. CPU cache is king – new data structures needed!

Page 40: Low Level CPU Performance Profiling Examples

gluent.com 40

The tools used here:

• Honest Profiler by Richard Warburton (@RichardWarburto)• https://github.com/RichardWarburton/honest-profiler

• Flame Graphs by Brendan Gregg (@brendangregg)• http://www.brendangregg.com/flamegraphs.html

• Linux perf tool• https://perf.wiki.kernel.org/index.php/Main_Page

• Spark-Prof demos:• https://github.com/gluent/spark-prof

Page 42: Low Level CPU Performance Profiling Examples

gluent.com 42

Thanks!

http://gluent.com/

We are hiring developers & data engineers!!!

http://blog.tanelpoder.com@tanelpoder