Low Level CPU Performance Profiling Examples

gluent.com 1

Low-Level CPU Performance Profiling Examples

Tanel Podera long time computer performance geek

@tanelpoderblog.tanelpoder.com

http://gluent.com/

gluent.com 2

Intro: About me

• Tanel Põder• RDBMS Performance geek 20+ years (Oracle)• Unix/Linux Performance geek• Hadoop Performance geek• Spark Performance geek?

• http://blog.tanelpoder.com• @tanelpoder

Expert Oracle Exadata book

http://gluent.com/

http://blog.tanelpoder.com/

gluent.com 3

GluentOracle

TeradataNoSQL

Big Data Sources

MSSQL

App X

App Y

App Z

A data sharing platform for enterprise applications

Gluent as a data virtualization layer

http://gluent.com/

gluent.com 4

Some Microscopic level stuff to talk about…

1. Some things worth knowing about modern CPUs2. Measuring internal CPU efficiency (C++)3. A columnar database scanning example (Oracle)4. Low level Analysis of Spark Performance

• RDD vs DataFrame• DataFrame with bad code

This is gonna be a (hopefully fun)

hacking session!

http://gluent.com/

gluent.com 5

”100%” busy?

A CPU close to 100% busy?

What if I told you your CPU is not that busy?

http://gluent.com/

gluent.com 6

CPU Performance Counters on Linux# perf stat -d -p PID sleep 30

Performance counter stats for process id '34783': 27373.819908 task-clock # 0.912 CPUs utilized 86,428,653,040 cycles # 3.157 GHz 32,115,412,877 instructions # 0.37 insns per cycle # 2.39 stalled cycles per insn 7,386,220,210 branches # 269.828 M/sec 22,056,397 branch-misses # 0.30% of all branches 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle 256,440,384 cache-references # 9.368 M/sec 222,036,981 cache-misses # 86.584 % of all cache refs 234,361,189 LLC-loads # 8.562 M/sec 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits 18,493,582 LLC-stores # 0.676 M/sec 3,233,231 LLC-store-misses # 0.118 M/sec 7,324,946,042 L1-dcache-loads # 267.589 M/sec 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits 36,890,302 L1-dcache-prefetches # 1.348 M/sec 30.000601214 seconds time elapsed

Measure what’s going on inside a

CPU!

Metrics explained in my blog entry: http://

bit.ly/1PBIlde

http://gluent.com/

http://bit.ly/1PBIle



gluent.com 7

Modern CPUs can run multiple operations concurrently

http://software.intel.com

Multiple ports/execution

units for computation &

memory ops

If waiting for RAM – CPU pipeline

stall!

http://gluent.com/

https://software.intel.com/

gluent.com 8

Latency Numbers Every Programmer Should Know

Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 ns 3 usSend 1K bytes over 1 Gbps network 10,000 ns 10 usRead 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSDRead 1 MB sequentially from memory 250,000 ns 250 usRound trip within same datacenter 500,000 ns 500 usRead 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memoryDisk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter

roundtripRead 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms Source: https://gist.github.com/jboner/2841832

http://gluent.com/

https://gist.github.com/jboner/2841832

gluent.com 9

CPU = fast

CPU L2 / L3 cache in between

RAM = slow

http://gluent.com/

gluent.com 10

Tape is dead, disk is tape, flash is disk, RAM locality is king

Jim Gray, 2006

http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

http://gluent.com/

http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

gluent.com 11

Just caching all your data in RAM does not give you a modern “in-memory” system!

* Columnar data structures to the rescue!

http://gluent.com/

gluent.com 12

Row-Major Data Structures

SELECT SUM(column) FROM array

http://gluent.com/

gluent.com 13

Variable field offsets Memory line (cache line)

size = 64 Bytes

http://gluent.com/

gluent.com 14

Columnar Data Structure (conceptual)

Store values of a column next to each other (data locality)

Much less data to scan (or filter)

if accessing a subset of columns

Better compression due

to adjacent repeating (or

slightly differing) values

http://gluent.com/

gluent.com 15

Single-Instruction-Multiple-Data (SIMD) processing

• Run an operation (like ADD) on multiple registers/memory locations in a single instruction:

Do the same work with less (but more

complex) instructions

More concurrency inside CPU

If the underlying data structures “feed”

data fast enough …

http://gluent.com/

gluent.com 16

A database example (Oracle)

http://gluent.com/

gluent.com 17

A simple Data Retrieval test!

• Retrieve 1% rows out of a 8 GB table:

SELECT COUNT(*) , SUM(order_total)FROM orders WHERE warehouse_id BETWEEN 500 AND 510

The Warehouse IDs range between

1 and 999

Test data generated by

SwingBench tool

http://gluent.com/

gluent.com 18

Data Retrieval: Test Results• Remember, this is a very simple scanning + filtering query:

TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ------------------------- ---------- -------- -------- --------- ---------test1: index range scan * 16715356 265203 37438 782858 511231test2: full buffered */ C 630573765 132075 48944 1013913 849316test3: full direct path * 630573765 15567 11808 1013873 1013850test4: full smart scan */ 630573765 2102 729 1013873 1013850test5: full inmemory scan 630573765 155 155 14 0test6: full buffer cache 630573765 7850 7831 1014741 0

Test 5 & Test 6 run entirely

from memory

Source: http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action

But why 50x difference in CPU usage?

http://gluent.com/

http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action

gluent.com 19

CPU & cache friendly data structures are key!

Headers, ITL entries

Row Directory

#0 hdr row

#1 hdr row

#2 hdr row

#3 hdr row

#4 hdr row

#5 hdr row

#6 hdr row

#7 hdr row

#8 hdr row

… row

#1 offset

#2 offset

#3 offset

#0 offset

…

Hdrbyte Column dataLock

byteCC

byteCol. len Column dataCol.

len Column dataCol. len Column dataCol.

len

• OLTP: Block->Row->Column format• 8kB blocks• Great for writes, changes

• Field-length encoding• Reading column #100 requires walking

through all preceding columns

• Columns (with similar values) not densely packed together

• Not CPU cache friendly for analytics!

http://gluent.com/

gluent.com 20

Scanning columnar data structures

Scanning a column in a row-oriented data block

Scanning a column in a column-oriented compression unit

col 1 col 2

col 3

col 4

col 5

col 6

col 2col 2

col 3col 3

col 4col 4

col 5col 5

col5col 6

col 1 col 2

3…

col 3 col 4col 4 col 5

col 6 col 1 col 2col 3

col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6

col 1 col 2

3…



col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6

col 1 col 2

3…



col 3

col 4

col 4

col 5

col 5col 1 col 2

col 6col 6 Read filter

column(s) first. Access only

projected columns if matches found.

Reduced memory traffic. More

sequential RAM access, SIMD on adjacent data.

http://gluent.com/

gluent.com 21

Testing data access path differences on Oracle 12c

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Run the same query on same dataset stored in different formats/layouts.

Full details:http://blog.tanelpoder.com/2015/11/30/ram-is-the-new-disk-and-how-to-measure-its-performance-part-3-cpu-instructions-cycles/

Test result data:http://bit.ly/1RitNMr

http://gluent.com/

http://blog.tanelpoder.com/2015/11/30/ram-is-the-new-disk-and-how-to-measure-its-performance-part-3-cpu-instructions-cycles/




http://bit.ly/1RitNMr

gluent.com 22

CPU instructions used for scanning/counting 69M rows

http://gluent.com/

gluent.com 23

Average CPU instructions per row processed

• Knowing that the table has about 69M rows, I can calculate the average number of instructions issued per row processed

http://gluent.com/

gluent.com 24

CPU cycles consumed (full scans only)

http://gluent.com/

gluent.com 25

CPU efficiency (Instructions-per-Cycle)

Yes, modern superscalar CPUs can execute multiple

instructions per cycle

http://gluent.com/

gluent.com 26

Reducing memory writes within SQL execution

• Old approach:1. Read compressed data chunk2. Decompress data (write data to temporary memory location)3. Filter out non-matching rows4. Return data

• New approach:1. Read and filter compressed columns2. Decompress only required columns of matching rows3. Return data

http://gluent.com/

gluent.com 27

Memory reads & writes during internal processing

Unit = MB Read only requested columns

Rows counted from chunk headers

Scan compressed data: few memory writes

http://gluent.com/

gluent.com 28

Spark Examples

• Will use:• Spark built in tools• Perf• Honest Profiler• FlameGraphs

http://gluent.com/

gluent.com 29

Apache Spark Tungsten Data Structures

Databricks presentation:http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen

Much denser data structure

Using sun.misc.Unsafe

API to bypass JVM object allocator

http://gluent.com/

http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen





gluent.com 30

Apache Spark Tungsten Data Structures

Much denser data structure

“Good memory locality”

http://gluent.com/

gluent.com 31

Spark test setup (RDD)

CSV RDD (partitoned)

RDD(single

partition)

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

log("cache entire RDD in memory")data.cache()

log("run map(length).max to populate cache")println(data.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

I wanted to simplify this test as much as

possible

http://gluent.com/

gluent.com 32

“SELECT” sum (Year) from RDD// SUM all values of “year” columnprintln(data.map(d => d(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

Cached RDD ~1M records, ~40 columns

1-column sum: 0.349 seconds!

17/01/19 18:43:36 INFO DAGScheduler: ResultStage 123 (reduce at demo.scala:89) finished in 0.349 s17/01/19 18:43:36 INFO DAGScheduler: Job 61 finished: reduce at demo.scala:89, took 0.353754 s

http://gluent.com/

gluent.com 33

Spark test setup (DataFrame)

CSV RDD partitioned

RDDsingle

partition

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

...

val dataFrame = ss.createDataFrame(data.map(d => Row(d: _*)), schema)

log("cache entire data-frame in memory")dataFrame.cache()

log("run map(length).max to populate cache")println(dataFrame.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

DataFrame

http://gluent.com/

gluent.com 34

“SELECT” sum (Year) from DataFrame (silly example!)// SUM all values of “year” columnprintln(dataFrame.map(r => r(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

17/01/19 19:39:25 INFO DAGScheduler: ResultStage 29 (reduce at demo.scala:71) finished in 4.664 s17/01/19 19:39:25 INFO DAGScheduler: Job 14 finished: reduce at demo.scala:71, took 4.673204 s

Cached DataFrame: ~1M records, ~40 columns

1-column SUM: 4.67 seconds! (13x more than RDD?)

This does not make sense!

http://gluent.com/

gluent.com 35

“SELECT” sum (Year) from DataFrame (proper)// SUM all values of “year” columnprintln(dataFrame.agg(sum("Year")).first.get(0))

17/01/19 19:32:02 INFO DAGScheduler: ResultStage 118 (first at demo.scala:70) finished in 0.004 s17/01/19 19:32:02 INFO DAGScheduler: Job 40 finished: first at demo.scala:70, took 0.041698 s

Cached DataFrame ~1M records, ~40 columns

1-column sum with aggregation pushdown: 0.041 seconds!

(Over 100x faster than previous Silly DataFrame and 8.5x faster than 1st RDD example)

http://gluent.com/

gluent.com 36

Summary

• New data structures are required for CPU efficiency!• Columnar …

• On efficient data structures, efficient code becomes possible• Bad code still performs badly …

• It is possible to measure the CPU efficiency of your code• That should come after the usual profiling and DAG / execution plan

validation

• All secondary metrics (like efficiency ratios) should be used in context of how much work got done

http://gluent.com/

gluent.com 37

Past & Future

http://gluent.com/

gluent.com 38

Future-proof Open Data Formats!

• Disk-optimized columnar data structures• Apache Parquet

• https://parquet.apache.org/

• Apache ORC• https://orc.apache.org/

• Memory / CPU-cache optimized data structures• Apache Arrow

• Not only storage format• … also a cross-system/cross-platform IPC communication framework• https://arrow.apache.org/

http://gluent.com/

https://parquet.apache.org/

https://parquet.apache.org/

https://orc.apache.org/

https://arrow.apache.org/

https://arrow.apache.org/

gluent.com 39

Future

1. RAM gets cheaper + bigger, not necessarily faster

2. CPU caches get larger

3. RAM blends with storage and becomes non-volatile

4. IO subsystems (flash) get even closer to CPUs

5. IO latencies shrink

6. The latency difference between non-volatile storage and volatile RAM shrinks - new database layouts!

7. CPU cache is king – new data structures needed!

http://gluent.com/

gluent.com 40

The tools used here:

• Honest Profiler by Richard Warburton (@RichardWarburto)• https://github.com/RichardWarburton/honest-profiler

• Flame Graphs by Brendan Gregg (@brendangregg)• http://www.brendangregg.com/flamegraphs.html

• Linux perf tool• https://perf.wiki.kernel.org/index.php/Main_Page

• Spark-Prof demos:• https://github.com/gluent/spark-prof

http://gluent.com/

https://github.com/RichardWarburton/honest-profiler

https://github.com/RichardWarburton/honest-profiler

http://www.brendangregg.com/flamegraphs.html

http://www.brendangregg.com/flamegraphs.html

https://perf.wiki.kernel.org/index.php/Main_Page

https://perf.wiki.kernel.org/index.php/Main_Page

https://github.com/gluent/spark-prof

gluent.com 41

References

• Slides & Video of a similar presentation (about Oracle):• http://www.slideshare.net/tanelp• https://vimeo.com/gluent

• RAM is the new disk series:• http

://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-how-to-measure-its-performance-part-1/

• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqAAlHnZqmuVmSFbHMLDsjaU/

http://gluent.com/

http://www.slideshare.net/tanelp



https://vimeo.com/gluent

https://vimeo.com/gluent

http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-how-to-measure-its-performance-part-1/




https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqAAlHnZqmuVmSFbHMLDsjaU/



gluent.com 42

Thanks!

http://gluent.com/

We are hiring developers & data engineers!!!

http://blog.tanelpoder.com@tanelpoder

http://gluent.com/

http://gluent.com/

http://gluent.com/

http://gluent.com/

http://blog.tanelpoder.com/

Data & Analytics

Low Level CPU Performance Profiling Examples