Steve Shaw, Intel Database Technology Manager Makers TPC -H Overview DSS ... Manufacturer: Intel...

Preview:

Citation preview

Steve Shaw, IntelDatabase Technology Manager

2

Agenda

• Introduction • HammerDB Introduction• OS Configuration Essentials and Tools: CPU, Memory, I/O• OLTP workload customer example• Analysing Results and Price/Performance • Scaling and Clustering• Development and Directions• Summary

3

Introduction

• Steve Shaw – Database Tech Manager Intel

• Co-authored 2 books on Oracle

• Work with multiple databases commercial and open source all on Intel since Dynix/ptx on Pentium Pro 200Mhz 1MB Cache

• Specialized in scaling up and scaling out

• HammerDB is a GPL employer approved open source project developed under Intel Linux User Group program

4

5

What is HammerDB?“industry standard database benchmarking tool”*

http://www.hammerdb.com/benchmarks.html

Database License Test Results Interface Library OLTP

(from TPC-C)

OLAP

(from TPC-H)

Oracle/TimesTen Commercial Restricted Oracle OCI Oratcl

MS SQL Server Commercial Restricted ODBC TclODBC

IBM DB2 Commercial Restricted DB2 CLI Db2tcl

MariaDB/MySQL /

(Amazon Aurora)

Open

Source

Free to publish MySQL C API MySQLtcl

PostgreSQL /

(EnterpriseDB /

Greenplum

/Amazon Redshift)

Open

Source

Free to publish libpq Pgtclng

Redis Open

Source

Free to publish TCL client In-built (Retcl

planned)

Trafodion SQL on

Hadoop

Open

Source

Free to publish ODBC TDBC

6

Databases, Licenses and Workloads• HammerDB is GPL as are all extensions compiled using gcc on Linux and MS Visual C/C++ on Windows

• Third party database libraries are required in the library path

7

How benchmarking has changed

• Old method:• Submit official audited benchmarks• Example cost for 1 official benchmark $4,483,729• ‘Bare Metal’ environment• Last official TPC-C benchmark in 2014, 1 current

• New method: • Enable people to run own benchmarks• Inbuilt OLTP (TPC-C) and OLAP (TPC-H) workloads• Zero Cost• Bare Metal + Cloud, virtualization, containers, control groups• Share your results online

OLTP and OLAP workloads

DSS Database 100GB, 300GB,

1TB, 3TB, 10TB

OLTP

Database

Business Analysis

Business Operations

OLTP

Transaction

s

DSS Queries

Decision Makers

TPC-H Overview

DSS – Decision

Support System

TPC-H

TPC-C

Source: www.tpc.org

• OLAP based on TPC-H specification Complex Analytic Queries favours parallel query engines and column store databases

• OLTP based on TPC-C Specification Transactional throughput complex workload with deliberate contention to test scalability of the database engine

NOTE: HammerDB does not implement a full TPC-C or TPC-H workload or use any of the terminology (eg tpmC, QphH) to imply that the workloads do.

Scalable Schema Configurations

OLTP/TPC-C OLAP/TPC-H

10

OLTP HammerDB Scalability• Used in Intel database testing in multiple groups

eg Generate database performance data for forthcoming processors (more than 50 data points)

– CPUs required to pass multiple performance tests HammerDB is one of these these

– Scalability proven alignment over generations with TPC-C at fraction of cost

• Example ‘Skylake’

– Generation of performance data for multiple SKUs

– Review and Approval

– Known product performance

– Not public due to commercial database licensing

– Known as ‘De Witt’ ClauseTPM

CPU Generations

TPM

CPU Generations and SKUs

2Up to 5x claim based on OLTP Warehouse workload: 1-Node, 4 x Intel® Xeon® Processor E7-4870 Source: Request Number: 56, Benchmark: HammerDB, Score: 2.46322e+006 Higher is better vs. 1-Node, 4 x Intel® Xeon® Platinum 8180 Processor

11

Example: Skylake Launch

“With up to 28 of the highest-performance

cores, the all-new Intel Xeon Scalable

platform can support up to 5x more

transactions per second2 than 4-year-old systems”

https://newsroom.intel.com/editorials/intel-xeon-scalable-processor-family-data-center/

‘Relative’ rather than ‘Absolute’ performance data at product launches

12

OLTP Comparing Results: TPM and NOPM

• HammerDB produces 2 results TPM and NOPM eg

• TPM is database specific ie TPM cannot be compared between different databases (apart from relative scaling)

• NOPM is Schema/workload specific ie NOPM can be compared between different databases

• HammerDB uses both as TPM is a lightweight database statistic to gather and monitor, NOPM may impact schema so used minimally

13

HammerDB/Commerical and Sysbench/MariaDB

E5-2697 v3 E5-2697 v4

HammerDB v3 to v4

(Commercial)

• Similar scaling observed across databases• HammerDB has more complexity and contention and supports OLTP and OLAP

1.26X

E5-2699 v4 Intel® Xeon®

Platinum 8180

HammerDB v4 to

Skylake

(Commercial) 1.70X

14

Why HammerDB is written in TCL (and not Python) • TCL Threading Model - User sees 1 process per application

• TCL One Interpreter per thread eq High Performance, High Scalability, Stability

• Python restricted by GIL – ‘global interpreter lock’ = one thread executing at a time

One thread, 1 interpreter with low level API

Database drivenTo 100%

Users check a TSV toSee if stop button pressedor can kill threads

Tens of millionstransactions per minute

15

Commercial Tool Comparison

• TCL based open source application delivers higher performance at lower CPU utilisation than leading commercial tool

16

BIOS Settings

17

• Optimal BIOS settings are essential to performance

• And testing essential for optimization

• Beware of ‘Maximum Performance’ set and forget

Visit ark.intel.com

CPU cat /proc/cpuinfo | grep -i intel

vendor_id : GenuineIntel

model name : Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

18

intel@purley1:~$ sudo turbostat --debug

turbostat version 4.16 24 Dec 2016 Len Brown <lenb@kernel.org>

10 * 100 = 1000 MHz max efficiency frequency

25 * 100 = 2500 MHz base frequency

cpu106: MSR_IA32_POWER_CTL: 0x29240059

cpu106: MSR_TURBO_RATIO_LIMIT: 0x2021232323232426

32 * 100 = 3200 MHz max turbo 8 active cores

33 * 100 = 3300 MHz max turbo 7 active cores

35 * 100 = 3500 MHz max turbo 6 active cores

35 * 100 = 3500 MHz max turbo 5 active cores

35 * 100 = 3500 MHz max turbo 4 active cores

35 * 100 = 3500 MHz max turbo 3 active cores

36 * 100 = 3600 MHz max turbo 2 active cores

38 * 100 = 3800 MHz max turbo 1 active cores

Intel® Turbo Boost Technology

Increases performance by increasing processor frequency and enabling faster speeds when conditions

allow

Co

re0

Co

re1

Co

re8

Co

re0

Co

re1

Co

re8

Co

re0

Co

re1

Fre

qu

en

cy

All cores operate at

rated frequency

All cores operate at higher

frequency

Fewer cores may operate at

even higherfrequencies

8C TurboNormal<8C Turbo

… … …

Higher Performance on Demand

Intel® Turbo Boost Technology (turbostat)

cpupower frequency (P-States)

20

intel@purley1:~$ cpupower frequency-set --governor=performanceintel@purley1:~$ sudo cpupower frequency-infoanalyzing CPU 0:driver: intel_pstateCPUs which run at the same hardware frequency: 0CPUs which need to have their frequency coordinated by software: 0maximum transition latency: Cannot determine or is not supported.hardware limits: 1000 MHz - 3.80 GHzavailable cpufreq governors: performance powersavecurrent policy: frequency should be within 1000 MHz and 3.80 GHz.

The governor "performance" may decide which speed to usewithin this range.

current CPU frequency: 2.83 GHz (asserted by call to hardware)boost state support:Supported: yesActive: yes

cpupower idle (C-States)

intel@purley1:~$ cpupower idle-set --enable-all

intel@purley1:~$ sudo cpupower idle-info

CPUidle driver: intel_idle

CPUidle governor: menu

analyzing CPU 0:

Number of idle states: 4

Available idle states: POLL C1-SKX C1E-SKX C6-SKX

POLL:

Flags/Description: CPUIDLE CORE POLL IDLE

Latency: 0

Usage: 2280

Duration: 1822287

21

Energy Performance Bias

MSR_IA32_ENERGY_PERF_BIAS

0 = high performance 6 = balanced 15 = low power

Red Hat 5 defaulted to ‘performance’

Red Hat 7 set this MSR to ‘balanced’

To return to performance setting :

intel@purley1:~$ sudo x86_energy_perf_policy -v performance

CPUID.06H.ECX: 0x9

cpu0 msr0x1b0 0x0000000000000006 -> 0x0000000000000000

cpu1 msr0x1b0 0x0000000000000006 -> 0x0000000000000000

22

Hyper-Threading

./cpu_topology64.out

Software visible enumeration in the system:

Number of logical processors visible to the OS: 112

Number of logical processors visible to this process: 112

Number of processor cores visible to this process: 56

Number of physical packages visible to this process: 2

Memorydmidecode | more

Base Board Information

Manufacturer: Intel Corporation

Product Name: S2600WFD

Memory Device

Total Width: 72 bits

Data Width: 64 bits

Size: 16384 MB

Form Factor: DIMM

Bank Locator: NODE 1

Type: DDR4

Speed: 2666 MHz

Manufacturer: Micron

Configured Clock Speed: 2666 MHz

ark.intel.com

NUMA and Memory Latency

intel@purley1:~/cputools/Linux$ sudo ./mlc

Intel(R) Memory Latency Checker - v3.4

Measuring idle latencies (in ns)...

Numa node

Numa node 0 1

0 72.5 135.0

1 132.4 70.5

Measuring Peak Injection Memory Bandwidths for the system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is

enabled

Using traffic with the following read-write ratios

ALL Reads : 226259.4

3:1 Reads-Writes : 207531.0

2:1 Reads-Writes : 204783.2

1:1 Reads-Writes : 188107.7

Stream-triad like: 182377.5

Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

26

Database Files

1 x Intel® SSD DC P3700 Series

Database Files

1 x Intel® Optane™ SSD DC P4800X

With all NAND SSDs

Server

2 x Intel® Xeon® E5

With Intel® Optane™ SSD

10xMore transactions per second (TPS at same latency level)1

91% Lower cost per transaction1

TPS 1395

Latency ~11ms @ 99%

$/transaction ~$10.09

16480

~10ms @ 99%

~$0.90

1. System configuration: Server Intel® Server System R2208WT2YS, 2x Intel® Xeon® E5 2699v4, 384 GB DDR4 DRAM, boot drive- 1x Intel® SSD DC S3710 Series (400 GB), database drives- 1x Intel® SSD DC P3700 Series (400 GB) and 1x Intel® SSD DC P4800X Series (140 GB prototype), CentOS 7.2, MySQL Server 5.7.14, Sysbench 0.5 configured for 70/30 Read/Write OLTP transaction split using a 100GB database. Cost per transaction determined by total MSRP for each configuration divided by the transactions per second.

Server

2 x Intel® Xeon® E5

*Other names and brands names may be claimed as the property of others

up to

up to

I/O :Intel® Optane™ SSD DC P4800X

27

28

Choose Database

29

Schema Creation

• Choose Schema Options

• Select Build

• Multi-threaded schema build

• Option for flat-file data generation for cloud

30

Build Complete

• Schema of chosen size has built successfully

31

Timed Workload

• Test and Timed Workloads

32

Virtual Users

• Configure Virtual Users• For Timed workloads one

Vuser is a monitor

33

Transaction Counter

• Transaction Counter should be as ‘flat’ as possible

• Peaks and Troughs indicate configuration errors

34

Test Complete

• Monitor Virtual User shows average TPM and NOPM over the test

• TPM will be slightly lower than Transaction Counter due to longer sample interval

35

Autopilot

• Completely Automated and Unattended performance test

• Provide sequence of Virtual Users and leave to run

36

37

Analysing Results

0

100000

200000

300000

400000

500000

600000

0 20 40 60 80 100

NO

PM

Virtual Users

Performance Profile

MariaDB v3 MariaDB v4

0

100000

200000

300000

400000

500000

600000

NO

PM

MariaDB Peak Performance

MariaDB v3 MariaDB v4

1.26X

Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

38

Commercial Comparison

0

100000

200000

300000

400000

500000

600000

NO

PM

MariaDB Peak Performance

MariaDB v3 MariaDB v4

• NOPM results enable comparison of MariaDB results to commercial database results archive

39

Example Price / Performance

0

200000

400000

600000

800000

1000000

1200000

Commercial MariaDB

US

D $

Database Software 3 year TCO

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Commercial MariaDB

Pri

ce /

NO

PM

Cost per Transaction

• Calculate License Cost + Support for a 3 year period

• Divide Cost by Performance for Cost per Transaction

40

41

Skylake InnoDB Optimisation

0

100000

200000

300000

400000

500000

600000

700000

0 10 20 30 40 50 60 70 80 90

NO

PM

Virtual Users

MariaDB OLTP Database Performance

MariaDB Skylake InnoDB Update MariaDB Skylake DEFAULT

• Non-contention workloads do not highlight performance impact

• Optimize InnoDB spinlock on Skylake to reduce contention and increase throughput 1.43X

SELECT FOR UPDATE Locking Overhead

SELECT d_next_o_id, d_tax INTO no_d_next_o_id,

no_d_tax

FROM district

WHERE d_id = no_d_id AND d_w_id = no_w_id FOR

UPDATE;

UPDATE district SET d_next_o_id = d_next_o_id +

1 WHERE d_id = no_d_id AND d_w_id = no_w_id;

SET o_id = no_d_next_o_id;

UPDATE district SET d_next_o_id = d_next_o_id +

1 WHERE d_id = no_d_id AND d_w_id = no_w_id

RETURNING d_next_o_id, d_tax INTO

no_d_next_o_id, no_d_tax;

o_id := no_d_next_o_id;

SELECT d_next_o_id, d_tax INTO no_d_next_o_id,

no_d_tax

FROM OLD TABLE ( UPDATE district

SET d_next_o_id = d_next_o_id + 1

WHERE d_id = no_d_id

AND d_w_id = no_w_id );

SET o_id = no_d_next_o_id;

UPDATE dbo.district

SET @no_d_tax = d_tax

, @o_id = d_next_o_id

, d_next_o_id = district.d_next_o_id + 1

WHERE district.d_id = @no_d_id

AND district.d_w_id = @no_w_id

SET @no_d_next_o_id = @o_id+1

Oracle

MS SQL Server

MariaDB / MySQL

DB2

MariaDB Galera Cluster on Intel® SSD DC P3700 Series

Clustering Performance

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

NEWORD PAYMENT DELIVERY SLEV OSTAT

Re

spo

nse

Tim

es

Mic

rose

con

ds

Stored Procedures

Stored Procedure Peak Throughput Response Times

MariaDB at 363285 NOPM Galera at 72489 NOPM

• Replication impact onperformance

• High Penalty of stored procedure with ‘Delete’ Transactions

45

MyRocks Potential

• Optane and MyRocks great potential to better use NVM

• > 460,000 HammerDB NOPM in initial testing

Future: Intel® DIMMsBased on 3D XPoint™ memory media

OS Paging

‘memory pool’

DRAM

PCIe

Intel® 3D NAND SSDs

Intel® Optane™ SSDs

Intel® Xeon®

E5

DDR

PCIe*

Intel® Memory Drive Technology

FPGA Intel Arria 10 for Database

Up to 80% reduction in power

consumption (vs. Intel® Xeon®

Real-time, inline processing of streaming data without buffering

power efficiency high-throughput, low latency

• Acceleration of Targeted Algorithms for example:

• Compression

48

HammerDB v3.0

eg requests for SAP HANA, SQLite, Cassandra, MongoDB, Tibero, Cubrid, Linter and TPC-E, TPC-DS

• Multiple Requests to support additional databases

• Refactoring underway to make adding databases easier

• Now XML <-> Dict driven to make adding databases by XML file and build and driver scripts

Moore’s Law

49

Hi-K Metal Gate

Strained Silicon

3D Transistors

65 nm 45 nm 32 nm 22 nm 14 nm 10 nm 7 nm90 nm

Enabling new devices with higher functionality and complexity while

controlling power, cost, and size

50

Executing to Moore’s Law

Recommended