53
DBA Level 400

Super scaling singleton inserts

Embed Size (px)

Citation preview

Page 1: Super scaling singleton inserts

DBA Level 400

Page 2: Super scaling singleton inserts

About Me

I’m pushing the database engine as hard as I can captain, she’s going

blow.

An independent SQL consultant.

A user of SQL Server since 2000.

14+ years of SQL Server experience.

The ‘Standard’ stuff What I’m passionate about !

Page 3: Super scaling singleton inserts

The Exercise

Squeeze every last drop of performance out of the hardware !

ostress –E –dSingletonInsert –Q”exec usp_insert” –n40

Page 4: Super scaling singleton inserts

Test Environment

SQL Server 2016 CTP 2.3

Windows server 2012 R2

2 x 10 Xeon V3 cores 2.2Ghz with hyper-threading enabled

64GB DDR 4 quad channel memory

4 x SanDisk Extreme Pro 480GB Raid 1 (64K allocation size ) )

ostress used for generating concurrent workload

Use the conventional database engine to begin with . . .

Page 5: Super scaling singleton inserts

I Will Be Using Windows Performance Toolkit . . . A Lot ! It allows CPU time to be

quantified across the whole database engine.

Not just what Microsoft deem what we should seebut everything !.

The database engine equivalent of seeing the Matrix in code form ;-)

Page 6: Super scaling singleton inserts

Where Everyone Starts From . . . A Monotonically Increasing Key

CREATE TABLE [dbo].[MyBigTable] (

[c1] [bigint] IDENTITY(1, 1) NOT NULL,

,[c2] [datetime] NULL,

,[c3] [char](111) NULL,

,[c4] [int] NULL,

,[c5] [int] NULL,

,[c6] [bigint] NULL,

CONSTRAINT [PK_BigTableSeq] PRIMARY KEY CLUSTERED (

[c1] ASC

)

)

CPU utilization02:12:26 Waits stats

Page 7: Super scaling singleton inserts

The “Last Page Problem”

Min

Min

Min Min

Min

MinMin

Min Min

Min

HOBT_ROOT

Max

Page 8: Super scaling singleton inserts

Overcoming The “Last Page” Problem

600

616

982

7946

8170

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

SPID Offset

Partition + SPID Offset

NEWID()

IDENTITY

NEWSEQUENTIALiD

Elapsed Time (s)

Key

Typ

e

Elapsed Time (s) / Key Type

What are we waiting

on ?

Page 9: Super scaling singleton inserts

Can Delayed Durability Help ?

265

600

0 100 200 300 400 500 600 700

Delayed durability

Conventional

Elapsed Time (s)

Logg

ing

Typ

e

Elapsed time (s) / Logging Type

Page 10: Super scaling singleton inserts

What Is Wrong In Task Manager ?

Page 11: Super scaling singleton inserts

Fixing CPU Core Starvation With Trace Flag 8008

The scheduler with least load is now favoured over the ‘Preferred’ scheduler.

Documented in this CSS engineers note.

Elapsed time has gone backwards, it is now 453 seconds !why ?.

Page 12: Super scaling singleton inserts

Where Are Our CPU Cycles Going ?

Page 13: Super scaling singleton inserts

How Spinlocks Work

A task on a scheduler will spin until it can acquire the spinlock it is after

For short lived waits this uses less CPU cycles than a yielding then waiting for the task thread to be at the head of the runnable queue.

Page 14: Super scaling singleton inserts

Spinlock Backoff

We have to yield the scheduler at some stage !

Page 15: Super scaling singleton inserts

Introducing The LOGCACHE_ACCESS Spinlock

Buffer Offset (cache line)

LOGCACHE

ACCESS

Alloc Slot in Buffer

MemCpy Slot

Content

Log Writer

Writer Queue

Async I/O Completion Port

Slot

1LOGBUFFER

WR

ITE

LO

G

LOG

FLUSHQ

Signal thread which issued commit

T0

Tn

Slot

127

Slot

126

The bit we are interested in

Page 16: Super scaling singleton inserts

Anatomy of A Modern CPU

Core

L3 Cache

L1 Instruction Cache 32KB

L2 Unified Cache 256K

Power and

ClockQPI

MemoryController

L1 Data Cache32KB

Core

CoreL1 Instruction Cache 32KB

L2 Unified Cache 256K

L1 Data Cache32KB

Core

TLBMemory bus

C P U

QPI. . .

Un-core

L0 UOP Cache L0 UOP Cache

Page 17: Super scaling singleton inserts

Memory, Cache Lines and The CPU Cache

C P U

new OperationData() new OperationData() new OperationData()

Cache Line Cache LineCache Line

64B

Cache Line

Cache Line

Cache Line

Cache Line

Tag

Tag

Tag

Tag

C P U C a c h e

Page 18: Super scaling singleton inserts

Spinlocks and Memory

spin_acquire

Int sspin_acquire

Int s

spin_acquire

Int s

Transfer cache line

Transfer cache line

CPU CPU

L3

Core

Core

C P U

L3

Core

Core

C P U

Page 19: Super scaling singleton inserts

What Happens If We Give The Log Writer Its Own CPU Core ?

Page 20: Super scaling singleton inserts

600

265

1193

231

0

200

400

600

800

1000

1200

1400

Conventional Logging Delayed Durability TF8008, DelayedDurability

TF8008, DelayedDurability, Affinity mask

change

Elap

sed

Tim

e (s

)

Configuration

Elapsed time (s)

We Get The Lowest Elapsed Time So Far

* With 38 threads, all other tests with 40.

Page 21: Super scaling singleton inserts

Scalability With and Without A CPU Core Dedicated To The Log Writer

0

100,000

200,000

300,000

400,000

500,000

600,000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

Inse

rts

/ s

Insert Threads

Insert Rate / Insert Threads

Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1

Page 22: Super scaling singleton inserts

. . . and What About LOGCACHE_ACCESS Spins ?

0

2,000,000,000

4,000,000,000

6,000,000,000

8,000,000,000

10,000,000,000

12,000,000,000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Spin

s

Threads

LOGCACHE_ACCESS spins / Thread Count

Baseline Log Writer with Dedicated CPU Core

Page 23: Super scaling singleton inserts

What Difference Has This Made To Where CPU Time Is Going ?

With the default CPU affinity mask

Log writer with dedicated CPU core

63,166,836 ms(40 threads)

Vs.

220,168 ms(38 threads)

Page 24: Super scaling singleton inserts

Optimizations That Failed To Make The Grade

Large memory pagesAllows The Look aside buffer to cover a more memory for logical to physical memory mapping.

Trace flag 2330Stops spins on OPT_IDX_STATS.

Trace flag 1118prevents mixed allocation extents – enabled by default in SQL Server 2016

Page 25: Super scaling singleton inserts

A Different Spinlock Is Now The Most Spin Intensive

A new spinlock is now the most spin intensive: XDESMGR, probably spinlock<109,9,1>

what does it do ?

Page 26: Super scaling singleton inserts

Digging Into The Call Stack To Understand Undocumented Spinlocks

xperf -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile

xperf –d stackwalk.etl

1. Start trace

2. Run workload

3. Stop trace

4. Load trace into WPA

5. Locate spinlock in call stack 6. ‘Invert’ the call stack

Page 27: Super scaling singleton inserts

Examining The XDESMGR Spinlock By Digging Into The Call Stack

This serialises access to the part of the database engine that allocates and destroys transaction ids.

How do you relieve pressure on this spinlock ? Have multiple insert statement per transaction.

Page 28: Super scaling singleton inserts

Options For Dealing With The XDESMGR Spinlock

Relieving pressure on the LOGCACHE_ACCESS spinlock makes the XDESMGR spinlock the bottleneck.

There are three places to go at this point: Increase the ratio of transactions to DML statements.

Shard the table across databases and instances.

Use in memory OLTP native transactions.

Page 29: Super scaling singleton inserts

Increasing The Batch Size By Just One Makes A Big Difference !

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Insert Rate / Thread Count

Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1

Log Writer With Dedicated Core Batch Size=2

Page 30: Super scaling singleton inserts

. . . and The Difference This Makes To XDESMGR Spins

0

20,000,000,000

40,000,000,000

60,000,000,000

80,000,000,000

100,000,000,000

120,000,000,000

140,000,000,000

160,000,000,000

180,000,000,000

200,000,000,000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

XDESMGR Spins / Thread Count

Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1

Log Writer With Dedicated Core Batch Size=2

Page 31: Super scaling singleton inserts

Does It Matter Which NUMA Node The Insert Runs On ?

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

Faster here ?Numa Node 0

. . . Or faster here? Numa Node 1

“Whats really going to bake your

noodle . . .”

8 threads here

73 s

8 threads here

125 s

Page 32: Super scaling singleton inserts

What Does Windows Performance Toolkit Have To Tell Us ?

18 insert thread log writer CPU socketCo-location.

18 insert threads not co-located on same socket as the log writer

84,697 ms

Vs.

11,281,235 ms

Page 33: Super scaling singleton inserts

So I Should Look At Tuning The CPU Affinity Mask ?

Get the basics right first: Minimize transaction log fragmentation ( both internal and external ). Use low latency storage. Avoid log intensive operations, page splits etc . . . Use minimally logged operations where appropriate.

Only when: All of the above has been done. The disk row store engine is being used. The workload is OLTP heavy using more than 12 CPU cores, 6 per socket,

look at giving the log writer a CPU core to itself.

Page 34: Super scaling singleton inserts

Hard To Solve Logging Issues

I’m have to use the disk row store engine.

My single threaded app cannot easily be multi threaded.

How do I get the best possible write log performance ?

Use NUMA connection affinity to connect to the same socket as the log writer.

Disable hyper-threading, whole cores and always faster than hyper-threads.

‘Affinitize’ the rest of the database engine away from the log writer thread ‘Home’ CPU core.

Go for a CPU with the best single threaded performance available.

Page 35: Super scaling singleton inserts

The CPU Cycle Cost Of Spinlock Cache Line Transferspin_acquire

Int sspin_acquire

Int s

spin_acquire

Int s

Transfer cache line

Transfer cache line

CPU CPU

L3Core

C P U

C P U

C P U

C P U100 CPU cycles

Core

34 CPU cycles100 CPU cycles

34 CPU cycles

Core to core on the same socket Core to core on different sockets

Page 36: Super scaling singleton inserts

Remember, All Memory Access Is CPU Intensive

Page 37: Super scaling singleton inserts

This Man Seriously Knows A Lot About Memory

Ulrich Drepper, author of: What Every Programmer Should Know About Memory

From Understanding CPU Caches

“Use per CPU memory; lock thread to specific CPU”This is our CPU affinity mask trick

Page 38: Super scaling singleton inserts

Cache Line Ping Pong

IO H

ub

CPU 6 CPU 7

CPU 4 CPU 5

CPU 2 CPU 3

CPU 0 CPU 1

IO H

ub

IO H

ub

IO H

ub

“Cache line ping pong is deadly for performance”

The more CPU sockets and cores you have the greater the ramifications this has for SQL Server scalability on “Big boxes”.

Page 39: Super scaling singleton inserts

‘Sharding’ The Database Across Instances

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

Instance A - ‘Affinitized’ to NUMA Node 0

Instance B - ‘Affinitized’ to NUMA Node 1

‘Shard’ databases across instances.

2 x LOGCACHE_ACCESS and XDESMGR spinlocks.

Spinlock cache entries are bound by the latency of the L3 cache, not the quick path inter-connect.

Page 40: Super scaling singleton inserts

What Can We Get From An Instance ‘Affinitized’ To One CPU Socket ?

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Inse

rts

/ s

Threads

Insert Rate / Thread Count

Page 41: Super scaling singleton inserts

With a Batch Size of 2, 32 Threads Achieve The Best Throughput

Logging related activity

Latching !

Where to now ?

Page 42: Super scaling singleton inserts

In Memory OLTP To The Rescue, But What Will It Give Us ?

Only redo is written to the transaction log (durability = SCHEMA_AND_DATA)

Does this relieve pressure on the LOGCACHE_ACCESS spinlock ?.

Zero latching and locking.

Native procedure compilation.

No “Last page” problem due to IMOLTP’s use of hash buckets.

Spinlocks will still be in play though .

Page 43: Super scaling singleton inserts

Insert Scalability with A Non Natively Compiled Stored Procedure

0

100,000

200,000

300,000

400,000

500,000

600,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Inse

rts

/ s

Threads

Insert Rate / Thread Count

Default Engine IMOLTP Range Index

IMOLTP Hash Index bc=8388608 IMOLTP Hash Index bc=16777216

Page 44: Super scaling singleton inserts

What Does The BLOCKER_ENUM Spinlock Protect ?

Transaction synchronization between the default and in-memory OLTP engines ?

Page 45: Super scaling singleton inserts

Where Are Our CPU Cycles Going, The Overhead Of Language Processing

Time to try native in memory OLTP transactions and compiled stored procedures ?

Page 46: Super scaling singleton inserts

Insert Scalability with A Natively Compiled Stored Procedure

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Inse

rts

/ s

Threads

Insert Rate / Thread Count

bucket count=8388608 bucket count=16777216 bucket count=33554432 range

Page 47: Super scaling singleton inserts

Hash Indexes Bucket Count and Balancing The Equation

Smaller bucket counts = better cache line reuse + reduced TLB thrashing+ reduced hash table cache out

Larger bucket counts = reduced cache line reuse + increased TLB thrashing+ less hash bucket scanning for lookups

Page 48: Super scaling singleton inserts

Is Our CPU Affinity Mask Trick Relevant To In Memory OLTP ?

Default CPU affinity mask and 18 insert threads.

A CPU core dedicated to the log writer and 18 insert threads.

Page 49: Super scaling singleton inserts

Optimizations That Failed To Make The Grade

Large memory pagesAs per the default database engine, this made no difference to performance.

Turning off adjacent cache line pre-fetchingThis can degrade performance by saturating the memory bus when hyper threading is in use and cause cache pollution when the pre-fetched line is not used.

Page 50: Super scaling singleton inserts

Takeaways

Monotonically increasing keys do not scale with the default database engine.

Dedicate a CPU core to the log write to relieve pressure on the LOGCACHE_ACCESS spinlock.

Batch DML statements together per transaction to relieve XDESMGR spinlock pressure.

The further the LOGCACHE_ACCESS spinlock cache line has to travel, the more performance is degraded.

Native compilation results in a performance increase of over an order of magnitude (at least) over non natively compiled stored procedures.

There is a bucket count “Sweet spot” for IMOLTP hash indexes which is influenced by hash collisions, bucket scans and hash lookup table cache out.

Page 51: Super scaling singleton inserts

Further Reading

Super scaling singleton inserts blog post

Tuning The LOGCACHE_ACCESS Spinlock On A “Big Box” blog post

Tuning The XDESMGR Spinlock On A “Big Box” blog post

Page 52: Super scaling singleton inserts
Page 53: Super scaling singleton inserts

[email protected]

http://uk.linkedin.com/in/wollatondba

ChrisAdkin8