Episode 4 DB2 pureScale Performance Webinar Oct 2010

DB2 pureScale Performance

Steve [email protected] Oct 19, 2010

2Copyright IBM 2010

Agenda

� DB2 pureScale technology review

� RDMA and low-latency interconnect

� Monitoring and tuning bufferpools

in pureScale

� Architectural features for top

performance

� Performance metrics

3Copyright IBM 2010

Cluster Interconnect

DB2 pureScale : Technology Review

Single Database View

Clients

Database

Log Log Log Log

Shared Storage Access

CS CS CSCS

CS CS

CS

Member Member Member Member

Primary2nd-ary

DB2 engine runs on several host computers– Co-operate with each other to provide coherent access to the

database from any member

Data sharing architecture – Shared access to database– Members write to their own logs– Logs accessible from another host (used during recovery)

Cluster Caching Facility (CF) technology from STG– Efficient global locking and buffer management– Synchronous duplexing to secondary ensures availability

Low latency, high speed interconnect– Special optimizations provide significant advantages on RDMA-

capable interconnects like Infiniband

Clients connect anywhere,…… see single database

– Clients connect into any member– Automatic load balancing and client reroute may change

underlying physical member to which client is connected

Integrated cluster services– Failure detection, recovery automation, cluster file system– In partnership with STG (GPFS,RSCT) and Tivoli (SA MP)

Leverage IBM’s System z Sysplex Experience and Know-How

4Copyright IBM 2010

DB2 pureScale and low-latency interconnect

� Infiniband & uDAPL provide thelow-latency RDMA infrastructure

exploited by pureScale

� pureScale currently usesDDR and QDR IB adapters according to platform

– Peak throughput of about2-4 M messages per second

– Provide message latenciesin the 10s of microseconds

or even lower

� The Infiniband development roadmap indicates continued increases in bit rates

Infiniband Roadmap from www.infinibandta.org

5Copyright IBM 2010

Two-level page buffering – data consistency & improved performance

� The local bufferpool (LBP) caches both read-only and updated pages for that member

� The shared GBP contains references to every page in all LBPs across the cluster

– References ensure consistency across members – who’s interested in which pages, in case the pages are updated

� The GBP also contains copies of all updated pages from the LBPs

– Sent from the LBP at transaction commit time– Stored in the GBP & available to members on

demand– 30 µs page read request over Infiniband from

the GBP can be more than 100x faster than reading from disk

� Statistics are kept for tuning– Found in LBP vs. found in GBP vs. read

from disk– Useful in tuning GBP / LBP sizes

CF

M1 M2 M3

10µs

3

��

5000µs

60µs

��30µs

2

��30µs

1

Expensive disk reads from M1,

M2 not required –get the modified

page from the CF

#1

#2

#3

#4#5

6Copyright IBM 2010

pureScale bufferpool monitoring and tuning

� Familiar DB2 hit ratio calculations are useful with pureScale– HR = (logical reads – physical reads) / logical reads

e.g. (pool_data_l_reads – pool_data_p_reads)/pool_data_l_reads

– As usual, physical reads come from disk, logical reads from the bufferpool (in pureScale, either this means either the LBP or GBP)

e.g., pool_data_l_reads = pool_data_lbp_pages_found +

pool_data_gbp_l_reads

� New metrics in pureScale support breaking this down by LBP & GBP amounts– pool_data_lbp_pages_found = logical data reads satisfied by the LBP

• i.e., we needed a page, and it was present & valid in the LBP

– pool_data_gbp_l_reads = logical data reads attempted at the GBP

• i.e., either not present or not valid in the LBP, so we needed to go to the GBP

– pool_data_gbp_p_reads = physical data reads due to page not present in

either the LBP or GBP• Essentially the same as non-pureScale pool_data_p_reads

– pool_data_gbp_invalid_pages = number of GBP data read attempts due

to an LBP page being present but marked invalid

• An indicator of the rate of GBP updates & their impact on the LBP

Of course, there are index

ones too

7Copyright IBM 2010

pureScale bufferpool monitoring

� Overall (and non-pureScale) hit ratio– (pool_data_l_reads – pool_data_p_reads)/pool_data_l_reads

– Great values: 95% for index, 90% for data– Good values: 80-90% for index, 75-85% for data

� LBP hit ratio– (pool_data_lbp_pages_found / pool_data_l_reads) * 100%

– Generally lower than the overall hit ratio, since it excludes GBP hits

– Factors which may affect it, other than LBP size• Increases with greater portion of read activity in the system

– Decreasing probability that LBP copies of the page have been invalidated• May decrease with cluster size

– Increasing probability that another member has invalidated the LBP page

� GBP hit ratio– (pool_data_gbp_l_reads – pool_data_gbp_p_reads) /

pool_data_gbp_l_reads

– A hit here is a read of a previously modified page, so hit ratios are typically quite low• An overall (LBP+GBP) H/R in the high 90's can correspond to a GBP H/R in the

low 80's– Factors which may affect it, other than GBP size

• Decreases with greater portion of read activity

8Copyright IBM 2010

pureScale bufferpool tuning

Step 1: typical rule-of-thumb for GBP size = 35-40% of Σ( all members’ LBP sizes )

e.g. 4 members, LBP size of 1M pages each -> GBP size of 1.4 to 1.6M pagesNB - don't forget, GBP page size is always 4kB, no matter what the LBP page size is.

– If your workload very read-heavy (e.g. 90% read), initial GBP allocation could be in the 20-30% range

– For 2-member clusters, you may want to start with 40-50% of total LBP, vs. 35-40%

Step 2: monitor the overall BP hit ratio as usual, with pool_data_l_reads,

pool_data_p_reads, etc.

– Meets your goals? If yes, then done!

Step 3: check LBP H/R with pool_data_lbp_pages_found/pool_data_l_reads

– Great values: 90% for index, 85% for data

– Good values: 70-80% for index, 65-80% for data– Increasing LBP size can help increase LBP H/R– NB – for each 16 extra LBP pages, the GBP needs 1 extra page for registrations

Step 4: check GBP H/R with pool_data_gbp_l_reads, pool_data_gbp_p_reads, etc.

– Great values: 90% for index, 80% for data– Good values: 65-80% for index, 60-75% for data– pool_data_l_reads > 10 x pool_data_gbp_l_reads means low GBP

dependence – may mean tuning GBP size in this case is less valuable– pool_data_gbp_invalid_pages > 25% of pool_data_gbp_l_reads means

GBP is really helping out, and could benefit from extra pages

9Copyright IBM 2010

� Page lock negotiation – or Psst! Hey buddy, can you pass me that page?

– pureScale page locks are physical locks, indicating which member currently ‘owns’the page. Picture the following:

• Member A : acquires a page P and modifies a row on it, and continues with its transaction. ‘A’ holds an exclusive page lock on page P until ‘A’ commits

• Member B : wants to modify a different row on the same page P. What now?

– ‘B’ doesn’t have to wait until ‘A’ commits & releases the page lock• The CF will negotiate the page back from ‘A’ in the middle of ‘A’s transaction,

on ‘B’s behalf• Provides far better concurrency & performance than needing to wait for a page

lock until the holder commits.

Log P

P

pureScale architectural features for optimum performance

P P

Member A Member B

Log

P ?P !

CF

GLM

Px: A: B

10Copyright IBM 2010

pureScale architectural features for optimum performance

� Table append cache and index page cache

– What happens in the case of rapid inserts into a single table bymultiple members? Or rapid index updates?

Will it cause the insert page to ‘thrash’ back & forth between the members, each time one has a new row?

– No - each member sets aside an extent for insertion into the table to eliminate contention & page thrashing. Similarly for indexes with the page cache

� Lock avoidance– pureScale exploits cursor stability (CS) locking semantics to

avoid taking locks in many common cases– Reduces pathlength and saves trips to the CF– Transparent & always on


Notes on storage configuration for performance

� GPFS best practices

– Automatically configured by db2cluster command• Blocksize >= 1 MB (vs. default 64k) provides

noticeably improved performance• Direct (unbuffered) IO for both logs & tablespace

containers

• SCSI-3 P/R on AIX enables faster disk takeover on member failure

– Separate paths for logs & tablespaces are recommended

� Dominant storage performance factor for pureScale: fast log writes

– Always important in OLTP– Extra important in pureScale due to log flushes driven

by page reclaims

– Separate filesystems, separate devices from each other & from tablespaces

– Ideally – comfortably under 1ms – Possibly even SSDs to keep write latencies as

low as possible


12 Member Scalability Example

� Moderately heavy transaction processing

workload modeling warehouse & ordering process

– Write transactions rate 20%– Typical read/write ratio of many OLTP

workloads

� No cluster awareness in the app– No affinity

– No partitioning

– No routing of transactions to members

� Configuration– Twelve 8-core p550 members, 64 GB, 5 GHz

– IBM 20Gb/s IB HCAs + 7874-024 IB Switch– Duplexed PowerHA pureScale across 2 additional

8-core p550s, 64 GB, 5 GHz

– DS8300 storage 576 15K disks, Two 4Gb FC Switches

1Gb EthernetClient

Connectivity

20Gb IB pureScale

Interconnect7874-024

Switch

Two 4Gb FC Switches

DS8300Storage

p550 members

p550 Cluster Caching Facility

Clients (2-way x345)


12 Member Scalability Example - Results

0123456789

101112

0 5 10 15

1.98x @ 2 members

3.9x @ 4 members

# Members

Th

rou

ghp

ut vs. 1

mem

be

r

7.6x @ 8 members

10.4x @ 12 members


DB2 pureScale Architecture Scalability

� How far will it scale?

� Take a web commerce type workload– Read mostly but not read only – about 90/10

� Don’t make the application cluster aware– No routing of transactions to members– Demonstrate transparent application scaling

� Scale out to the 128 member limit and measure scalability


The 128-member result

64 Members

95% Scalability

16 Members

Over 95%

Scalability

2, 4 and 8

Members Over

95% Scalability

32 Members

Over 95%

Scalability

88 Members

90% Scalability

112 Members

89% Scalability

128 Members

84% Scalability


Summary

� Performance & scalability are two top goals of pureScale

– many architectural features were designed solely to drive the best possible performance

� Monitoring and tuning for pureScale extends existing DB2 interfaces and practices– e.g., techniques for optimizing GBP/LBP

configuration builds on steps already familiar to DB2 DBAs.

� The pureScale architecture exploits leading-edge low latency interconnects and RDMA, to achieve

excellent performance & scalability– Initial 12- & 128-member proofpoints are

strong evidence of a successful first release, with even better things to come!


Questions

Documents

Episode 4 DB2 pureScale Performance Webinar Oct 2010