Techniques for Analyzing Performance Problems Session: What are the key indicators of problems and...

Preview:

Citation preview

Techniques for Analyzing Performance Problems

Session: What are the key indicators of problems and poor performance that every database administrator should be tracking? Where do you find this information, and how can you distinguish the urgent from the mundane? We will learn how to approach common monitoring problems through live demonstrations and explain how to get started on the road to proactively managing your database and application. Join this session to find out the answers to all these questions and more!

Techniques For Analyzing

Performance Problems

It’s The Right Thing To Do.

Tom Bascom, White Star Softwaretom@wss.com

3

A Few Words about the Speaker

• Tom Bascom; Progress user & roaming DBA since 1987

• President, DBAppraise, LLC– Remote database management service for OpenEdge.– Simplifying the job of managing and monitoring the world’s

best business applications.– tom@dbappraise.com

• VP, White Star Software, LLC– Expert consulting services related to all aspects of Progress and

OpenEdge.– tom@wss.com

Get the Basic Facts

• What end-user owns the problem? Who confirms that we have fixed it?

• Is the problem related to a specific screen, program or business process?

• Is the problem repeatable? On demand? Is there a pattern (i.e. only in the afternoon or whenever X is also running…)

• Is there a specific target metric that we need to satisfy?• Do we have before and after run times or other end-user

metrics related to the identified problem?• Did the problem start on a particular date? Or after an

identified change?

Establish a Timeline

• When did the problem begin?– Verify this with independent data! (Logfiles, timestamps on output

etc.)• What changes are known to have occurred around this time?

– Software releases – “The System”, OpenEdge, Apache, Java, OS etc.– Parameter changes (DB, 4gl client, OS, SAN, network etc.)

• Has the Workload changed?– Application workload characteristics (business volume, number of

users, transaction volume, CRUD statistics, etc.)– Environment workload changes: OS, disk, network…– External workload changes (Shared SAN, Network, Other VMs).

When did the problem begin?

• It isn’t always the same as when the user first noticed the problem.

• Getting the wrong begin date may lead to identifying the wrong root cause.

• Independent verification is critical – log files, timestamps of output files and careful interviews of end-users are all tools that can help establish a reliable begin date.

What Changes Occurred?

• A complete and reliable timeline of all changes is critical.

• It is important to look “outside the box” – changes sometimes have unexpected side effects.

• Make sure that changes to any and all systems involved in the business process being analyzed are tracked.

Sample Timeline

Normal Performance

OE 11.2 Upgrade

Monthly Release

Record Reads

Increase 300%, Latch

Timeouts Spike

SAN Upgrade

Users Report Bad

Performance

Application Workload

• Over time more users create more data and engage in more activity. If parts of the application are inappropriately sensitive to growth there may be problems.

• DB activity should scale linearly (or less) with business volume.• Number of users, record creates, reads, updates and deletes,

number of transactions etc. should all be related to business activity.

• Reporting and data aggregation will take longer with more data – but the additional time should be linear and predictable.

• Any metrics which are growing exponentially faster than business growth are indicators of an underlying scalability problem.

Environment Workload

• As the application becomes busier so should the environment that it is hosted in.

• Just as with the application workload the environment workload should scale linearly with business activity.

• CPU utilization, network traffic, memory utilization and disk space should all grow in step with business growth.

• Disk IO operations should grow more slowly than business growth (IO ops are buffered resources that should follow an inverse square growth path).

• Any metrics which are growing exponentially faster than business growth are indicators of an underlying scalability problem.

External Workload

• Changes outside the immediate environment can impact performance.

• Network load from other applications my reduce the available bandwidth and increase latency.

• Disk IO from other applications may impact SAN performance.

• Demand for CPU from other applications on a shared VM may impact the scheduling and availability of shared CPU.

• If the application and environment workloads have not changed but we are seeing increased latency or deeper queuing it is likely that the external workload has increased.

TOOLS

Tools to Narrow the Scope• Kill USR1 -- 4GL Stack Trace

– Generates a “protrace” file from a running process.– Does not interfere with the running process.– Requires that you be the process owner or “root”.

• Progress Profiler– Gathers detailed line by line execution timing.– Can be enabled via startup parameters or by manipulating PROFILER

handle.• Client Logging

– Logs specified 4GL session details to a file.– Can be very verbose.

• Client Statement Cache– Shows the 4GL data access statement and stack trace– Or the SQL statement

About Line Numbers

• Most tools use DEBUG LIST line numbers.• This listing expands all include files and pre-

processor statements.• COMPILE “program.p” DEBUG-LIST

“program.dbg”.

Obtaining a USR#

/home/tom> ps -fu f474458 UID PID PPID C STIME TTY TIME CMD f474458 18612454 35669114 0 09:30:43 - 0:00 boks_sshd: f474458@pts/11 f474458 14171658 46286012 0 12:58:50 pts/20 0:00 /progress/dlc/bin/_progres -p mls.p f474458 34488084 44385460 0 09:34:30 - 0:00 boks_sshd: f474458@pts/20 f474458 37176446 39666876 0 Dec 04 pts/12 0:00 bash f474458 39666876 17448590 0 Dec 04 - 0:00 boks_sshd: f474458@pts/12 f474458 42812578 47989918 120 12:59:02 pts/11 0:00 ps -fu f474458 f474458 46286012 34488084 0 09:34:31 pts/20 0:00 bash f474458 47989918 18612454 0 09:30:43 pts/11 0:00 bash

/home/tom> grep f474458 /pdb_prod/xlos.lg | grep 14171658

[2012/12/06@12:58:50] P-14171658 T-1 I ABL 124: (452) Login by f474458 on /dev/pts/20.[2012/12/06@12:58:52] P-14171658 T-1 I ABL 124: (708) Userid is now f474458.

KILL USR14GL Stack Trace From The Command Line

Generate a 4GL Stack Trace

• Great for batch processes and “hung” users.• Use “kill -USR1” to generate a stack trace and

establish what code is actually running.

• The “protrace” output will be in the process’ current directory.

$ ps –fu f474458UID PID PPID C STIME TTY TIME CMDf474458 33570788 37176446 62 11:21:57 pts/12 0:00 ps -fu f474458f474458 37176446 39666876 0 08:44:23 pts/12 0:00 bashf474458 39666876 17448590 0 08:44:23 - 0:00 boks_sshd: f474458@pts/12f474458 42288226 45303030 0 11:21:30 pts/48 0:00 bashf474458 45302816 42288226 0 11:21:37 pts/48 0:00 /progress/dlc/bin/_progres -p mls.p$ kill –USR1 45302816$ ls –l /los_prod/protrace.45302816$ -rw-rw-r-- 1 f474458 los_prd 2194 Dec 4 11:28 /los_prod/protrace.45302816

Sample 4GL Stack Trace$ cat /los_prod/protrace.45302816

PROGRESS stack trace as of Tue Dec 4 11:28:00 2012Command line arguments are/progress/dlc/bin/_progres -p mls.p -pf /los_prod/develop.pf -T /dbtmp

Startup parameters:-pf /progress/dlc/startup.pf,-T /dbtmp,-clientlog /los_logs/debug/client_f474458_20121204_112720_45302816.log... (snip)

+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0... (snip)

** 4GL Stack Trace **

--> obj/mnu/menu (/los_prod/obj/mnu/menu.r) at line 416 obj/src/startup (/los_prod/obj/src/startup.r) at line 2042 mls.p (/los_prod/mls.r) at line 26

** Persistent procedures/Classes **

Handle File Name001010 /los_prod/obj/tools/proclib.r001009 /los_prod/obj/tools/quick-supers.r000000 /los_prod/config.r (STATIC)

CODE PROFILERLine By Line Execution Time

Profiler – The Hard Way

• -profile (startup option)– Non-intrusive– Non-selective

• PROFILER: (session handle)– More selective– But requires code insertion or “wrappering”– profiler/profctrl.p

• Analysis Tools– http://communities.progress.com/pcom/docs/DOC-2808– http://dbappraise.com

PROFILER Attributes

• DESCRIPTION – optional text describing this session• LISTINGS – whether or not to create debug listings• DIRECTORY – where to create debug listings (default to –T)• FILE-NAME – name of output file (default profile.out)• ENABLED – yes/no; initializes listings and so forth• PROFILING – turn profiling on or off

Profiling an Entire Session

• Create file called profiler.cfg with 3 lines:-OUTFILE /tmp/profiler.out-LISTINGS /tmp-DESCRIBE someDescription

• Add –profile to session startup:mpro dbName –p start.p –profile profiler.cfg

• Run normally.• Terminate cleanly & analyze the output.• Multiple gigabytes of data may be generated…

Minimal Embedded Usageassign profiler:enabled = yes profiler:profiling = yes

.

do i = 1 to 1000000:

/* do something */end.

assign profiler:enabled = no profiler:profiling = no

.

profiler:write-data().

<Escape>-P Profiler Tool

• Dynamically capture line by line execution time at the point where issues occur.

• Send output to a user-defined destination.• <Esc>P to configure and enable. (Don’t forget

“Yes”.)• <Esc>P again to complete the capture and disable.

Sample Profiling OutputDescription: XYZZY Date: 10/04/11

Top Total Time Lines

Program Line Avg Time Time Calls

------------------------------ ----- ----------- ---------- -------

xtabsms2.p 19281 93.256652 186.513304 2

getdocprep2.p 939 63.346967 126.693934 2

proc_create_sitm xtabsms2.p 11926 12.611345 50.445380 4

xtcountry_x2_x3.p 359 0.000013 0.459658 34,967

proc_read-database sysval.p 536 0.000117 0.336606 2,879

xtcountry_x2_x3.p 360 0.000009 0.324163 34,894

getdocprep2.p 741 0.067411 0.269642 4

proc_upd_nref xtmfintb2.p 3582 0.003438 0.233802 68

proc_upd_nref xtmfintb2.p 3298 0.003137 0.213345 68

proc_process_tasks xttskscn.r 3091 0.012560 0.200956 16

findClient sysval.p 328 0.000094 0.165167 1,763

CLIENT LOGGINGDetailed 4GL Session Logging

Client Logging – The Hard Way

• -clientlog filename– /logs/debug/USER_DATE_TIME_PID.log

• -logginglevel– 0 = disable– 1 = errors only– 2 thru 4 = basic, verbose & extended -logentrytypes

• -logentrytypes– 4GLTrace, 4GLTrans, DynObjects.*, FileID, QryInfo– LOG-MANAGER:LOG-ENTRY-TYPES = ”4GLTrace:2”

<Escape>-T Trace Tool

• Dynamically combine LOG-ENTRY-TYPEs & LOGGING-LEVELs to suit.

• 4GLTrace, QryInfo, DynObjects…• 2 = Basic, 3 = Verbose, 4 = Extended

• Specify a custom path to log file (/home/SID…)• Trace only the relevant portion of your session (not all

or nothing).

Client Logging Output

[12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Logging level set to = 1[12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Log entry types activated: 4GLTrace:2[12/12/06@09:12:22.490-0500] P-41043172 T-000001 2 4GL 4GLTRACE Run obj/src/connect [Main Block - mls.p @ 25]

Run obj/src/startup [Main Block - mls.p @ 26]PUBLISH Monitor [Main Block - obj/src/startup @ 270]Run obj/sec/access "startup no" [Main Block - obj/src/startup @ 366]Run obj/mnt/setcolor [Main Block - obj/src/startup @ 435]Func isDbReadOnly [Main Block - obj/src/startup @ 849]Func get-env [Main Block - obj/src/startup @ 882]Run obj/prt/genBarcode PERSIST [Main Block - obj/src/startup @ 903]Run loadCodes "0" [Main Block - obj/src/startup @ 914]Func addCode "00 00 11011001100 " [loadCodes - obj/prt/genBarcode @ 352]Func buildPCLString "11011001100 00" [addCode - obj/prt/genBarcode @ 291]…Run obj/mnu/menu "main" [Main Block - obj/src/startup @ 2042]Run obj/app/ckqaread [Main Block - obj/mnu/menu @ 136]Run obj/sec/access "main no" [Main Block - obj/mnu/menu @ 140]Run obj/lkp/fixmod [Main Block - obj/mnu/menu @ 189]Run obj/hlp/helpmess "main Retail Mortgage Processing" [Main Block - obj/mnu/menu @ 404]Run obj/tools/logout [Main Block - obj/src/startup @ 2119]

Prolog removed to improve readability…

STATEMENT CACHETracing Data Access Statements

Client Statement Cache

• Shows the session’s most recent data access statement.– Both 4gl and SQL clients are supported.

• 4gl stack trace is optionally available for 4gl clients.

• Enable via PROMON or via _CONNECT VST• Access results via PROMON or _CONNECT

PROMON – Statement CacheUser number : 89User name : f474458User type : SELF/ABLLogin date/time : 12/06/12 09:30

Statement caching type : SQL Statement or Partial ABL Program StackStatement caching last updated : 12/06/12 09:33

Statement cache information : 101 : obj/mrxv/lnstatex-val 1152 : execute obj/mrxe/50005399-2012-06-13-31926.r 153 : obj/mrxe/50005399-2012-06-13-31926.r 80 : obj/mrx/runcompmtrx 169 : obj/mrx/mtrxeval-stateless 282 : obj/sec/access 4300 : obj/adm/mortype4 575 : obj/mnu/menu 568 : obj/mnu/menu 3018 : flow-proc obj/src/loanflow 1947 : obj/src/loanflow 51 : obj/lkp/retailfl 575 : obj/mnu/menu 568 : obj/mnu/menu 568 : obj/mnu/menu 2042 : obj/src/startup 26 : mls.p

SCENARIOS

Scenarios

• Hung Session• Runaway• Slow Program• Slow System

NON-RESPONSIVE SESSION

Checking a “Hung” Session

• Check for Scroll-Lock (Control-S)• Examine log files.• Is the process consuming any CPU ticks?• Examine table statistics by user – is there any activity?• Check “blocked users” – is the session waiting on a

record lock or other resource?• Get a 4GL stack trace – examine the active code.

Could we be waiting for something?– Be especially wary of OS-* commands, file IO and any

external interfaces.

CPU Utilization

• Use “ps” to check CPU utilization:

• No change in CPU in 60 seconds – we might be “hung”.

• In this case a sub-process is blocked waiting for input.

/home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658f474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.pf474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.p

/home/tom> ps -ef | grep [1]4171658f474458 14171658 46286012 0 12:58:50 pts/20 0:06 _progres -p mls.pf474458 50676836 14171658 0 13:48:56 pts/20 0:00 rm -i /tmp/xyzzy

Is There a Problem?

• Holding record locks, preventing access etc.• No? Don’t worry about it. Move on.• Yes? It might be a hung session that needs to be killed.

– Save log files and other analysis from above.– Dynamic restart: kill –INT (raises OpenEdge STOP condition

and re-runs startup procedure)– proshut dbname –disconnect usr#– Terminate: kill –HUP or kill –TERM– Check the .lg files for transaction rollback completion

before killing again.– Do NOT “kill -9” if there is DB activity – you may crash the

db!

Key Metrics

• ProTop – Active TRX, Blocked Users, Table IO by User

MY LITTLE RUNAWAY

Runaway?

• A particular OS process is consistently consuming significant CPU resources (usually something in excess of 90% of a core).

• It may, or may not, be executing as designed and expected.

• How can we tell?

Checking a Runaway

• Examine log files.• Is the process consuming a lot of CPU ticks?• Get a 4GL stack trace.• Examine table and index statistics by user.• Verify the Client Statement Cache.

CPU Utilization

• Check CPU utilization over time• If the utilization is a significant portion of the

time slice then you might have a “runaway”:

/home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658f474458 14171658 46286012 120 12:58:50 pts/20 0:08 _progres -p mls.pf474458 14171658 46286012 120 12:58:50 pts/20 0:46 _progres -p mls.p

0:46 – 0:08 = 34 seconds of CPU time, 57% of the available timeslice.

Performing as Expected?

• Yes? It may be time to refactor.• No? It might be a runaway that needs to be killed.– Save log files and other analysis from above.– Dynamic restart: kill –INT (raises OpenEdge STOP

condition and re-runs startup procedure)– proshut dbname –disconnect usr#– Terminate: kill –HUP or kill –TERM– Check for transaction rollback completion before killing

again.– Do NOT “kill -9” if there is DB activity – you may crash

the db!

Key Metrics

• ProTop – Active TRX, Blocked Users, Table IO by User

RAPID READERS

Is it a “Rapid Reader”?

• Very heavy DB read activity.• Probably close to the theoretical limit for a

session (200,000 rec/sec for current HW).• Usually focused on just one table.• Often a small table held entirely in memory.• Often associated with a table scan (WHOLE-

INDEX search).• Poorly constructed joins may be involved.

ProTop Example

• Show summary, uio, tableStat data

Fixing a Rapid Reader

• Add an appropriate Index• Refactor JOIN logic.– Rewriting– Use a temp-table

• Use a temp-table to replace real table.

*** Tuesday at 2:30 -- Pick an Index, Any Index

A SPECIFIC PROGRAM IS

SLOWThe Easiest Scenario

Slow Program

• A specific program is slow.• The problem is repeatable and measurable.• We have used tools to establish which code is

slow.– If the profiler shows that data access statements

are at fault follow “Data Access” analysis.– Otherwise follow “Other Cause”.

Data Access Issues

• Possible Data Access Issues:– Table scans.– Missing or inappropriate indexes.– Poorly constructed joins.– Non-linear scaling.

• May be sensitive to the amount of data present.

• May be sensitive to data values.• May not be reproducible outside the problem

environment.

Expected Data Access

• First – determine what you *expect*:– What business metrics can we relate to this process?– How much activity should we see for X business?– How long should that activity take?– What tables and indexes should be used by this

business process?Example:

Extracting 100 loans of data should result in 100 “loanfile” and 225 “borrower” reads…

Actual Data Access

• Examine Table & Index Statistics By User– Are the expected tables being accessed at the

expected rates?– Are the indexes being used at runtime the

indexes that you expected?– Do data access operations vary linearly with the

amount of useful work being performed?– Is any activity approaching the known limits of

the infrastructure?

ProTop – CRUD Data09:38:33 ProTop -- Progress Database Monitor 09/19/04Sample sports2000 [/data/s2k/sports2000] RateHit Ratio: 14:1 14:1 Commits: 62 65 Local: 51 Miss%: 7.239% 7.140% Latch Waits: 45 46 Remote: 0 Hit%:92.761% 92.860% Tot/Mod Bufs: 1002 370 Batch: 50Log Reads: 22960 26486 Evict Bufs: 26602 6225 Server: 0 OS Reads: 1662 1891 Lock Table: 8192 11 Other: 1 Chkpts: 1 0 Lock Tbl HWM: 138 TRX: 1 Flushed: 0 0 Old/Curr BI: 6141 6141 Blocked: 5Area Full: 1 100.00% After Image: DISABLED Total: 52 Table StatisticsTbl# Table Name Create Read Update Delete---- -------------------- --------- --------- --------- --------- 4 OrderLine 0 5937 152 0 24 POLine 0 2641 56 0 23 PurchaseOrder 0 1699 36 0 18 Order 0 1608 37 0 21 Bin 0 286 14 0 2 Customer 0 206 16 0 12 Vacation 0 111 5 0

Theoretical Limits

• Record Reads/sec = 200,000 per session, 1.2M total (in a “shared” LPAR environment, 2M if “dedicated”).

• Transactions/sec = 18,000• Disk Read/sec = 20,000• Disk Write/sec = 5,000

Actual limits may be lower – these are the best values that we obtained in the performance lab.

Other Cause

• External Resources– Slow File IO.– Interfaces (especially networked services).

• Excessive recursion.• Dynamic Object allocation.

– May be revealed by memory growth.

• Parameter passing with deep copies.– May be revealed by memory growth.

• Poor “Big O” algorithm choice.

Key Metrics

• ProTop – Active TRX, Blocked Users, Table IO by User

“IT’S SLOW”Let’s Go Fishing…

Slow System

• “The System”, as a whole, is slow.• The problem is repeatable or predictable.– From 2pm until 5pm.– The last Thursday of the month.– Whenever the X extract is run.

Verify

• Are there workload changes that correlate with the problem?– Has business volume increased?– Have there been changes in how the system is being

used?– Has there been a change in the mix of products being

originated?• Are any system limits being approached?– OS limits: CPU utilization? Disk IO? Network Ops?– DB limits: TRX/sec? Record Rd/sec?

Database Activity

• Are there behavioral changes that correlate with the problem?– Do certain tables show increased IO activity?

• Have the “top 10” tables changed?

– Do certain indexes become active?• Have the “top 10” indexes changed?

– Do particular background processes become exceptionally busy?

– Are any bottleneck metrics worrisome?• Latch timeouts, blocked users, long transactions

Key Metrics

• Record & Index Creates, Reads, Updates, Deletes, Locks and Waits per second.– Aggregate– Per Table– Per User

• Transactions (Commits) per second• Latch Wait Timeouts• Number of Users:

– Blocked– With Active Transactions– With Old Active Transactions

66

Questions?

67

Thank You!

Recommended