8
Tips for Using SQL Server Performance Monitor Counters One cause of excess I/O on a SQL Server is page splitting. Page splitting occurs when an index or data page becomes full, and then is split between the current page and a newly allocated page. While occasional page splitting is normal, excess page splitting can cause excessive disk I/O and contribute to slow performance. If you want to find out if your SQL Server is experiencing a large number of page splits, monitor the SQL Server Access Methods object: Page Splits/sec. If you find out that the number of page splits is high, consider increasing the fill factor of your indexes. An increased fill factor helps to reduce page splits because there is more room in data pages before it fills up and a page split has to occur. What is a high Page Splits/sec? There is no simple answer, as it somewhat depends on your system's I/O subsystem. But if you are having disk I/O performance problems on a regular basis, and this counter is over 100 on a regular basis, then you might want to experiment with increasing the fill factor to see if it helps or not. [6.5, 7.0, 2000] Updated 9-4- 2006 ***** If you want to see how much physical RAM is devoted to SQL Server's data cache, monitor the SQL Server Buffer Manager Object: Cache Size (pages). This number is presented in pages, so you will have to take this number and multiply it by 8K (8,192) to determine the amount of RAM in K that is being used. Generally, this number should be close to the total amount of RAM in the server, less the RAM used by NT, SQL Server, and any utilities you have running on the server. If the amount of RAM devoted to the data cache is much smaller than you would expect, then you need to do some investigating to find out why. Perhaps you aren't allowing SQL Server to dynamically allocate RAM, and instead have accidentally specified that SQL Server use less RAM that it should have access to for optimal performance. Whatever the cause, you need to find a solution , as the amount of data cache available to SQL Server can significantly affect SQL Server's performance. In the real world, I don't spend much time looking at this counter, as there are other counters that do a better job of letting you know if SQL Server is memory starved or not. [6.5, 7.0, 2000] Updated 9-4-2006 ***** To get a feel of how busy SQL Server is, monitor the SQLServer: SQL Statistics: Batch Requests/Sec counter. This counter measures the number of batch requests that SQL Server receives per second, and generally follows in step to how busy your server's CPUs are. Generally speaking, over 1000 batch requests per second indicates a very busy SQL Server, and could mean that if you are not already experiencing a CPU bottleneck, that you may very well soon. Of course, this is a relative number, and the bigger your hardware, the more batch requests per second SQL Server can handle. From a network bottleneck approach, a typical 100 Mbs network card is only able to handle about 3000 batch requests per second. If you have a server that is this busy, you may need to have two or more network cards, or go to a 1 Gbs network card. Some DBAs use the SQLServer: Databases : Transaction/Sec: _Total to measure total SQL Server activity, but this is not a good idea. Transaction/Sec only measures activity that is inside a transaction, 1

Tips for Using SQL Server Performance Monitor Counters

  • Upload
    james

  • View
    1.957

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Tips for Using SQL Server Performance Monitor Counters

Tips for Using SQL Server Performance Monitor CountersOne cause of excess I/O on a SQL Server is page splitting. Page splitting occurs when an index or data page becomes full, and then is split between the current page and a newly allocated page. While occasional page splitting is normal, excess page splitting can cause excessive disk I/O and contribute to slow performance.

If you want to find out if your SQL Server is experiencing a large number of page splits, monitor the SQL Server Access Methods object: Page Splits/sec. If you find out that the number of page splits is high, consider increasing the fill factor of your indexes. An increased fill factor helps to reduce page splits because there is more room in data pages before it fills up and a page split has to occur.

What is a high Page Splits/sec? There is no simple answer, as it somewhat depends on your system's I/O subsystem. But if you are having disk I/O performance problems on a regular basis, and this counter is over 100 on a regular basis, then you might want to experiment with increasing the fill factor to see if it helps or not. [6.5, 7.0, 2000] Updated 9-4-2006

*****

If you want to see how much physical RAM is devoted to SQL Server's data cache, monitor the SQL Server Buffer Manager Object: Cache Size (pages). This number is presented in pages, so you will have to take this number and multiply it by 8K (8,192) to determine the amount of RAM in K that is being used.

Generally, this number should be close to the total amount of RAM in the server, less the RAM used by NT, SQL Server, and any utilities you have running on the server.

If the amount of RAM devoted to the data cache is much smaller than you would expect, then you need to do some investigating to find out why. Perhaps you aren't allowing SQL Server to dynamically allocate RAM, and instead have accidentally specified that SQL Server use less RAM that it should have access to for optimal performance. Whatever the cause, you need to find a solution, as the amount of data cache available to SQL Server can significantly affect SQL Server's performance.

In the real world, I don't spend much time looking at this counter, as there are other counters that do a better job of letting you know if SQL Server is memory starved or not. [6.5, 7.0, 2000] Updated 9-4-2006

*****

To get a feel of how busy SQL Server is, monitor the SQLServer: SQL Statistics: Batch Requests/Sec counter. This counter measures the number of batch requests that

SQL Server receives per second, and generally follows in step to how busy your server's CPUs are. Generally speaking, over 1000 batch requests per second indicates a very busy SQL Server, and could mean that if you are not already experiencing a CPU bottleneck, that you may very well soon. Of course, this is a relative number, and the bigger your hardware, the more batch requests per second SQL Server can handle.

From a network bottleneck approach, a typical 100 Mbs network card is only able to handle about 3000 batch requests per second. If you have a server that is this busy, you may need to have two or more network cards, or go to a 1 Gbs network card.

Some DBAs use the SQLServer: Databases: Transaction/Sec: _Total to measure total SQL Server activity, but this is not a good idea. Transaction/Sec only measures activity that is inside a transaction, not all activity, producing skewed results. Instead, always use the SQLServer: SQL Statistics: Batch Requests/Sec counter, which measures all SQL Server activity. [7.0, 2000] Updated 9-4-2006

*****

SQL compilations of Transact-SQL code is a normal part of SQL Server's operation. But because compilations chew up CPU and other resources, SQL attempts to reuse as many execution plans in cache as possible (execution plans are created when compilations occur). The more execution plans are reused, the less overhead there is on the server, and the faster overall performance there is.

To find out how many compilations SQL Server is doing, you can monitor the SQLServer: SQL Statistics: SQL Compilations/Sec counter. As you would expect, this measures how many compilations are performed by SQL Server per second.

Generally speaking, if this figure is over 100 compilations per second, then you may be experiencing unnecessary compilation overhead. A high number such as this might indicate that you server is just very busy, or it could mean that unnecessary compilations are being performed. For example, compilations can be forced by SQL Server if object schema changes, if previously parallelized execution plans have to run serially, if statistics are recomputed, or if a number of other things occur. In some cases, you have the power to reduce the number of unnecessary compilations. See this page for tips on how to do this.

If you find that your server is performing over 100 compilations per second, you should take the time to investigate if the cause of this is something that you can control. Too many compilations will hurt your SQL Server's performance. [7.0, 2000] Updated 9-4-2006

*****

1

Page 2: Tips for Using SQL Server Performance Monitor Counters

The SQLServer: Databases: Log Flushes/sec counter measures the number of log flushes per second. This can be measured on a per database level, or for all databases on a SQL Server.

So exactly what is a log flush? The best way to describe it is to provide an example. Let's say that you want to start a transaction that has 10 INSERTs in it. When the transaction begins, and the first INSERT is made, and new data is inserted into data pages, essentially two things happen at the same time. The data page in the buffer cache is updated with the newly INSERTed row of data, and the appropriate data for the log file is written to the log cache for this single INSERT. This continues to happen until the transaction is complete. At this time, the data for this transaction from the log cache is immediately written to the log file, but the data in the buffer cache stays there until the next checkpoint process runs, which at that time the database is updated with the newly INSERTed rows.

You may have never heard of the log cache, which is a place in memory that SQL Server records data to be written to the log file. The purpose of the log cache is very important, as it is used to roll back a transaction before it is committed, if the circumstances call for it. But once a transaction is complete (and no longer can be rolled back), this log cache is immediately flushed to the physical log file. This is a normal procedure. Keep in mind that SELECT queries that don't modify data don't create transactions and don't produce log flushes.

Essentially, a log flush occurs when data is written from the log cache to the physical log file. So in essence, a log flush occurs every time after a transaction is complete, and the number of log flushes that occur are related to the number of transactions performed by SQL Server. And as you might expect, the size of a log flush (how much data is written from the log cache to disk) varies depending on the transaction. So how can this information help us?

Let's say that we know we have a disk I/O bottleneck, but we are not sure what is causing it. One way to trouble-shoot the disk I/O bottleneck is to capture the Log Flushes/sec counter data and see how busy this mechanism is. As you might expect, if your server experiences lots of transactions, it will also experience a lot of log flushes, so the value you see for this counter can vary from server to server, depending on how busy it is with action-type queries that create transactions. What you want to do with this information is to try to identify situations where the number of log flushes per seconds seems to be significantly higher than the expected number of transactions that you think should be running on a server.

For example, let's say that you have a daily process that INSERTs 1,000,000 rows into a table. There are several different ways that these rows could be inserted. First, each row could be inserted separately, each INSERT wrapped inside a single transaction. Second, all of the INSERTS could be performed within a single transaction. And last, the

INSERTs might be divided into multiple transactions, somewhere between 1 and 1,000,000. Each of these options is different and has a significantly different effect on SQL Server, and the number of log flushes per second. In addition, it's easy to make a mistake and assume that this process you are running is a single transaction, even though it might not be. Most people tend to think of a single process as a single transaction.

In the first case, if 1,000,000 rows are INSERTed with 1,000,000 transactions, there will also be 1,000,000 log flushes. But in the second case, 1,000,000 rows will be inserted within a single transaction, and there will only be one log flush. And in the third case, the number of log flushes will equal the number of transactions. Obviously, the size of the log flush will be bigger with 1,000,000 transactions than with 1 transaction, but for the most part, this is not important from a performance standpoint as described here.

So which option is best? In all cases, you will still be producing a lot of disk I/O. There is no way to get around this if you deal with 1,000,000 rows. But by using one or just a few transactions, you reduce the number of log flushes significantly, and disk I/O is reduced significantly, which helps to reduce the I/O bottleneck, boosting performance.

So we have learned two key things here. First, that you want to reduce log flushes as much as you can, and one key way to do this is to reduce the number of transaction occurring on your server. [7.0, 2000] Updated 9-4-2006

*****

Since the number of users using SQL Server affects its performance, you may want to keep an eye on the SQL Server General Statistics Object: User Connections. This shows the number of user connections, not the number of users, that are currently connected to SQL Server.

When interpreting this number, keep in mind that a single user can have multiple connections open, and also that multiple people can share a single user connection. Don't make the assumption that this number represents actual users. Instead, use it as a relative measure of how "busy" the server is. Watch the number over time to get a feel if your server is being used more, or being used less. [6.5, 7.0, 2000] Updated 6-12-2006

*****

If your databases are suffering from deadlocks, you can track then by using the SQL Server Locks Object: Number of Deadlocks/sec. But unless this number is relatively high, you want see much here because the measure is by second, and it takes quite a few deadlocks to be noticeable.

2

Page 3: Tips for Using SQL Server Performance Monitor Counters

But still, it is worth checking out if you are having a deadlock problem. Better yet, use the Profiler's ability to track deadlocks. It will provide you with more detailed information. What you might consider doing is to use the Number of Deadlocks/sec counter on a regular basis to get the "big" picture, and if you discover deadlock problems with this counter, then use the Profiler to "drill" down on the problem for a more detailed analysis. [6.5, 7.0, 2000] Updated 6-12-2006

*****

If your users are complaining that they have to wait for their transactions to complete, you may want to find out if object locking on the server is contributing to this problem. To do this, use the SQL Server Locks Object: Average Wait Time (ms). You can use this counter to measure the average wait time of a variety of locks, including: database, extent, Key, Page, RID, and table.

As the DBA, you have to decide what an acceptable average wait time is. One way to do this is to watch this counter over time for each of the lock types, finding average values for each type of lock. Then use these average values as a point of reference. For example, if the average wait time in milliseconds of RID (row) locks is 500, then you might consider any value over 500 as potentially a problem, especially if the value is a lot higher than 500, and extends over long periods of time.

If you can identify one or more types of locks causing transaction delays, then you will want to investigate further to see if you can identify what specific transactions are causing the locking. The Profiler is the best tool for this detailed analysis of locking issues. [6.5, 7.0, 2000] Updated 6-12-2006

*****

While table scans are a fact of life, and sometimes faster than index seeks, generally it is better to have fewer table scans than more. To find out how many table scans your server is performing, use the SQL Server Access Methods Object: Full Scans/sec. Note that this counter is for an entire server, not just a single database. One thing you will notice with this counter is that there often appears to a pattern of scans occurring periodically. In many cases, these are table scans SQL Server is performing on a regular basis for internal use.

What you want to look for are the random table scans that represent your application. If you see what you consider to be an inordinate number of table scans, then break out the Profiler and Index Tuning Wizard to help you determine exactly what is causing them, and if adding any indexes can help reduce the table scans. Of course, SQL may just be doing its job well, and performing table scans instead of using indexes because it is just plain more efficient. But you won't know unless you look and see what is really

happening under the covers. [6.5, 7.0, 2000] Updated 6-12-2006

*****

If you suspect that your backup or restore operations are running at sub-optimal speeds, you can help verify this by using the SQL Server Backup Device Object: Device Throughput Bytes/sec. This counter will give you a good feel for how fast your backups are performing. You will also want to use the Physical Disk Object: Avg. Disk Queue Length counter to help collaborate your suspicions. Most likely, if your are having backup or restore performance issues, it is because of an I/O bottleneck.

As the DBA, it will be your job to determine the I/O bottlenecks you may be experiencing and dealing with them appropriately. For example, the cause of slow backups or restores could be something as simple as a DTS job that is running at the same time, and could be fixed by rescheduled the job. [6.5, 7.0, 2000] Updated 6-12-2006

*****

If you are using transactional replication, you may want to monitor the latency that it takes the Log Reader to move transactions from a database's transaction log until it moves it to the distribution database, and also to monitor the latency it takes the Distributor Agent to move transactions from the distribution database to the subscriber database. The total of these two figures is the amount of time it takes a transaction to get from the publication database to the subscriber database.

The counters for these two processes are the: SQL Server Replication LogReader: Delivery Latency counter and the SQL Server Replication Dist.: Delivery Latency counter.

If you see a significant increase in the latency for either of these processes, this should be a signal to you to find out what new or different has happened to cause the increased latency. [6.5, 7.0, 2000] Updated 6-12-2006

*****

A key counter to watch is the SQL Server Buffer Manager Object: Buffer Cache Hit Ratio. This indicates how often SQL Server goes to the buffer, not the hard disk, to get data. The higher this ratio, the less often SQL Server has to go to the hard disk to fetch data, and performance overall is boosted.

Unlike many of the other counters available for monitoring SQL Server, this counter averages the Buffer Cache Hit Ratio from the time the last instance of SQL Server was restarted. In other words, this counter is not a real-time measurement, but an average of all the days since SQL Server was last restarted. Because of this, if you really want

3

Page 4: Tips for Using SQL Server Performance Monitor Counters

to get an accurate record of what is happening in your Buffer Cache right now, you must stop and restart the SQL Server service, then letting SQL Server run several hours of normal activity before you check this figure (in order to get a good reading).

If you have not restarted SQL Server lately, then the Buffer Cache Hit Ratio figure you see may not be accurate for what is occurring now in your SQL Server, and it is possible that although your Buffer Cache Hit Ratio looks good, it may really, in fact, not be good, because of the way this counter averages this ratio over time.

In OLTP applications, this ratio should exceed 90-95%. If it doesn't, then you need to add more RAM to your server to increase performance.

In OLAP applications, the ratio could be much less because of the nature of how OLAP works. In any case, more RAM should increase the performance of SQL Server OLAP activity. [6.5, 7.0, 2000] Updated 8-21-2005

*****

Consider watching these two counters: SQLServer:Memory Manager: Total Server Memory (KB) and SQLServer:Memory Manager: Target Server Memory (KB). The first counter, SQLServer:Memory Manager: Total Server Memory (KB), tells you how much the mssqlserver service is currently using. This includes the total of the buffers committed to the SQL Server BPool and the OS buffers of the type "OS in Use".

The second counter, SQLServer:Memory Manager: Target Server Memory (KB), tells you how much memory SQL Server would like to have in order to operate efficiently. This is based on the number of buffers reserved by SQL Server when it is first started up.

If, over time, the SQLServer:Memory Manager: Total Server Memory (KB) counter is less than the SQLServer:Memory Manager: Target Server Memory (KB) counter, then this means that SQL Server has enough memory to run efficiently. On the other hand, if the SQLServer:Memory Manager: Total Server Memory (KB) counter is more or equal than the SQLServer:Memory Manager: Target Server Memory (KB) counter, this indicates that SQL Server may be under memory pressure and could use access to more physical memory. [7.0, 2000] Updated 5-25-2005

*****

SQL Server performs faster and with less resources if it can retrieve data from the buffer cache instead of reading it from disk. In some cases, memory intensive operations can force data pages out of the cache before they ideally should be flushed out. This can occur if the buffer cache is not large enough and the memory intensive operation needs more buffer space to work with. When this happens, the

data pages that were flushed out to make extra room must again be read from disk, hurting performance.

There are three different SQL Server counters that you can watch to help determine if your SQL Server is experiencing such a problem.

SQL Server Buffer Mgr: Page Life Expectancy: This performance monitor counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds, this is a potential indication that your SQL Server could use more memory in order to boost performance.

SQL Server Buffer Mgr: Lazy Writes/Sec: This counter tracks how many time a second that the Lazy Writer process is moving dirty pages from the buffer to disk in order to free up buffer space. Generally speaking, this should not be a high value, say more than 20 per second or so. Ideally, it should be close to zero. If it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages, instead waiting for this to occur during regular checkpoints. If this value is high, then a need for more memory is indicated.

SQL Server Buffer Mgr: Checkpoint Pages/Sec: When a checkpoint occurs, all dirty pages are written to disk. This is a normal procedure and will cause this counter to rise during the checkpoint process. What you don't want to see is a high value for this counter over time. This can indicate that the checkpoint process is running more often than it should, which can use up valuable server resources. If this has a high figure (and this will vary from server to server), consider adding more RAM to reduce how often the checkpoint occurs, or consider increasing the "recovery interval" SQL Server configuration setting.

These performance monitor counters should be considered advanced and only used to "refine" a potential diagnosis of "not enough memory" for your SQL Server. [7.0, 2000] Updated 8-21-2005

*****

A latch is in essence a "lightweight lock". From a technical perspective, a latch is a lightweight, short-term synchronization object (for those who like technical jargon). A latch acts like a lock, in that its purpose is to prevent data from changing unexpectedly. For example, when a row of data is being moved from the buffer to the SQL Server storage engine, a latch is used by SQL Server during this move (which is very quick indeed) to prevent the data in the row from being changed during this very short time period. This not only applies to rows of data, but to index information as well, as it is retrieved by SQL Server.

4

Page 5: Tips for Using SQL Server Performance Monitor Counters

Just like a lock, a latch can prevent SQL Server from accessing rows in a database, which can hurt performance. Because of this, you want to minimize latch time.

SQL Server provides three different ways to measure latch activity. They include:

Average Latch Wait Time (ms): The wait time (in milliseconds) for latch requests that have to wait. Note here that this is a measurement for only those latches whose requests had to wait. In many cases, there is no wait. So keep in mind that this figure only applies for those latches that had to wait, not all latches.

Latch Waits/sec: This is the number of latch requests that could not be granted immediately. In other words, these are the amount of latches, in a one second period, that had to wait. So these are the latches measured by Average Latch Wait Time (ms).

Total Latch Wait Time (ms): This is the total latch wait time (in milliseconds) for latch requests in the last second. In essence, this is the two above numbers multiplied appropriately for the most recent second.

When reading these figures, be sure you have read the scale on Performance Monitor correctly. The scale can change from counter to counter, and this is can be confusing if you don't compare apples to apples.

Based on my experience, the Average Latch Wait Time (ms) counter will remain fairly constant over time, while you may see huge fluctuations in the other two counters, depending on what SQL Server is doing.

Because each server is somewhat different, latch activity is different on each server. Tt is a good idea to get baseline numbers for each of these counters for your typical workload. This will allow you to compare "typical" latch activity against what is happening right now, letting you know if latch activity is higher or lower than "typical".

If latch activity is higher than expected, this often indicates one of two potential problems. First, it may mean your SQL Server could use more memory. If latch activity is high, check to see what your buffer cache hit ratio is. If it is below 99%, your server could probably benefit from more RAM. If the hit ratio is above 99%, then it could be the I/O system that is contributing to the problem, and a faster I/O system might benefit your server's performance.

If you really like to get your hands dirty, here are a couple of commands you might want to experiment with to learn more about latching behavior of your software.

SELECT * FROM SYSPROCESSES WHERE waittime>0 and spid>50

This query will display currently existing SPIDs that are waiting, along with the waittype, waittime, lastwaittype, and waitresource. The lastwaittype and waitresource tells you what your latch type, and the waitresource will tell you what object the SPID is waiting on. When you run it, you may not get any results because there are no waiting occuring at the time you ran the query. But if you run the query over and over, you will eventually get some results.

DBCC SQLPerf (waitstats, clear) --clears stats DBCC SQLPerf (waitstats) --give you stats as of the last clear (or SQL Server service restart)

This query displays the current latches (among other stuff), along with their Wait Type and Wait Time. You may first want to clear the stats, then run DBCC SQLPerf (waitstats) periodically over a short time period to see what latches are taking the most time.

Thanks to these forum members who contributed to this tip: josephobrien, rortloff, harryarchibald. [7.0, 2000] Updated 8-21-2005

*****

SQL Server performs faster and with less resources if it can retrieve data from the buffer cache instead of reading it from disk. In some cases, memory intensive operations can force data pages out of the cache before they ideally should be flushed out. This can occur if the buffer cache is not large enough and the memory intensive operation needs more buffer space to work with. When this happens, the data pages that were flushed out to make extra room must again be read from disk, hurting performance.

There are three different SQL Server counters that you can watch to help determine if your SQL Server is experiencing such a problem.

SQL Server Buffer Mgr: Page Life Expectancy: This performance monitor counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds, this is a potential indication that your SQL Server could use more memory in order to boost performance.

SQL Server Buffer Mgr: Lazy Writes/Sec: This counter tracks how many time a second that the Lazy Writer process is moving dirty pages from the buffer to disk in order to free up buffer space. Generally speaking, this should not be a high value, say more than 20 per second or so. Ideally, it should be close to zero. If it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages, instead waiting for this to occur during regular checkpoints. If this value is high, then a need for more memory is indicated.

SQL Server Buffer Mgr: Checkpoint Pages/Sec: When a checkpoint occurs, all dirty pages are

5

Page 6: Tips for Using SQL Server Performance Monitor Counters

written to disk. This is a normal procedure and will cause this counter to rise during the checkpoint process. What you don't want to see is a high value for this counter over time. This can indicate that the checkpoint process is running more often than it should, which can use up valuable server resources. If this has a high figure (and this will vary from server to server), consider adding more RAM to reduce how often the checkpoint occurs, or consider increasing the "recovery interval" SQL Server configuration setting.

These performance monitor counters should be considered advanced and only used to "refine" a potential diagnosis of "not enough memory" for your SQL Server. [7.0, 2000] Updated 8-21-2005

6