Navis Analyzer

EMC Navisphere Analyzer: A Case Study

May 2001

Navisphere Analyzer: A Case Study

0

No part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written consent of EMC Corporation. The information contained in this document is subject to change without notice. EMC Corporation assumes no responsibility for any errors that may appear. All computer software programs, including but not limited to microcode, described in this document are furnished under a license, and may be used or copied only in accordance with the terms of such license. EMC either owns or has the right to license the computer software programs described in this document. EMC Corporation retains all rights, title and interest in the computer software programs. EMC Corporation makes no warranties, expressed or implied, by operation of law or otherwise, relating to this document, the products, or the computer software programs described herein. EMC CORPORATION DISCLAIMS ALL IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. In no event shall EMC Corporation be liable for (a) incidental, indirect, special, or consequential damages or (b) any damages whatsoever resulting from the loss of use, data or profits, arising out of this document, even if advised of the possibility of such damages. EMC2, EMC, CLARiiON, and Navisphere are registered trademarks and where information lives is a trademark of EMC Corporation. All other brands or products may be trademarks or registered trademarks of their respective holders. Copyright 2001 EMC Corporation. All rights reserved. C844


1

Table of Contents EMC Navisphere Analyzer: A Case Study...................................................................0 Executive Summary.......................................................................................................3 Introduction ....................................................................................................................4 Customers Problem......................................................................................................5 Conclusions of the Analysis .......................................................................................10 Summary ......................................................................................................................10 Appendix 1 ...................................................................................................................11


2

Executive SummaryThis white paper describes the functionality of EMC Navisphere Analyzer through the discussion of a customer case study. The case study presents the functionality available with Navisphere Analyzer, the problem that the customer experienced, the types of data that were collected, and the methodology that was used to resolve the problem using Analyzer. As more and more companies rely on CLARiiON products, the need increases to be able to quickly determine the following: The array is being used efficiently The array is working properly Sufficient resources exist on the array for normal day-to-day operations, as well as potential growth

Navisphere Analyzer software is a host-based performance analysis tool that is intended to be used as a microscope to examine specific data in as much detail as necessary, to determine the cause behind a bottleneck and/or a performance issue. Once the cause has been isolated, Analyzer is of further assistance in helping to assess whether fine tuning parameters of the array will solve the problem or whether hardware components, such as cache memory or disks, need to be added. Analyzer can be used to continuously monitor and analyze performance. This is most helpful in determining how to fine tune array performance for maximum utilization. Alternately, it can be used to analyze data collected earlier. Data can be collected automatically from selected arrays, LUNs, or storage processors. The user can specify when to record data, from which hosts to gather data, and where the data should be stored. Collecting historical data of this type is helpful in determining the cause of lingering performance problems. Finally, the user can compare realtime data to data recorded previously to help analyze performance issues.


3

IntroductionA typical problem for a system administrator to encounter is a complaint by one or more departments that the performance of an array drastically changes from time to time and that it seems unrelated to what that department is doing at the time. That is, the departments usage of the array has not changed significantly. The administrator will usually try to gather information about when the problem occurred to see if he or she can determine what else was going on at the time. This is usually difficult to do because its rare that every department remembers precisely what they were doing, when. Navisphere Analyzer permits the administrator to collect data over different blocks of time and then analyze that data to see if there is any hint about the underlying causes of the problem. For many problems, looking at the utilization of different components of the array is usually sufficient to quickly narrow down the basis of the problem. That is, first looking in general at the utilization of each LUN. Then, if a particular LUNs utilization is high, looking in more and more detail at the performance characteristics of that particular LUN. When using Navisphere Analyzer for such an investigation, the Basic data types (see Appendix 1) are typically quite sufficient for the analysis of most performance problems. That is, utilization of the LUN, storage processor, and/or disks can help give a specific direction to pursue in researching a performance issue. To illustrate this point, a case study is presented here in which Navisphere Analyzer was used to determine the underlying cause of a performance problem.


4

Customers ProblemThe customer is a major CLARiiON account who was experiencing a severe performance problem each month when a large sales report was run. The configuration consisted of two FC5700 CLARiiON arrays with a combined storage of two terabytes. The arrays were configured as RAID 5. While the arrays performed well most of the time, when a particular large sales report was executed, the performance of the arrays was severely affected. The system administrator used Navisphere Analyzer to first look at the utilization of all of the LUNs on the array. Data was stored starting at 07:59 to 10:06 to overlap with the running of the sales report. Figure 1 is a printout of the utilization of the LUNs. It is clear that LUN 0x02 is close to 100 percent utilization. Its average utilization is at 90 percent and the latest utilization was close to 100 percent. This would obviously affect the performance of the array in general.

Figure 1. LUN Utilization Report


5

The next step was to look in detail at the utilization for that LUN. Figure 2 shows that shortly after the sales report started to run at 08:00, the utilization for the LUN reached 100 percent and more or less stayed there for the duration of the report. This makes it very clear that this LUN is over-utilized.

Figure 2. Report showing in detail, utilization of the LUN


6

The next step that the administrator took was to look at the utilization of the storage processor. This data is shown in the lower portion of Figure 3. It is clear that the storage processor is not being over utilized, that is, the storage processor is not the limiting factor in this performance issue, because its utilization is only around 40 percent.

Figure 3. Report showing LUN and storage processor utilization combined


7

The next step was to look at the queue length of the storage processor. This is shown in Figure 4. It is clear that again, the storage processor is not the issue, because its queue length is around 5, which is the number of disks that constitute the LUN.

Figure 4. LUN utilization and storage processor queue length


8

The queue length for the LUN itself was then looked at. It is shown in Figure 5. This queue length demonstrates the location of the problem, because it is close to 20. This exceeds the number of disks that constitute the LUN. It is clear that this is the location of the bottleneck.

Figure 5 LUN utilization and LUN queue length


9

Conclusions of the AnalysisFirst, looking at the overall utilization report, it was clear that the LUN was being over-utilized. What wasnt clear, however, was whether the issue was the load on the storage processor or whether it was due to a lack of disks that constitute the LUN. The next step looked specifically at the LUN, while the sales report was being run. It was obvious looking at that report (Figure 2) that while the report was running, the LUN was over-utilized. Next, the utilization of the storage processor was examined. When the data was examined, it was clear that the storage processor was not being bogged down because its utilization was less than 50percent. The queue lengths for both the storage processor and the LUN were examined. The queue length for the storage processor was less than or equal to the number of disks that constitute the LUN, and therefore, the storage processor was not the performance bottleneck. The queue length for the LUN, however, was close to 20, which is four times the number of disks in the LUN. That is, the load on the LUN is four times higher than it can handle. The conclusion that was drawn from this data is that the number of disks on the system should be increased. Once this was done, the problem was fixed.

SummaryNavisphere Analyzer was used to determine the cause of a performance problem. Starting at the level of the LUN, reports were used to move closer and closer to the problem. It was concluded, using a few very clear reports, that the LUN in question needed more storage in it. Additional disks were added and the problem was resolved.


10

Appendix 1Navisphere Analyzer collects and analyzes data on the following performance properties: Basic (for disk, storage processor, and LUN) Utilization The fraction of a certain observation period that the system component is busy serving incoming requests. An SP or disk that shows 100 percent (or close to 100 percent) utilization is a system bottleneck, since an increase in the overall workload will not affect the component throughput; the component has reached its saturation point. Since a LUN is considered busy if any of its disks are busy, LUN utilization usually represents a pessimistic view. That is, a high LUN utilization value does not necessarily indicate that the LUN is approaching its maximum capacity. Queue length The average number of requests within a certain time interval waiting to be served by the component, including the one in service. An (average) queue length of zero indicates an idle system. If three requests arrive at an empty service center at the same time, only one of them can be served immediately; the other two must wait in the queue, resulting in a queue length of three. Response time (ms) The average time, in milliseconds, required for one request to pass through a system component, including its waiting time. The higher the queue length for a component, the more requests are waiting in its queue, thus increasing the average response time of a single request. For a given workload, queue length and response time are directly proportional. Total bandwidth (MB/s) The average amount of data in Mbytes that is passed through a system component per second. Larger requests usually result in a higher total bandwidth than smaller requests. Total bandwidth includes both read and write requests. Total throughput (IO/s) The average number of requests that pass through a system component per second. Since smaller requests need a shorter time for this, they usually result in a higher total throughput than larger requests do. Total throughput includes both read and write requests.


11

-

Workload Maximum outstanding requests (storage processor) The largest number of commands on the storage processor at one time since statistics logging was enabled. This value measures the biggest burst of requests sent to this storage processor at a time. Maximum request count (LUN) The largest number of requests queued to this LUN at one time since statistics logging was enabled. This value also indicates the worst instantaneous response time due to the maximum number of waiting requests. Maximum requests in queue (disk) The maximum number of requests waiting to be serviced by this specific disk since statistics logging was enabled. Read/write throughput (I/Os disk, storage processor, LUN) The average number of reads or writes respectively passed through a component per second. Since smaller requests need less processing time, they usually result in a higher read or write throughput than larger requests. Read/write size (KB disk, storage processor, LUN) The average read or write size respectively in Kbytes. This number indicates whether the read workload is oriented more toward throughput (I/Os per second) or bandwidth (MB per second). Read/write bandwidth (MB/s disk, storage processor, LUN) The average number of Mbytes read or written respectively that was passed through a component per second. Large requests usually result in a higher bandwidth than smaller ones.

-

Read cache Used prefetches ( percent - LUN) This measure is an indication of prefetching efficiency. To improve real bandwidth, two consecutive requests trigger prefetching, thereby filling the read cache with data before it is requested. Thus sequential requests will receive the data from the read cache instead of from the disks, which results in a lower response time and higher throughput. As the percentage of sequential requests rises, so does the percentage of used prefetches. Hit ratio (LUN) The fraction of read requests served from both read and write caches vs. the number of read requests to this LUN. The higher the ratio, the better the read performance. Miss rate (LUN) The rate of read requests that could not be satisfied by the storage processor cache, and therefore, required a disk access. Hit rate (LUN) The number of read requests that was satisfied by either the write or read cache, within a second. A read cache hit occurs when recently accessed data is referenced while it is still in the cache.

-

Write cache Miss rate (LUN) The number of write requests per second that could not be satisfied by the write cache, since the data was not currently in the cache from a previous disk access. Hit ratio ( LUN) The fraction of write requests served from the write cache vs. the total number of write requests to this LUN. The higher the ratio, the better the write performance. Hit rate (LUN) The number of write requests per second that was satisfied by the write cache, since they have been referenced before and not yet flushed to the disks. Write cache hits occur when recently accessed data is referenced again while it is still in the write cache. Flush ratio (storage processor) The fraction of the number of flush operations performed vs. the number of write requests. A flush operation is a write of a portion of the cache to make room for incoming write data. Since the ratio is a measure for the back-end activity vs. front-end activity, a lower number indicates better performance. Dirty page percentages (percent- storage processor) The percentage of cache pages owned by this storage processor that was modified since it was last read from, or written to, disk. In an optimal environment, the dirty pages percentages will not exceed the high watermark for a long period. Block flush rate Forced flush rate (LUN) Number of times per second the cache had to flush pages to disk to free space for incoming write requests. Forced flushes indicate that the incoming workload is higher than the back-end workload. A relatively high number over a long period of time suggests that you spread the load over more disks.


12

High watermark flush on rate (storage processor) Number of times, since the last sample, that the number of modified pages in the write cache reached the high watermark. The higher the number, the greater the write workload coming from the host. Idle flush on rate (storage processor) Number of times, since the last sample, that the write cache started flushing dirty pages to disk due to a given idle period. Idle flushes indicate a low workload. Low watermark flush off rate (storage processor) Number of times, since the last sample, that the number of modified pages in the write cache reached the low watermark, at which point the storage processor stops flushing the cache. The higher the number, the greater the write workload coming from the host. This number should be close to the high watermark flush on number. Flush rate (storage processor) Number of times per second that the write cache performed a flush operation. A flush operation is a write of a portion of a cache for any reason; it includes forced flushes, flushes resulting from high watermark, and flushes from an idle state. This value indicates back-end workload.

-

Miscellaneous Average seek distance (disk) Average seek distance in gigabytes. Longer seek distances result in longer seek times and therefore higher response times. Defragmentation might help to reduce seek distances. Disk crossing percentage (LUN) Percentage of requests that requires I/O to at least two disks vs. the total number of server requests. A disk crossing may involve more than two disks; that is, more than two stripe element crossings. Disk crossings relate to the LUN stripe element size. Generally, a low value is needed for good performance. Disk crossing rate (LUN) Indicates how many back-end requests per second used an average of at least two disks. Disk crossings are counted for read and write requests. Generally, a low value is needed for good performance. Average busy queue length (disk, storage processor) Average number of requests waiting for a busy system component to be serviced, including the request that is currently in service. Since the queue length is counted only when the component is busy, the value indicates the frequency variation (burst frequency) of incoming requests. The higher the value, the bigger the burst, and the longer the average response time at this component. Service time (disk, storage processor, LUN) Time, in milliseconds, a request spent being serviced by a component. It does not include time waiting in a queue. Service time is mainly a property of the system component. However, larger I/Os take longer and therefore usually result in lower throughput (I/Os) but better bandwidth (MB/s).

-

SnapView Reads from snapshot cache The number of reads during this session that have resulted in a read from the snapshot cache rather than reading from the sourceLUN. Reads from snapshot LUN The number of read requests on SnapView during this snapshot session. Reads from snapshot source LUN The number of reads during this snapshot session from the source LUN. It is calculated by the difference between the total reads in session and reads from cache. Writes to snapshot source LUN The number of writes during this snapshot session to the source LUN (on the pertinent storage processor). Writes to snapshot cache The number of writes to the source LUN this session that triggered a copy-on-write operation (the first write to each snapshot cache chunk region). Writes larger than cache chunk size The number of writes to the source LUN during this session, which were larger than the chunk size (they have resulted in multiple writes to the cache). Cache chunks used in snapshot session The number of chunks that this session has used.


13

Documents

Navis Analyzer