Performance monitoring is the process of collecting and analyzing the server data to compare the server statistics against the expected values. It helps the performance tester to have the health check of the server during the test. By monitoring the servers during the test, one can identify the server behavior for load condition and take steps to change the server behavior by adopting software or hardware performance tuning activities.
Each performance counter helps in identifying a specific value about the server performance. For example, % CPU Utilization is a performance counter which helps in identifying the utilization level of the CPU.
In a nutshell, server monitoring should provide information on four parameters of any system: Latency, Throughput, Utilization and Efficiency, which helps in answering the following questions.
- Is your server available?
- How busy is your CPU?
- Is there enough Primary Memory (RAM)?
- Is the disk fast enough?
- Is there any other hardware issues?
- Is the hardware issue result of software malfunctioning?
Always start with few counters and once you notice a specific problem, start adding few more counters related to the symptom. Start monitoring the performance of the major resources like CPU, Memory, Disk or Network. This section provides you the details about key counters from each of above mentioned 4 areas which are very important for a Performance Tester to know.
Processor Bottlenecks
The bottlenecks related to processor (CPU) are comparatively easy to identify. The important performance counters that helps in identifying the processor bottleneck includes
% Processor Utilization (Processor_Total: % Processor Time) – This counter helps in knowing how busy the system is. It indicates the processor activity. It is the average percentage of elapsed time that the processor spends to execute a productive (non-idle) thread. A consistent level of more than 80% utilization (in case of single CPU machines) indicates that there is not enough CPU capacity. It is worth further investigation using other processor counters.
% User time (Processor_Total: % User Time) – This refers to the processor’s time spent in handling the application related processes. A high percentage indicates that the application is consuming high CPU. The process level counters needs to be monitored to understand which user process consumes more CPU.
% Privileged time (Processor_Total: % Privilege Time) – This refers to the processor’s time spent in handling the kernel mode processes. A high value indicates that the processor is too busy in handling other operating system related activities. It needs immediate attention from the system administrator to check the system configuration or service.
% I/O Wait (%wio - in case of UNIX platforms) – This refers to the percentage wait for completion of I/O activity. It is a good indication to confirm whether the threads are waiting for the I/O completion.
Processor Queue Length (System: Processor Queue Length) – This counter helps in identifying how many threads are waiting in queue for execution. A consistent queue length of more than 2 indicates bottleneck and it is worth investigation. Generally if the queue length is more than the number of CPUs available in the system, then it might reduce the system performance. A high value of % usr time coupled with high processor queue length indicates the processor bottleneck.
Other counters of interest:
Other counters like Processor: Interrupts per second, System: Context Switches per second can be used in case of any specific issues. Interrupts per second refers to the number of interrupts that the hardware devices sends to the processor. A consistent value of above 1000 in Interrupts per second indicates hardware failure or driver configuration issues. Context Switches refers to the switching of the processor from a lower priority thread to a high priority thread. A consistent value of above 15000 per second per processor indicates the presence of too many threads of same priority and possibility of having blocked threads.
Windows operating system comes with a performance monitoring tool called Perfmon. This monitoring tool can be used to collect the server statistics during the performance test run. Normally, most of the performance testing tools have its own monitors to monitor the system resources of the system under test. In this case, it becomes easy to compare the system load related metrics with the system resource utilization metrics to arrive at a conclusion. Infact, any performance testing tool that monitors windows machine internally talks to perfmon to collect the system resource utilization details.
There are lots of licensed tools available in the market which is used for post production monitoring. These tools are used to capture the web server traffic details and provide online traffic details. A very popular tool of this category is WebTrends. Many organizations uses this tool to get to the traffic trends of an application running in production environment. There are other tools like HP OpenView tools which run in production servers and monitor the server resource utilization levels. It provides alarming mechanism to indicate the heavy usage and provides easy bottleneck isolation capabilities. But due to the cost involved with these kinds of tools, most of small organizations don’t opt for them. But post production monitoring data would be of great use for designing realistic performance tests.
· Allows you to analyze and isolate the performance problem.
· Understand the resource utilization of the server and make best use of them.
· Plan for Capacity Planning activity based on the resource utilization level.
· Provides the server performance details offline (by creating a alert of sending a mail/message when resource utilization reaches the threshold value).