Advertorial: Windows Server Performance Monitoring
“The system’s really slow today!”
How often have you heard that? Finding the solution isn’t so easy. The obvious questions to ask are why is it running slowly and what can you do about it? An even better question is how can you tell that a server is beginning to reach its limits in time to do something about it before system performance and business productivity start to suffer? This is where Server Performance Monitoring comes in.
As with most things in life, computing resources are finite and there is a limit to how much can be done in any given period of time by any system. The key to understanding performance issues on servers is to know:
- What are the key resources that might be the limit on the overall performance of the system?
- What should be measured in order to assess their utilisation?
- What should be done if there are signs of overload?
Some things are easier to rectify than others, and knowing when and how to upgrade or replace systems or components is important to ensure that any investments that are made will actually solve the problem that is being experienced.
Unfortunately, there is no single solution that will address all performance issues. It depends on several factors including the application mix, numbers of users, the hardware itself and external factors such as network topology. Of the many parts of a server, there are four key elements that in practice tend to influence the performance of the system as a whole:
- Processor throughput
- Memory capacity
- Disk I/O throughput
- Network I/O throughput.
Each of these will have their limits and the overall performance of the server will be determined by which of them is exhausted first.
So, enough of the generalities. Where should you look and for what should you be looking in each case?
Processor
This one’s easy, right? Open up performance monitor and look at the CPU utilisation. Simple! Well, in reality certainly prolonged periods of high utilisation are a bad thing but periods of 100 percent CPU utilisation are quite normal and simply mean that the applications are making full use of the capacity of the system. Monitoring the CPU utilization is useful but even at 100 percent the system can still be responsive as long as the applications are well-written. In any multi-tasking, operating system applications get allocations of processor time in turn and as long as the system can switch between them quickly enough it will still give acceptable performance.
When performance really starts to be impacted is if applications are queuing for processor time and being held up by one another. Fortunately modern versions of Windows are much better at scheduling applications but there are limits to what can be achieved. For that reason one of the most useful parameters to monitor is the processor queue length. If the operating system is struggling to balance the demands of multiple applications the applications will be queued for access to the processor and the queue length will go up. As a general rule, if the processor queue length exceeds twice the number of CPUs and/or processor cores in the system, then performance will drop off dramatically.
If it is shown that the processor capacity in a system is the underlying problem, the options are fairly straightforward:
- Add more processors if the system can accommodate them.
- Move applications onto different servers.
- Upgrade the server as a whole to one with higher performance processor(s).
One thing to note is that the difference between having one processor and more than one can be very significant. In a multi-tasking operating system, such as all modern versions of Windows, the OS will schedule tasks for access to the processor and if there is only one processor in the system a single application can hog the processor despite the best efforts of the OS to schedule other tasks. With two or more processors the OS has more flexibility in allocating tasks so even if the utilization is very high the system can be more responsive.
Memory
Modern operating systems all use virtual memory to manage applications and data being processed by the system. This is a good thing in that it prevents the system from stopping if there is insufficient physical memory available – unfortunately the fact that virtual memory is based on using disk space as a substitute for physical memory means that if virtual memory is being used to a significant extent then system performance will be impacted dramatically.
Again, it might appear simple to know when a shortage of memory is proving to be a problem – look at the memory allocated and compare it to the physical memory in the machine; if it is higher, then add more memory. However, as with CPU utilization, it is a little more complicated. For example, most versions of Microsoft Exchange will allocate as much memory as they can to maximize throughput and responsiveness (through the store.exe process). If another application requires more memory then store.exe is supposed to release memory to avoid the need to swap memory to disk. Hence in most servers running Exchange it will almost always appear that all physical memory is allocated all the time.
Note: Adding more memory in this situation will make little or no difference to the performance of the server.
Significant over-allocation of memory may be an indication of a problem but the key question is how much of that memory is in active use. If an application has requested a memory allocation but is almost completely idle then the memory will be swapped out to disk and does not have to be recalled, so performance will not be overly affected. Therefore, the other key parameter that is useful to monitor in relation to memory is the page fault rate, the frequency with which the operating system has to move some data from physical memory to disk storage in order to recall other data that is required in memory at that time. Every time this happens the system has to wait until the swap has completed before it can carry on with the next processing task and performance will be much slower.
The solution to a memory problem is simple, at least up to a point. Adding more memory is often the simplest and cheapest way to boost the performance of a system but with 32-bit versions of Windows the maximum addressable physical memory is 4GB, of which (by default) half is reserved for the system address space, leaving just 2GB for applications. Adding more memory beyond 4GB will not bring about any further improvement, so alternative approaches such as splitting application load across multiple servers must be employed. Of course, 64-bit versions of Windows don’t have this problem, being able to address 16TB directly.
Disk I/O
Hard drives are one of the slowest components in a server as they rely on the physical movement of heads over platters to access the positions at which data is to be read or written. Like most other components of a multi-tasking system, different applications can access disks simultaneously and it is up to the operating system to manage the requests for access to disk resources. As with other shared resources, the operating system maintains a queue for these requests and handles them in sequence, routing the data between the disk controller and the applications requesting disk access.
The raw throughput of a disk drive or array and the associated disk controller(s) clearly have a significant effect on performance but these do not change as the system is used, so it is more useful to look at the amount of time that the disks spend servicing requests, reflected in the percentage disk time. If this value is consistently high it will usually indicate that the disk system is working flat out to process all the data transfers that are being requested.
Another useful indicator of disk performance limiting overall system performance is to look at the read and write queue lengths. As with processor queues, if the counters start to show values significantly in excess of the number of devices (in this case, drives) in the system, this indicates that the system is not able to process requests as fast as the system is making them and therefore applications are likely to be held up waiting for data to be delivered from the disks.
The solution to disk bottlenecks will depend on the underlying problem.
- If the system is running multiple different tasks, each of which is making heavy use of the disk subsystem, then there may be little that can be done other than installing faster disks, disk controllers or array configurations. The right answer to the question of which one to change will vary from one case to another. The classic answer is that the three most important things about disk performance are spindles, spindles and spindles. By spreading the activity across multiple physical devices it is possible to increase the overall throughput substantially and the right use of RAID configurations (preferably in hardware rather than software) can be very effective.
- Another low-cost, approach is to spread applications across more than one server, if there are other machines available, making effective use of the aggregate performance across all the available hardware.
The use of network disk devices (SAN or NAS) is a further, albeit significantly more expensive, solution, removing the disk I/O from the server to a device optimized for high throughput applications. - However, hardware may not always be the answer – a disk bottleneck may be the result of poor design so increasing the throughput of the disk subsystem may not be that effective. The most common case of this nature is databases that are poorly indexed. Performing searches on database tables that are not well indexed will place a very heavy load on the disks and the impact of correcting such a problem by adding the right indices is dramatic – disk I/O can be reduced dramatically and system performance improved significantly.
Network
Identifying network bottlenecks is generally fairly straightforward – any network link will have a finite amount of bandwidth and the higher the proportion of this that is used up the more applications will be slowed up in communicating with other devices on the network. The actual value of utilisation that will indicate a serious problem will depend on the network topology – with switched networks the utilisation level can be significantly higher than can be sustained in shared networks, so whereas anything over about 30-40 percent in a shared network will indicate a problem, that would not be the case in a switched environment.
Modern PCs and servers have sufficient performance that they can saturate a network link quite quickly if they are performing sustained network activity such as large file copies. For each individual PC, the impact is limited to that PC but of course a server has to service requests from all the other machines on a network, so it is a point of concentration and if there are several client machines making heavy network I/O demands on the server, these must all be channeled through the server’s NIC and the switch port into which it is connected.
There are a number of ways to overcome network utilisation issues including installing higher throughput components, such as Gigabit Ethernet devices; network reconfiguration, to use multiple network segments to divide up the traffic across multiple network interfaces; and teaming of network interfaces, to utilise more than one physical network port on a single subnet. Which solution is practical and appropriate will require a detailed understanding of the current network topology and application environment.
Summary
In conclusion, there are a number of parameters that can affect the overall performance of a system and understanding performance issues and how to solve them depends on analysing the underlying root cause and identifying the most effective solution to overcome that. While there are a great number of performance metrics that can be used, in reality a relatively small number of them can be used to get a view of how a machine is performing and which of its components is becoming overloaded.
In most systems there will be spikes of high utilisation on all of the parameters but performance will only be impacted noticeably when one starts to show signs of consistent overloading. The use of effective tools that monitor performance metrics over a period of time and can show when these signs are developing is a valuable addition to the network monitoring process.
Further ReadingPerformance Monitoring for Windows Server 2003 and 2000
Simple guide to performance monitoring under windows
Windows 2000 Performance Guide
Refers to Windows 2000 but has in-depth discussion of performance principles and the detail of how Windows performance can be analysed, most of which is still relevant to current versions of Windows.
Windows 2000 Performance Tuning
TechNet article describing performance monitoring and optimization techniques, again focused on Windows 2000 but still generally applicable.
Windows Server 2003 Performance Counters Reference
Comprehensive, albeit a little hard to navigate, listing of key performance counters and what they mean.
This article was submitted by GFI Software, a leading software developer that provides a single source for network administrators to address their network security, content security and messaging needs. With award-winning technology, an aggressive pricing strategy and a strong focus on small-to-medium sized businesses, GFI is able to satisfy the need for business continuity and productivity encountered by organizations on a global scale. GFI has offices in the US, Malta, England, Scotland, Austria, Romania, Hong Kong and Australia which support more than 200,000 installations worldwide. GFI is a channel-focused company with over 10,000 partners worldwide. GFI is a Microsoft Gold Certified Partner. More information about GFI can be found at http://www.gfi.com.

