Performance Counter Disambiguation

RISC Networks assessments collect data from a variety of sources within your IT environment. In addition to consuming this data throughout our platform to provide cutting edge analytics, we make aggregated performance data available to you directly in the Performance and Trending module. This document presents some low-level explanations and disambiguations regarding the source of some of the data you will encounter in our platform.

Disambiguation / Clarification on Selected Counters

In this section, we provide some clarification on some of the subtleties of the performance counters we report.

CPU Utilization

The percent and absolute CPU utilization of a given device may be collected directly from the device's operating system (through WMI, SSH, or SNMP), and for virtual machines it may be reported by VMware (see Data Sources, below). When both metrics are present, you may notice a difference in the two. VMware's view of CPU utilization includes the operational overhead of virtualization, so it is often higher than the value reported by the operating system. Conversely, at times of high host resource consumption, VMware may restrict a guest's CPU access in a way that is not visible to the operating system, resulting in VMware reporting a lower CPU utilization than the host does.

Memory Usage

Please note that we collect a different representative metric for memory utilization from VMware than we do directly from the OS (see Data Sources, below). Operating systems report committed memory - the amount of memory currently reserved by processes, and thus unavailable to the OS. VMware, in contrast, reports the amount of memory actively being used by guest processes. Because processes may not be actively using all of the memory they have reserved, the two metrics are not exactly equivalent. However, both metrics play similar roles in our analytics.

Disk Read and Write statistics

Counters relating to disk IO are generally estimated by sampling the absolute counters provided by the operating system twice, separated by a brief interval. For SSH and some 64-bit SNMP implementations, we will use a longer sampling window to improve convergence (see Sampled vs Aggregate Data, below).

Network IO

Data transmit and receive rates (both in packets and bytes) are estimated by sampling the absolute counters provided by the operating system twice, separated by a brief interval. For SSH and some 64-bit SNMP implementations, we will use a longer sampling window to improve convergence (see Sampled vs Aggregate Data, below).

Packets dropped and packets with errors are reported as an average number seen during the sampling window, rather than as a rate. Because 32-bit counters do not provide the same level of coverage for detecting interface errors and discards as 64-bit counters, we recommend against comparing interface errors and discards between different devices. An increase or decrease in this statistic for a given device is worthy of note, and any value above 0 for these counters may be indicative of a problem.

CPU Active/Running (VMware only)

These counters are reported by VMware as a percentage, but please be aware that they are reported as the sum of the percent utilization of each logical core. This means that the total can exceed 100% for systems with multiple logical cpu cores.

Data Sources

We collect data about the state of the hosts and network devices in your system via a variety of methods: WMI, SSH, SNMP, VMware's API, Netflow, database metadata, etc. We may collect against the same device using multiple collection methods. In order to distinguish between data about a virtualized server gathered using VMware's API and data collected directly from that server's operating system (through WMI, SSH, or SNMP), we may refer separately to the server's VMware-sourced data and its data collected directly from the device's OS. While the metrics collected are typically ways of measuring the same property of the system in question, they can differ from each other due to differences in data availability, collection interval, or sampling method.

For the purposes of cloud cost estimation, we prefer data collected from VMware if both that and directly-collected data were available throughout the assessment's collection period.

Sampled vs. Aggregate Data

Most of our performance data is sampled: we poll each device once per collection interval (not more than once every five minutes). Many metrics and collection types report the instantaneous value of relevant performance metrics at the time of collection (e.g. a server with 25% memory usage at a given point in time); we report the average value of these metrics observed each hour of collection. Because we do not collect continuously, observed values are approximate; we recommend collecting for at least a month to reduce the impact of sampling error.

Some metrics, however, are recorded by operating systems as a count of events since system restart (or counter rollover). When the counter is stored as a 32-bit integer, we will measure the rate of events by polling the absolute counter twice within a short period, and extrapolate from this sampled rate to determine the average overall rate of events. With enough data points, these sampled values should converge towards the "true" average. When the operating system provides 64-bit counters (SSH and some SNMP implementations), we are able to calculate the number of events between polling intervals, effectively collecting events from the whole interval rather than extrapolating an approximate rate from a sample.