Monitoring Windows Server Memory Pressure in Prometheus

A common alert to see in Prometheus monitoring setups for Linux hosts is something to do with high memory pressure, which is determined by having both 1) a low amount of available RAM, and 2) a high amount of major page faults.

For example, Gitlab is kind and gracious enough to make the alerting rules they use available to the public, and their alert looks like this:

  - alert: HighMemoryPressure
    expr: instance:node_memory_available:ratio * 100 < 5 and rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 15m
    labels:
      severity: s4
      alert_type: cause
    annotations:
      description: The node is under heavy memory pressure.  The available memory is under 5% and
        there is a high rate of major page faults.
      runbook: docs/monitoring/node_memory_alerts.md
      value: 'Available memory {{ $value | printf "%.2f" }}%'
      title: Node is under heavy memory pressure

This is a great way to reduce noise and increase the quality of the alert, because you are increasing your chances that when the alert fires there is actually a problem. Your server might be happily chugging along at 5% memory available, but if it start page.

Unfortunately, however, Windows does not distinguish between major and minor page faults in its performance counters, which is what the windows_exporter collects from. Consequently, you have to do a little bit of extra work to determine how often the major page faults are occurring.

In order to determine the rate of major faults, you have to combine the metrics windows_memory_swap_page_operations_total and windows_memory_swap_page_reads_total. The operations counter will increase no matter the type of page operations (minor or major). The reads counter will increase when data has to be read out of the pagefile into memory.

With this in mind, I devised the following query that we are now using in our production monitoring for Windows.

instance:windows_memory_available:ratio * 100 < 5
and (rate(windows_memory_swap_page_operations_total[2m]) > 1000)
and (
    (rate(windows_memory_swap_page_reads_total[2m]) / rate(windows_memory_swap_page_operations_total[2m])) > 0.5
)

For this alert to fire, 3 things need to happen:

Windows memory available has to be less than 5%.
There must be a high rate of page operations occurring.
The proportion of page reads vs total page operations must be at least 50%.

All of these thresholds are arbitrary and can be adjusted as fits your needs. These thresholds just happened to fit well for us.

For more information on the distinction between minor and major page faults, the Wikipedia article on Page Faults explains it better than I ever could.