Red Hunting Cap

Alerting on Missing Data in Prometheus

Will Hegedus — Tue, 29 Nov 2022 18:23:00 GMT

Alerting on missing data in Prometheus is commonly handled by the absent function, but that's really only useful when you know the labels you expect to be there ahead of time. How can you dynamically alert on missing data then?

By using the unless operator, you can return a set of labels only when a different matching metric does not exist. For example,

  group without (instance) (up{job="blackbox_http_2xx"})
unless
  count without (instance) (probe_http_status_code{job="blackbox_http_2xx"} == 200)

Only if there are no HTTP 200s for the label set that results from the group query will the alert would fire. The alert would fire with a label set that looks similar to this in my environment:

{job="blackbox_http_2xx", environment="production", cluster="clusterA", service="website"}

Having these extra labels can be extremely useful in your Alertmanager routing configs and any templating you do, which is why I strive to keep as many labels as possible when designing alerting rules.

Terminating Prometheus Exporter TLS with Vector

Will Hegedus — Wed, 22 Dec 2021 15:51:14 GMT

I've recently been digging into using Vector more for collecting telemetry from systems since it can pull from a variety of sources (logs and metrics are what I'm most concerned about) and spit them out to a variety of "sinks"

One of the use cases I have for Vector is to have it aggregate multiple Prometheus exporters on a single host and expose them all under a single port/endpoint. Previously, I used a reverse proxy for this, which had its uses but was also overkill. However, it did provide me with the benefit of putting my exporters behind HTTPS, which I did not know was possible to do with Vector.

Vector's docs for the Prometheus Exporter sink do not mention it at my time of writing this, but Vector actually does support listening for SSL connections on the Prometheus exporter it exposes. To do so, simply add a tls object to your sink's configuration. Example below:

Using Linode Object Storage with Thanos

Will Hegedus — Tue, 25 May 2021 20:47:45 GMT

Thanos is an amazing tool for extending and scaling the functionality of Prometheus. One of the core features it provides is the ability to back up your TSDB data into an object storage bucket to be queried later on by the Thanos Store component.

Thanos supports a variety of storage backends, including S3. Linode offers an object storage service with an S3-compatible API, and it's easy to get started with it and begin sending data from Thanos/Prometheus into it.

This presumes that you already set up a Thanos Sidecar to run alongside your Prometheus deployment and it has access to the directory in which the TSDB writes its blocks. Setting that up is outside of the scope of this post, but it is simple to get up and running.

On the Linode side, first login to your account and sign up for the Object Storage service. Please note that this comes at a flat rate of $5 in order to reserve your minimum storage space of 250GB. You can get signed up just by creating your first bucket.

After your bucket is created, at the home page for Object Storage, click on the "Access Keys" tab and create an Access Key. I recommend limiting access to just the bucket you just created, but that is up to you. Thanos Sidecar will need read and write in order to do its job.

Once that Access Key is created, make sure you save the secret key and the access key – we'll need to put them in a config file for Thanos Sidecar.

Go ahead and open your text editor of choice and create a file named something like linode.yml with the contents below. Modify endpoint based on your datacenter – inferable from the link displayed under your bucket name in the GUI. This is the least intuitive part – the URL presented is a full path directly to the bucket (e.g. thanos-snap.us-east-1.linodeobjects.com), but Thanos won't like that. So just strip out the bucket name from the URL for the endpoint and modify bucket based on your bucket name.

type: S3
config:
  bucket: "thanos-snap"
  endpoint: "us-east-1.linodeobjects.com"
  access_key: "YOUR_ACCESS_KEY_HERE"
  insecure: false
  secret_key: "YOUR_SECRET_KEY_HERE"
  signature_version2: false

Now you can point your Thanos Sidecar to this configuration file by adding the --objstore.config-file=linode.yml flag, and it should automatically start uploading TSDB blocks.

One last thing to note is to make sure that you followed the instructions in setting up the Thanos Sidecar so that Prometheus doesn't compact the blocks – you want another Thanos component, Thanos Compactor to do the compaction. The good news, though, is that you can reuse that linode.yml file when specifying the --objstore.config-file flag for Thanos Compactor (and any other components that connect to block storage).

Configuring Podman for WSL2

Will Hegedus — Wed, 31 Mar 2021 20:48:49 GMT

This post is intended to serve as a sort of update to Red Hat's (now outdated since v3 of Podman) blog post on how to run Podman in WSL2.

The commands for the below presume that you are running Ubuntu 20.04 or higher, but the WSL2 specific configuration at the end is independent of which Linux distro you are using.

You can find the specific installation dependencies and commands for your distro in the Podman docs.

Basic Install

Here are the commands I used in order to install Podman on Ubuntu.

First, you must determine the version of Ubuntu you're running, if you don't already know. Then you export that to an environment variable, which I named VERSION_ID in order to be able to use that when adding repos from the Kubic project, which was necessary for my specific OS version.

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
echo "deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /" | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/Release.key | sudo apt-key add -
export VERSION_ID="20.04"
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y install podman

WSL2 Specifics

OK - now comes the WSL2 specific stuff that you need to change in order for Podman to work. Basically, you just have to change the systemd-specific stuff (default) to non-systemd stuff.

With version 3+ of Podman, this can all be done in one file.

First, make the requisite directory. Mine was not automatically created, but YMMV. Then, create a containers.conf file in that directory.

mkdir ~/.config/containers
vim ~/.config/containers/containers.conf

Inside that file, simply add the following:

[engine]
events_logger="file"
cgroup_manager="cgroupfs"

The default events_logger is journald and the default cgroup_manager is systemd, in case you were curious.

Now Podman should run for you with no issues, and you don't need to run that clunky Docker daemon in Windows anymore 🎉

Monitoring Windows Server Memory Pressure in Prometheus

Will Hegedus — Mon, 09 Nov 2020 17:22:03 GMT

A common alert to see in Prometheus monitoring setups for Linux hosts is something to do with high memory pressure, which is determined by having both 1) a low amount of available RAM, and 2) a high amount of major page faults.

For example, Gitlab is kind and gracious enough to make the alerting rules they use available to the public, and their alert looks like this:

  - alert: HighMemoryPressure
    expr: instance:node_memory_available:ratio * 100 < 5 and rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 15m
    labels:
      severity: s4
      alert_type: cause
    annotations:
      description: The node is under heavy memory pressure.  The available memory is under 5% and
        there is a high rate of major page faults.
      runbook: docs/monitoring/node_memory_alerts.md
      value: 'Available memory {{ $value | printf "%.2f" }}%'
      title: Node is under heavy memory pressure

This is a great way to reduce noise and increase the quality of the alert, because you are increasing your chances that when the alert fires there is actually a problem. Your server might be happily chugging along at 5% memory available, but if it start page.

Unfortunately, however, Windows does not distinguish between major and minor page faults in its performance counters, which is what the windows_exporter collects from. Consequently, you have to do a little bit of extra work to determine how often the major page faults are occurring.

In order to determine the rate of major faults, you have to combine the metrics windows_memory_swap_page_operations_total and windows_memory_swap_page_reads_total. The operations counter will increase no matter the type of page operations (minor or major). The reads counter will increase when data has to be read out of the pagefile into memory.

With this in mind, I devised the following query that we are now using in our production monitoring for Windows.

instance:windows_memory_available:ratio * 100 < 5
and (rate(windows_memory_swap_page_operations_total[2m]) > 1000)
and (
    (rate(windows_memory_swap_page_reads_total[2m]) / rate(windows_memory_swap_page_operations_total[2m])) > 0.5
)

For this alert to fire, 3 things need to happen:

Windows memory available has to be less than 5%.
There must be a high rate of page operations occurring.
The proportion of page reads vs total page operations must be at least 50%.

All of these thresholds are arbitrary and can be adjusted as fits your needs. These thresholds just happened to fit well for us.

For more information on the distinction between minor and major page faults, the Wikipedia article on Page Faults explains it better than I ever could.

Demystifying Kubernetes CPU Limits (and Throttling)

Will Hegedus — Fri, 23 Oct 2020 18:53:18 GMT

Recently, I've been doing some investigation into high CPU utilization occurring during routine security scans of our Wordpress websites causing issues such as slow response, increased errors, and other undesirable outcomes. This is typically limited to a single pod–the one the scanner randomly gets routed to–but can still be user-visible (and Pagerduty-activating 😅), so we want to get better monitoring on it.

Initial Investigation

Like anyone else in IT investigating something they're not sure of, I turned first to Google. I sought out what other people are doing to monitor CPU usage of pods in Kubernetes. This is what first led me to discover that it's actually far more useful to monitor how much the CPU is being throttled rather than how much it's being used.

I already knew of the kubernetes-mixin project, which provides sane default Prometheus alerting rules for monitoring Kubernetes cluster health, so I looked there first to see what rules they are using to monitor CPU. Currently, the only CPU usage alert bundled in is "CPUThrottlingHigh", which calculates number_of_cpu_cycles_pod_gets_throttled / number_of_cpu_cycles_total (not acutal metric names) to give you a percentage of how frequently your pod is getting its CPU throttled.

But wait, what does throttled even mean? Throttled (at least in my mind) means something along the lines of just getting slowed down, but in this case throttled means completely stopped – you cannot use any more CPU until the next CFS period (every 100ms in Kubernetes, which is also the Linux default - more on this later).

While abstractly this seems pretty cut and dry, it gets more confusing when you're actually looking at in practice on production servers with tons of CPU cores.

Conceptualizing

For the purposes of this article, I'll be referring to a server with 128 CPU cores running a pod with a CPU limit of 4.0.

If you are not already familiar with the concept of millicores, suffice to say that 1 milllicore = 1/1000th of a CPU's time (1000 millicores = 1 whole core). This is the metric used to define CPU requests/limits in Kubernetes. Our example pod has a limit of 4.0 which is 4,000 millicores, or 4 whole cores worth of work capability.

But how does the operating system kernel even enforce this measure? If you're famililar with how Linux containers work, you probably have heard of cgroups. Cgroups, put simply, are a way to isolate and control groups of processes such that have no awareness of the other processes also running on the same server as them It's why when you run a Docker container, it thinks that its ENTRYPOINT + CMD is PID 1.

Among other things, Cgroups use the Linux CFS (Completely Fair Scheduler) to set and enforce resource limits on groups of processes, e.g. our pods in Kubernetes. It does this by setting a quota and a period. A quota is how much CPU time you can use during a given period. Once you use up your quota, you are "throttled" until the next period when you can begin using CPU again.

Going back to our discussion on millicores, this means that in every 100ms cfs_period in the operating system, we get 400ms of usage allowed. The reason why we get 400ms in a 100ms time frame is each core is capable of doing 100ms of work in a 100ms period – 100ms x 4cores = 400ms. This 400ms of work can be broken up in any way - it could translate to 4 vCPUs each doing 100ms of work in a 100ms cfs_period, 8 vCPUs each doing doing 50ms of work, etc. Remember - CPU limits are based on time, not actual vCPUs.

Understanding that, the reason for the throttling confusion starts to come into focus. So far as I can comprehend, the theoretical upper bound of throttling is n * (100ms) - limit where n is the number of vCPUs and limit is how many milliseconds of CPU you are allotted in a 100ms window (calculated earlier by cpuLimit * 100ms). This means that the theoretical upper bound for throttling on my 128 core machine is 124 seconds of throttling per second because (128cores * 100ms - 400ms) * 10 = 124.

Note: the actual CPU throttling is determined by how many processes you're running and which core(s) they're assigned to by the OS scheduler.

Putting it together

Now things started to finally click in my brain. At least... as much as they could, considering I'm still somewhat ignorant of all the nitty gritty details occurring in the Linux scheduler itself.

This whole investigation was kicked off by the fact that when I went to use a rate() function on the container_cpu_cfs_throttled_seconds_total metric in Prometheus, the per second rate of throttling was significantly higher than 1s (think closer to 70s per second). How can a pod be throttled for more than 1second in a 1second window? I wondered.

Putting all this information together, I now know that the reason for such high throttling was because httpd was spawning additional processes on additional CPU cores, which raises the amount of throttling to significantly higher than the resource limit.

Conclusion

With my brain sufficiently exploded, I can now say that we have sufficient monitoring in place to alert us of high CPU usage based on the amount of time the CPU is being throttled. This is the alert we have in place now:

    - alert: Wordpress_High_CPU_Throttling
      expr: rate(container_cpu_cfs_throttled_seconds_total{namespace=~"wordpress-.*"}[1m]) > 1
      for: 30m
      labels:
        severity: warning
      annotations:
        message: The {{ $labels.pod_name }} pod in {{ $labels.namespace }} is experiencing a high amount of CPU throttling as a result of its CPU limit.

In summary:

Kubernetes uses a cfs_period_us of 100ms (Linux default)
Each a CPU request of 1.0 in k8s represents 100ms of CPU time in a cfs_period
- Theoretically this is 100% of 1 CPU's time, but not pratically since pods usually run multiple processes on multiple cores
The upper bound of how many seconds the kernel reports a pod being throttled for is determined by the number of CPU cores that the pod is using.
- The number of CPU cores you use is not directly related to your CPU limit. It correlates more strongly with the number of processes your pod runs.

Please feel free to reach out if I got anything wrong, or if you have any questions. I'm available on Twitter @wbhegedus

Works Cited

These resources were useful to me in my quest for knowledge.

CFS Bandwidth Control - Kernel.org
Unthrottled: Fixing CPU Limits in the Cloud - Indeed Engineering
CFS quotas can lead to unnecessary throttling - Kubernetes Github Issue #67577
CPU bandwidth control for CFS - Academic paper on Linux CFS from Turner, Rao, and Rao
cAdvisor Github
Kubernetes Monitoring Mixin CPU Alert
CPU limits and aggressive throttling in Kubernetes - Omio Engineering

Recovering from a major etcd failure

Will Hegedus — Thu, 28 May 2020 15:28:19 GMT

Etcd defines a "disastrous" failure as more than (N-1)/2 members being lost "permanently", in which N signifies the number of cluster members. In order to recover from this type of failure, you will need to essentially create a new etcd cluster.

The following steps presume a 3 node cluster in which m1, m2, and m3 are 3 Kubernetes masters running etcd as static pods. They encompass what I did in order to restore the etcd portion of one of my Kubernetes clusters.

Identify the master that is going to be the progenitor of your new cluster. In our case, this will be m1.
Stop etcd on all masters, even m1.
1. Do this by moving the etcd manifest out of the /etc/kubernetes/manifests directory
2. mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml
Update the etcd manifest on m1 to force it to create a new cluster
1. Add the --force-new-cluster flag to the command in the manifest
Start etcd on m1
1. mv /root/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
Verify that the container is started and OK
1. docker container ls -a | grep etcd
2. docker logs

Exec into the container on m1 to add a new member to the cluster

docker exec -it /bin/sh

 # Check on the existing members. It should just be m1 right now. Replace m1 with the FQDN of your etcd endpoint

 etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
  
 # Add the first new member. The first argument to "add" is the name of the cluster member. This isn't terribly important, but make sure you can use it to distinguish cluster members. The second argument is the IP and port for the peer address.
  
 etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member add m2 https://10.253.5.18:2380

After adding back m2, your etcd cluster will be unavailable until you start etcd on m2 since a quorum needs to be established.
1. SSH to m2 and remove the etcd data from the previous cluster
  - rm -rf /var/lib/etcd
2. Ensure m2's etcd manifest has only m1 and m2 in the --initial-cluster flag
  - --initial-cluster=m2=https://10.253.5.18:2380,m1=https://10.253.5.17:2380
  - Also ensure that the --initial-cluster-state=existing flag is set
  - You'll get an error if the number of nodes specified in initial-cluster is more than the actual number of nodes in the cluster.
3. Start etcd on m2 using the command from Step 4
4. On m1, run the "member list" command from above to ensure that m2 joined successfully.
  - If m2 hasn't joined yet and participated in a leader election, you'll get an error saying m1 has no leader.
Now that m2 is added, we need to add m3 back in.
1. Add the member
  - Repeat the steps in 6 above, but update the member name and peer address to reflect that of m3
2. Start etcd on m3
  - Follow the steps from 7 above, but update the --initial-cluster flag to also include m3 now
Verify on m1 that the "member list" command from step 6 now shows all 3 members.
- Everything should be OK now!

Works Cited:

Nested Active Directory Group Membership in Grafana

Will Hegedus — Fri, 20 Dec 2019 21:21:23 GMT

I am currently in the process of onboarding several teams into our Grafana environment. While we were just POC'ing Grafana, it was all fine and dandy to just have "Grafana Viewer", "Grafana Editor", and "Grafana Admin" groups because not that many people would be in any of them. However, as our environment is growing, it has quickly become clear that managing this additional group membership would be a pain.

When I first set up Grafana, I was unable to get nested group membership working but didn't care enough at the time to troubleshoot much – after all, only myself and a few others would be using it. Now that I've gone back and actually figured out how it works, I want to share what I've learned.

Note: these instructions presume that you are using Active Directory, but the concepts should be transferable to other LDAP providers.

Presumably, you've already configured your Grafana environment to use LDAP as your authentication provider with this bit in your configuration file:

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml

The instructions on Grafana's site do a good job of getting you up and running with what you'll need in that /etc/grafana/ldap.toml file. However, we need to expand upon that by not just getting what groups a user is a member of, but also what groups those groups are a member of (nested group membership).

To do this, you need to specify a group_search_filter in addition to your plain search_filter. This is in supplement to search_filter, and not a replacement for it – it's required. Your group_search_filter is an LDAP query that essentially tells AD to find all groups that a user is a member of within the group_search_base_dns.

This group_search_filter looks like:

(member:1.2.840.113556.1.4.1941:=CN=%s,OU=Users,OU=FSA,DC=example,DC=domain,DC=com)

Those random looking numbers are an important OID to enable LDAP_MATCHING_RULE_IN_CHAIN, which is what lets us find nested group memberships, too.

When it's configured, it should look like:

## Group search filter, to retrieve the groups of which the user is a member (only set if memberOf attribute is not available)
group_search_filter = "(member:1.2.840.113556.1.4.1941:=CN=%s,OU=Users,OU=FSA,DC=example,DC=domain,DC=com)"
## An array of the base DNs to search through for groups. Typically uses ou=groups
group_search_base_dns = ["OU=Admin Groups,OU=Security Groups,OU=FSA,DC=example,DC=domain,DC=com"]
group_search_filter_user_attribute = "sAMAccountName"

As a bonus tip when configuring this, make your group_search_base_dns as specific as possible, because that is what Grafana is going to loop through looking for your group memberships. For example, it was taking 15-20s to log me in when I used "Security Groups" as my base OU, but when I got more specific to use the "Admin Groups" OU (which is where my "Grafana Admin", "Grafana Viewer", and "Grafana Editor" groups are), the results were nearly instant.

Automate Testing of Prometheus Targets files with Drone CI/CD

Will Hegedus — Thu, 05 Sep 2019 21:49:08 GMT

File-based service discovery is one of the most popular and flexible methods of service discovery available in Prometheus. However, there was no good way that I knew of to test the validity of the files that file_sd will try to use prior to actually giving them to Prometheus to be read.

This particular solution uses Drone, as it is my CI/CD tool of choice, but you can use the same logic in Jenkins, Gitlab-CI, or whatever you utilize for your builds/deployments. I also use JSON as my file-format for file_sd but you can apply the same concepts if you use YAML.

First, we need a way to actually define what the file should look like. To do this, I chose to use JSON schemas. You can pop on over to jsonschema.net, paste in one of your valid JSON target files, and infer a schema based off of it. Personally, I changed it so that env, team, and service labels are required on targets in order to enforce uniformity in our targets regardless of what team is adding them. Go ahead and save that schema as schema.json in a drone folder in the top level of your repository.

With that schema in hand, now we need a way to compare our JSON files to it. To do this, I wrote a simple Python script:

from jsonschema import validate
from jsonschema.exceptions import ValidationError
from json.decoder import JSONDecodeError

import glob
import json

import sys

errors: {str: str} = {}
error_encountered = False

with open("drone/schema.json") as f:
    schema = json.load(f)
    f.close()

for f in glob.glob("files/targets/*.json"):
    with open(f) as inst:
        try:
            validate(json.load(inst), schema)
        except ValidationError as e:
            error_encountered = True
            errors[inst.name] = f"Schema error: {e.message}"
        except JSONDecodeError as e:
            error_encountered = True
            errors[inst.name] = f"JSON decoding error: {e.msg} on line {e.lineno}"

for f, err in errors.items():
    if err:
        print(f"[ ERROR in {f} ]")
        print("  |>", err)
        print()

if error_encountered:
    sys.exit(1)

print("All tests passed! Have a cookie. 🍪")

The script presumes the location of the schema file (drone/schema.json) and the target file(s) (files/targets/*.json) but you can adapt that as needed.

Now in your .drone.yml just add a step to your pipeline to run the validation script (and install its dependency):

  - name: validate JSON of targets
    image: python:3-alpine
    commands:
      - pip3 install --quiet jsonschema
      - python3 drone/validate.py

Easy peasy. Now you'll never have another malformed target file again! (err... at least not one that makes it to prod).

Deleting Elements from Slices in Go

Will Hegedus — Mon, 05 Aug 2019 21:09:36 GMT

If you come from a language such as Python or Java, you might expect to find a standardized way to remove items from a list object in Go. However, as of Go 1.12, there is not a built-in way to perform this action. You can do it with relative ease once you wrap your head around it, though.

Personally, it took me a bit to wrap my head around it, which is why I'm writing this — in hopes of aiding someone else down the line.

I was working on a problem that required me to take a list of items and filter them by removing any items that matched a particular regex, then return the filtered list. Several hours of debugging and brainstorming later, I finally figured out how to do it right.

In my particular case, I did not care about the order of the elements in the slice so I could rearrange them any which way.

First, we start with the function declaration:

func filterExcluded(tl *[]task, excluded []string) []task {
	filtered := *tl
	lastIndex := (len(filtered) - 1)

The function accepts a pointer to a slice of task-typed objects, along with a slice of strings that will be interpreted as regex matchers, it then returns a slice of task-typed objects.

The filtered variable is a copy of all the items in the tl argument, which we will whittle down later. The lastIndex variable will be updated repeatedly later on and we'll use it to return a subset of data from the filtered slice.

Next, we loop over the regex matchers and make them *regexp.Regexp objects. Nothing fancy.

We'll also created a matched boolean that we'll use to log an error if it matches nothing, so that end users can make sure the regex is actually right and isn't just wasting compute time.

    for _, regexString := range excluded {
		matched := false
		regex, err := regexp.Compile(regexString)
		if err != nil {
			logrus.Error(err)
			continue
		}

Now we're gonna go 0 to 100 real quick, but don't panic — I've commented everything in-line so you can see what's happening:

		// Do not increment index on each loop.
		// Handle incrementing inside of the loop.
		for index := 0; index <= lastIndex; {

			taskName := &filtered[index].Name

			if regex.Match([]byte(*taskName)) {
				logrus.WithFields(logrus.Fields{
					"task_name": *taskName,
					"regex":     regexString,
				}).Debug("Excluding task")

				// Hooray! The provided regex matched something and wasn't a waste of time.
				matched = true

				// Replace the value at the current index with the value at
				// the last index of the slice.
				filtered[index] = filtered[lastIndex]

				// Set the value of the last index of the slice
				// to the nil value of `task`. The value that was previously
				// there is now at filtered[index], so we did not lose it.
				// We will just NOT increment `index` so that the new value gets checked, too.
				filtered[lastIndex] = task{}

				// Set the `filtered` slice to be everything up to
				// the last index, which we just set to a nil value.
				filtered = filtered[:lastIndex]

				// The last index will now be one less than before.
				// This is the same as if we just did
				// lastIndex = len(filtered)
				// everytime, except this should be slightly more performant.
				lastIndex--
			} else {
				// If no match was found, increment the index
				// so that we check the next value in the `filtered` slice
				index++
				logrus.WithFields(logrus.Fields{
					"task_name": *taskName,
					"regex":     regexString,
				}).Debug("Did not match regex")
			}
		}
		// Log an error if the regex didn't match any tasks.
		// This should warn users if they're providing a useless regex.
		if !matched {
			logrus.Error("No task found to exclude matching: ", regexString)
		}
	}

Now you just need to return the filtered list and you're golden.

	return filtered

Ta-Da! We just filtered a slice while incurring only a slight amount of pain.

TL;DR — Loop over slice; only increment index on non-matches; on match, set slice[i] = last item in slice; set last item in slice to nil (helps w/ GC); set slice = slice[:lastItem]; profit.

Bonus

Putting it all together, it looks like this:

func filterExcluded(tl *[]task, excluded []string) []task {
	filtered := *tl
	lastIndex := (len(filtered) - 1)

	for _, regexString := range excluded {
		matched := false
		regex, err := regexp.Compile(regexString)
		if err != nil {
			logrus.Error(err)
			continue
		}

		// Do not increment index on each loop.
		// Handle incrementing inside of the loop.
		for index := 0; index <= lastIndex; {

			taskName := &filtered[index].Name

			if regex.Match([]byte(*taskName)) {
				logrus.WithFields(logrus.Fields{
					"task_name": *taskName,
					"regex":     regexString,
				}).Debug("Excluding task")

				// Hooray! The provided regex matched something and wasn't a waste of time.
				matched = true

				// Replace the value at the current index with the value at
				// the last index of the slice.
				filtered[index] = filtered[lastIndex]

				// Set the value of the last index of the slice
				// to the nil value of `task`. The value that was previously
				// there is now at filtered[index], so we did not lose it.
				// We will just NOT increment `index` so that the
				// new value will get checked, too.
				filtered[lastIndex] = task{}

				// Set the `filtered` slice to be everything up to
				// the last index, which we just set to a nil value.
				filtered = filtered[:lastIndex]

				// The last index will now be one less than before.
				// This is the same as if we just did
				// lastIndex = len(filtered)
				// everytime, except this should be slightly more performant.
				lastIndex--
			} else {
				// If no match was found, increment the index
				// so that we check the next value in the `filtered` slice
				index++
				logrus.WithFields(logrus.Fields{
					"task_name": *taskName,
					"regex":     regexString,
				}).Debug("Did not match regex")
			}
		}
		// Log an error if the regex didn't match any tasks.
		// This should warn users if they're providing a useless regex.
		if !matched {
			logrus.Error("No task found to exclude matching: ", regexString)
		}
	}
	return filtered
}

Fix a Nagios Service Stuck in Scheduled Downtime

Will Hegedus — Thu, 13 Jun 2019 13:46:43 GMT

Today an issue was brought to my attention in which a Nagios service check was genuinely stuck in downtime, with the only way to fix it being manually updating the values in the database.

Background

The service did not show up in the "Scheduled Downtime" page in the XI UI, nor did it show up in the nagios_scheduleddowntime table in our MariaDB instance. Yet, the service page still showed it as being in downtime.

According to the comment history on the service, the downtime was scheduled until 2042, so we couldn't really just wait and hope it expired.

This service has been scheduled for fixed downtime from 2019-04-30 00:39:16 to 2042-02-21 07:39:16.

Note: Before I talk about how the problem was resolved, if you're just here wondering how to remove scheduled downtime then you can do so in the aforementioned "Scheduled Downtime" page. It's at nagiosxi/includes/components/xicore/downtime.php in the XI UI, and under the "System -> Downtime" page in Nagios Core.

Technical Info

The issue that presented itself to us was that the entry in the nagios_servicestatus table had incorrect values in the acknowledgement_type and scheduled_downtime_depth columns.

The acknowledgement_type column can either 0, 1, or 2 which represent None, Normal, or Sticky, respectively.

The scheduled_downtime_depth column can be any smallint number, and it represents the number of downtimes that the service is in (since a service can be in multiple levels of downtime).

Fix

After determining the object ID of the service and verifying that no entries corresponding to it existed in nagios_scheduleddowntime, I used the following SQL to fix the service's status.

-- Fix acknowledgement_type, replace service_object_id with the appropriate ID
update nagios.nagios_servicestatus set acknowledgement_type='0' where service_object_id='8540';

-- Fix scheduled_downtime_depth, replace service_object_id with the appropriate ID
update nagios.nagios_servicestatus set scheduled_downtime_depth='0' where service_object_id='8540';

After doing this, you can also optionally remove the comment history for the downtime from the nagios_commenthistory table.

-- replace object_id and commenthistory_id with appropriate values.
delete from nagios.nagios_commenthistory where object_id = '8540' and commenthistory_id='2461761';

Using Explicit Waits in Selenium to Improve Test Case Reliability

Will Hegedus — Thu, 30 May 2019 18:25:17 GMT

Selenium test cases can be a pain to write. They seem to fail for inane and irreplicable reasons that vanish on the next run of the test case. However, incorporating Explicit Waits can help reduce the number of false failures you receive.

I use Selenium at work to run test cases that are called by Nagios checks. Since Nagios needs a response from a check within 60s, having predictable wait times is imperative. Rather than using the implicit waits for everything, I prefer to set explicit waits depending on what it is that I'm waiting on.

The default value in Selenium for an implicit wait is 0 seconds. This means that running driver.find_element_by_id("main-box") will fail if the element isn't there, even if the page hasn't fully loaded yet.

I solve this by using WebDriverWait and expected_conditions.

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# [...] (code ommitted)

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "application")),
                                "Page did not load as expected")

This code calls WebDriverWait on the driver I set up, it waits a maximum of 10s or until the expected condition (EC) is met. If the EC is not met, then it will raise a TimeoutException with "Page did not load as expected" as the error message.

Not only does WebDriverWait allow me to explicitly set how long to wait for something (e.g. the presence of an element on a page), but it also allows me to set the error message to something meaningful that will help our on-shift team identify where the issue actually is – all without having to add an unnecessary try-catch block.

Explicitly specifying how long I want to wait for an element to load or until a page's title changes allows me to better guarantee that a site is actually broken when our test cases fail – not that it just happened to load slower than expected.

Note: code examples are in Python, but the same functionality is available in Java and other Selenium libraries.

Migrating Grafana's Database from SQLite to Postgres

Will Hegedus — Mon, 27 May 2019 15:26:57 GMT

Background

If you are unaware, Grafana is a fantastic tool for creating and sharing dashboards. However, you may quickly find yourself outgrowing its default SQLite database if you have too many people trying to use your Grafana instance at once.

In my case, this issue of scale presented itself by the SQLite database frequently being inable to be lookup a user's session cookie, which would cause users to get logged out despite having a valid session. This is discussed in various Github issues in the Grafana project, with some of the main ones being issue #10727 and issue #15316.

The Grafana project has made some improvements in Grafana 6.2.1 designed to address the issue I experienced of being frequently logged out, but I still wanted to migrate my database to Postgres as an additional precaution and to have a more scalable database.

The Journey

The first thing I tried when migrating my database was to use the pgloader project, which is a nice tool designed for migrating a variety of databases into Postgres. However, there are some column types that are difficult to translate from SQLite to Postgres, particularly the boolean values. Booleans in SQLite are 1 or 0 whereas they are true or false in Postgres. If these aren't translated correctly, Grafana will just implode in on itself and you won't be able to login.

After discovering the boolean issue, I found the grafana-migrator project on Github, which is designed specifically for migrating Grafana databases. This addressed the boolean issue by changing the column types after the database is initialized, importing the SQLite database, and then changing the column types back by translating the values to ones Postgres will like.

-- alter column type
ALTER TABLE alert ALTER COLUMN silenced TYPE integer USING silenced::integer;

-- translate back after import
ALTER TABLE alert
			ALTER COLUMN silenced TYPE boolean
			USING CASE WHEN silenced = 0 THEN FALSE
				WHEN silenced = 1 THEN TRUE
				ELSE NULL
				END;

While this project was a good start, it was inconsistently maintained and was essentially just stringing together of multiple scripts. To ensure that everything would work the way I wanted it to, I set out to make my own application that would handle the import of the SQLite database to Postgres.

The Solution

Rather than just copying the grafana-migrator project, I wanted to create a more robust method of migrating the Grafana database than just a collection of scripts. Consequently, I decided to create a Go program since that would allow me to compile to a single binary for easier usage.

In addition to just running the DDL statements after sanitizing them, the primary points I addressed in my program are:

Automatic resetting of database sequences
Automatic decoding of hex-encoded values
Automatic translation of boolean columns and values

The biggest challenge I encountered during the building of the project was the fact that SQLite dumps the JSON definitions of Grafana dashboards as hex-encoded values, but you cannot just import those hex-encoded strings into Postgres as they will not be recognized. The HexDecode function I wrote has some inefficiencies (such as converting decoded to a string, so that it can be concatenated to be between single quotes), however the performance hit is negligible due to the scale of the project.

// HexDecode takes a file path containing a SQLite dump and
// decodes any hex-encoded data it finds.
func HexDecode(dumpFile string) error {
	re := regexp.MustCompile(`(?m)X\'([a-fA-F0-9]+)\'`)
	re2 := regexp.MustCompile(`'`)
	data, err := ioutil.ReadFile(dumpFile)
	if err != nil {
		return err
	}

	// Define a function to actually decode hexstring.
	decodeHex := func(hexEncoded []byte) []byte {
		// Find the regex submatch in the argument passed to the function
		// then decode the submatch.
		decoded, err := hex.DecodeString(string(re.FindSubmatch(hexEncoded)[1]))
		if err != nil {
			logrus.Fatalf("Failed to decode hex-string in: %s", hexEncoded)
		}
		decoded = re2.ReplaceAll(decoded, []byte(`''`))

		// Surround decoded string with single quotes again.
		return []byte(`'` + string(decoded) + `'`)
	}

	// Replace regex matches from the dumpFile using the `decodeHex` function defined above.
	sanitized := re.ReplaceAllFunc(data, decodeHex)
	return ioutil.WriteFile(dumpFile, sanitized, 0644)
}

The fruit of all this effort can be found on my Github in the grafana-sqlite-to-postgres repository. I have successfully used this app to migrate multiple Grafana databases, including our production Grafana database used by 50+ users with 50+ dashboards. As such, I would consider it safe for production use but you should maintain a copy of your grafana.db SQLite database just in case.

You can find usage instructions in the README, but feel free to open up an issue if you encounter any issues.

Deduplicating Prometheus Blackbox ICMP checks with File Based Service Discovery

Will Hegedus — Fri, 21 Sep 2018 21:29:36 GMT

In my Prometheus set up, most of my scrape targets come from a file_sd service discovery set up. This poses a challenge of deduplicating ICMP probes from the Blackbox Exporter for hosts with multiple scrape endpoints.

Most of my file_sd targets have a JSON files that looks like this:

[
  {
    "targets": [ "foo.example.com:9100","bar.example.com:9100" ],
    "labels": {
      "env": "prod",
      "job": "node",
      "service": "foobar"
    }
  },
  {
    "targets": [ "foo.example.com:9104" ],
    "labels": {
      "env": "prod",
      "job": "mysql",
      "service": "barfoo"
    }
  }
]

When setting up the Blackbox Exporter in a "standard" way, it uses relabel_configs to take the target's URL (including the port) and put that into the __param_target label. For example:

  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target

This means that foo.example.com would get scraped twice because it has two different __address__ labels (one on port 9100 and one on port 9104).

To prevent this, we can employ the use of regex to remove the port and prevent the creation of multiple ICMP metrics for a single host. Simply update the part of the relabel_configs configuration that sets the __param_target to look like this:

  relabel_configs:
    - source_labels: [__address__]
      regex: (.*?)(:[0-9]+)?
      target_label: __param_target
      replacement: ${1}

This regex will take everything up to the colon (the URL without the port) and save that in a capture group ( ${1} ), which we then use as the __param_target label.

Now you can update your config, reload Prometheus, and you'll see in the Targets page that it's no longer duplicating targets on your ICMP job.

Configure Nginx to Export Prometheus-formatted Metrics

Will Hegedus — Wed, 12 Sep 2018 00:48:22 GMT

The first thing I wanted to do upon setting up my Kubernetes cluster at home was to configure monitoring for it with Prometheus. Since I didn't really have any pods spun up at the time, I turned to what I did have -- this blog.

This blog runs on the Ghost blogging engine, which is served behind an Nginx reverse proxy. Nginx has a built in "stub_status" page, but it gives you basically no info. In order to get decent in-built monitoring/metrics support, you are directed to purchase NGINX Plus, which I'm obviously not the target audience for.

Luckily, there's a Github project to fill the void. The nginx-module-vts is available as an Nginx module that will provide host traffic statistics. However, it does involve compiling Nginx from source.

Instructions

Note: This guide is for Ubuntu systems and is a conglomeration of my experience and various resources I used, but the most useful one is located here if you want more info.

First thing's first, you'll want to set up your workspace. So go ahead and make a directory that you can work out of. For the purposes of this demonstration, it will be ~/rebuildingnginx
- ```
  mkdir ~/rebuildingnginx
```

cd to that directory and clone the VTS module from Github

  cd ~/rebuildingnginx
  git clone git://github.com/vozlt/nginx-module-vts.git

Add the Nginx repository, if you haven't already

  sudo add-apt-repository -y ppa:nginx/stable

Make sure that deb-src isn't commented out in the respository file

  cat /etc/apt/sources.list.d/nginx-ubuntu-stable-xenial.list
  
deb http://ppa.launchpad.net/nginx/stable/ubuntu xenial main
deb-src http://ppa.launchpad.net/nginx/stable/ubuntu xenial main

Do package manager things

  # Update package lists
  sudo apt-get update
  
  # Install dpkg-dev to create package
  sudo apt-get install -y dpkg-dev

  # Get Nginx source files
  sudo apt-get source nginx

  # Install the build dependencies for nginx
  sudo apt-get build-dep nginx

Edit the rules file in the source code

  vim ~/rebuildnginx/nginx-1.14.0/debian/rules

Note: the exact path may differ depending on what version of Nginx source your downloaded
Since we're going to install the nginx-full version, we're going to append the build flag to the full_configure_flags section of the file
Go ahead and add an --add-module=/root/nginx-module-vts to the end of that list of arguments

It should look something like:

full_configure_flags := \
                   $(common_configure_flags) \
                   --with-http_addition_module \
                   --with-http_geoip_module=dynamic \
                   --with-http_gunzip_module \
                   --with-http_gzip_static_module \
                   --with-http_image_filter_module=dynamic \
                   --with-http_sub_module \
                   --with-http_xslt_module=dynamic \
                   --with-stream=dynamic \
                   --with-stream_ssl_module \
                   --with-stream_ssl_preread_module \
                   --with-mail=dynamic \
                   --with-mail_ssl_module \
                   --add-dynamic-module=$(MODULESDIR)/http-auth-pam \
                   --add-dynamic-module=$(MODULESDIR)/http-dav-ext \
                   --add-dynamic-module=$(MODULESDIR)/http-echo \
                   --add-dynamic-module=$(MODULESDIR)/http-upstream-fair \
                   --add-dynamic-module=$(MODULESDIR)/http-subs-filter \
                   --add-module=/root/nginx-module-vts

Now we can start the actual build process

  # Again, this path may differ depending on your version
  cd ~/rebuildingnginx/nginx-1.14.0
  sudo dpkg-buildpackage -b

This takes a LONG time so go find something fun to do and come back later

Now that it's built, we can actually install Nginx
- The build process put a bunch of .deb files in our ~/rebuildingnginx directory, but the only one we need to care about is nginx-full_1.14.0-0+xenial1_amd64.deb or whatever the equivalent is for you depending on your Ubuntu & Nginx versions.
- ```
  cd ~/rebuildingnginx
  sudo dpkg --install nginx-full_1.14.0-0+xenial1_amd64.deb
```
Nginx should be installed now! Now we need to configure the VTS module for our metrics

Open your Nginx configuration file for some light editing

```
  sudo vim /etc/nginx/nginx.conf
```
I've ellipsed the parts of the configuration that aren't important to this with [...], but your configuration file should be updated to include the following info:

  user www-data;
  worker_processes auto;
  pid /run/nginx.pid;
  include /etc/nginx/modules-enabled/*.conf;

  events {
      worker_connections 768;
      # multi_accept on;
  }

  http {

      [...]

      ##
      # VTS Settings
      ##
      vhost_traffic_status_zone;
      vhost_traffic_status_dump /var/log/nginx/vts.db;

      server {
        listen 8080;
            server_name wbhegedus.me;

        if ($time_iso8601 ~ "^(\d{4})-(\d{2})-(\d{2})") {
          set $year $1;
          set $month $2;
          set $day $3;
        }

        vhost_traffic_status_filter_by_set_key $year year::$server_name;
        vhost_traffic_status_filter_by_set_key $year-$month month::$server_name;
        vhost_traffic_status_filter_by_set_key $year-$month-$day day::$server_name;

        location /status {
          vhost_traffic_status_bypass_limit on;
          vhost_traffic_status_bypass_stats on;
          vhost_traffic_status_display;
          vhost_traffic_status_display_format html;
        }
      }	
     [...]
  }

Once this is complete, you can reload your Nginx config

   # Check for a valid config first
   nginx -t
   
   # If that command returns fine, go ahead and reload
   sudo systemctl reload nginx

If everything worked correctly, you should be able to access Prometheus-formatted metrics at localhost:8080/status/format/prometheus on your Nginx box
- You can add this target to your Prometheus config the same as you would any other endpoint.
- For example:
```
- job_name: ghost_blog
 scheme: http
 metrics_path: /status/format/prometheus
 static_configs:
 - targets:
   - wbhegedus.me:8080
```

Enjoy the wonder of Prometheus and let me know in the comments if you run into any issues!

Red Hunting Cap

Alerting on Missing Data in Prometheus

Terminating Prometheus Exporter TLS with Vector

Using Linode Object Storage with Thanos

Configuring Podman for WSL2

Basic Install

WSL2 Specifics

Monitoring Windows Server Memory Pressure in Prometheus

Demystifying Kubernetes CPU Limits (and Throttling)

Initial Investigation

Conceptualizing

Putting it together

Conclusion

Works Cited

Recovering from a major etcd failure

Works Cited:

Nested Active Directory Group Membership in Grafana

Automate Testing of Prometheus Targets files with Drone CI/CD

Deleting Elements from Slices in Go

Further Reading

Bonus

Fix a Nagios Service Stuck in Scheduled Downtime

Background

Technical Info

Fix

Using Explicit Waits in Selenium to Improve Test Case Reliability

Migrating Grafana's Database from SQLite to Postgres

Background

The Journey

The Solution

Deduplicating Prometheus Blackbox ICMP checks with File Based Service Discovery

Configure Nginx to Export Prometheus-formatted Metrics

Instructions