<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Red Hunting Cap]]></title><description><![CDATA[Catcher in the .py]]></description><link>https://wbhegedus.me/</link><image><url>https://wbhegedus.me/favicon.png</url><title>Red Hunting Cap</title><link>https://wbhegedus.me/</link></image><generator>Ghost 4.48</generator><lastBuildDate>Thu, 09 Apr 2026 12:42:10 GMT</lastBuildDate><atom:link href="https://wbhegedus.me/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Alerting on Missing Data in Prometheus]]></title><description><![CDATA[<p>Alerting on missing data in Prometheus is commonly handled by the <code>absent</code> function, but that&apos;s really only useful when you know the labels you expect to be there ahead of time. How can you dynamically alert on missing data then?</p><p>By using the <code>unless</code> operator, you can return</p>]]></description><link>https://wbhegedus.me/alerting-on-missing-data-in-prometheus/</link><guid isPermaLink="false">63864c582a9cc502e639c09b</guid><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Tue, 29 Nov 2022 18:23:00 GMT</pubDate><content:encoded><![CDATA[<p>Alerting on missing data in Prometheus is commonly handled by the <code>absent</code> function, but that&apos;s really only useful when you know the labels you expect to be there ahead of time. How can you dynamically alert on missing data then?</p><p>By using the <code>unless</code> operator, you can return a set of labels only when a different matching metric does not exist. For example,</p><pre><code class="language-promql">  group without (instance) (up{job=&quot;blackbox_http_2xx&quot;})
unless
  count without (instance) (probe_http_status_code{job=&quot;blackbox_http_2xx&quot;} == 200)</code></pre><p>Only if there are no HTTP 200s for the label set that results from the <code>group</code> query will the alert would fire. The alert would fire with a label set that looks similar to this in my environment:</p><pre><code class="language-promql">{job=&quot;blackbox_http_2xx&quot;, environment=&quot;production&quot;, cluster=&quot;clusterA&quot;, service=&quot;website&quot;}</code></pre><p>Having these extra labels can be extremely useful in your Alertmanager routing configs and any templating you do, which is why I strive to keep as many labels as possible when designing alerting rules.</p>]]></content:encoded></item><item><title><![CDATA[Terminating Prometheus Exporter TLS with Vector]]></title><description><![CDATA[<p>I&apos;ve recently been digging into using <a href="https://vector.dev/">Vector</a> more for collecting telemetry from systems since it can pull from a variety of sources (logs and metrics are what I&apos;m most concerned about) and spit them out to a variety of &quot;sinks&quot;</p><p>One of the use</p>]]></description><link>https://wbhegedus.me/terminate-tls-with-vector/</link><guid isPermaLink="false">61c347ea6393191a38fe5543</guid><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Wed, 22 Dec 2021 15:51:14 GMT</pubDate><content:encoded><![CDATA[<p>I&apos;ve recently been digging into using <a href="https://vector.dev/">Vector</a> more for collecting telemetry from systems since it can pull from a variety of sources (logs and metrics are what I&apos;m most concerned about) and spit them out to a variety of &quot;sinks&quot;</p><p>One of the use cases I have for Vector is to have it aggregate multiple Prometheus exporters on a single host and expose them all under a single port/endpoint. Previously, I used a reverse proxy for this, which had its uses but was also overkill. However, it did provide me with the benefit of putting my exporters behind HTTPS, which I did not know was possible to do with Vector.</p><p>Vector&apos;s docs for the <a href="https://vector.dev/docs/reference/configuration/sinks/prometheus_exporter/">Prometheus Exporter sink</a> do not mention it at my time of writing this, but Vector actually does support listening for SSL connections on the Prometheus exporter it exposes. To do so, simply add a <code>tls</code> object to your sink&apos;s configuration. Example below:</p><!--kg-card-begin: html--><script src="https://gist.github.com/wbh1/47016f1a04a46ed97c437479329700cd.js"></script><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Using Linode Object Storage with Thanos]]></title><description><![CDATA[How to send your Prometheus TSDB blocks to Linode for long term storage.]]></description><link>https://wbhegedus.me/using-linode-object-storage-with-thanos/</link><guid isPermaLink="false">60ad5c466393191a38fe54c3</guid><category><![CDATA[thanos]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Tue, 25 May 2021 20:47:45 GMT</pubDate><content:encoded><![CDATA[<p><a href="https://thanos.io/">Thanos</a> is an amazing tool for extending and scaling the functionality of Prometheus. One of the core features it provides is the ability to back up your TSDB data into an object storage bucket to be queried later on by the <a href="https://thanos.io/tip/components/store.md/">Thanos Store</a> component.</p><p>Thanos supports a variety of storage backends, including S3. Linode offers an <a href="https://www.linode.com/products/object-storage/">object storage service</a> with an S3-compatible API, and it&apos;s easy to get started with it and begin sending data from Thanos/Prometheus into it.</p><p>This presumes that you already set up a <a href="https://thanos.io/tip/components/sidecar.md/">Thanos Sidecar</a> to run alongside your Prometheus deployment and it has access to the directory in which the TSDB writes its blocks. Setting that up is outside of the scope of this post, but it is simple to get up and running.</p><p>On the Linode side, first login to your account and sign up for the Object Storage service. Please note that this comes at a flat rate of $5 in order to reserve your minimum storage space of 250GB. You can get signed up just by creating your first bucket.</p><figure class="kg-card kg-image-card"><img src="https://wbhegedus.me/content/images/2021/05/Screen-Shot-2021-05-25-at-4.36.42-PM.png" class="kg-image" alt loading="lazy" width="2000" height="1095" srcset="https://wbhegedus.me/content/images/size/w600/2021/05/Screen-Shot-2021-05-25-at-4.36.42-PM.png 600w, https://wbhegedus.me/content/images/size/w1000/2021/05/Screen-Shot-2021-05-25-at-4.36.42-PM.png 1000w, https://wbhegedus.me/content/images/size/w1600/2021/05/Screen-Shot-2021-05-25-at-4.36.42-PM.png 1600w, https://wbhegedus.me/content/images/size/w2400/2021/05/Screen-Shot-2021-05-25-at-4.36.42-PM.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>After your bucket is created, at the home page for Object Storage, click on the &quot;Access Keys&quot; tab and create an Access Key. I recommend limiting access to just the bucket you just created, but that is up to you. Thanos Sidecar will need read <em>and</em> write in order to do its job. </p><p>Once that Access Key is created, make sure you save the secret key and the access key &#x2013; we&apos;ll need to put them in a config file for Thanos Sidecar.</p><p>Go ahead and open your text editor of choice and create a file named something like <code>linode.yml</code> with the contents below. Modify <code>endpoint</code> based on your datacenter &#x2013; inferable from the link displayed under your bucket name in the GUI. This is the least intuitive part &#x2013; the URL presented is a full path directly to the bucket (e.g. <code>thanos-snap.us-east-1.linodeobjects.com</code>), but Thanos won&apos;t like that. So just strip out the bucket name from the URL for the <code>endpoint</code> and modify <code>bucket</code> based on your bucket name.</p><pre><code class="language-yaml">type: S3
config:
  bucket: &quot;thanos-snap&quot;
  endpoint: &quot;us-east-1.linodeobjects.com&quot;
  access_key: &quot;YOUR_ACCESS_KEY_HERE&quot;
  insecure: false
  secret_key: &quot;YOUR_SECRET_KEY_HERE&quot;
  signature_version2: false</code></pre><p>Now you can point your Thanos Sidecar to this configuration file by adding the <code>--objstore.config-file=linode.yml</code> flag, and it should automatically start uploading TSDB blocks.</p><p>One last thing to note is to make sure that you followed the instructions in setting up the Thanos Sidecar so that Prometheus doesn&apos;t compact the blocks &#x2013; you want another Thanos component, <a href="https://thanos.io/tip/components/compact.md/">Thanos Compactor</a> to do the compaction. The good news, though, is that you can reuse that <code>linode.yml</code> file when specifying the <code>--objstore.config-file</code> flag for Thanos Compactor (and any other components that connect to block storage).</p>]]></content:encoded></item><item><title><![CDATA[Configuring Podman for WSL2]]></title><description><![CDATA[How to set up Podman to run in Windows Subsystem for Linux 2 (WSL2)]]></description><link>https://wbhegedus.me/running-podman-on-wsl2/</link><guid isPermaLink="false">609ecc936393191a38fe548c</guid><category><![CDATA[containers]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Wed, 31 Mar 2021 20:48:49 GMT</pubDate><content:encoded><![CDATA[<p>This post is intended to serve as a sort of update to Red Hat&apos;s (now outdated since v3 of Podman) <a href="https://www.redhat.com/sysadmin/podman-windows-wsl2">blog post</a> on how to run Podman in WSL2.</p><p>The commands for the below presume that you are running Ubuntu 20.04 or higher, but the WSL2 specific configuration at the end is independent of which Linux distro you are using.</p><p>You can find the specific installation dependencies and commands for your distro in the <a href="https://podman.io/getting-started/installation.html">Podman docs</a>.</p><h2 id="basic-install">Basic Install</h2><p>Here are the commands I used in order to install Podman on Ubuntu.</p><p>First, you must determine the version of Ubuntu you&apos;re running, if you don&apos;t already know. Then you export that to an environment variable, which I named <code>VERSION_ID</code> in order to be able to use that when adding repos from the <a href="https://kubic.opensuse.org/">Kubic project</a>, which was necessary for my specific OS version.</p><!--kg-card-begin: html--><pre class="command-line" data-user="caulfield" data-host="pencey" data-output="2-5">
<code class="language-bash">cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION=&quot;Ubuntu 20.04.2 LTS&quot;
echo &quot;deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /&quot; | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/Release.key | sudo apt-key add -
export VERSION_ID=&quot;20.04&quot;
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y install podman</code>
</pre><!--kg-card-end: html--><h2 id="wsl2-specifics">WSL2 Specifics</h2><p>OK - now comes the WSL2 specific stuff that you need to change in order for Podman to work. Basically, you just have to change the systemd-specific stuff (default) to non-systemd stuff.</p><p>With version 3+ of Podman, this can all be done in one file.</p><p>First, make the requisite directory. Mine was not automatically created, but YMMV. Then, create a <code>containers.conf</code> file in that directory.</p><!--kg-card-begin: html--><pre class="command-line" data-user="caulfield" data-host="pencey">
<code class="language-bash">mkdir ~/.config/containers
vim ~/.config/containers/containers.conf</code>
</pre><!--kg-card-end: html--><p>Inside that file, simply add the following:</p><pre><code class="language-toml">[engine]
events_logger=&quot;file&quot;
cgroup_manager=&quot;cgroupfs&quot;</code></pre><p>The default <code>events_logger</code> is <strong>journald</strong> and the default <code>cgroup_manager</code> is <strong>systemd</strong>, in case you were curious.</p><p>Now Podman should run for you with no issues, and you don&apos;t need to run that clunky Docker daemon in Windows anymore &#x1F389;</p>]]></content:encoded></item><item><title><![CDATA[Monitoring Windows Server Memory Pressure in Prometheus]]></title><description><![CDATA[Windows does not distinguish between major and minor page faults in its performance counters. Consequently, you have to do a little bit of extra work to determine how often the major page faults are occurring.]]></description><link>https://wbhegedus.me/monitoring-windows-server-memory-pressure-in-prometheus/</link><guid isPermaLink="false">609ecc936393191a38fe548b</guid><category><![CDATA[prometheus]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Mon, 09 Nov 2020 17:22:03 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1542978709-19c95dc3bc7e?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1542978709-19c95dc3bc7e?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" alt="Monitoring Windows Server Memory Pressure in Prometheus"><p>A common alert to see in Prometheus monitoring setups for Linux hosts is something to do with high memory pressure, which is determined by having both 1) a low amount of available RAM, and 2) a high amount of <em>major</em> page faults.</p><p>For example, Gitlab is kind and gracious enough to make the alerting rules they use available to the public, and their alert looks like <a href="https://gitlab.com/gitlab-com/runbooks/blob/master/rules/node.yml#L166">this</a>:</p><pre><code class="language-yaml">  - alert: HighMemoryPressure
    expr: instance:node_memory_available:ratio * 100 &lt; 5 and rate(node_vmstat_pgmajfault[1m]) &gt; 1000
    for: 15m
    labels:
      severity: s4
      alert_type: cause
    annotations:
      description: The node is under heavy memory pressure.  The available memory is under 5% and
        there is a high rate of major page faults.
      runbook: docs/monitoring/node_memory_alerts.md
      value: &apos;Available memory {{ $value | printf &quot;%.2f&quot; }}%&apos;
      title: Node is under heavy memory pressure
</code></pre><p>This is a great way to reduce noise and increase the quality of the alert, because you are increasing your chances that when the alert fires there is actually a problem. Your server might be happily chugging along at 5% memory available, but if it start page.</p><p>Unfortunately, however, Windows <a href="https://techcommunity.microsoft.com/t5/ask-the-performance-team/the-basics-of-page-faults/ba-p/373120">does not distinguish</a> between major and minor page faults in its performance counters, which is what the <a href="https://github.com/prometheus-community/windows_exporter">windows_exporter</a> collects from. Consequently, you have to do a little bit of extra work to determine how often the major page faults are occurring.</p><p>In order to determine the rate of major faults, you have to combine the metrics <code>windows_memory_swap_page_operations_total</code> and <code>windows_memory_swap_page_reads_total</code>. The operations counter will increase no matter the type of page operations (minor or major). The reads counter will increase when data has to be read out of the pagefile into memory. </p><p>With this in mind, I devised the following query that we are now using in our production monitoring for Windows.</p><!--kg-card-begin: markdown--><pre><code class="language-promql">instance:windows_memory_available:ratio * 100 &lt; 5
and (rate(windows_memory_swap_page_operations_total[2m]) &gt; 1000)
and (
    (rate(windows_memory_swap_page_reads_total[2m]) / rate(windows_memory_swap_page_operations_total[2m])) &gt; 0.5
)
</code></pre>
<!--kg-card-end: markdown--><p>For this alert to fire, 3 things need to happen:</p><ol><li>Windows memory available has to be less than 5%.</li><li>There must be a high rate of page operations occurring.</li><li>The proportion of page reads vs total page operations must be at least 50%.</li></ol><p>All of these thresholds are arbitrary and can be adjusted as fits your needs. These thresholds just happened to fit well for us.</p><p>For more information on the distinction between minor and major page faults, the <a href="https://en.wikipedia.org/wiki/Page_fault">Wikipedia article</a> on Page Faults explains it better than I ever could.</p>]]></content:encoded></item><item><title><![CDATA[Demystifying Kubernetes CPU Limits (and Throttling)]]></title><description><![CDATA[How can a pod have its CPU throttled for more than 1second in a 1second window? Let's find out.]]></description><link>https://wbhegedus.me/understanding-kubernetes-cpu-limits/</link><guid isPermaLink="false">609ecc936393191a38fe548a</guid><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Fri, 23 Oct 2020 18:53:18 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1592664474483-a62364f738d8?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1592664474483-a62364f738d8?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" alt="Demystifying Kubernetes CPU Limits (and Throttling)"><p>Recently, I&apos;ve been doing some investigation into high CPU utilization occurring during routine security scans of our Wordpress websites causing issues such as slow response, increased errors, and other undesirable outcomes. This is typically limited to a single pod&#x2013;the one the scanner randomly gets routed to&#x2013;but can still be user-visible (and Pagerduty-activating &#x1F605;), so we want to get better monitoring on it.</p><h2 id="initial-investigation">Initial Investigation</h2><p>Like anyone else in IT investigating something they&apos;re not sure of, I turned first to Google. I sought out what other people are doing to monitor CPU usage of pods in Kubernetes. This is what first led me to discover that it&apos;s actually far more useful to monitor how much the CPU is being <em>throttled</em> rather than how much it&apos;s being <em>used</em>.</p><p>I already knew of the <a href="https://github.com/kubernetes-monitoring/kubernetes-mixin/">kubernetes-mixin</a> project, which provides sane default Prometheus alerting rules for monitoring Kubernetes cluster health, so I looked there first to see what rules they are using to monitor CPU. Currently, the only CPU <em>usage</em> alert bundled in is &quot;<a href="https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/64aa37e837b0e93bfc6fab9430f57bd7366e5a83/alerts/resource_alerts.libsonnet#L149">CPUThrottlingHigh</a>&quot;, which calculates <code>number_of_cpu_cycles_pod_gets_throttled / number_of_cpu_cycles_total</code> (not acutal metric names) to give you a percentage of how frequently your pod is getting its CPU throttled. </p><p>But wait, what does <em>throttled</em> even mean? Throttled (at least in my mind) means something along the lines of just getting slowed down, but in this case throttled means completely stopped &#x2013; you cannot use any more CPU until the next CFS period (every 100ms in Kubernetes, which is also the <a href="https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html#management">Linux default</a> - more on this later).</p><p>While abstractly this seems pretty cut and dry, it gets more confusing when you&apos;re actually looking at in practice on production servers with tons of CPU cores.</p><h2 id="conceptualizing">Conceptualizing</h2><p><em>For the purposes of this article, I&apos;ll be referring to a server with 128 CPU cores running a pod with a CPU limit of 4.0.</em></p><p>If you are not already familiar with the concept of <a href="https://www.noqcks.io/notes/2016/12/14/kubernetes-understanding-millicores/">millicores</a>, suffice to say that 1 milllicore = 1/1000th of a CPU&apos;s time (1000 millicores = 1 whole core). This is the metric used to define CPU requests/limits in Kubernetes. Our example pod has a limit of 4.0 which is 4,000 millicores, or 4 whole cores worth of work capability.</p><p>But how does the operating system kernel even enforce this measure? If you&apos;re famililar with how Linux containers work, you probably have heard of <a href="https://en.wikipedia.org/wiki/Cgroups">cgroups</a>. Cgroups, put simply, are a way to isolate and control groups of processes such that have no awareness of the other processes also running on the same server as them It&apos;s why when you run a Docker container, it thinks that its ENTRYPOINT + CMD is PID 1.</p><p>Among other things, Cgroups use the Linux CFS (Completely Fair Scheduler) to set and enforce resource limits on groups of processes, e.g. our pods in Kubernetes. It does this by setting a quota and a period. A quota is how much CPU time you can use during a given period. Once you use up your quota, you are &quot;throttled&quot; until the next period when you can begin using CPU again.</p><p>Going back to our discussion on millicores, this means that in every 100ms <code>cfs_period</code> in the operating system, we get 400ms of usage allowed. The reason why we get 400ms in a 100ms time frame is each core is capable of doing 100ms of work in a 100ms period &#x2013; 100ms x 4cores = 400ms. This 400ms of work can be broken up in any way - it could translate to 4 vCPUs each doing 100ms of work in a 100ms <code>cfs_period</code>, 8 vCPUs each doing doing 50ms of work, etc. Remember - CPU limits are based on time, not actual vCPUs.</p><p>Understanding that, the reason for the throttling confusion starts to come into focus. So far as I can comprehend, the theoretical upper bound of throttling is <code>n * (100ms) - limit</code> where <code>n</code> is the number of vCPUs and <code>limit</code> is how many milliseconds of CPU you are allotted in a 100ms window (calculated earlier by <code>cpuLimit * 100ms</code>). This means that the theoretical upper bound for throttling on my 128 core machine is 124 seconds of throttling per second because <code>(128cores * 100ms - 400ms) * 10 = 124</code>.</p><p><em>Note: the <u>actual</u> CPU throttling is determined by how many processes you&apos;re running and which core(s) they&apos;re assigned to by the OS scheduler.</em></p><h2 id="putting-it-together">Putting it together</h2><p>Now things started to finally click in my brain. At least... as much as they could, considering I&apos;m still somewhat ignorant of all the nitty gritty details occurring in the Linux scheduler itself.</p><p>This whole investigation was kicked off by the fact that when I went to use a <code>rate()</code> function on the <code>container_cpu_cfs_throttled_seconds_total</code> metric in Prometheus, the per second rate of throttling was significantly higher than 1s (think closer to 70s per second). <em>How can a pod be throttled for more than 1second in a 1second window?</em> I wondered. </p><p>Putting all this information together, I now know that the reason for such high throttling was because <code>httpd</code> was spawning additional processes on additional CPU cores, which raises the amount of throttling to significantly higher than the resource limit. </p><h2 id="conclusion">Conclusion</h2><p>With my brain sufficiently exploded, I can now say that we have sufficient monitoring in place to alert us of high CPU usage based on the amount of time the CPU is being throttled. This is the alert we have in place now:</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">    - alert: Wordpress_High_CPU_Throttling
      expr: rate(container_cpu_cfs_throttled_seconds_total{namespace=~&quot;wordpress-.*&quot;}[1m]) &gt; 1
      for: 30m
      labels:
        severity: warning
      annotations:
        message: The {{ $labels.pod_name }} pod in {{ $labels.namespace }} is experiencing a high amount of CPU throttling as a result of its CPU limit.
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p>In summary:</p>
<ul>
<li>Kubernetes uses a <code>cfs_period_us</code> of 100ms (Linux default)</li>
<li>Each a CPU request of 1.0 in k8s represents 100ms of CPU time in a <code>cfs_period</code>
<ul>
<li>Theoretically this is 100% of 1 CPU&apos;s time, but not pratically since pods usually run multiple processes on multiple cores</li>
</ul>
</li>
<li>The upper bound of how many seconds the kernel reports a pod being throttled for is determined by the number of CPU cores that the pod is using.
<ul>
<li>The number of CPU cores you use is not directly related to your CPU limit. It correlates more strongly with the number of processes your pod runs.</li>
</ul>
</li>
</ul>
<!--kg-card-end: markdown--><p>Please feel free to reach out if I got anything wrong, or if you have any questions. I&apos;m available on Twitter <a href="twitter.com/wbhegedus">@wbhegedus</a><br></p><h4 id="works-cited">Works Cited</h4><p>These resources were useful to me in my quest for knowledge.</p><!--kg-card-begin: markdown--><ol>
<li><a href="https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html">CFS Bandwidth Control</a> - Kernel.org</li>
<li><a href="https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/">Unthrottled: Fixing CPU Limits in the Cloud</a> - Indeed Engineering</li>
<li><a href="https://github.com/kubernetes/kubernetes/issues/67577">CFS quotas can lead to unnecessary throttling</a> - Kubernetes Github Issue #67577</li>
<li><a href="https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf">CPU bandwidth control for CFS</a> - Academic paper on Linux CFS from Turner, Rao, and Rao</li>
<li><a href="https://github.com/google/cadvisor">cAdvisor Github</a></li>
<li><a href="https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/64aa37e837b0e93bfc6fab9430f57bd7366e5a83/alerts/resource_alerts.libsonnet#L149">Kubernetes Monitoring Mixin CPU Alert</a></li>
<li><a href="https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718">CPU limits and aggressive throttling in Kubernetes</a> - Omio Engineering</li>
</ol>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Recovering from a major etcd failure]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Etcd defines a &quot;disastrous&quot; failure as more than (N-1)/2 members being lost &quot;permanently&quot;, in which N signifies the number of cluster members. In order to recover from this type of failure, you will need to essentially create a new etcd cluster.</p>
<p>The following steps presume</p>]]></description><link>https://wbhegedus.me/recovering-from-an-etcd-failure/</link><guid isPermaLink="false">609ecc936393191a38fe5489</guid><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Thu, 28 May 2020 15:28:19 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1532264523420-881a47db012d?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://images.unsplash.com/photo-1532264523420-881a47db012d?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" alt="Recovering from a major etcd failure"><p>Etcd defines a &quot;disastrous&quot; failure as more than (N-1)/2 members being lost &quot;permanently&quot;, in which N signifies the number of cluster members. In order to recover from this type of failure, you will need to essentially create a new etcd cluster.</p>
<p>The following steps presume a 3 node cluster in which m1, m2, and m3 are 3 Kubernetes masters running etcd as static pods. They encompass what I did in order to restore the etcd portion of one of my Kubernetes clusters.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><ol>
<li>Identify the master that is going to be the progenitor of your new cluster. In our case, this will be m1.</li>
<li>Stop etcd on all masters, even m1.
<ol>
<li>Do this by moving the etcd manifest out of the /etc/kubernetes/manifests directory</li>
<li><code>mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml</code></li>
</ol>
</li>
<li>Update the etcd manifest on m1 to force it to create a new cluster
<ol>
<li>Add the <code>--force-new-cluster</code> flag to the command in the manifest</li>
</ol>
</li>
<li>Start etcd on m1
<ol>
<li><code>mv /root/etcd.yaml /etc/kubernetes/manifests/etcd.yaml</code></li>
</ol>
</li>
<li>Verify that the container is started and OK
<ol>
<li><code>docker container ls -a | grep etcd</code></li>
<li><code>docker logs &lt;container_id&gt;</code></li>
</ol>
</li>
<li>Exec into the container on m1 to add a new member to the cluster
<ol>
<li><code>docker exec -it &lt;container_id&gt; /bin/sh</code></li>
<li>
<pre><code class="language-bash"> # Check on the existing members. It should just be m1 right now. Replace m1 with the FQDN of your etcd endpoint

 etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
  
 # Add the first new member. The first argument to &quot;add&quot; is the name of the cluster member. This isn&apos;t terribly important, but make sure you can use it to distinguish cluster members. The second argument is the IP and port for the peer address.
  
 etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member add m2 https://10.253.5.18:2380
</code></pre>
</li>
</ol>
</li>
<li>After adding back m2, your etcd cluster will be unavailable until you start etcd on m2 since a quorum needs to be established.
<ol>
<li>SSH to m2 and remove the etcd data from the previous cluster
<ul>
<li><code>rm -rf /var/lib/etcd</code></li>
</ul>
</li>
<li>Ensure m2&apos;s etcd manifest has only m1 and m2 in the --initial-cluster flag
<ul>
<li><code>--initial-cluster=m2=https://10.253.5.18:2380,m1=https://10.253.5.17:2380</code></li>
<li>Also ensure that the --initial-cluster-state=existing flag is set</li>
<li>You&apos;ll get an error if the number of nodes specified in initial-cluster is more than the actual number of nodes in the cluster.</li>
</ul>
</li>
<li>Start etcd on m2 using the command from Step 4</li>
<li>On m1, run the &quot;member list&quot; command from above to ensure that m2 joined successfully.
<ul>
<li>If m2 hasn&apos;t joined yet and participated in a leader election, you&apos;ll get an error saying m1 has no leader.</li>
</ul>
</li>
</ol>
</li>
<li>Now that m2 is added, we need to add m3 back in.
<ol>
<li>Add the member
<ul>
<li>Repeat the steps in 6 above, but update the member name and peer address to reflect that of m3</li>
</ul>
</li>
<li>Start etcd on m3
<ul>
<li>Follow the steps from 7 above, but update the <code>--initial-cluster</code> flag to also include m3 now</li>
</ul>
</li>
</ol>
</li>
<li>Verify on m1 that the &quot;member list&quot; command from step 6 now shows all 3 members.
<ul>
<li>Everything should be OK now!</li>
</ul>
</li>
</ol>
<!--kg-card-end: markdown--><h3 id="works-cited-">Works Cited:</h3><ul><li><a href="https://docs.openshift.com/container-platform/3.11/admin_guide/assembly_restore-etcd-quorum.html#cluster-restore-etcd-quorum-static-pod_restore-etcd-quorum">https://docs.openshift.com/container-platform/3.11/admin_guide/assembly_restore-etcd-quorum.html#cluster-restore-etcd-quorum-static-pod_restore-etcd-quorum</a></li><li><a href="https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md">https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md</a></li><li><a href="https://etcd.io/docs/v3.3.12/op-guide/runtime-configuration/#restart-cluster-from-majority-failure">https://etcd.io/docs/v3.3.12/op-guide/runtime-configuration/#restart-cluster-from-majority-failure</a></li></ul>]]></content:encoded></item><item><title><![CDATA[Nested Active Directory Group Membership in Grafana]]></title><description><![CDATA[<p>I am currently in the process of onboarding several teams into our Grafana environment. While we were just POC&apos;ing Grafana, it was all fine and dandy to just have &quot;Grafana Viewer&quot;, &quot;Grafana Editor&quot;, and &quot;Grafana Admin&quot; groups because not that many people</p>]]></description><link>https://wbhegedus.me/nested-active-directory-group-membership-in-grafana/</link><guid isPermaLink="false">609ecc936393191a38fe5488</guid><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Fri, 20 Dec 2019 21:21:23 GMT</pubDate><content:encoded><![CDATA[<p>I am currently in the process of onboarding several teams into our Grafana environment. While we were just POC&apos;ing Grafana, it was all fine and dandy to just have &quot;Grafana Viewer&quot;, &quot;Grafana Editor&quot;, and &quot;Grafana Admin&quot; groups because not that many people would be in any of them. However, as our environment is growing, it has quickly become clear that managing this additional group membership would be a pain. </p><p>When I first set up Grafana, I was unable to get nested group membership working but didn&apos;t care enough at the time to troubleshoot much &#x2013; after all, only myself and a few others would be using it. Now that I&apos;ve gone back and actually figured out how it works, I want to share what I&apos;ve learned.</p><p><em>Note: these instructions presume that you are using Active Directory, but the concepts should be transferable to other LDAP providers.</em></p><p>Presumably, you&apos;ve already configured your Grafana environment to use LDAP as your authentication provider with this bit in your configuration file:<br></p><!--kg-card-begin: markdown--><pre><code class="language-ini">[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
</code></pre>
<!--kg-card-end: markdown--><p>The instructions <a href="https://grafana.com/docs/grafana/latest/auth/ldap/">on Grafana&apos;s site</a> do a good job of getting you up and running with what you&apos;ll need in that <code>/etc/grafana/ldap.toml</code> file. However, we need to expand upon that by not just getting what groups a user is a member of, but also what groups those groups are a member of (nested group membership).</p><p>To do this, you need to specify a <code>group_search_filter</code> in addition to your plain <code>search_filter</code>. This is in supplement to <code>search_filter</code>, and not a replacement for it &#x2013; it&apos;s required. Your <code>group_search_filter</code> is an LDAP query that essentially tells AD to find all groups that a user is a member of within the <code>group_search_base_dns</code>. </p><p>This <code>group_search_filter</code> looks like:</p><pre><code class="language-ldap">(member:1.2.840.113556.1.4.1941:=CN=%s,OU=Users,OU=FSA,DC=example,DC=domain,DC=com)</code></pre><p>Those random looking numbers are an important OID to enable <a href="https://docs.microsoft.com/en-us/windows/win32/adsi/search-filter-syntax">LDAP_MATCHING_RULE_IN_CHAIN</a>, which is what lets us find nested group memberships, too. </p><p>When it&apos;s configured, it should look like:</p><!--kg-card-begin: markdown--><pre><code class="language-toml">## Group search filter, to retrieve the groups of which the user is a member (only set if memberOf attribute is not available)
group_search_filter = &quot;(member:1.2.840.113556.1.4.1941:=CN=%s,OU=Users,OU=FSA,DC=example,DC=domain,DC=com)&quot;
## An array of the base DNs to search through for groups. Typically uses ou=groups
group_search_base_dns = [&quot;OU=Admin Groups,OU=Security Groups,OU=FSA,DC=example,DC=domain,DC=com&quot;]
group_search_filter_user_attribute = &quot;sAMAccountName&quot;
</code></pre>
<!--kg-card-end: markdown--><p>As a bonus tip when configuring this, make your <code>group_search_base_dns</code> as specific as possible, because that is what Grafana is going to loop through looking for your group memberships. For example, it was taking 15-20s to log me in when I used &quot;Security Groups&quot; as my base OU, but when I got more specific to use the &quot;Admin Groups&quot; OU (which is where my &quot;Grafana Admin&quot;, &quot;Grafana Viewer&quot;, and &quot;Grafana Editor&quot; groups are), the results were nearly instant.</p>]]></content:encoded></item><item><title><![CDATA[Automate Testing of Prometheus Targets files with Drone CI/CD]]></title><description><![CDATA[File-based service discovery is one of the most popular and flexible methods of service discovery available in Prometheus. However, there was no good way that I knew of to test the validity of the files...]]></description><link>https://wbhegedus.me/automate-testing-of-prometheus-targets-files/</link><guid isPermaLink="false">609ecc936393191a38fe5487</guid><category><![CDATA[prometheus]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Thu, 05 Sep 2019 21:49:08 GMT</pubDate><content:encoded><![CDATA[<p>File-based service discovery is one of the most popular and flexible methods of service discovery available in Prometheus. However, there was no <em>good</em> way that I knew of to test the validity of the files that <code>file_sd</code> will try to use prior to actually giving them to Prometheus to be read. </p><p>This particular solution uses Drone, as it is my CI/CD tool of choice, but you can use the same logic in Jenkins, Gitlab-CI, or whatever you utilize for your builds/deployments. I also use JSON as my file-format for <code>file_sd</code> but you can apply the same concepts if you use YAML.</p><p>First, we need a way to actually define what the file <em>should</em> look like. To do this, I chose to use JSON schemas. You can pop on over to <a href="https://www.jsonschema.net/">jsonschema.net</a>, paste in one of your valid JSON target files, and infer a schema based off of it. Personally, I changed it so that <strong>env</strong>, <strong>team</strong>, and <strong>service</strong> labels are required on targets in order to enforce uniformity in our targets regardless of what team is adding them. Go ahead and save that schema as <code>schema.json</code> in a <code>drone</code> folder in the top level of your repository.</p><p>With that schema in hand, now we need a way to compare our JSON files to it. To do this, I wrote a simple Python script:</p><!--kg-card-begin: markdown--><pre><code class="language-python">from jsonschema import validate
from jsonschema.exceptions import ValidationError
from json.decoder import JSONDecodeError

import glob
import json

import sys

errors: {str: str} = {}
error_encountered = False

with open(&quot;drone/schema.json&quot;) as f:
    schema = json.load(f)
    f.close()

for f in glob.glob(&quot;files/targets/*.json&quot;):
    with open(f) as inst:
        try:
            validate(json.load(inst), schema)
        except ValidationError as e:
            error_encountered = True
            errors[inst.name] = f&quot;Schema error: {e.message}&quot;
        except JSONDecodeError as e:
            error_encountered = True
            errors[inst.name] = f&quot;JSON decoding error: {e.msg} on line {e.lineno}&quot;

for f, err in errors.items():
    if err:
        print(f&quot;[ ERROR in {f} ]&quot;)
        print(&quot;  |&gt;&quot;, err)
        print()

if error_encountered:
    sys.exit(1)

print(&quot;All tests passed! Have a cookie. &#x1F36A;&quot;)
</code></pre>
<!--kg-card-end: markdown--><p>The script presumes the location of the schema file (<code>drone/schema.json</code>) and the target file(s) (<code>files/targets/*.json</code>) but you can adapt that as needed.</p><p>Now in your <code>.drone.yml</code> just add a step to your pipeline to run the validation script (and install its dependency):</p><!--kg-card-begin: markdown--><pre><code class="language-yaml">  - name: validate JSON of targets
    image: python:3-alpine
    commands:
      - pip3 install --quiet jsonschema
      - python3 drone/validate.py

</code></pre>
<!--kg-card-end: markdown--><p>Easy peasy. Now you&apos;ll never have another malformed target file again! (err... at least not one that makes it to prod).</p>]]></content:encoded></item><item><title><![CDATA[Deleting Elements from Slices in Go]]></title><description><![CDATA[A brief how-to on filtering slices in Go without pulling your hair out.]]></description><link>https://wbhegedus.me/deleting-elements-from-slices-in-go/</link><guid isPermaLink="false">609ecc936393191a38fe5486</guid><category><![CDATA[go]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Mon, 05 Aug 2019 21:09:36 GMT</pubDate><content:encoded><![CDATA[<p>If you come from a language such as Python or Java, you might expect to find a standardized way to remove items from a list object in Go. However, as of Go 1.12, there is not a built-in way to perform this action. You can do it with relative ease once you wrap your head around it, though.</p><p>Personally, it took me a bit to wrap <em>my</em> head around it, which is why I&apos;m writing this &#x2014; in hopes of aiding someone else down the line.</p><p>I was working on a problem that required me to take a list of items and filter them by removing any items that matched a particular regex, then return the filtered list. Several hours of debugging and brainstorming later, I finally figured out how to do it right.</p><p>In my particular case, I did not care about the order of the elements in the slice so I could rearrange them any which way.</p><p>First, we start with the function declaration:</p><pre><code class="language-go">func filterExcluded(tl *[]task, excluded []string) []task {
	filtered := *tl
	lastIndex := (len(filtered) - 1)
</code></pre><p>The function accepts a pointer to a slice of task-typed objects, along with a slice of strings that will be interpreted as regex matchers, it then returns a slice of task-typed objects.</p><p>The <code>filtered</code> variable is a copy of all the items in the <code>tl</code> argument, which we will whittle down later. The <code>lastIndex</code> variable will be updated repeatedly later on and we&apos;ll use it to return a subset of data from the <code>filtered</code> slice.</p><p>Next, we loop over the regex matchers and make them <code>*regexp.Regexp</code> objects. Nothing fancy.</p><p>We&apos;ll also created a <code>matched</code> boolean that we&apos;ll use to log an error if it matches nothing, so that end users can make sure the regex is actually right and isn&apos;t just wasting compute time.</p><!--kg-card-begin: markdown--><pre><code class="language-go">    for _, regexString := range excluded {
		matched := false
		regex, err := regexp.Compile(regexString)
		if err != nil {
			logrus.Error(err)
			continue
		}
</code></pre>
<!--kg-card-end: markdown--><p>Now we&apos;re gonna go 0 to 100 real quick, but don&apos;t panic &#x2014; I&apos;ve commented everything in-line so you can see what&apos;s happening:</p><!--kg-card-begin: markdown--><pre><code class="language-go">		// Do not increment index on each loop.
		// Handle incrementing inside of the loop.
		for index := 0; index &lt;= lastIndex; {

			taskName := &amp;filtered[index].Name

			if regex.Match([]byte(*taskName)) {
				logrus.WithFields(logrus.Fields{
					&quot;task_name&quot;: *taskName,
					&quot;regex&quot;:     regexString,
				}).Debug(&quot;Excluding task&quot;)

				// Hooray! The provided regex matched something and wasn&apos;t a waste of time.
				matched = true

				// Replace the value at the current index with the value at
				// the last index of the slice.
				filtered[index] = filtered[lastIndex]

				// Set the value of the last index of the slice
				// to the nil value of `task`. The value that was previously
				// there is now at filtered[index], so we did not lose it.
				// We will just NOT increment `index` so that the new value gets checked, too.
				filtered[lastIndex] = task{}

				// Set the `filtered` slice to be everything up to
				// the last index, which we just set to a nil value.
				filtered = filtered[:lastIndex]

				// The last index will now be one less than before.
				// This is the same as if we just did
				// lastIndex = len(filtered)
				// everytime, except this should be slightly more performant.
				lastIndex--
			} else {
				// If no match was found, increment the index
				// so that we check the next value in the `filtered` slice
				index++
				logrus.WithFields(logrus.Fields{
					&quot;task_name&quot;: *taskName,
					&quot;regex&quot;:     regexString,
				}).Debug(&quot;Did not match regex&quot;)
			}
		}
		// Log an error if the regex didn&apos;t match any tasks.
		// This should warn users if they&apos;re providing a useless regex.
		if !matched {
			logrus.Error(&quot;No task found to exclude matching: &quot;, regexString)
		}
	}
</code></pre>
<!--kg-card-end: markdown--><p>Now you just need to return the filtered list and you&apos;re golden.</p><!--kg-card-begin: markdown--><pre><code class="language-go">	return filtered
</code></pre>
<!--kg-card-end: markdown--><p>Ta-Da! We just filtered a slice while incurring only a slight amount of pain.</p><p><em>TL;DR &#x2014; Loop over slice; only increment index on non-matches; on match, set slice[i] = last item in slice; set last item in slice to nil (helps w/ GC); set slice = slice[:lastItem]; &#xA0;profit.</em></p><h3 id="further-reading">Further Reading</h3><ul><li><a href="https://github.com/golang/go/wiki/SliceTricks">SliceTricks</a> (Golang Github wiki)</li></ul><h3 id="bonus">Bonus</h3><p>Putting it all together, it looks like this:</p><!--kg-card-begin: markdown--><pre><code class="language-go">func filterExcluded(tl *[]task, excluded []string) []task {
	filtered := *tl
	lastIndex := (len(filtered) - 1)

	for _, regexString := range excluded {
		matched := false
		regex, err := regexp.Compile(regexString)
		if err != nil {
			logrus.Error(err)
			continue
		}

		// Do not increment index on each loop.
		// Handle incrementing inside of the loop.
		for index := 0; index &lt;= lastIndex; {

			taskName := &amp;filtered[index].Name

			if regex.Match([]byte(*taskName)) {
				logrus.WithFields(logrus.Fields{
					&quot;task_name&quot;: *taskName,
					&quot;regex&quot;:     regexString,
				}).Debug(&quot;Excluding task&quot;)

				// Hooray! The provided regex matched something and wasn&apos;t a waste of time.
				matched = true

				// Replace the value at the current index with the value at
				// the last index of the slice.
				filtered[index] = filtered[lastIndex]

				// Set the value of the last index of the slice
				// to the nil value of `task`. The value that was previously
				// there is now at filtered[index], so we did not lose it.
				// We will just NOT increment `index` so that the
				// new value will get checked, too.
				filtered[lastIndex] = task{}

				// Set the `filtered` slice to be everything up to
				// the last index, which we just set to a nil value.
				filtered = filtered[:lastIndex]

				// The last index will now be one less than before.
				// This is the same as if we just did
				// lastIndex = len(filtered)
				// everytime, except this should be slightly more performant.
				lastIndex--
			} else {
				// If no match was found, increment the index
				// so that we check the next value in the `filtered` slice
				index++
				logrus.WithFields(logrus.Fields{
					&quot;task_name&quot;: *taskName,
					&quot;regex&quot;:     regexString,
				}).Debug(&quot;Did not match regex&quot;)
			}
		}
		// Log an error if the regex didn&apos;t match any tasks.
		// This should warn users if they&apos;re providing a useless regex.
		if !matched {
			logrus.Error(&quot;No task found to exclude matching: &quot;, regexString)
		}
	}
	return filtered
}
</code></pre>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Fix a Nagios Service Stuck in Scheduled Downtime]]></title><description><![CDATA[Today an issue was brought to my attention in which a Nagios service check was genuinely stuck in downtime, with the only way to fix it being manually updating the nagios_servicestatus table.]]></description><link>https://wbhegedus.me/fix-a-nagios-service-stuck-in-downtime/</link><guid isPermaLink="false">609ecc936393191a38fe5485</guid><category><![CDATA[nagios]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Thu, 13 Jun 2019 13:46:43 GMT</pubDate><content:encoded><![CDATA[<p>Today an issue was brought to my attention in which a Nagios service check was genuinely <em>stuck</em> in downtime, with the only way to fix it being manually updating the values in the database.</p><h1 id="background">Background</h1><p>The service did not show up in the &quot;Scheduled Downtime&quot; page in the XI UI, nor did it show up in the <code>nagios_scheduleddowntime</code> table in our MariaDB instance. Yet, the service page still showed it as being in downtime.</p><p>According to the comment history on the service, the downtime was scheduled until 2042, so we couldn&apos;t really just wait and hope it expired.</p><!--kg-card-begin: markdown--><blockquote>
<p>This service has been scheduled for fixed downtime from 2019-04-30 00:39:16 to 2042-02-21 07:39:16.</p>
</blockquote>
<!--kg-card-end: markdown--><p><em><strong>Note: </strong>Before I talk about how the problem was resolved, if you&apos;re just here wondering how to remove scheduled downtime then you can do so in the aforementioned &quot;Scheduled Downtime&quot; page. It&apos;s at <code>nagiosxi/includes/components/xicore/downtime.php</code> in the XI UI, and under the &quot;System -&gt; Downtime&quot; page in Nagios Core.</em></p><h1 id="technical-info">Technical Info</h1><p>The issue that presented itself to us was that the entry in the <code>nagios_servicestatus</code> table had incorrect values in the <code>acknowledgement_type</code> and <code>scheduled_downtime_depth</code> columns. </p><p>The <code>acknowledgement_type</code> column can either <strong>0</strong>, <strong>1</strong>, or <strong>2</strong> which represent <strong>None</strong>, <strong>Normal</strong>, or <strong>Sticky</strong>, respectively.</p><p>The <code>scheduled_downtime_depth</code> column can be any <strong>smallint</strong> number, and it represents the number of downtimes that the service is in (since a service can be in multiple levels of downtime).</p><h1 id="fix">Fix</h1><p>After determining the object ID of the service and verifying that no entries corresponding to it existed in <code>nagios_scheduleddowntime</code>, I used the following SQL to fix the service&apos;s status.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- Fix acknowledgement_type, replace service_object_id with the appropriate ID
update nagios.nagios_servicestatus set acknowledgement_type=&apos;0&apos; where service_object_id=&apos;8540&apos;;

-- Fix scheduled_downtime_depth, replace service_object_id with the appropriate ID
update nagios.nagios_servicestatus set scheduled_downtime_depth=&apos;0&apos; where service_object_id=&apos;8540&apos;;
</code></pre>
<!--kg-card-end: markdown--><p>After doing this, you can also optionally remove the comment history for the downtime from the <code>nagios_commenthistory</code> table.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- replace object_id and commenthistory_id with appropriate values.
delete from nagios.nagios_commenthistory where object_id = &apos;8540&apos; and commenthistory_id=&apos;2461761&apos;;
</code></pre>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Using Explicit Waits in Selenium to Improve Test Case Reliability]]></title><description><![CDATA[Selenium test cases can be a pain to write. They seem to fail for inane and irreplicable reasons that vanish on the next run of the test case. However, incorporating Explicit Waits can help reduce the number of false failures you receive. ]]></description><link>https://wbhegedus.me/using-explicit-waits-in-selenium-to-improve-test-case-reliability/</link><guid isPermaLink="false">609ecc936393191a38fe5484</guid><category><![CDATA[selenium]]></category><category><![CDATA[monitoring]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Thu, 30 May 2019 18:25:17 GMT</pubDate><content:encoded><![CDATA[<p>Selenium test cases can be a pain to write. They seem to fail for inane and irreplicable reasons that vanish on the next run of the test case. However, incorporating <a href="https://www.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-and-implicit-waits">Explicit Waits</a> can help reduce the number of false failures you receive. </p><p>I use Selenium at work to run test cases that are called by Nagios checks. Since Nagios needs a response from a check within 60s, having predictable wait times is imperative. Rather than using the implicit waits for everything, I prefer to set explicit waits depending on what it is that I&apos;m waiting on.</p><p>The default value in Selenium for an implicit wait is 0 seconds. This means that running <code>driver.find_element_by_id(&quot;main-box&quot;)</code> will fail if the element isn&apos;t there, even if the page hasn&apos;t fully loaded yet. </p><p>I solve this by using <code>WebDriverWait</code> and <code>expected_conditions</code>.</p><!--kg-card-begin: markdown--><pre><code class="language-python">from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# [...] (code ommitted)

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, &quot;application&quot;)),
                                &quot;Page did not load as expected&quot;)
</code></pre>
<!--kg-card-end: markdown--><p>This code calls <code>WebDriverWait</code> on the <code>driver</code> I set up, it waits a maximum of 10s or until the expected condition (<code>EC</code>) is met. If the <code>EC</code> is <em>not</em> met, then it will raise a <code>TimeoutException</code> with &quot;Page did not load as expected&quot; as the error message.</p><p>Not only does <code>WebDriverWait</code> allow me to explicitly set how long to wait for something (e.g. the presence of an element on a page), but it also allows me to set the error message to something meaningful that will help our on-shift team identify where the issue actually is &#x2013; all without having to add an unnecessary try-catch block.</p><p>Explicitly specifying how long I want to wait for an element to load or until a page&apos;s title changes allows me to better guarantee that a site is actually broken when our test cases fail &#x2013; not that it just happened to load slower than expected.</p><p><em>Note: code examples are in Python, but the same functionality is available in Java and other Selenium libraries.</em></p>]]></content:encoded></item><item><title><![CDATA[Migrating Grafana's Database from SQLite to Postgres]]></title><description><![CDATA[Grafana is a fantastic tool for creating and sharing dashboards. However, you may quickly find yourself outgrowing its default SQLite database if you have too many people trying to use your Grafana instance at once.]]></description><link>https://wbhegedus.me/migrating-grafanas-database-from-sqlite-to-postgres/</link><guid isPermaLink="false">609ecc936393191a38fe5483</guid><category><![CDATA[grafana]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[postgres]]></category><category><![CDATA[sqlite]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Mon, 27 May 2019 15:26:57 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1535320903710-d993d3d77d29?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="background">Background</h2>
<img src="https://images.unsplash.com/photo-1535320903710-d993d3d77d29?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=1080&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjExNzczfQ" alt="Migrating Grafana&apos;s Database from SQLite to Postgres"><p>If you are unaware, Grafana is a fantastic tool for creating and sharing dashboards. However, you may quickly find yourself outgrowing its default SQLite database if you have too many people trying to use your Grafana instance at once.</p>
<p>In my case, this issue of scale presented itself by the SQLite database frequently being inable to be lookup a user&apos;s session cookie, which would cause users to get logged out despite having a valid session. This is discussed in various Github issues in the Grafana project, with some of the main ones being <a href="https://github.com/grafana/grafana/issues/10727#issuecomment-479378941">issue #10727</a> and <a href="https://github.com/grafana/grafana/issues/15316">issue #15316</a>.</p>
<p>The Grafana project has made some improvements in <a href="https://github.com/grafana/grafana/blob/c53c55391c2ccb83cf7979124b9217fd9161c0ed/CHANGELOG.md">Grafana 6.2.1</a> designed to address the issue I experienced of being frequently logged out, but I still wanted to migrate my database to Postgres as an additional precaution and to have a more scalable database.</p>
<h2 id="thejourney">The Journey</h2>
<p>The first thing I tried when migrating my database was to use the <a href="https://github.com/dimitri/pgloader">pgloader</a> project, which is a nice tool designed for migrating a variety of databases into Postgres. However, there are some column types that are difficult to translate from SQLite to Postgres, particularly the boolean values. Booleans in SQLite are <code>1</code> or <code>0</code> whereas they are <code>true</code> or <code>false</code> in Postgres. If these aren&apos;t translated correctly, Grafana will just implode in on itself and you won&apos;t be able to login.</p>
<p>After discovering the boolean issue, I found the <a href="https://github.com/haron/grafana-migrator">grafana-migrator</a> project on Github, which is designed specifically for migrating Grafana databases. This addressed the boolean issue by changing the column types after the database is initialized, importing the SQLite database, and then changing the column types back by translating the values to ones Postgres will like.</p>
<pre><code class="language-sql">-- alter column type
ALTER TABLE alert ALTER COLUMN silenced TYPE integer USING silenced::integer;

-- translate back after import
ALTER TABLE alert
			ALTER COLUMN silenced TYPE boolean
			USING CASE WHEN silenced = 0 THEN FALSE
				WHEN silenced = 1 THEN TRUE
				ELSE NULL
				END;
</code></pre>
<p>While this project was a good start, it was inconsistently maintained and was essentially just stringing together of multiple scripts. To ensure that everything would work the way I wanted it to, I set out to make my own application that would handle the import of the SQLite database to Postgres.</p>
<h2 id="thesolution">The Solution</h2>
<p>Rather than just copying the <a href="https://github.com/haron/grafana-migrator">grafana-migrator</a> project, I wanted to create a more robust method of migrating the Grafana database than just a collection of scripts. Consequently, I decided to create a <a href="https://golang.org/">Go</a> program since that would allow me to compile to a single binary for easier usage.</p>
<p>In addition to just running the DDL statements after sanitizing them, the primary points I addressed in my program are:</p>
<ul>
<li>Automatic resetting of <a href="https://www.w3resource.com/PostgreSQL/postgresql-sequence.php">database sequences</a></li>
<li>Automatic decoding of hex-encoded values</li>
<li>Automatic translation of boolean columns and values</li>
</ul>
<p>The biggest challenge I encountered during the building of the project was the fact that SQLite dumps the JSON definitions of Grafana dashboards as hex-encoded values, but you cannot just import those hex-encoded strings into Postgres as they will not be recognized. The HexDecode function I wrote has some inefficiencies (such as converting <code>decoded</code> to a string, so that it can be concatenated to be between single quotes), however the performance hit is negligible due to the scale of the project.</p>
<pre><code class="language-go">// HexDecode takes a file path containing a SQLite dump and
// decodes any hex-encoded data it finds.
func HexDecode(dumpFile string) error {
	re := regexp.MustCompile(`(?m)X\&apos;([a-fA-F0-9]+)\&apos;`)
	re2 := regexp.MustCompile(`&apos;`)
	data, err := ioutil.ReadFile(dumpFile)
	if err != nil {
		return err
	}

	// Define a function to actually decode hexstring.
	decodeHex := func(hexEncoded []byte) []byte {
		// Find the regex submatch in the argument passed to the function
		// then decode the submatch.
		decoded, err := hex.DecodeString(string(re.FindSubmatch(hexEncoded)[1]))
		if err != nil {
			logrus.Fatalf(&quot;Failed to decode hex-string in: %s&quot;, hexEncoded)
		}
		decoded = re2.ReplaceAll(decoded, []byte(`&apos;&apos;`))

		// Surround decoded string with single quotes again.
		return []byte(`&apos;` + string(decoded) + `&apos;`)
	}

	// Replace regex matches from the dumpFile using the `decodeHex` function defined above.
	sanitized := re.ReplaceAllFunc(data, decodeHex)
	return ioutil.WriteFile(dumpFile, sanitized, 0644)
}
</code></pre>
<p>The fruit of all this effort can be found on my Github in the <a href="https://github.com/wbh1/grafana-sqlite-to-postgres">grafana-sqlite-to-postgres</a> repository. I have successfully used this app to migrate multiple Grafana databases, including our production Grafana database used by 50+ users with 50+ dashboards. As such, I would consider it safe for production use but you should maintain a copy of your <code>grafana.db</code> SQLite database just in case.</p>
<p>You can find usage instructions in the <a href="https://github.com/wbh1/grafana-sqlite-to-postgres/blob/master/README.md">README</a>, but feel free to open up an issue if you encounter any issues.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Deduplicating Prometheus Blackbox ICMP checks with File Based Service Discovery]]></title><description><![CDATA[Most of my scrape targets come from service discovery. This poses a challenge of deduplicating ICMP probes for hosts with multiple scrape endpoints.]]></description><link>https://wbhegedus.me/deduplicating-prometheus-icmp-checks-with-file-based-service-discovery/</link><guid isPermaLink="false">609ecc936393191a38fe5482</guid><category><![CDATA[prometheus]]></category><category><![CDATA[monitoring]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Fri, 21 Sep 2018 21:29:36 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In my Prometheus set up, most of my scrape targets come from a <code>file_sd</code> service discovery set up. This poses a challenge of deduplicating ICMP probes from the <a href="https://github.com/prometheus/blackbox_exporter">Blackbox Exporter</a> for hosts with multiple scrape endpoints.</p>
<p>Most of my <code>file_sd</code> targets have a JSON files that looks like this:</p>
<pre><code class="language-json">[
  {
    &quot;targets&quot;: [ &quot;foo.example.com:9100&quot;,&quot;bar.example.com:9100&quot; ],
    &quot;labels&quot;: {
      &quot;env&quot;: &quot;prod&quot;,
      &quot;job&quot;: &quot;node&quot;,
      &quot;service&quot;: &quot;foobar&quot;
    }
  },
  {
    &quot;targets&quot;: [ &quot;foo.example.com:9104&quot; ],
    &quot;labels&quot;: {
      &quot;env&quot;: &quot;prod&quot;,
      &quot;job&quot;: &quot;mysql&quot;,
      &quot;service&quot;: &quot;barfoo&quot;
    }
  }
]
</code></pre>
<p>When setting up the Blackbox Exporter in a &quot;standard&quot; way, it uses <code>relabel_configs</code> to take the target&apos;s URL (including the port) and put that into the <code>__param_target</code> label. For example:</p>
<pre><code class="language-yaml">  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
</code></pre>
<p>This means that <strong>foo.example.com</strong> would get scraped twice because it has two different <code>__address__</code> labels (one on port 9100 and one on port 9104).</p>
<p>To prevent this, we can employ the use of regex to remove the port and prevent the creation of multiple ICMP metrics for a single host. Simply update the part of the <code>relabel_configs</code> configuration that sets the <code>__param_target</code> to look like this:</p>
<pre><code class="language-yaml">  relabel_configs:
    - source_labels: [__address__]
      regex: (.*?)(:[0-9]+)?
      target_label: __param_target
      replacement: ${1}
</code></pre>
<p>This regex will take everything up to the colon (the URL without the port) and save that in a capture group ( <code>${1}</code> ), which we then use as the <code>__param_target</code> label.</p>
<p>Now you can update your config, reload Prometheus, and you&apos;ll see in the Targets page that it&apos;s no longer duplicating targets on your ICMP job.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Configure Nginx to Export Prometheus-formatted Metrics]]></title><description><![CDATA[How to build Nginx from source with an integration to serve Prometheus-formatted metrics]]></description><link>https://wbhegedus.me/configure-nginx-to-export-prometheus-formatted-metrics/</link><guid isPermaLink="false">609ecc936393191a38fe5481</guid><category><![CDATA[prometheus]]></category><category><![CDATA[nginx]]></category><dc:creator><![CDATA[Will Hegedus]]></dc:creator><pubDate>Wed, 12 Sep 2018 00:48:22 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>The first thing I wanted to do upon setting up my Kubernetes cluster at home was to configure monitoring for it with Prometheus. Since I didn&apos;t really have any pods spun up at the time, I turned to what I did have -- this blog.</p>
<p>This blog runs on the <a href="https://github.com/TryGhost/Ghost">Ghost</a> blogging engine, which is served behind an <a href="https://www.nginx.com/">Nginx</a> reverse proxy. Nginx has a built in <a href="http://nginx.org/en/docs/http/ngx_http_stub_status_module.html">&quot;stub_status&quot;</a> page, but it gives you basically no info. In order to get decent in-built monitoring/metrics support, you are directed to purchase NGINX Plus, which I&apos;m obviously not the target audience for.</p>
<p>Luckily, there&apos;s a Github project to fill the void. The <a href="https://github.com/vozlt/nginx-module-vts">nginx-module-vts</a> is available as an Nginx module that will provide host traffic statistics. However, it <em>does</em> involve compiling Nginx from source.</p>
<h1 id="instructions">Instructions</h1>
<p><em>Note: This guide is for Ubuntu systems and is a conglomeration of my experience and various resources I used, but the most useful one is located <a href="https://serversforhackers.com/c/compiling-third-party-modules-into-nginx">here</a> if you want more info.</em></p>
<ol>
<li>First thing&apos;s first, you&apos;ll want to set up your workspace. So go ahead and make a directory that you can work out of. For the purposes of this demonstration, it will be <code>~/rebuildingnginx</code>
<ul>
<li>
<pre><code class="language-bash">  mkdir ~/rebuildingnginx
</code></pre>
</li>
</ul>
</li>
<li><code>cd</code> to that directory and clone the VTS module from Github
<ul>
<li>
<pre><code class="language-bash">  cd ~/rebuildingnginx
  git clone git://github.com/vozlt/nginx-module-vts.git
</code></pre>
</li>
</ul>
</li>
<li>Add the Nginx repository, if you haven&apos;t already
<ul>
<li>
<pre><code class="language-bash">  sudo add-apt-repository -y ppa:nginx/stable
</code></pre>
</li>
</ul>
</li>
<li>Make sure that <strong>deb-src</strong> isn&apos;t commented out in the respository file
<ul>
<li>
<pre><code class="language-bash">  cat /etc/apt/sources.list.d/nginx-ubuntu-stable-xenial.list
  
deb http://ppa.launchpad.net/nginx/stable/ubuntu xenial main
deb-src http://ppa.launchpad.net/nginx/stable/ubuntu xenial main
</code></pre>
</li>
</ul>
</li>
<li>Do package manager things
<ul>
<li>
<pre><code class="language-bash">  # Update package lists
  sudo apt-get update
  
  # Install dpkg-dev to create package
  sudo apt-get install -y dpkg-dev

  # Get Nginx source files
  sudo apt-get source nginx

  # Install the build dependencies for nginx
  sudo apt-get build-dep nginx
</code></pre>
</li>
</ul>
</li>
<li>Edit the <code>rules</code> file in the source code
<ul>
<li>
<pre><code class="language-bash">  vim ~/rebuildnginx/nginx-1.14.0/debian/rules
</code></pre>
</li>
<li><em>Note: the exact path may differ depending on what version of Nginx source your downloaded</em></li>
<li>Since we&apos;re going to install the nginx-full version, we&apos;re going to append the build flag to the <code>full_configure_flags</code> section of the file</li>
<li>Go ahead and add an <code>--add-module=/root/nginx-module-vts</code> to the end of that list of arguments</li>
<li>It should look something like:<pre><code class="language-bash">full_configure_flags := \
                   $(common_configure_flags) \
                   --with-http_addition_module \
                   --with-http_geoip_module=dynamic \
                   --with-http_gunzip_module \
                   --with-http_gzip_static_module \
                   --with-http_image_filter_module=dynamic \
                   --with-http_sub_module \
                   --with-http_xslt_module=dynamic \
                   --with-stream=dynamic \
                   --with-stream_ssl_module \
                   --with-stream_ssl_preread_module \
                   --with-mail=dynamic \
                   --with-mail_ssl_module \
                   --add-dynamic-module=$(MODULESDIR)/http-auth-pam \
                   --add-dynamic-module=$(MODULESDIR)/http-dav-ext \
                   --add-dynamic-module=$(MODULESDIR)/http-echo \
                   --add-dynamic-module=$(MODULESDIR)/http-upstream-fair \
                   --add-dynamic-module=$(MODULESDIR)/http-subs-filter \
                   --add-module=/root/nginx-module-vts
</code></pre>
</li>
</ul>
</li>
<li>Now we can start the actual build process
<ul>
<li>
<pre><code class="language-bash">  # Again, this path may differ depending on your version
  cd ~/rebuildingnginx/nginx-1.14.0
  sudo dpkg-buildpackage -b
</code></pre>
</li>
<li>This takes a <strong>LONG</strong> time so go find something fun to do and come back later</li>
</ul>
</li>
<li>Now that it&apos;s built, we can actually install Nginx
<ul>
<li>The build process put a bunch of <code>.deb</code> files in our <code>~/rebuildingnginx</code> directory, but the only one we need to care about is <code>nginx-full_1.14.0-0+xenial1_amd64.deb</code> or whatever the equivalent is for you depending on your Ubuntu &amp; Nginx versions.</li>
<li>
<pre><code class="language-bash">  cd ~/rebuildingnginx
  sudo dpkg --install nginx-full_1.14.0-0+xenial1_amd64.deb
</code></pre>
</li>
</ul>
</li>
<li>Nginx should be installed now! Now we need to configure the VTS module for our metrics</li>
<li>Open your Nginx configuration file for some light editing
<ul>
<li>
<pre><code class="language-bash">  sudo vim /etc/nginx/nginx.conf
</code></pre>
</li>
<li>I&apos;ve ellipsed the parts of the configuration that aren&apos;t important to this with <code>[...]</code>, but your configuration file should be updated to include the following info:</li>
<li>
<pre><code class="language-bash">  user www-data;
  worker_processes auto;
  pid /run/nginx.pid;
  include /etc/nginx/modules-enabled/*.conf;

  events {
      worker_connections 768;
      # multi_accept on;
  }

  http {

      [...]

      ##
      # VTS Settings
      ##
      vhost_traffic_status_zone;
      vhost_traffic_status_dump /var/log/nginx/vts.db;

      server {
        listen 8080;
            server_name wbhegedus.me;

        if ($time_iso8601 ~ &quot;^(\d{4})-(\d{2})-(\d{2})&quot;) {
          set $year $1;
          set $month $2;
          set $day $3;
        }

        vhost_traffic_status_filter_by_set_key $year year::$server_name;
        vhost_traffic_status_filter_by_set_key $year-$month month::$server_name;
        vhost_traffic_status_filter_by_set_key $year-$month-$day day::$server_name;

        location /status {
          vhost_traffic_status_bypass_limit on;
          vhost_traffic_status_bypass_stats on;
          vhost_traffic_status_display;
          vhost_traffic_status_display_format html;
        }
      }	
     [...]
  }
</code></pre>
</li>
</ul>
</li>
<li>Once this is complete, you can reload your Nginx config
<ul>
<li>
<pre><code class="language-bash">   # Check for a valid config first
   nginx -t
   
   # If that command returns fine, go ahead and reload
   sudo systemctl reload nginx
</code></pre>
</li>
</ul>
</li>
<li>If everything worked correctly, you should be able to access Prometheus-formatted metrics at <code>localhost:8080/status/format/prometheus</code> on your Nginx box
<ul>
<li>You can add this target to your Prometheus config the same as you would any other endpoint.</li>
<li>For example:</li>
</ul>
<pre><code class="language-yaml">- job_name: ghost_blog
 scheme: http
 metrics_path: /status/format/prometheus
 static_configs:
 - targets:
   - wbhegedus.me:8080
</code></pre>
</li>
</ol>
<p>Enjoy the wonder of Prometheus and let me know in the comments if you run into any issues!</p>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>