Debugging Prometheus Memory Usage

When you put workloads on Kubernetes, most of your debugging and hunting for errors can quickly escalate into a murder mystery. Not much to a surprise, this was also the case for parts of our Prometheus stack.

A few words on Prometheus: Once you start out using Prometheus, it might be a bit cumbersome and high initial investment, but as soon as you gather the data, you will be able to do a lot of great things just because the data is there. So if you are starting with it: Just keep going!

Out of memory

We’re running Prometheus in a highly distributed environment and on many clusters. The workloads of those clusters are also vastly different, so a one-size-fits-all approach usually does not cut straight to the final solution. I was checking onto some metrics of one compute node late evening to check on a few things and build a theory about another issue we were seeing. I noticed that the data wasn’t available in Grafana anymore and soon discovered that the Prometheus is pushing the compute node out of its 32G memory – Bummer. This felt like a lot also for the size of the cluster it was running on. So my search began to figure out why Prometheus uses that much memory.

Dangerous Defaults

One thing I encounter over and over in modern systems is something I started to call dangerous defaults (or better “Did Not RTFM-Defaults”), as I trust every developer and engineer that no defaults in technology exist for truly malicious reasons. Most of the time those defaults have reasons, which can be discovered by reading through the entire history of said feature. I’ve started to try to document the reasoning for default values when I encounter them. This helps future-me to be proud of past-me and shows why values are set to specifics.

TSDB Top 10 Metrics

I started to look at the failing Prometheus node and saw that there’s a handy overview of a lot of Time Series Database (TSDB) vitals which includes Top 10 metrics stored. Which produced the following output:

A few metrics stand out by order of magnitude. After checking those metrics I saw that we currently do not need to store them as we don’t make use of those metrics. So we decided to drop those metrics. Here it gets a bit tricky as I didn’t find any good resources on how this is actually done. To cut to the chase: What you will be looking for is metricRelabelings. As we’re using the helm chart for rolling out the ingress-nginx to our clusters it resulted in the following snippet.

      metricRelabelings:
        - action: drop
          regex: '(nginx_ingress_controller_request_duration_seconds_bucket|nginx_ingress_controller_response_size_bucket|nginx_ingress_controller_request_size_bucket|nginx_ingress_controller_response_duration_seconds_bucket|nginx_ingress_controller_bytes_sent_bucket)'
          sourceLabels: [__name__]

Cleanup

After the change above is rolled out, the TSDB Buckets will remain high but should not add more data. There is a way to drop those from the TSDB via the API, e.g. api/v1/status/tsdb will give you the seriesCountByMetricName and then it’s a few calls to drop those buckets via /api/v1/admin/tsdb/delete_series alternatively you do the no-ops approach and wait till the buckets get rotated away from the TSDB or remove the TSDB if that is an acceptable approach (yes this is a 🔨 approach but you are here for it).

This helped us identify the memory-hogging metrics, lower the usage, and stabilize Prometheus because it wasn’t pushing out the infra node in the cluster out of memory occasionally.

As you can see, the usage went from around 24G of memory usage down to approximately 14G, which is much more practical.

One contributing factor that helped us in this case: In more prominent clusters, we run compute nodes that do specific tasks, e.g. Loadbalancing or Monitoring/Infra, which would be two node groups kept separate from customer-facing workloads to limit issues e.g. OOM Situations to those Nodes or run specific compute nodes tailored to those workloads.

Der Umgang mit Transparenz

Die Transparenz – hoch gepriesen in der Politik aber jedoch leider (noch) nicht konsequent umgesetzt.

Ich habe mir dazu in den vergangenen Monaten Gedanken gemacht und habe meine Webseite beim Neubau kurz überarbeitet.

Wieso Neubau? Weil eine Gatsby 2 zu 4 Migration meine Nerven zu fest strapaziert 😉

Interessenbindung

Eigentlich möchte Mensch ja immer wissen, welche Interessen das gegenüber vertritt oder wo eine Interessenbindung besteht. Ich gehe hier den Schritt, in welchem ich sämtliche Vorstandstätigkeiten, Mitgliedschaften bis runter zu regelmässigen Spenden oder Projekte, welche ich finanziell unterstütze, offenlege.

Abstimmungen

Das sollte nicht überraschend kommen. /votez gibt es schon sehr lange auf meiner Seite, jedoch erhält es jetzt ein bisschen mehr Prominenz. Hier lege ich meine persönlichen Voten offen. Dies tue ich punktuell, wenn ich auf das Wahlgeheimnis verzichten will.

Pandemic Code of Conduct

Und zu guter Letzt – der Pandemic Code of Conduct. Ähnlich wie bei den Interessen ist es unterdessen dummerweise sehr wichtig, den Umgang mit der Pandemie zu regeln. Dies, um einfach meine Grundsätze festzuhalten. Fragen einfach behandeln zu können, wenn ich spontan absagen würde oder ohne grosse umschweife eine Veranstaltung verlasse, wenn ich merke, dass kein Konsens herrscht über die Grundsätze; weil sorge tragen zueinander ist toll und gegenseitig. 💚

Der Code of Conduct ist unter Creative Commons veröffentlicht – Mensch möge sich bedienen 🙂

Angelesen #82

Also der COP26 war ja eher weniger überzeugend, das schlägt sich hier auch in einigen Artikeln nieder. Nebst Corona-Themen die zwischen Zertifiakts-Infrastruktur und Impfungen hin und her springen, geht es noch um Chip Fabriken und heftigen DDoS Attacken über ungepatchte GitLab instanzen. Enjoy.

Grosse Worte um den kleinen Piks (woz.ch)

Rhetorische Abrüstung ist dringend nötig. Der Impfentscheid ist kein Akt des Widerstands – und auch keiner der grossen Solidarität. Er sollte eine rationale Aushandlung sein. Ich selbst habe mich impfen lassen, um mich und meine Familie zu schützen. Und um mich möglichst uneingeschränkt bewegen zu können. Das ist letztlich genauso egoistisch wie der Entscheid, auf eine Impfung zu verzichten. Weshalb es albern ist, wenn Leute die Impfung vor sich hertragen wie ein Ehrenabzeichen.

Mensch ich bin so müde. Müde die n-te Pro- oder Kontra-Argumentarium mit überhöhten Worten zu hören. Zu hören wie wir in einer Diktatur leben sollen und mit wie viel Endzeitstimmung der Überwachungsstaat eingeläutet wird. Keine Sorge wir haben das PMT und verschiedene Überwachungsgesetze schon. Und vor allem wie einige Bekannte in alle möglichen Richtungen abdriften. Zuzusehen wie reflektierte Menschen plötzlich sich Bekriegen über ein zugegeben schwieriges Thema und da total die Bodenhaftung verlieren. Das ist hart. Vor den Abstimmungen gönne ich mir eine Medien und Kommunikationsdiät. Ist gut für die persönliche Befindlichkeit. Ich nehme mich das selbst an der Nase, ich war auch die Tage "hässig" und habe meiner Meinung Raum gegeben. Jedoch immer basiert auf Tatsachen und Statistik.

Unter anderem gesehen beim Habi

Vierte Corona-Welle: Selbst 2G reicht jetzt nicht mehr, warnen führende Corona-Forscher (zeit.de)

Well shit. Ich behalte meine Analyse zur Grosslage mal für mich.

Verkehrswende: "Die Begeisterung der Boomer fürs Auto war legitim" (zeit.de)

Das ist angesichts der Situation in den Zentren und des Klimawandels völlig legitim. Aber in den Siebzigerjahren waren die Umstände anders. Die Begeisterung der Boomer fürs Auto war damals ebenfalls legitim und das sollte die jüngere Generation ihnen auch zugestehen.

Letztlich habe ich das Thema mit einer Freundin von mir zertiskutiert. Das es nun auch in einem Interview aufgegriffen ist ist toll, denn es widerspiegelt viele Lebensrealitäten rund ums Kultobjekt: Dem Auto.

Maestral (maestral.app)

Maestral is a lightweight Dropbox client for macOS and Linux. It provides powerful command line tools, supports gitignore patterns to exclude local files from syncing and allows syncing multiple Dropbox accounts.

Sweeet! Works like a charm!

H/T Manu

Intel slipped—and its future now depends on making everyone else’s chips (arstechnica.com)

Which is why Intel, under Gelsinger, is doing something now that it historically has shunned. “We are now a foundry,” Gelsinger said at the Arizona groundbreaking. In the coming years, he said, Intel will “open the doors of our fab wide for the community at large to serve the foundry needs of our customers—many of them US companies that are dependent on solely having foreign supply sources today.”

All in all not bad seeing not just the big 4 foundries TSMC, Samsung, UMC and GlobalFoundries on the market

Befriedung des Braunkohletagebau: Die Möglichkeit einer Halbinsel (taz.de)

Und nun nach dem COP23 gewisse Zahlen zu sehen was die Politik kollektiv auf die Beine stellt… Das Kohle-Ende wird erst kommen wenn es Haftungsmässig nicht mehr tragbar ist für die Konzerne. siehe Angelesen #61

Big Tech in Zürich: Kommt das Silicon Limmattal? (tsri.ch)

Während die Grossen natürlich deutlich mehr zahlen – Gertsch hat von gut ausgebildeten Google-Mitarbeitenden gehört, die 350’000 Franken im Jahr verdienen – könne ein Start-Up wie seines mit einer familiären Atmosphäre punkten sowie der Aussicht, im Unternehmen langfristig Verantwortung zu übernehmen: «Ich vermute mal, dass man bei Facebook, Google und so weiter stärker in seiner Rolle steckt.»

Das Lohn-Niveau in Zürich ist… Interessant

Meta will continue to use facial recognition technology, actually (inputmag.com)

Well, see my surprised recognizable face, actually.

wechselwarm | Dein Leben in der Klimakrise (wechselwarm.de)

Toller Podcast in welchem der Ausgang selbst gesteuert werden kann.

GitLab servers are being exploited in DDoS attacks in excess of 1 Tbps (therecord.media)

Threat actors are exploiting a security flaw in GitLab self-hosted servers to assemble botnets and launch gigantic distributed denial of service (DDoS) attacks, with some in excess of 1 terabit per second (Tbps).

Based on the Metrics of Cloudflare the Attack Peaked at 2TBps which seems only to be partially coming from gitlab servers.

Xiloe on Twitter: "Some anon got access to a #COVID19 certificate issuance panel and is claiming to have all the EU private keys saying he will leak them soon I’m kinda amazed at how poorly secured this shit is for someone random to get access like that.. (twitter.com)

That was a fun weeekend when someone found the unsecured Infrastructure of a DGCA Web Panel and just was great enough to create a name people will totally show around. Which then after a while lead to a full revoke of the North Macedonian Key. So technology is working after all?

The great theory around this is that the default docker-compose settings were blamed partially for this issue. WELL IF YOU RUN YOUR CERTIFICATE INFRASTRUCTURE ON A VM BY JUST RUNNING docker-compose up -d AND WALK AWAY I DONT HAVE MORE CAPSLOCK FOR YOU.

This twitter thread discusses the theory but I really hope it’s just a hot take.

Amazon copied products and rigged search results, documents show (reuters.com)

Looks like not only Google is rigging things… Surprised much?

Wearable Microphone Jamming (sandlab.cs.uchicago.edu)

We engineered a wearable microphone jammer that is capable of disabling microphones in its user’s surroundings, including hidden microphones. Our device is based on a recent exploit that leverages the fact that when exposed to ultrasonic noise, commodity microphones will leak the noise into the audible range.

Gold!

Angelesen #81

And here we are again 👋 enjoy some short and long-reads. I’m working towards a new schedule for this format, as Sunday-Sunday seems to lead to a lot of off-by-one errors on my end. Let’s see – For now just enjoy a few links from the archive.

Reise-Busse in den USA – Flixbus übernimmt Greyhound (srf.ch)

Henusode, Irgendwie schade…

Operations is not Developer IT (matduggan.com)

It is baffling on many levels to me. First, I am not an application developer and never have been. I enjoy writing code, mostly scripting in Python, as a way to reliably solve problems in my own field. I have very little context on what your application may even do, as I deal with many application demands every week. I’m not in your retros or part of your sprint planning. I likely don’t even know what "working" means in the context of your app.

A long read, but a really good one. I fully understand a lot of the pull and push factors with those roles involved. But somehow on the way there, we lost the DevOps it seems. Or basically, a lot of stacks got very complex within just a few short years that people won’t master things anymore and just expect things to work. And then things get passed off to Operations "because they know". The bandwidth on things operations is expce

H/T Tyler

COVID lesson: trust the public with hard truths (nature.com)

Of the many fears during the pandemic, one has been particularly pernicious: governments’ fear of their people. Former US president Donald Trump admitted to playing down the risks of the coronavirus to “reduce panic”. Jair Bolsonaro, president of Brazil, blamed the press for causing “hysteria”. The UK government delayed its lockdown, fearing the British population would rapidly become fatigued by restrictions. And, in my home country of Denmark, the authorities tried not to draw public attention to pandemic preparations in early 2020, to avoid “unnecessary fear”.

But Denmark pivoted to a strategy of trusting its citizens with hard truths. The buy-in that ensued led to low death rates and laid the groundwork for a vaccination rate of 95% for everyone aged above 50 (and 75% for the population in general). In September 2021, my country announced that COVID-19 is no longer classified as a “critical threat”.

Well that aged somehow. But the general strategy seems not to have been the worst one.

Why the "specialness spiral" leads us to not use some ordinary objects (edition.cnn.com)

When people decide not to use something at one point in time, the item can start to feel more special. And as it feels more special, they want to protect it and are less likely to want to use it in the future. This accrual of specialness can be one explanation for how possessions accumulate and turn into unused clutter.

That’s good knowledge. I fall for this sometimes too. That something simple like a notebook is too special to use. So just go with it and use it. It’s meant to be used.

Covid pandemic is not the supply chains’ only problem (washingtonpost.com)

I think I’m talking about Supply Chain issues since early June and the situation has not gotten better. And It is something that most likely will stay for a bit. This article on Bloomberg is sadly behind paywall now but it’s also a good one.

[Bug]: Let’s Encrypt root CA isn’t working properly · Issue #31212 · electron/electron (github.com)

It is interesting to see which parts of the Internet broke when the Let’s Encrypt Root Certificate ran out.

Also another case of what comes out from the Operations is not Developer IT post above.

Mistakes I’ve Made in AWS (laravel-news.com)

Some low level money saving tips

Leaded Gas Was a Known Poison the Day It Was Invented (smithsonianmag.com)

That report acknowledged that exposure levels might rise over time. “But, of course, that would be another generation’s problem,” she writes. Those early actions set a precedent that was hard to undo: it wouldn’t be until the mid-1970s that a growing body of evidence about the dangers of leaded gasoline lead the EPA to enter into a years-long legal struggle with gasoline-makers over phasing out leaded gasoline.

An industry not to be trusted for so many reasons…

Clearview AI Offered Free Trials To Police Around The World (buzzfeednews.com)

Clearview Trail coming to a police officer near you soon!

Tor is a Great SysAdmin Tool (jamieweb.net)

Testing IP Address Based Access Rules Testing Internally-Hosted Services From an External Perspective Making Reliable External DNS Lookups When Operating in a Split-Horizon DNS Environment

Didn’t think about using tor to test those scenarios because I’ve access to enough jump hosts across many networks. But it’s a good reminder that tor can also be used for this 🙂

Has the firefighting stopped? The effect of COVID-19 on on-call engineers (pagerduty.com)

For many teams responsible for supporting this always-on world, “firefighting” has become the typical mode of operation. But this digital shift is here to stay, and the workload is not going to reduce. Over the next few blogs, we’re going to dig further into the findings from our platform data and explore how the growing volume of real-time work is increasingly burdening technical teams. In this first blog, we’ll share how this firefighting affects burnout levels, how to classify and quantify interruptions, and what teams can do to avoid attrition.

Seeing this article from Pagerduty made me realize how much shifted around in the past 2 Years. Day to day changes were easy to see but I still see many teams fighting day to day. Luckily a few patterns outlined in the article (24/7) availability is something we’re able to handle via different timezones in our team, this alone makes 24/7 more doable.

World’s Largest Chip Maker to Raise Prices, Threatening Costlier Electronics (wsj.com)

TSMC to increase prices of most advanced chips by roughly 10%; less advanced chips will cost about 20% more

Welcome to the supply chain shortage

Zoom RCE from Pwn2Own 2021 (sector7.computest.nl)

zero-click exploits are crazy to witness. This is a great writeup on some details of the RCE

CO2 Einsparungen durch Homeoffice (erneuer.bar)

Die Ersparnis pro Jahr rechne ich mal der einfachheit halber in 47 Wochen wegen der 5 Wochen Ferien je Mitarbeiter. Damit kommen wir auf eingesparte 35’626 Km und 4’117 kg CO2 Emissionen. Und das ist sehr passend, denn der durchschnittliche CO2 Ausstoss pro Kopf in der Schweiz beträgt 4’120 kg. In Deutschland zum Vergleich ist der Ausstoss doppelt so hoch. Durch unser Homeoffice-Credo mit 320 Stellenprozent sparen wir also so viel CO2 wie ein Mensch in der Schweiz im Schnitt verursacht. Bittegärngscheh.

Gute Zusammenfassung wie viel CO2 im Homeoffice eingespart werden kann.

Turing Pi V2 is here (turingpi.com)

The Turing PI V2 looks great. Able to handle 4 Compute modules. Can I haz?

My MacBook Pro had over 10,000 USD in repairs (pqvst.com)

The total repair costs (excluding complete laptop replacements), which has thankfully all been covered by AppleCare Protection Plan, are roughly 4,000 USD. More than the initial cost of the laptop itself. Factoring in the cost of the complete replacements, it would be closer to 10,000 USD!

Been there too:

  • 2 top case replacements on my old device due to popped speakers
  • Graphics Card Damage when it just got out of warranty

Joe Rogan, confined to Spotify, is losing influence (theverge.com)

However, a new data investigation by The Verge finds that the powerful podcaster’s influence has waned since he went behind Spotify’s wall. His show has declined as a hype vehicle for guests, and Rogan’s presence as a mainstay in the news has plummeted.

Surprised Much? The team around Joe had the Youtube Game pretty much figured out. Being constrained to a walled garden won’t help build more reach.

Happy birthday – 30 Years of Linux (ubuntu.com)

just a hobby, won’t be big and professional like gnu

This made my day when I read it first 🙂

Samsung Supports Retailers Affected By Looting With Innovative Television Block Function (news.samsung.com)

The aim of the technology is to mitigate against the creation of secondary markets linked to the sale of illegal goods, both in South Africa and beyond its borders. This technology is already pre-loaded on all Samsung TV products.

digital arm-breakers… Didn’t post that one in an earlier installment but here we go : The zombie economy and digital arm-breakers. Not saying stealing is bad, but the piece of technology is in your TV no matter what.

PaulWetz – Hypnotize Me (youtube.com)

Running a Public DNS Resolver for fun

When I’ve set up my Odroid Server earlier this year, I’ve wondered if it was a good idea to run a public-facing DNS resolver based on Pi-Hole. Against all the voices telling me no, I decided nonetheless to try it and see what happens. In the end, the traffic will be limited at some point by the CPU power available, and the Operations Team at CommunityRack.org will give me a hearty slap on the wrist saying, “you broke it, you’ll fix it”, and they will make me buy some Pizza and/or Doughnuts for the next time we meet in person. So the experiment began towards mid-February.

You can see the traffic I was making most of the time until around May. Primary clients connected to the VPN using the DNS Resolver following mostly standard day/night/weekend traffic patterns.

There’s a noticeable bump in July, a considerable Spike towards august, and then in October, the floodgates opened entirely with a couple of million DNS queries per day. My theory is that at some point in July, the resolver got on some well known DNS list and started to gain “trust” as it was always online.

A few observations:

  1. First, there was only my traffic, but soon after, someone or a small group of people discovered the resolver and started using it.
  2. Discovery — I was confused why and how people found the resolver, but they seemed to use it steadily.
  3. Service-Thoughts — But you can’t get in touch with someone like that, so I’ve set up a small landing page on the IP and added an email address for anyone to reach out if they plan to use the service for an extended time, so I could give them at least a heads up in case the service needs to shut down. If you ever debug a failing DNS server, you would know why – Nobody deserves this.
  4. Privacy — It was when I noticed that I’d need to shred the log files at some point and started lowering the data logged to disk. The less I know, the better. At this point, I only cared for the raw numbers.
  5. Trust? — And last and the most concerning one for me personally is that people seem to blindly trust a random IP on the internet that gives them DNS responses. (I kind of pride myself that I was able to run a DNS resolver with seemingly good uptime and minimal maintenance).

So how long?

The answer is 7 months and 3 days (15th March till 18th October).

Sorry to the people that have a broken DNS resolver now. And sorry if my resolver has been part of some sort of a DNS Amplification Attack (based on the traffic it should not have, but that’s hard to say).

The experiment has ended; thanks for participating. I’ve just shredded all logs.