Prometheus and anti-pattern pushgateway timeouts

Prometheus is a pretty awesome recording server for metrics of all sorts. We use it at work to record data about servers, room temperatures, and other things. The whole server gets really nice an shiny if combined with a slick dashboard like Grafana.

But enough of this fanboy-ism, there is a problem with prometheus which almost became a deal breaker for us using it: Prometheus employs a (relatively) strict pull mechanism for fetching metrics from devices. The server is configured to regularly check on peers to fetch the metrics from them. Promethues takes the active part of the data collector and therefore can detect downtimes of devices automatically. It nicely allows one to define what metrics should be available on a client and configure a server to fetch them. A nicely encapsulated design!

This design comes to its limits though when it collides with company restrictions on dataflow, also called “firewalls”. Publishing metrics to the internet from “the inside” becomes almost impossible since the active part of the prometheus system is isolated and cannot contact the machines it should “scrape” the data from. This is a well-known issue and it can partially be fixed by relying on the so-called pushgateway. Metrics are pushed to this pushgateway, are saved, and later served to the prometheus server when the pushgateway is scraped. Since metrics now are push from the devices it is possible to penetrate business firewalls and send data to servers on the internet.

However, the authors of prometheus see this use of the pushgateway as an antipattern. The official usecase for the pushgateway is be to persist metrics that are not continuously available, but are generated, for example, by an automated script runs for a short time. When it finishes it produces some metric that needs to be made available to prometheus, but it cannot be made available by the script since it is not a continuously running server process. Pushing the generated metrics on a local(!) pushgateway for later scraping from an external(!) prometheus server is the solution. Note that this is different from the proposed firewall penetration usecase for the pushgateway. To be able to push through the firewall, the pushgateway must be on the prometheus server-side, not on the devices.

The consequence of this design decision is that an important feature is missing from the pushgateway: timeouts for stored metrics. These are important in the firewall usecase, because the prometheus server cannot check if a device is offline anymore. The last stored metric is persisted in the pushgateway forever and data just “flatlines” if a device goes offline.

At work, this was a real shame: the prometheus server worked fine and was great, but we could not use it through our business firewall. Personally, I see why the original developers see it as an antipattern to try to use a pushgateway for firewall circumvention. On the other hand it is also a pitty that this software becomes entirely unsusable in this situation, expecially since the missing feature is relatively small. Therefore, it was time to code the antipattern!

Since it was needed for work, I contributed to the project and implemented the unintended feature, which is available on github and also in a binary form on docker-hub as a compiled docker image. The extension of the pusgateway allows devices to send timeout information about metrics. The pushgateway will then delete these metrics, if they were not refreshed within the defined timeout.

This works like a charm and allows us to make use of prometheus through our firewall. If servers are offline, metrics do not only flatline, but are shown as missing. Perfect! I cannot support this project at work but will probably do so every now and then in my free time. So have fun using this feature, if you feel a bit “antipattern”.

Leave a Reply

Your email address will not be published. Required fields are marked *