Replacing hchk.io with Prometheus and Pushgateway

Published Aug 31, 2019 by Ricard Bejarano

This post has been archived.
It is no longer maintained nor listed.

The other day, reading this Lobste.rs thread, I came across healthchecks.io (or hchk.io), a service for monitoring cron jobs.

The service is pretty simple, you specify the details of your cron job (schedule, grace time, etc.) and you are given a unique URL, which you then make your job ping once it’s finished, letting healthchecks.io know the job ran.

You can also first ping URL/start and then URL to track the time your job took to finish. As well as ping URL/fail to signal a failure.

This is all pretty easy logic. There’s nothing special about the service, and as I monitor my services with Prometheus, I thought I could hack together a similar solution with Pushgateway.

What is Pushgateway?

First, a little background on how Prometheus does gathering of metrics.

Prometheus does pull collection of metrics, that is, it actively reaches out to endpoints containing metrics in the Prometheus format.

Therefore all endpoints must be up and running at any time, should Prometheus go ahead and scrape them.

But how do I make Prometheus get metrics about my batch jobs, if they run only for a few minutes?

That’s where the Pushgateway comes into action. Pushgateway is a service that listens to metrics being pushed to it, stores them for some time and offers them to Prometheus, something like voicemail for metrics.

Why replace healthchecks.io?

Less cost

healthchecks.io has three pricing tiers (as of August 2019):

Hobbyist: $0/mo for 20 jobs and 3 team members
Business: $16/mo for 100 jobs, 10 team members and 50 SMS & WhatsApp monthly alerts
Business Plus: $64/mo for 1000 jobs, unlimited team members and 500 SMS & WhatsApp monthly alerts

If you don’t exceed the Hobbyist tier, you’ll be fine, but at 16$/mo for 100 jobs, I’d rather write my own alternative.

With Prometheus, you get unlimited jobs, unlimited team members and unlimited alerts through whatever channels you choose for the cost of running a simple stack of three microservices and a bit less “turn-key” feel.

More features

Pushgateway supports labeling metrics, so you can label your pings with a stage label and let Pushgateway know the status of each of your job’s stages.
Pushgateway supports pushing many metrics at once, letting you export more insight about your job every time you push.
Prometheus lets you better tune which rules trigger an alert, who in the team gets notified, etc.
With this setup, your jobs report status over the LAN, letting you restrict internet access to your jobs, which is a nice security plus.
This setup keeps your IP address private (not like it’s sensitive information if you are running internet services anyway, but it’s nice to have).

Note: the only downside to this setup is Alertmanager’s subset of supported alert channels, but it’s not like it’s too hard to write your own client.

How to replace healthchecks.io?

Install the Prometheus stack

Install and run Prometheus, Pushgateway and Alertmanager
Connect Prometheus to Alertmanager, edit prometheus.yml:
Make Prometheus scrape Pushgateway, edit prometheus.yml:

Make jobs push metrics to Pushgateway

Once you have all that interconnected, you can go ahead and push metrics to Pushgateway, for Prometheus to scrape:

Monitoring job execution:
Monitoring job success or failure:
Measuring job execution time:

See Pushgateway’s README.md for more information on how to push metrics.

Configure alerting upon job failure

Add the following to prometheus.yml and rules.yml respectively:
Finally, configure an Alertmanager alert receiver and start getting notified whenever your jobs fail

Note: check out my previous post to integrate Alertmanager with Amazon SES for email alerts without worrying about setting up SMTP servers.

Thanks for dropping by!

Did you find what you were looking for?
Let me know if you didn't.

Have a great day!