mtail job on rdsys-test-01 falling through default route
I got this email while doing the round of reboots this morning:
Date: Wed, 03 Jul 2024 13:34:47 +0000
From: alertmanager@hetzner-nbg1-01.torproject.org
To: root@localhost
Subject: Configuration error - Default route: [FIRING:1] JobDown
CONFIGURATION ERROR: The following notifications were sent via the default route node, meaning
that they had no team label matching one of the per-team routes.
This should not be happening and it should be fixed. See:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus#reference
Total firing alerts: 1
## Firing Alerts
-----
Time: 2024-07-03 13:34:17.366 +0000 UTC
Summary: Job mtail@rdsys-test-01.torproject.org is down
Description: Job mtail on rdsys-test-01.torproject.org has been down for more than 5 minutes.
-----
There are two problems here:
- an alert was sent out. this was a routine reboot operation that shouldn't have triggered an error
- it fell through the default notification route. it should have been sent to a specific team instead
I'm not sure which team this belongs to. My gut reaction was that this is an anti-censorship host so they should get the notification, but looking more closely it's the mtail job, which is our responsability. So this should probably have routed to TPA.
/cc @lelutin