prom: review design and architecture (#41655) authored by anarcat's avatar anarcat
......@@ -2040,7 +2040,7 @@ Prometheus is currently not doing alerting so it doesn't have any sort
of guaranteed availability. It should, hopefully, not lose too many
metrics over time so we can do proper long-term resource planning.
## Design
## Design and architecture
Here is, from the [Prometheus overview documentation][], the
basic architecture of a Prometheus site:
......@@ -2082,103 +2082,23 @@ Here's how the internal design of the Alertmanager looks like:
The first deployments of the Alertmanager at TPO do not feature
a "cluster", or high availability (HA) setup.
Alerts are typically sent over email, but Alertmanager also has
builtin support for:
The Alertmanager has its own web interface to see and silence alerts
but it's not deployed in our configuration, we use [Karma][]
(previously Cloudflare's [unsee][]) instead.
* Email
* Slack
* [Victorops][] (now Splunk)
* [Pagerduty][]
* [Opsgenie][] (now Atlassian)
* Wechat
There's also a [generic web hook receiver][] which is typically used
to send notifications. Many other endpoints are implemented through
that web hook, for example:
* [Cachet][]
* [Dingtalk][]
* [Discord][]
* [Google Chat][]
* [IRC][]
* Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
also [#40216][]
* [Mattermost][]
* [Microsoft teams][]
* [Phabricator][]
* [Sachet][] supports *many* messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
* [Sentry][]
* [Signal][] (or [Signald][])
* [Splunk][]
* [SNMP][]
* Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
* [Twilio][]
* [Wechat][]
* Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]
And that is only what was available at the time of writing, the
[`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like [Karma][] (previously
Cloudflare's [unsee][]). The web interface is
not shipped with the Debian package, because it depends on the [Elm
compiler][] which is [not in Debian][]. It can be built by hand
using the `debian/generate-ui.sh` script, but only in newer, post
buster versions. Another alternative to consider is [Crochet][].
[the "My Philosophy on Alerting" paper from a Google engineer]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
[Monitoring distributed systems]: https://www.oreilly.com/radar/monitoring-distributed-systems/
[Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[kthxbye bot]: https://github.com/prymitive/kthxbye
#### Alerting philosophy
### Alerting philosophy
In general, when working on alerting, keeping [the "My Philosophy on
Alerting" paper from a Google engineer][] (now the [Monitoring
distributed systems][] chapter of the [Site Reliability
Engineering][] O'Reilly book.
Another issue with alerting in Prometheus is that you can only silence
warnings for a certain amount of time, then you get a notification
again. The [kthxbye bot][] works around that issue.
[Victorops]: https://victorops.com
[Pagerduty]: https://pagerduty.com/
[Opsgenie]: https://opsgenie.com
[generic web hook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
[Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
[Discord]: https://github.com/rogerrum/alertmanager-discord
[Google Chat]: https://github.com/mr-karan/calert
[IRC]: https://github.com/crisidev/alertmanager_irc
[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
[knopfler]: https://github.com/sinnwerkstatt/knopfler
[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
[Phabricator]: https://github.com/knyar/phalerts
[Sachet]: https://github.com/messagebird/sachet
[Sentry]: https://github.com/mikeroll/alertmanager-sentry-gateway
[Signal]: https://github.com/dadevel/alertmanager-signal-receiver
[Signald]: https://github.com/dgl/alertmanager-webhook-signald
[Splunk]: https://github.com/sylr/alertmanager-splunkbot
[SNMP]: https://github.com/maxwo/snmp_notifier
[`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
[`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
[Twilio]: https://github.com/Swatto/promtotwilio
[Wechat]: https://github.com/daozzg/work_wechat_robot
[`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
[`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
[`alertmanager` tags]: https://github.com/topics/alertmanager
[Karma]: https://karma-dashboard.io/
[unsee]: https://github.com/cloudflare/unsee
[Elm compiler]: https://github.com/elm/compiler
[not in Debian]: http://bugs.debian.org/973915
[Crochet]: https://github.com/simonpasquier/crochet
[the "My Philosophy on Alerting" paper from a Google engineer]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
[Monitoring distributed systems]: https://www.oreilly.com/radar/monitoring-distributed-systems/
[Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[kthxbye bot]: https://github.com/prymitive/kthxbye
#### Alert timing details
### Alert timing details
Alert timing can be a hard topic to understand in Prometheus alerting,
because there are many components associated with it, and Prometheus
......@@ -2289,6 +2209,10 @@ So, conclusions:
This analysis was done in response to a [mysterious failure to send
notification in a particularly flappy alert][].
Another issue with alerting in Prometheus is that you can only silence
warnings for a certain amount of time, then you get a notification
again. The [kthxbye bot][] works around that issue.
[Alertmanager git HEAD]: https://github.com/prometheus/alertmanager/tree/e9904f93a7efa063bac628ed0b74184acf1c7401
[customized by route]: https://prometheus.io/docs/alerting/latest/configuration/#route
[documentation on grouping]: https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
......@@ -2956,3 +2880,85 @@ respective team's service admins.
| `tor-check-onionoo` | Network health |
[#40052]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052
### Other Alertmanager receivers
Alerts are typically sent over email, but Alertmanager also has
builtin support for:
* Email
* Slack
* [Victorops][] (now Splunk)
* [Pagerduty][]
* [Opsgenie][] (now Atlassian)
* Wechat
There's also a [generic web hook receiver][] which is typically used
to send notifications. Many other endpoints are implemented through
that web hook, for example:
* [Cachet][]
* [Dingtalk][]
* [Discord][]
* [Google Chat][]
* [IRC][]
* Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
also [#40216][]
* [Mattermost][]
* [Microsoft teams][]
* [Phabricator][]
* [Sachet][] supports *many* messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
* [Sentry][]
* [Signal][] (or [Signald][])
* [Splunk][]
* [SNMP][]
* Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
* [Twilio][]
* [Wechat][]
* Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]
And that is only what was available at the time of writing, the
[`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might
have more.
The Alertmanager web interface is not shipped with the Debian package,
because it depends on the [Elm compiler][] which is [not in
Debian][]. It can be built by hand using the `debian/generate-ui.sh`
script, but only in newer, post buster versions. Another alternative
to consider is [Crochet][].
[Victorops]: https://victorops.com
[Pagerduty]: https://pagerduty.com/
[Opsgenie]: https://opsgenie.com
[generic web hook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
[Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
[Discord]: https://github.com/rogerrum/alertmanager-discord
[Google Chat]: https://github.com/mr-karan/calert
[IRC]: https://github.com/crisidev/alertmanager_irc
[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
[knopfler]: https://github.com/sinnwerkstatt/knopfler
[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
[Phabricator]: https://github.com/knyar/phalerts
[Sachet]: https://github.com/messagebird/sachet
[Sentry]: https://github.com/mikeroll/alertmanager-sentry-gateway
[Signal]: https://github.com/dadevel/alertmanager-signal-receiver
[Signald]: https://github.com/dgl/alertmanager-webhook-signald
[Splunk]: https://github.com/sylr/alertmanager-splunkbot
[SNMP]: https://github.com/maxwo/snmp_notifier
[`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
[`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
[Twilio]: https://github.com/Swatto/promtotwilio
[Wechat]: https://github.com/daozzg/work_wechat_robot
[`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
[`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
[`alertmanager` tags]: https://github.com/topics/alertmanager
[Karma]: https://karma-dashboard.io/
[unsee]: https://github.com/cloudflare/unsee
[Elm compiler]: https://github.com/elm/compiler
[not in Debian]: http://bugs.debian.org/973915
[Crochet]: https://github.com/simonpasquier/crochet