Changes
Page history
prom: review design and architecture (
#41655
)
authored
Oct 07, 2024
by
anarcat
Show whitespace changes
Inline
Side-by-side
service/prometheus.md
View page @
45bd3dac
...
@@ -2040,7 +2040,7 @@ Prometheus is currently not doing alerting so it doesn't have any sort
...
@@ -2040,7 +2040,7 @@ Prometheus is currently not doing alerting so it doesn't have any sort
of guaranteed availability. It should, hopefully, not lose too many
of guaranteed availability. It should, hopefully, not lose too many
metrics over time so we can do proper long-term resource planning.
metrics over time so we can do proper long-term resource planning.
## Design
## Design
and architecture
Here is, from the
[
Prometheus overview documentation
][]
, the
Here is, from the
[
Prometheus overview documentation
][]
, the
basic architecture of a Prometheus site:
basic architecture of a Prometheus site:
...
@@ -2082,103 +2082,23 @@ Here's how the internal design of the Alertmanager looks like:
...
@@ -2082,103 +2082,23 @@ Here's how the internal design of the Alertmanager looks like:
The first deployments of the Alertmanager at TPO do not feature
The first deployments of the Alertmanager at TPO do not feature
a "cluster", or high availability (HA) setup.
a "cluster", or high availability (HA) setup.
Alerts are typically sent over email, but Alertmanager also has
The Alertmanager has its own web interface to see and silence alerts
builtin support for:
but it's not deployed in our configuration, we use
[
Karma
][]
(previously Cloudflare's
[
unsee
][]
) instead.
*
Email
[
the "My Philosophy on Alerting" paper from a Google engineer
]:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
*
Slack
[
Monitoring distributed systems
]:
https://www.oreilly.com/radar/monitoring-distributed-systems/
*
[
Victorops
][]
(now Splunk)
[
Site Reliability Engineering
]:
https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
*
[
Pagerduty
][]
[
kthxbye bot
]:
https://github.com/prymitive/kthxbye
*
[
Opsgenie
][]
(now Atlassian)
*
Wechat
There's also a
[
generic web hook receiver
][]
which is typically used
to send notifications. Many other endpoints are implemented through
that web hook, for example:
*
[
Cachet
][]
*
[
Dingtalk
][]
*
[
Discord
][]
*
[
Google Chat
][]
*
[
IRC
][]
*
Matrix:
[
`matrix-alertmanager`
][]
(JavaScript) or
[
knopfler
][]
(Python), see
also
[
#40216
][]
*
[
Mattermost
][]
*
[
Microsoft teams
][]
*
[
Phabricator
][]
*
[
Sachet
][]
supports
*many*
messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
*
[
Sentry
][]
*
[
Signal
][]
(or
[
Signald
][]
)
*
[
Splunk
][]
*
[
SNMP
][]
*
Telegram:
[
`nopp/alertmanager-webhook-telegram-python`
][]
or
[
`metalmatze/alertmanager-bot`
][]
*
[
Twilio
][]
*
[
Wechat
][]
*
Zabbix:
[
`alertmanager-zabbix-webhook`
][]
or
[
`zabbix-alertmanager`
][]
And that is only what was available at the time of writing, the
[
`alertmanager-webhook`
][]
and
[
`alertmanager` tags
][]
GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like
[
Karma
][]
(previously
Cloudflare's
[
unsee
][]
). The web interface is
not shipped with the Debian package, because it depends on the
[
Elm
compiler
][]
which is
[
not in Debian
][]
. It can be built by hand
using the
`debian/generate-ui.sh`
script, but only in newer, post
buster versions. Another alternative to consider is
[
Crochet
][]
.
###
#
Alerting philosophy
### Alerting philosophy
In general, when working on alerting, keeping
[
the "My Philosophy on
In general, when working on alerting, keeping
[
the "My Philosophy on
Alerting" paper from a Google engineer
][]
(now the
[
Monitoring
Alerting" paper from a Google engineer
][]
(now the
[
Monitoring
distributed systems
][]
chapter of the
[
Site Reliability
distributed systems
][]
chapter of the
[
Site Reliability
Engineering
][]
O'Reilly book.
Engineering
][]
O'Reilly book.
Another issue with alerting in Prometheus is that you can only silence
### Alert timing details
warnings for a certain amount of time, then you get a notification
again. The
[
kthxbye bot
][]
works around that issue.
[
Victorops
]:
https://victorops.com
[
Pagerduty
]:
https://pagerduty.com/
[
Opsgenie
]:
https://opsgenie.com
[
generic web hook receiver
]:
https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[
Cachet
]:
https://github.com/oxyno-zeta/prometheus-cachethq
[
Dingtalk
]:
https://github.com/timonwong/prometheus-webhook-dingtalk
[
Discord
]:
https://github.com/rogerrum/alertmanager-discord
[
Google Chat
]:
https://github.com/mr-karan/calert
[
IRC
]:
https://github.com/crisidev/alertmanager_irc
[
#40216
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[
`matrix-alertmanager`
]:
https://github.com/jaywink/matrix-alertmanager
[
knopfler
]:
https://github.com/sinnwerkstatt/knopfler
[
Mattermost
]:
https://github.com/cpanato/mattermost-plugin-alertmanager
[
Microsoft teams
]:
https://github.com/prometheus-msteams/prometheus-msteams
[
Phabricator
]:
https://github.com/knyar/phalerts
[
Sachet
]:
https://github.com/messagebird/sachet
[
Sentry
]:
https://github.com/mikeroll/alertmanager-sentry-gateway
[
Signal
]:
https://github.com/dadevel/alertmanager-signal-receiver
[
Signald
]:
https://github.com/dgl/alertmanager-webhook-signald
[
Splunk
]:
https://github.com/sylr/alertmanager-splunkbot
[
SNMP
]:
https://github.com/maxwo/snmp_notifier
[
`nopp/alertmanager-webhook-telegram-python`
]:
https://github.com/nopp/alertmanager-webhook-telegram-python
[
`metalmatze/alertmanager-bot`
]:
https://github.com/metalmatze/alertmanager-bot
[
Twilio
]:
https://github.com/Swatto/promtotwilio
[
Wechat
]:
https://github.com/daozzg/work_wechat_robot
[
`alertmanager-zabbix-webhook`
]:
https://github.com/gmauleon/alertmanager-zabbix-webhook
[
`zabbix-alertmanager`
]:
https://github.com/devopyio/zabbix-alertmanager
[
`alertmanager-webhook`
]:
https://github.com/topics/alertmanager-webhook
[
`alertmanager` tags
]:
https://github.com/topics/alertmanager
[
Karma
]:
https://karma-dashboard.io/
[
unsee
]:
https://github.com/cloudflare/unsee
[
Elm compiler
]:
https://github.com/elm/compiler
[
not in Debian
]:
http://bugs.debian.org/973915
[
Crochet
]:
https://github.com/simonpasquier/crochet
[
the "My Philosophy on Alerting" paper from a Google engineer
]:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
[
Monitoring distributed systems
]:
https://www.oreilly.com/radar/monitoring-distributed-systems/
[
Site Reliability Engineering
]:
https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[
kthxbye bot
]:
https://github.com/prymitive/kthxbye
#### Alert timing details
Alert timing can be a hard topic to understand in Prometheus alerting,
Alert timing can be a hard topic to understand in Prometheus alerting,
because there are many components associated with it, and Prometheus
because there are many components associated with it, and Prometheus
...
@@ -2289,6 +2209,10 @@ So, conclusions:
...
@@ -2289,6 +2209,10 @@ So, conclusions:
This analysis was done in response to a
[
mysterious failure to send
This analysis was done in response to a
[
mysterious failure to send
notification in a particularly flappy alert
][]
.
notification in a particularly flappy alert
][]
.
Another issue with alerting in Prometheus is that you can only silence
warnings for a certain amount of time, then you get a notification
again. The
[
kthxbye bot
][]
works around that issue.
[
Alertmanager git HEAD
]:
https://github.com/prometheus/alertmanager/tree/e9904f93a7efa063bac628ed0b74184acf1c7401
[
Alertmanager git HEAD
]:
https://github.com/prometheus/alertmanager/tree/e9904f93a7efa063bac628ed0b74184acf1c7401
[
customized by route
]:
https://prometheus.io/docs/alerting/latest/configuration/#route
[
customized by route
]:
https://prometheus.io/docs/alerting/latest/configuration/#route
[
documentation on grouping
]:
https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
[
documentation on grouping
]:
https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
...
@@ -2956,3 +2880,85 @@ respective team's service admins.
...
@@ -2956,3 +2880,85 @@ respective team's service admins.
|
`tor-check-onionoo`
| Network health |
|
`tor-check-onionoo`
| Network health |
[
#40052
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052
[
#40052
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052
### Other Alertmanager receivers
Alerts are typically sent over email, but Alertmanager also has
builtin support for:
*
Email
*
Slack
*
[
Victorops
][]
(now Splunk)
*
[
Pagerduty
][]
*
[
Opsgenie
][]
(now Atlassian)
*
Wechat
There's also a
[
generic web hook receiver
][]
which is typically used
to send notifications. Many other endpoints are implemented through
that web hook, for example:
*
[
Cachet
][]
*
[
Dingtalk
][]
*
[
Discord
][]
*
[
Google Chat
][]
*
[
IRC
][]
*
Matrix:
[
`matrix-alertmanager`
][]
(JavaScript) or
[
knopfler
][]
(Python), see
also
[
#40216
][]
*
[
Mattermost
][]
*
[
Microsoft teams
][]
*
[
Phabricator
][]
*
[
Sachet
][]
supports
*many*
messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
*
[
Sentry
][]
*
[
Signal
][]
(or
[
Signald
][]
)
*
[
Splunk
][]
*
[
SNMP
][]
*
Telegram:
[
`nopp/alertmanager-webhook-telegram-python`
][]
or
[
`metalmatze/alertmanager-bot`
][]
*
[
Twilio
][]
*
[
Wechat
][]
*
Zabbix:
[
`alertmanager-zabbix-webhook`
][]
or
[
`zabbix-alertmanager`
][]
And that is only what was available at the time of writing, the
[
`alertmanager-webhook`
][]
and
[
`alertmanager` tags
][]
GitHub might
have more.
The Alertmanager web interface is not shipped with the Debian package,
because it depends on the
[
Elm compiler
][]
which is
[
not in
Debian
][]
. It can be built by hand using the
`debian/generate-ui.sh`
script, but only in newer, post buster versions. Another alternative
to consider is
[
Crochet
][]
.
[
Victorops
]:
https://victorops.com
[
Pagerduty
]:
https://pagerduty.com/
[
Opsgenie
]:
https://opsgenie.com
[
generic web hook receiver
]:
https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[
Cachet
]:
https://github.com/oxyno-zeta/prometheus-cachethq
[
Dingtalk
]:
https://github.com/timonwong/prometheus-webhook-dingtalk
[
Discord
]:
https://github.com/rogerrum/alertmanager-discord
[
Google Chat
]:
https://github.com/mr-karan/calert
[
IRC
]:
https://github.com/crisidev/alertmanager_irc
[
#40216
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[
`matrix-alertmanager`
]:
https://github.com/jaywink/matrix-alertmanager
[
knopfler
]:
https://github.com/sinnwerkstatt/knopfler
[
Mattermost
]:
https://github.com/cpanato/mattermost-plugin-alertmanager
[
Microsoft teams
]:
https://github.com/prometheus-msteams/prometheus-msteams
[
Phabricator
]:
https://github.com/knyar/phalerts
[
Sachet
]:
https://github.com/messagebird/sachet
[
Sentry
]:
https://github.com/mikeroll/alertmanager-sentry-gateway
[
Signal
]:
https://github.com/dadevel/alertmanager-signal-receiver
[
Signald
]:
https://github.com/dgl/alertmanager-webhook-signald
[
Splunk
]:
https://github.com/sylr/alertmanager-splunkbot
[
SNMP
]:
https://github.com/maxwo/snmp_notifier
[
`nopp/alertmanager-webhook-telegram-python`
]:
https://github.com/nopp/alertmanager-webhook-telegram-python
[
`metalmatze/alertmanager-bot`
]:
https://github.com/metalmatze/alertmanager-bot
[
Twilio
]:
https://github.com/Swatto/promtotwilio
[
Wechat
]:
https://github.com/daozzg/work_wechat_robot
[
`alertmanager-zabbix-webhook`
]:
https://github.com/gmauleon/alertmanager-zabbix-webhook
[
`zabbix-alertmanager`
]:
https://github.com/devopyio/zabbix-alertmanager
[
`alertmanager-webhook`
]:
https://github.com/topics/alertmanager-webhook
[
`alertmanager` tags
]:
https://github.com/topics/alertmanager
[
Karma
]:
https://karma-dashboard.io/
[
unsee
]:
https://github.com/cloudflare/unsee
[
Elm compiler
]:
https://github.com/elm/compiler
[
not in Debian
]:
http://bugs.debian.org/973915
[
Crochet
]:
https://github.com/simonpasquier/crochet