Skip to content

review prometheus documentation after service overhaul

We're moving a lot of things around in Prometheus. Make sure the Prometheus documentation is up to date in the wiki, in particular, perform the following checks:

priority A

Those need to be done as part of %TPA-RFC-33-A: emergency Icinga retirement, before we give out training (#41767 (closed)).

  • overall document structure review (done until pager playbooks section)
  • quickly, review monitoring and testing section, to see if there's any urgent changes to be made there
  • how to scrape a new target? (present, but messy)
  • how to add an existing alert to prometheus
  • document IRC channel
  • how to write an alert?
  • "where is my nagios check?" howto
  • document silences
  • how to create a blackbox check?
  • document blackbox exporter oddities

See also the questions raised in the training, in #41767 (closed).

priority B

those may be done after icinga is retired, as part of priority B (%TPA-RFC-33-B: Prometheus server merge, more exporters).

  • sync with template.md
  • review backups
  • review monitoring and testing, yes, again
  • review architecture
  • review design, possibly copying a lot of TPA-RFC-33 in here
  • storage
  • queues
  • authentication
  • implementation
  • related services
  • Security and risk assessment
  • Technical debt and next steps
  • document TPA-RFC-33 in the wiki page (proposed solution?)
  • remaining TODO items

/cc @lelutin

Edited by lelutin
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information