... | ... | @@ -1173,8 +1173,11 @@ is *not* configured through our Puppet like other Prometheus |
|
|
servers. It has still been (manually) integrated in our Prometheus
|
|
|
setup and Grafana dashboards (see [pager playbook](#pager-playbook)) have been deployed.
|
|
|
|
|
|
More work is underway to improve monitoring in [this issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40077) (not
|
|
|
hardcoding exporters). We could also use the following tools:
|
|
|
One problem with the current monitoring is that the GitLab exporters
|
|
|
are [currently hardcoded](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40077).
|
|
|
|
|
|
We could also use the following tools to integrate alerting into
|
|
|
GitLab better:
|
|
|
|
|
|
* [moosh3/gitlab-alerts](https://github.com/moosh3/gitlab-alerts): autogenerate issues based from Prometheus
|
|
|
Alert Manager (with the webhook)
|
... | ... | @@ -1185,6 +1188,13 @@ hardcoding exporters). We could also use the following tools: |
|
|
including Prometheus (starting from 13.1) and Pagerduty (which is
|
|
|
supported by Prometheus)
|
|
|
|
|
|
We also lack visibility on certain key aspects of GitLab. For example,
|
|
|
it would be nice to [monitor issue counts in Prometheus](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591) or have
|
|
|
better monitoring of GitLab pipelines like wait time, success/failure
|
|
|
rates and so on. There was an issue open about [monitoring individual
|
|
|
runners](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41042) but the runners do not expose (nor do they have access to)
|
|
|
that information, so that was scrapped.
|
|
|
|
|
|
There is a development server called `gitlab-dev-01` that can be used
|
|
|
to test dangerous things if there is a concern a change could break
|
|
|
the production server.
|
... | ... | |