Skip to content
Snippets Groups Projects
Verified Commit 94cfe039 authored by anarcat's avatar anarcat
Browse files

finish filling up the grafana template

parent 0a1c9642
No related branches found
No related tags found
No related merge requests found
......@@ -15,9 +15,32 @@ difference between the internal and external servers.
# Tutorial
## Important dashboards
Typically, working Grafana dashboards are "starred". Since we have
many such dashboards now, here's a curated list of the most important
dashboards you might need to look at:
* [Overview](https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview) - first panel to show up on login, can filter basic
stats (bandwidth, memory, load, etc) per server role (currently
"class" field)
* [Per-node server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full) - basic server stats (CPU, disk, memory
usage), with drill down options
* [Node comparison dashboard](https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics) - similar to the above, but can
display multiple servers in columns, useful for cluster overview and
drawing correlations between servers
* [Postfix](https://grafana.torproject.org/d/Ds5BxBYGk/postfix-mtail) - to monitor mailings, see [monitoring mailings, in
the CRM documentation](service/crm#monitoring-mailings)
Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have
their own dashboards, and many dashboards are still work in progress.
The above list doesn't cover the "external" Grafana server
(`grafana2`) which has its own distinct set of dashboards.
# How-to
## Updating a Grafana dashboard
## Updating a dashboard
As mentioned in the [installation section](#installation) below, the Grafana
dashboards are maintained by Puppet. So while new dashboard can be
......@@ -36,15 +59,20 @@ export dashboards into git.
## Pager playbook
<!-- information about common errors from the monitoring system and -->
<!-- how to deal with them. this should be easy to follow: think of -->
<!-- your future self, in a stressful situation, tired and hungry. -->
In general, Grafana is not a high availability service and shouldn't
"page" you. It is, however, quite useful in emergencies or diagnostics
situations. To diagnose server-level issues, head to the [per-node
server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full), which basic server stats (CPU, disk, memory usage),
with drill down options. If that's not enough, look at the [list of
important dashboards](#important-dashboards)
## Disaster recovery
<!-- what to do if all goes to hell. e.g. restore from backups? -->
<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
<!-- "Installation" below but some pointers. -->
In theory, if the Grafana server dies in a fire, it should be possible
to rebuild it from scratch in Puppet, see the [installation
procedure](#installation). In practice, it's possible that important dashboards
might not have been saved into git, in which case restoring from
backups might bring them back.
# Reference
......@@ -73,32 +101,23 @@ into the repository.
## SLA
<!-- this describes an acceptable level of service for this service -->
There is no SLA established for this service.
## Design
<!-- how this is built -->
<!-- should reuse and expand on the "proposed solution", it's a -->
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
<!-- "architectural" document, which the final result might differ -->
<!-- from, sometimes significantly -->
<!-- a good guide to "audit" an existing project's design: -->
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
Grafana is a single-binary daemon written in Golang with a frontend
written in Typescript. It stores its configuration in a `INI` file (in
`/etc/grafana/grafana.ini`, managed by Puppet). It doesn't keep
metrics itself and instead delegates time series storage to "data
stores", which we currently use Prometheus for.
<!-- things to evaluate here:
It is mostly driven by a web browser interface making heavy use of
Javascript. Dashboards are stored in JSON files deployed by Puppet.
* services
* storage (databases? plain text files? cloud/S3 storage?)
* queues (e.g. email queues, job queues, schedulers)
* interfaces (e.g. webserver, commandline)
* authentication (e.g. SSH, LDAP?)
* programming languages, frameworks, versions
* dependent services (e.g. authenticates against LDAP, or requires
git pushes)
* deployments: how is code for this deployed (see also Installation)
It supports doing alerting, but we do not use that feature, instead
relying on Prometheus and Nagios for alerts.
how is this thing built, basically? -->
Authentication is delegated to the webserver proxy (currently Apache).
## Issues
......@@ -111,61 +130,69 @@ There is no issue tracker specifically for this project, [File][] or
[File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
Issues with Grafana itself may be [browsed or filed on GitHub](https://github.com/grafana/grafana/issues).
## Maintainer, users, and upstream
<!-- document who deployed and operates this service, who the users -->
<!-- are, who the upstreams are, if they are still active, -->
<!-- collaborative, how do we keep up to date, -->
This service was deployed by anarcat and hiro. The internal server is
used by TPA and the external server can be used by any other teams,
but is particularly used by the anti-censorship and metrics teams.
Upstream is [Grafana Labs](https://grafana.com/), a startup with a few products alongside
Grafana.
## Monitoring and testing
<!-- describe how this service is monitored and how it can be tested -->
<!-- after major changes like IP address changes or upgrades. describe -->
<!-- CI, test suites, linting, how security issues and upgrades are -->
<!-- tracked -->
Grafana itself is monitored by [Prometheus](howto/prometheus) and produces graphs for
its own metrics.
The test procedure is basically to login to the service and loading a
few dashboards.
## Logs and metrics
<!-- where are the logs? how long are they kept? any PII? -->
<!-- what about performance metrics? same questions -->
Grafana doesn't hold metrics in itself, and delegates this task to
external datasource. We use [Prometheus](howto/prometheus) for that purpose, but
other backends could be used as well.
Grafana logs incoming requests in `/var/log/grafana/grafana.log` and
may contain private information like IP addresses and request times.
## Backups
<!-- does this service need anything special in terms of backups? -->
<!-- e.g. locking a database? special recovery procedures? -->
No special backup procedure has been established for Grafana,
considering the service can be rebuilt from scratch.
## Other documentation
<!-- references to upstream documentation, if relevant -->
* [Upstream Grafana manual](https://grafana.com/docs/grafana/latest/)
* [Grafana GitHub project](https://github.com/grafana/grafana)
# Discussion
## Overview
<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
<!-- if this is an old project being documented, summarize the known -->
<!-- issues with the project. to quote the "audit procedure":
5. When was the last security review done on the project? What was
the outcome? Are there any security issues currently? Should it
have another security review?
The Grafana project was quickly thrown together in 2019 to replace the
Munin service who had "died in a fire". Prometheus was first setup to
collect metrics and Grafana was picked as a frontend because
Prometheus didn't seem sufficient to produce good graphs. There was no
elaborate discussion or evaluation of alternatives done at the time.
6. When was the last risk assessment done? Something that would cover
risks from the data stored, the access required, etc.
There hasn't been a significant security audit of the service, but
given that authentication is managed by Apache with a limited set of
users, it should be fairly safe.
7. Are there any in-progress projects? Technical debt cleanup?
Migrations? What state are they in? What's the urgency? What's the
next steps?
Note that it is assumed the dashboard and Prometheus are *public* on
the internal server. The external server is considered private and
shouldn't be publicly accessible.
8. What urgent things need to be done on this project?
-->
There are lots of dashboards in the interface, which should probably
be cleaned up and renamed. Some are not in Git and might be lost in a
reinstall. Some dashboards do not work very well.
## Goals
<!-- include bugs to be fixed -->
N/A. No ongoing migration or major project.
### Must have
......@@ -179,8 +206,13 @@ There is no issue tracker specifically for this project, [File][] or
## Proposed Solution
N/A.
## Cost
N/A.
## Alternatives considered
<!-- include benchmarks and procedure if relevant -->
No extensive evaluation of alternatives were performed when Grafana
was deployed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment