Grafana is a graphing engine and dashboard management tool that processes data from multiple data sources. We use it to trend various metrics collected from servers by Prometheus.
Grafana is installed alongside Prometheus, on the same server. Those are the known instances:
- https://grafana.torproject.org/ - internal server
- https://grafana2.torproject.org/ - external server
See also the Prometheus monitored services to understand the difference between the internal and external servers.
Tutorial
Important dashboards
Typically, working Grafana dashboards are "starred". Since we have many such dashboards now, here's a curated list of the most important dashboards you might need to look at:
- Overview - first panel to show up on login, can filter basic stats (bandwidth, memory, load, etc) per server role (currently "class" field)
- Per-node server stats - basic server stats (CPU, disk, memory usage), with drill down options
- Node comparison dashboard - similar to the above, but can display multiple servers in columns, useful for cluster overview and drawing correlations between servers
- Postfix - to monitor mailings, see monitoring mailings, in the CRM documentation
Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have their own dashboards, and many dashboards are still work in progress.
The above list doesn't cover the "external" Grafana server
(grafana2
) which has its own distinct set of dashboards.
How-to
Updating a dashboard
As mentioned in the installation section below, the Grafana dashboards are maintained by Puppet. So while new dashboard can be created and edited in the Grafana web interface, changes to provisioned will be lost when Puppet ships a new version of the dashboard.
You therefore need to make sure you update the Dashboard in git before leaving. New dashboards not in git should be safe, but please do also commit them to git so we have a proper versioned history of their deployment. It's also the right way to make sure they are usable across other instances of Grafana. Finally, they are also easier to share and collaborate on that way.
Folders and tags
Dashboards provisioned by Grafana should be tagged with the
provisioned
label, and filed in the appropriate folder:
-
meta
: self-monitoring, mostly metrics on Prometheus and Grafana themselves -
network
: network monitoring, bandwidth management -
services
: service-specific dashboards, for example database, web server, applications like GitLab, etc -
system
: system-level metrics, like disk, memory, CPU usage
Non-provisioned dashboards should be filed in one of those folders:
-
broken
: dashboards found to be completely broken and useless, might be deleted in the future -
deprecated
: functionality overlapping with another dashboard, to be deleted in the future -
inprogrress
: currently being built, could be partly operational, must absolutely NOT be deleted
The General
folder is special and holds the "home" dashboard, which
is, on grafana1
, the "TPO overview" dashboard. It should not be
used by other dashboards.
See the grafana-dashboards repository for instructions on how to export dashboards into git.
Pager playbook
In general, Grafana is not a high availability service and shouldn't "page" you. It is, however, quite useful in emergencies or diagnostics situations. To diagnose server-level issues, head to the per-node server stats, which basic server stats (CPU, disk, memory usage), with drill down options. If that's not enough, look at the list of important dashboards
Disaster recovery
In theory, if the Grafana server dies in a fire, it should be possible to rebuild it from scratch in Puppet, see the installation procedure. In practice, it's possible that important dashboards might not have been saved into git, in which case restoring from backups might bring them back.
Reference
Installation
Puppet deployment
Grafana was installed with Puppet using the upstream Debian package, following a debate regarding the merits of Debian packages versus Docker containers when neither are trusted, see this comment for a summary.
Some manual configuration was performed after the install. An admin
password reset on first install, stored in tor-passwords.git
, in
hosts-extra-info
. Everything else is configured in Puppet.
Grafana dashboards, in particular, the grafana-dashboards
repository. The README.md
file there contains more instructions
on how to add and update dashboards. In general, dashboards must not
be modified directly through the web interface, at least not without
being exported back into the repository.
SLA
There is no SLA established for this service.
Design
Grafana is a single-binary daemon written in Golang with a frontend
written in Typescript. It stores its configuration in a INI
file (in
/etc/grafana/grafana.ini
, managed by Puppet). It doesn't keep
metrics itself and instead delegates time series storage to "data
stores", which we currently use Prometheus for.
It is mostly driven by a web browser interface making heavy use of Javascript. Dashboards are stored in JSON files deployed by Puppet.
It supports doing alerting, but we do not use that feature, instead relying on Prometheus for alerts.
Authentication is delegated to the webserver proxy (currently Apache).
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the Grafana label.
Issues with Grafana itself may be browsed or filed on GitHub.
Maintainer, users, and upstream
This service was deployed by anarcat and hiro. The internal server is used by TPA and the external server can be used by any other teams, but is particularly used by the anti-censorship and metrics teams.
Upstream is Grafana Labs, a startup with a few products alongside Grafana.
Monitoring and testing
Grafana itself is monitored by Prometheus and produces graphs for its own metrics.
The test procedure is basically to login to the service and loading a few dashboards.
Logs and metrics
Grafana doesn't hold metrics in itself, and delegates this task to external datasource. We use Prometheus for that purpose, but other backends could be used as well.
Grafana logs incoming requests in /var/log/grafana/grafana.log
and
may contain private information like IP addresses and request times.
Backups
No special backup procedure has been established for Grafana, considering the service can be rebuilt from scratch.
Other documentation
Discussion
Overview
The Grafana project was quickly thrown together in 2019 to replace the Munin service who had "died in a fire". Prometheus was first setup to collect metrics and Grafana was picked as a frontend because Prometheus didn't seem sufficient to produce good graphs. There was no elaborate discussion or evaluation of alternatives done at the time.
There hasn't been a significant security audit of the service, but given that authentication is managed by Apache with a limited set of users, it should be fairly safe.
Note that it is assumed the dashboard and Prometheus are public on the internal server. The external server is considered private and shouldn't be publicly accessible.
There are lots of dashboards in the interface, which should probably be cleaned up and renamed. Some are not in Git and might be lost in a reinstall. Some dashboards do not work very well.
Goals
N/A. No ongoing migration or major project.
Must have
Nice to have
Non-Goals
Approvals required
Proposed Solution
N/A.
Cost
N/A.
Alternatives considered
No extensive evaluation of alternatives were performed when Grafana was deployed.