Grafana is a graphing engine and dashboard management tool that processes data from multiple data sources. We use it to trend various metrics collected from servers by Prometheus.

Grafana is installed alongside Prometheus, on the same server. Those are the known instances:

https://grafana.torproject.org/ - internal server
https://grafana2.torproject.org/ - external server

See also the Prometheus monitored services to understand the difference between the internal and external servers.

Tutorial

Important dashboards

Typically, working Grafana dashboards are "starred". Since we have many such dashboards now, here's a curated list of the most important dashboards you might need to look at:

Overview - first panel to show up on login, can filter basic stats (bandwidth, memory, load, etc) per server role (currently "class" field)
Per-node server stats - basic server stats (CPU, disk, memory usage), with drill down options
Node comparison dashboard - similar to the above, but can display multiple servers in columns, useful for cluster overview and drawing correlations between servers
Postfix - to monitor mailings, see monitoring mailings, in the CRM documentation

Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have their own dashboards, and many dashboards are still work in progress.

The above list doesn't cover the "external" Grafana server (grafana2) which has its own distinct set of dashboards.

How-to

Updating a dashboard

As mentioned in the installation section below, the Grafana dashboards are maintained by Puppet. So while new dashboard can be created and edited in the Grafana web interface, changes to provisioned will be lost when Puppet ships a new version of the dashboard.

You therefore need to make sure you update the Dashboard in git before leaving. New dashboards not in git should be safe, but please do also commit them to git so we have a proper versioned history of their deployment. It's also the right way to make sure they are usable across other instances of Grafana. Finally, they are also easier to share and collaborate on that way.

Folders and tags

Dashboards provisioned by Grafana should be tagged with the provisioned label, and filed in the appropriate folder:

meta: self-monitoring, mostly metrics on Prometheus and Grafana themselves
network: network monitoring, bandwidth management
services: service-specific dashboards, for example database, web server, applications like GitLab, etc
system: system-level metrics, like disk, memory, CPU usage

Non-provisioned dashboards should be filed in one of those folders:

broken: dashboards found to be completely broken and useless, might be deleted in the future
deprecated: functionality overlapping with another dashboard, to be deleted in the future
inprogrress: currently being built, could be partly operational, must absolutely NOT be deleted

The General folder is special and holds the "home" dashboard, which is, on grafana1, the "TPO overview" dashboard. It should not be used by other dashboards.

See the grafana-dashboards repository for instructions on how to export dashboards into git.

Pager playbook

In general, Grafana is not a high availability service and shouldn't "page" you. It is, however, quite useful in emergencies or diagnostics situations. To diagnose server-level issues, head to the per-node server stats, which basic server stats (CPU, disk, memory usage), with drill down options. If that's not enough, look at the list of important dashboards

Disaster recovery

In theory, if the Grafana server dies in a fire, it should be possible to rebuild it from scratch in Puppet, see the installation procedure. In practice, it's possible that important dashboards might not have been saved into git, in which case restoring from backups might bring them back.

Reference

Installation

Puppet deployment

Grafana was installed with Puppet using the upstream Debian package, following a debate regarding the merits of Debian packages versus Docker containers when neither are trusted, see this comment for a summary.

Some manual configuration was performed after the install. An admin password reset on first install, stored in tor-passwords.git, in hosts-extra-info. Everything else is configured in Puppet.

Grafana dashboards, in particular, the grafana-dashboards repository. The README.md file there contains more instructions on how to add and update dashboards. In general, dashboards must not be modified directly through the web interface, at least not without being exported back into the repository.

SLA

There is no SLA established for this service.

Design

Grafana is a single-binary daemon written in Golang with a frontend written in Typescript. It stores its configuration in a INI file (in /etc/grafana/grafana.ini, managed by Puppet). It doesn't keep metrics itself and instead delegates time series storage to "data stores", which we currently use Prometheus for.

It is mostly driven by a web browser interface making heavy use of Javascript. Dashboards are stored in JSON files deployed by Puppet.

It supports doing alerting, but we do not use that feature, instead relying on Prometheus for alerts.

Authentication is delegated to the webserver proxy (currently Apache).

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the Grafana label.

Issues with Grafana itself may be browsed or filed on GitHub.

Maintainer, users, and upstream

This service was deployed by anarcat and hiro. The internal server is used by TPA and the external server can be used by any other teams, but is particularly used by the anti-censorship and metrics teams.

Upstream is Grafana Labs, a startup with a few products alongside Grafana.

Monitoring and testing

Grafana itself is monitored by Prometheus and produces graphs for its own metrics.

The test procedure is basically to login to the service and loading a few dashboards.

Logs and metrics

Grafana doesn't hold metrics in itself, and delegates this task to external datasource. We use Prometheus for that purpose, but other backends could be used as well.

Grafana logs incoming requests in /var/log/grafana/grafana.log and may contain private information like IP addresses and request times.

Backups

No special backup procedure has been established for Grafana, considering the service can be rebuilt from scratch.

Discussion

Overview

The Grafana project was quickly thrown together in 2019 to replace the Munin service who had "died in a fire". Prometheus was first setup to collect metrics and Grafana was picked as a frontend because Prometheus didn't seem sufficient to produce good graphs. There was no elaborate discussion or evaluation of alternatives done at the time.

There hasn't been a significant security audit of the service, but given that authentication is managed by Apache with a limited set of users, it should be fairly safe.

Note that it is assumed the dashboard and Prometheus are public on the internal server. The external server is considered private and shouldn't be publicly accessible.

There are lots of dashboards in the interface, which should probably be cleaned up and renamed. Some are not in Git and might be lost in a reinstall. Some dashboards do not work very well.

Goals

N/A. No ongoing migration or major project.

Must have

Nice to have

Non-Goals

Approvals required

Proposed Solution

N/A.

Cost