From 94cfe039e1f510917ae3b0c81cb7f02041266b65 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Wed, 18 Aug 2021 14:26:27 -0400
Subject: [PATCH] finish filling up the grafana template

---
 howto/grafana.md | 148 ++++++++++++++++++++++++++++-------------------
 1 file changed, 90 insertions(+), 58 deletions(-)

diff --git a/howto/grafana.md b/howto/grafana.md
index 47b873d2..cab37aad 100644
--- a/howto/grafana.md
+++ b/howto/grafana.md
@@ -15,9 +15,32 @@ difference between the internal and external servers.
 
 # Tutorial
 
+## Important dashboards
+
+Typically, working Grafana dashboards are "starred". Since we have
+many such dashboards now, here's a curated list of the most important
+dashboards you might need to look at:
+
+ * [Overview](https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview) - first panel to show up on login, can filter basic
+   stats (bandwidth, memory, load, etc) per server role (currently
+   "class" field)
+ * [Per-node server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full) - basic server stats (CPU, disk, memory
+   usage), with drill down options
+ * [Node comparison dashboard](https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics) - similar to the above, but can
+   display multiple servers in columns, useful for cluster overview and
+   drawing correlations between servers
+ * [Postfix](https://grafana.torproject.org/d/Ds5BxBYGk/postfix-mtail) - to monitor mailings, see [monitoring mailings, in
+   the CRM documentation](service/crm#monitoring-mailings)
+
+Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have
+their own dashboards, and many dashboards are still work in progress.
+
+The above list doesn't cover the "external" Grafana server
+(`grafana2`) which has its own distinct set of dashboards.
+
 # How-to
 
-## Updating a Grafana dashboard
+## Updating a dashboard
 
 As mentioned in the [installation section](#installation) below, the Grafana
 dashboards are maintained by Puppet. So while new dashboard can be
@@ -36,15 +59,20 @@ export dashboards into git.
 
 ## Pager playbook
 
-<!-- information about common errors from the monitoring system and -->
-<!-- how to deal with them. this should be easy to follow: think of -->
-<!-- your future self, in a stressful situation, tired and hungry. -->
+In general, Grafana is not a high availability service and shouldn't
+"page" you. It is, however, quite useful in emergencies or diagnostics
+situations. To diagnose server-level issues, head to the [per-node
+server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full), which basic server stats (CPU, disk, memory usage),
+with drill down options. If that's not enough, look at the [list of
+important dashboards](#important-dashboards)
 
 ## Disaster recovery
 
-<!-- what to do if all goes to hell. e.g. restore from backups? -->
-<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
-<!-- "Installation" below but some pointers. -->
+In theory, if the Grafana server dies in a fire, it should be possible
+to rebuild it from scratch in Puppet, see the [installation
+procedure](#installation). In practice, it's possible that important dashboards
+might not have been saved into git, in which case restoring from
+backups might bring them back.
 
 # Reference
 
@@ -73,32 +101,23 @@ into the repository.
 
 ## SLA
 
-<!-- this describes an acceptable level of service for this service -->
+There is no SLA established for this service.
 
 ## Design
 
-<!-- how this is built -->
-<!-- should reuse and expand on the "proposed solution", it's a -->
-<!-- "as-built" documented, whereas the "Proposed solution" is an -->
-<!-- "architectural" document, which the final result might differ -->
-<!-- from, sometimes significantly -->
-
-<!-- a good guide to "audit" an existing project's design: -->
-<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
+Grafana is a single-binary daemon written in Golang with a frontend
+written in Typescript. It stores its configuration in a `INI` file (in
+`/etc/grafana/grafana.ini`, managed by Puppet). It doesn't keep
+metrics itself and instead delegates time series storage to "data
+stores", which we currently use Prometheus for.
 
-<!-- things to evaluate here:
+It is mostly driven by a web browser interface making heavy use of
+Javascript. Dashboards are stored in JSON files deployed by Puppet.
 
- * services
- * storage (databases? plain text files? cloud/S3 storage?)
- * queues (e.g. email queues, job queues, schedulers)
- * interfaces (e.g. webserver, commandline)
- * authentication (e.g. SSH, LDAP?)
- * programming languages, frameworks, versions
- * dependent services (e.g. authenticates against LDAP, or requires
-   git pushes) 
- * deployments: how is code for this deployed (see also Installation)
+It supports doing alerting, but we do not use that feature, instead
+relying on Prometheus and Nagios for alerts.
 
-how is this thing built, basically? -->
+Authentication is delegated to the webserver proxy (currently Apache).
 
 ## Issues
 
@@ -111,61 +130,69 @@ There is no issue tracker specifically for this project, [File][] or
  [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
  [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
 
+Issues with Grafana itself may be [browsed or filed on GitHub](https://github.com/grafana/grafana/issues).
+
 ## Maintainer, users, and upstream
 
-<!-- document who deployed and operates this service, who the users -->
-<!-- are, who the upstreams are, if they are still active, -->
-<!-- collaborative, how do we keep up to date, -->
+This service was deployed by anarcat and hiro. The internal server is
+used by TPA and the external server can be used by any other teams,
+but is particularly used by the anti-censorship and metrics teams.
+
+Upstream is [Grafana Labs](https://grafana.com/), a startup with a few products alongside
+Grafana.
 
 ## Monitoring and testing
 
-<!-- describe how this service is monitored and how it can be tested -->
-<!-- after major changes like IP address changes or upgrades. describe -->
-<!-- CI, test suites, linting, how security issues and upgrades are -->
-<!-- tracked -->
+Grafana itself is monitored by [Prometheus](howto/prometheus) and produces graphs for
+its own metrics. 
+
+The test procedure is basically to login to the service and loading a
+few dashboards.
 
 ## Logs and metrics
 
-<!-- where are the logs? how long are they kept? any PII? -->
-<!-- what about performance metrics? same questions -->
+Grafana doesn't hold metrics in itself, and delegates this task to
+external datasource. We use [Prometheus](howto/prometheus) for that purpose, but
+other backends could be used as well.
+
+Grafana logs incoming requests in `/var/log/grafana/grafana.log` and
+may contain private information like IP addresses and request times.
 
 ## Backups
 
-<!-- does this service need anything special in terms of backups? -->
-<!-- e.g. locking a database? special recovery procedures? -->
+No special backup procedure has been established for Grafana,
+considering the service can be rebuilt from scratch.
 
 ## Other documentation
 
-<!-- references to upstream documentation, if relevant -->
+ * [Upstream Grafana manual](https://grafana.com/docs/grafana/latest/)
+ * [Grafana GitHub project](https://github.com/grafana/grafana)
 
 # Discussion
 
 ## Overview
 
-<!-- describe the overall project. should include a link to a ticket -->
-<!-- that has a launch checklist -->
-
-<!-- if this is an old project being documented, summarize the known -->
-<!-- issues with the project. to quote the "audit procedure":
-
- 5. When was the last security review done on the project? What was
-    the outcome? Are there any security issues currently? Should it
-    have another security review?
+The Grafana project was quickly thrown together in 2019 to replace the
+Munin service who had "died in a fire". Prometheus was first setup to
+collect metrics and Grafana was picked as a frontend because
+Prometheus didn't seem sufficient to produce good graphs. There was no
+elaborate discussion or evaluation of alternatives done at the time.
 
- 6. When was the last risk assessment done? Something that would cover
-    risks from the data stored, the access required, etc.
+There hasn't been a significant security audit of the service, but
+given that authentication is managed by Apache with a limited set of
+users, it should be fairly safe.
 
- 7. Are there any in-progress projects? Technical debt cleanup?
-    Migrations? What state are they in? What's the urgency? What's the
-    next steps?
+Note that it is assumed the dashboard and Prometheus are *public* on
+the internal server. The external server is considered private and
+shouldn't be publicly accessible.
 
- 8. What urgent things need to be done on this project?
-
--->
+There are lots of dashboards in the interface, which should probably
+be cleaned up and renamed. Some are not in Git and might be lost in a
+reinstall. Some dashboards do not work very well.
 
 ## Goals
 
-<!-- include bugs to be fixed -->
+N/A. No ongoing migration or major project.
 
 ### Must have
 
@@ -179,8 +206,13 @@ There is no issue tracker specifically for this project, [File][] or
 
 ## Proposed Solution
 
+N/A.
+
 ## Cost
 
+N/A.
+
 ## Alternatives considered
 
-<!-- include benchmarks and procedure if relevant -->
+No extensive evaluation of alternatives were performed when Grafana
+was deployed.
-- 
GitLab