finish filling up the grafana template

94cfe039 · anarcat · 0a1c9642 · 94cfe039
Verified Commit 94cfe039 authored 3 years ago by anarcat
--- a/howto/grafana.md
+++ b/howto/grafana.md
@@ -15,9 +15,32 @@ difference between the internal and external servers.

 # Tutorial

+## Important dashboards
+
+Typically, working Grafana dashboards are "starred". Since we have
+many such dashboards now, here's a curated list of the most important
+dashboards you might need to look at:
+
+ * [Overview](https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview) - first panel to show up on login, can filter basic
+   stats (bandwidth, memory, load, etc) per server role (currently
+   "class" field)
+ * [Per-node server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full) - basic server stats (CPU, disk, memory
+   usage), with drill down options
+ * [Node comparison dashboard](https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics) - similar to the above, but can
+   display multiple servers in columns, useful for cluster overview and
+   drawing correlations between servers
+ * [Postfix](https://grafana.torproject.org/d/Ds5BxBYGk/postfix-mtail) - to monitor mailings, see [monitoring mailings, in
+   the CRM documentation](service/crm#monitoring-mailings)
+
+Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have
+their own dashboards, and many dashboards are still work in progress.
+
+The above list doesn't cover the "external" Grafana server
+(`grafana2`) which has its own distinct set of dashboards.
+
 # How-to

-## Updating a Grafana dashboard
+## Updating a dashboard

 As mentioned in the [installation section](#installation) below, the Grafana
 dashboards are maintained by Puppet. So while new dashboard can be
@@ -36,15 +59,20 @@ export dashboards into git.

 ## Pager playbook

-<!-- information about common errors from the monitoring system and -->
-<!-- how to deal with them. this should be easy to follow: think of -->
-<!-- your future self, in a stressful situation, tired and hungry. -->
+In general, Grafana is not a high availability service and shouldn't
+"page" you. It is, however, quite useful in emergencies or diagnostics
+situations. To diagnose server-level issues, head to the [per-node
+server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full), which basic server stats (CPU, disk, memory usage),
+with drill down options. If that's not enough, look at the [list of
+important dashboards](#important-dashboards)

 ## Disaster recovery

-<!-- what to do if all goes to hell. e.g. restore from backups? -->
-<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
-<!-- "Installation" below but some pointers. -->
+In theory, if the Grafana server dies in a fire, it should be possible
+to rebuild it from scratch in Puppet, see the [installation
+procedure](#installation). In practice, it's possible that important dashboards
+might not have been saved into git, in which case restoring from
+backups might bring them back.

 # Reference

@@ -73,32 +101,23 @@ into the repository.

 ## SLA

-<!-- this describes an acceptable level of service for this service -->
+There is no SLA established for this service.

 ## Design

-<!-- how this is built -->
-<!-- should reuse and expand on the "proposed solution", it's a -->
-<!-- "as-built" documented, whereas the "Proposed solution" is an -->
-<!-- "architectural" document, which the final result might differ -->
-<!-- from, sometimes significantly -->
-
-<!-- a good guide to "audit" an existing project's design: -->
-<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
+Grafana is a single-binary daemon written in Golang with a frontend
+written in Typescript. It stores its configuration in a `INI` file (in
+`/etc/grafana/grafana.ini`, managed by Puppet). It doesn't keep
+metrics itself and instead delegates time series storage to "data
+stores", which we currently use Prometheus for.

-<!-- things to evaluate here:
+It is mostly driven by a web browser interface making heavy use of
+Javascript. Dashboards are stored in JSON files deployed by Puppet.

- * services
- * storage (databases? plain text files? cloud/S3 storage?)
- * queues (e.g. email queues, job queues, schedulers)
- * interfaces (e.g. webserver, commandline)
- * authentication (e.g. SSH, LDAP?)
- * programming languages, frameworks, versions
- * dependent services (e.g. authenticates against LDAP, or requires
-   git pushes) 
- * deployments: how is code for this deployed (see also Installation)
+It supports doing alerting, but we do not use that feature, instead
+relying on Prometheus and Nagios for alerts.

-how is this thing built, basically? -->
+Authentication is delegated to the webserver proxy (currently Apache).

 ## Issues

@@ -111,61 +130,69 @@ There is no issue tracker specifically for this project, [File][] or
 [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
 [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues

+Issues with Grafana itself may be [browsed or filed on GitHub](https://github.com/grafana/grafana/issues).
+
 ## Maintainer, users, and upstream

-<!-- document who deployed and operates this service, who the users -->
-<!-- are, who the upstreams are, if they are still active, -->
-<!-- collaborative, how do we keep up to date, -->
+This service was deployed by anarcat and hiro. The internal server is
+used by TPA and the external server can be used by any other teams,
+but is particularly used by the anti-censorship and metrics teams.
+
+Upstream is [Grafana Labs](https://grafana.com/), a startup with a few products alongside
+Grafana.

 ## Monitoring and testing

-<!-- describe how this service is monitored and how it can be tested -->
-<!-- after major changes like IP address changes or upgrades. describe -->
-<!-- CI, test suites, linting, how security issues and upgrades are -->
-<!-- tracked -->
+Grafana itself is monitored by [Prometheus](howto/prometheus) and produces graphs for
+its own metrics. 
+
+The test procedure is basically to login to the service and loading a
+few dashboards.

 ## Logs and metrics

-<!-- where are the logs? how long are they kept? any PII? -->
-<!-- what about performance metrics? same questions -->
+Grafana doesn't hold metrics in itself, and delegates this task to
+external datasource. We use [Prometheus](howto/prometheus) for that purpose, but
+other backends could be used as well.
+
+Grafana logs incoming requests in `/var/log/grafana/grafana.log` and
+may contain private information like IP addresses and request times.

 ## Backups

-<!-- does this service need anything special in terms of backups? -->
-<!-- e.g. locking a database? special recovery procedures? -->
+No special backup procedure has been established for Grafana,
+considering the service can be rebuilt from scratch.

 ## Other documentation

-<!-- references to upstream documentation, if relevant -->
+ * [Upstream Grafana manual](https://grafana.com/docs/grafana/latest/)
+ * [Grafana GitHub project](https://github.com/grafana/grafana)

 # Discussion

 ## Overview

-<!-- describe the overall project. should include a link to a ticket -->
-<!-- that has a launch checklist -->
-
-<!-- if this is an old project being documented, summarize the known -->
-<!-- issues with the project. to quote the "audit procedure":
-
- 5. When was the last security review done on the project? What was
-    the outcome? Are there any security issues currently? Should it
-    have another security review?
+The Grafana project was quickly thrown together in 2019 to replace the
+Munin service who had "died in a fire". Prometheus was first setup to
+collect metrics and Grafana was picked as a frontend because
+Prometheus didn't seem sufficient to produce good graphs. There was no
+elaborate discussion or evaluation of alternatives done at the time.

- 6. When was the last risk assessment done? Something that would cover
-    risks from the data stored, the access required, etc.
+There hasn't been a significant security audit of the service, but
+given that authentication is managed by Apache with a limited set of
+users, it should be fairly safe.

- 7. Are there any in-progress projects? Technical debt cleanup?
-    Migrations? What state are they in? What's the urgency? What's the
-    next steps?
+Note that it is assumed the dashboard and Prometheus are *public* on
+the internal server. The external server is considered private and
+shouldn't be publicly accessible.

- 8. What urgent things need to be done on this project?
-
-->
+There are lots of dashboards in the interface, which should probably
+be cleaned up and renamed. Some are not in Git and might be lost in a
+reinstall. Some dashboards do not work very well.

 ## Goals

-<!-- include bugs to be fixed -->
+N/A. No ongoing migration or major project.

 ### Must have

@@ -179,8 +206,13 @@ There is no issue tracker specifically for this project, [File][] or

 ## Proposed Solution

+N/A.
+
 ## Cost

+N/A.
+
 ## Alternatives considered

-<!-- include benchmarks and procedure if relevant -->
+No extensive evaluation of alternatives were performed when Grafana
+was deployed.