From 94cfe039e1f510917ae3b0c81cb7f02041266b65 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Wed, 18 Aug 2021 14:26:27 -0400 Subject: [PATCH] finish filling up the grafana template --- howto/grafana.md | 148 ++++++++++++++++++++++++++++------------------- 1 file changed, 90 insertions(+), 58 deletions(-) diff --git a/howto/grafana.md b/howto/grafana.md index 47b873d2..cab37aad 100644 --- a/howto/grafana.md +++ b/howto/grafana.md @@ -15,9 +15,32 @@ difference between the internal and external servers. # Tutorial +## Important dashboards + +Typically, working Grafana dashboards are "starred". Since we have +many such dashboards now, here's a curated list of the most important +dashboards you might need to look at: + + * [Overview](https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview) - first panel to show up on login, can filter basic + stats (bandwidth, memory, load, etc) per server role (currently + "class" field) + * [Per-node server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full) - basic server stats (CPU, disk, memory + usage), with drill down options + * [Node comparison dashboard](https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics) - similar to the above, but can + display multiple servers in columns, useful for cluster overview and + drawing correlations between servers + * [Postfix](https://grafana.torproject.org/d/Ds5BxBYGk/postfix-mtail) - to monitor mailings, see [monitoring mailings, in + the CRM documentation](service/crm#monitoring-mailings) + +Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have +their own dashboards, and many dashboards are still work in progress. + +The above list doesn't cover the "external" Grafana server +(`grafana2`) which has its own distinct set of dashboards. + # How-to -## Updating a Grafana dashboard +## Updating a dashboard As mentioned in the [installation section](#installation) below, the Grafana dashboards are maintained by Puppet. So while new dashboard can be @@ -36,15 +59,20 @@ export dashboards into git. ## Pager playbook -<!-- information about common errors from the monitoring system and --> -<!-- how to deal with them. this should be easy to follow: think of --> -<!-- your future self, in a stressful situation, tired and hungry. --> +In general, Grafana is not a high availability service and shouldn't +"page" you. It is, however, quite useful in emergencies or diagnostics +situations. To diagnose server-level issues, head to the [per-node +server stats](https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full), which basic server stats (CPU, disk, memory usage), +with drill down options. If that's not enough, look at the [list of +important dashboards](#important-dashboards) ## Disaster recovery -<!-- what to do if all goes to hell. e.g. restore from backups? --> -<!-- rebuild from scratch? not necessarily those procedures (e.g. see --> -<!-- "Installation" below but some pointers. --> +In theory, if the Grafana server dies in a fire, it should be possible +to rebuild it from scratch in Puppet, see the [installation +procedure](#installation). In practice, it's possible that important dashboards +might not have been saved into git, in which case restoring from +backups might bring them back. # Reference @@ -73,32 +101,23 @@ into the repository. ## SLA -<!-- this describes an acceptable level of service for this service --> +There is no SLA established for this service. ## Design -<!-- how this is built --> -<!-- should reuse and expand on the "proposed solution", it's a --> -<!-- "as-built" documented, whereas the "Proposed solution" is an --> -<!-- "architectural" document, which the final result might differ --> -<!-- from, sometimes significantly --> - -<!-- a good guide to "audit" an existing project's design: --> -<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html --> +Grafana is a single-binary daemon written in Golang with a frontend +written in Typescript. It stores its configuration in a `INI` file (in +`/etc/grafana/grafana.ini`, managed by Puppet). It doesn't keep +metrics itself and instead delegates time series storage to "data +stores", which we currently use Prometheus for. -<!-- things to evaluate here: +It is mostly driven by a web browser interface making heavy use of +Javascript. Dashboards are stored in JSON files deployed by Puppet. - * services - * storage (databases? plain text files? cloud/S3 storage?) - * queues (e.g. email queues, job queues, schedulers) - * interfaces (e.g. webserver, commandline) - * authentication (e.g. SSH, LDAP?) - * programming languages, frameworks, versions - * dependent services (e.g. authenticates against LDAP, or requires - git pushes) - * deployments: how is code for this deployed (see also Installation) +It supports doing alerting, but we do not use that feature, instead +relying on Prometheus and Nagios for alerts. -how is this thing built, basically? --> +Authentication is delegated to the webserver proxy (currently Apache). ## Issues @@ -111,61 +130,69 @@ There is no issue tracker specifically for this project, [File][] or [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues +Issues with Grafana itself may be [browsed or filed on GitHub](https://github.com/grafana/grafana/issues). + ## Maintainer, users, and upstream -<!-- document who deployed and operates this service, who the users --> -<!-- are, who the upstreams are, if they are still active, --> -<!-- collaborative, how do we keep up to date, --> +This service was deployed by anarcat and hiro. The internal server is +used by TPA and the external server can be used by any other teams, +but is particularly used by the anti-censorship and metrics teams. + +Upstream is [Grafana Labs](https://grafana.com/), a startup with a few products alongside +Grafana. ## Monitoring and testing -<!-- describe how this service is monitored and how it can be tested --> -<!-- after major changes like IP address changes or upgrades. describe --> -<!-- CI, test suites, linting, how security issues and upgrades are --> -<!-- tracked --> +Grafana itself is monitored by [Prometheus](howto/prometheus) and produces graphs for +its own metrics. + +The test procedure is basically to login to the service and loading a +few dashboards. ## Logs and metrics -<!-- where are the logs? how long are they kept? any PII? --> -<!-- what about performance metrics? same questions --> +Grafana doesn't hold metrics in itself, and delegates this task to +external datasource. We use [Prometheus](howto/prometheus) for that purpose, but +other backends could be used as well. + +Grafana logs incoming requests in `/var/log/grafana/grafana.log` and +may contain private information like IP addresses and request times. ## Backups -<!-- does this service need anything special in terms of backups? --> -<!-- e.g. locking a database? special recovery procedures? --> +No special backup procedure has been established for Grafana, +considering the service can be rebuilt from scratch. ## Other documentation -<!-- references to upstream documentation, if relevant --> + * [Upstream Grafana manual](https://grafana.com/docs/grafana/latest/) + * [Grafana GitHub project](https://github.com/grafana/grafana) # Discussion ## Overview -<!-- describe the overall project. should include a link to a ticket --> -<!-- that has a launch checklist --> - -<!-- if this is an old project being documented, summarize the known --> -<!-- issues with the project. to quote the "audit procedure": - - 5. When was the last security review done on the project? What was - the outcome? Are there any security issues currently? Should it - have another security review? +The Grafana project was quickly thrown together in 2019 to replace the +Munin service who had "died in a fire". Prometheus was first setup to +collect metrics and Grafana was picked as a frontend because +Prometheus didn't seem sufficient to produce good graphs. There was no +elaborate discussion or evaluation of alternatives done at the time. - 6. When was the last risk assessment done? Something that would cover - risks from the data stored, the access required, etc. +There hasn't been a significant security audit of the service, but +given that authentication is managed by Apache with a limited set of +users, it should be fairly safe. - 7. Are there any in-progress projects? Technical debt cleanup? - Migrations? What state are they in? What's the urgency? What's the - next steps? +Note that it is assumed the dashboard and Prometheus are *public* on +the internal server. The external server is considered private and +shouldn't be publicly accessible. - 8. What urgent things need to be done on this project? - ---> +There are lots of dashboards in the interface, which should probably +be cleaned up and renamed. Some are not in Git and might be lost in a +reinstall. Some dashboards do not work very well. ## Goals -<!-- include bugs to be fixed --> +N/A. No ongoing migration or major project. ### Must have @@ -179,8 +206,13 @@ There is no issue tracker specifically for this project, [File][] or ## Proposed Solution +N/A. + ## Cost +N/A. + ## Alternatives considered -<!-- include benchmarks and procedure if relevant --> +No extensive evaluation of alternatives were performed when Grafana +was deployed. -- GitLab