|
|
A "status" dashboard is a simple website that allows service admins to
|
|
|
clearly and simply announce downtimes and recovery.
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Tutorial
|
|
|
|
|
|
<!-- simple, brainless step-by-step instructions requiring little or -->
|
|
|
<!-- no technical background -->
|
|
|
|
|
|
# How-to
|
|
|
|
|
|
<!-- more in-depth procedure that may require interpretation -->
|
|
|
|
|
|
TODO: document how to push updates to the dashboard
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
<!-- information about common errors from the monitoring system and -->
|
|
|
<!-- how to deal with them. this should be easy to follow: think of -->
|
|
|
<!-- your future self, in a stressful situation, tired and hungry. -->
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
<!-- what to do if all goes to hell. e.g. restore from backups? -->
|
|
|
<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
|
|
|
<!-- "Installation" below but some pointers. -->
|
|
|
|
|
|
TODO: contingencies when/if the normal system is down
|
|
|
|
|
|
# Reference
|
|
|
|
|
|
## Installation
|
|
|
<!-- how to setup the service from scratch -->
|
|
|
|
|
|
## SLA
|
|
|
|
|
|
This service should be highly available. It should support failure
|
|
|
from one or all point of presence: if all fail, it should be easy to
|
|
|
deploy it to a third-party provider.
|
|
|
|
|
|
## Design
|
|
|
<!-- how this is built -->
|
|
|
<!-- should reuse and expand on the "proposed solution", it's a -->
|
|
|
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
|
|
|
<!-- "architectural" document, which the final result might differ -->
|
|
|
<!-- from, sometimes significantly -->
|
|
|
|
|
|
<!-- a good guide to "audit" an existing project's design: -->
|
|
|
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
<!-- such projects are never over. add a pointer to well-known issues -->
|
|
|
<!-- and show how to report problems. usually a link to the bugtracker -->
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][] or
|
|
|
[search][] for issues in the [team issue tracker][search].
|
|
|
|
|
|
[File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
|
|
|
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
|
|
|
|
|
|
## Monitoring and testing
|
|
|
|
|
|
<!-- describe how this service is monitored and how it can be tested -->
|
|
|
<!-- after major changes like IP address changes or upgrades -->
|
|
|
|
|
|
## Logs and metrics
|
|
|
|
|
|
<!-- where are the logs? how long are they kept? any PII? -->
|
|
|
<!-- what about performance metrics? same questions -->
|
|
|
|
|
|
## Backups
|
|
|
|
|
|
<!-- does this service need anything special in terms of backups? -->
|
|
|
<!-- e.g. locking a database? special recovery procedures? -->
|
|
|
|
|
|
## Other documentation
|
|
|
|
|
|
<!-- references to upstream documentation, if relevant -->
|
|
|
|
|
|
# Discussion
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
<!-- describe the overall project. should include a link to a ticket -->
|
|
|
<!-- that has a launch checklist -->
|
|
|
|
|
|
## Goals
|
|
|
<!-- include bugs to be fixed -->
|
|
|
|
|
|
### Must have
|
|
|
|
|
|
* **user-friendly**: the public website must be easy to understand by
|
|
|
the Tor wider community of users (not just TPI/TPA)
|
|
|
* **status updates and progress**: "post status problem we know about
|
|
|
so the world can learn if problems are known to the Tor team."
|
|
|
* example: "[recent] v3 outtage where we could have put out a small
|
|
|
FAQ right away (go static HTML!) and then update the world as we
|
|
|
figure out the problem but also expected return to normal."
|
|
|
* **multi-stakeholder**: "easily editable by many of us namely likely
|
|
|
the network health team and we could also have the network team to
|
|
|
help out"
|
|
|
* **simple to deploy and use**: pushing an update shouldn't require
|
|
|
complex software or procedures. editing a text file, commiting and
|
|
|
pushing, or building with a single command and pushing the HTML,
|
|
|
for example, is simple enough. installing a MySQL database and PHP
|
|
|
server, for example, is not simple enough.
|
|
|
* keep it simple
|
|
|
* free-software based
|
|
|
|
|
|
### Nice to have
|
|
|
|
|
|
* deployment through GitLab (pages?), with contingency plans
|
|
|
* separate TLD to thwart DNS-based attacks against torproject.org
|
|
|
* same tool for multiple teams
|
|
|
* per-team filtering
|
|
|
* RSS feeds
|
|
|
* integration with social media?
|
|
|
* responsive design
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
* automation: updating the site is a manual process. no automatic
|
|
|
reports of sensors/metrics or Nagios, as this tends to complicate
|
|
|
the implementation and cause false positives
|
|
|
|
|
|
## Approvals required
|
|
|
|
|
|
TPA, network team, network health team.
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
## Cost
|
|
|
|
|
|
Just research and development time. Hosting costs are negligible.
|
|
|
|
|
|
## Alternatives considered
|
|
|
|
|
|
Those are the status dashboards we know about and that are still
|
|
|
somewhat in active development:
|
|
|
|
|
|
* [Cachet](https://cachethq.io/)
|
|
|
* PHP
|
|
|
* MySQL database
|
|
|
* [demo site](https://demo.cachethq.io/) (test@test.com, test123)
|
|
|
* responsive
|
|
|
* [not decentralized](https://twitter.com/theanarcat/status/575061666532102144)
|
|
|
* [no nagios support](https://github.com/cachethq/Cachet/issues/225)
|
|
|
* user-friendly
|
|
|
* publicly accessible
|
|
|
* fairly easy to use
|
|
|
* [aims for LDAP support](https://github.com/CachetHQ/Cachet/issues/2108)
|
|
|
* no Twitter, Identica, IRC or XMPP support for now
|
|
|
* [dropped RSS support](https://github.com/CachetHQ/Cachet/issues/3313)
|
|
|
* future of the project uncertain ([4037](https://github.com/CachetHQ/Cachet/issues/4037), [3968](https://github.com/CachetHQ/Cachet/issues/3968))
|
|
|
* [cstate](https://github.com/cstate/cstate), hugo-based static site generator, tag-based RSS
|
|
|
feeds, easy setup on Netlify, GitLab CI integration, badges,
|
|
|
readonly API
|
|
|
* [Staytus](http://staytus.co/)
|
|
|
* ruby
|
|
|
* MySQL database
|
|
|
* responsive
|
|
|
* email notifications
|
|
|
* mobile-friendly
|
|
|
* not distributed
|
|
|
* no nagios integration
|
|
|
* [no Twitter notifications](https://github.com/adamcooke/staytus/issues/2)
|
|
|
* user-friendly - seems to be even nicer than Cachet, as there are links to individual announcements and notifications
|
|
|
* no LDAP support
|
|
|
* MIT-licensed
|
|
|
* [similar performance problems than Cachet](https://github.com/adamcooke/staytus/issues/4)
|
|
|
|
|
|
### Abandonware
|
|
|
|
|
|
Those were previously evaluated in a previous life but ended up being
|
|
|
abandoned upstream:
|
|
|
|
|
|
* [Overseer](https://github.com/disqus/overseer) - used at [Disqus.com](http://disqus.com/), Python/django, user-friendly/simple, [administrator non-friendly](https://overseer.readthedocs.org/en/latest/admin.html), twitter integration, Apache2 license, development stopped, Disqus replaced it with [Statuspage.io](https://www.statuspage.io/)
|
|
|
* [Stashboard](http://www.stashboard.org/) - used at [Twilio](http://www.twilio.com/), MIT license, [demo](http://stashboard.appspot.com/), Twitter integration,
|
|
|
REST API, abandon-ware, no authentication, no unicode support,
|
|
|
depends on Google App engine, requires daily updates
|
|
|
* [Baobab](https://github.com/Gandi/baobab) - previously used at [Gandi](https://gandi.net/), replaced with statuspage.io, Django based
|
|
|
|
|
|
### Hack-ish solutions
|
|
|
|
|
|
Those were discarded because they do not provide an "out of the box"
|
|
|
experience:
|
|
|
|
|
|
* use Jenkins to run jobs that check a bunch of things and report a
|
|
|
user-friendly status?
|
|
|
* just use a social network account (e.g. Twitter)
|
|
|
* "just use the wiki"
|
|
|
* use Drupal ("there's a module for that")
|
|
|
* roll our own with [Lektor](https://www.getlektor.com/), e.g. using [this template](https://www.hamma.dev/hamma1/)
|
|
|
|
|
|
### example sites
|
|
|
|
|
|
* [État des services gandi](https://www.gandi.net/servstat)
|
|
|
* [Amazon Service Health Dashboard](http://status.aws.amazon.com/)
|
|
|
* [Disqus.com service status](http://status.disqus.com) - based on [statuspage.io](https://www.statuspage.io/)
|
|
|
* [Github status](https://status.github.com/) - "Battle station fully operational",
|
|
|
auto-refresh, twitter-connected, simple color coded (see [this
|
|
|
blog post for more details](https://github.com/blog/1240-new-status-site)), not open-source (confirmed in
|
|
|
personnal email between github support and anarcat on 2013-05-02)
|
|
|
* [Wikimedia status page](http://status.wikimedia.org/) - based on [proprietary nimsoft
|
|
|
software](http://www.nimsoft.com/solutions/nimsoft-cloud-user-experience/key-features/public-status-page.html?m=41159&c=pspfoot), deprecated in favor of Grafana
|
|
|
* [Riseup](https://status.riseup.net/) - RSS feeds
|
|
|
* [Potager.org](https://meteo.potager.org/) - ikiwiki based
|
|
|
* [Twilio status](https://status.twilio.com/) - email, slack, RSS subscriptions, lots of
|
|
|
services shown |