deploy webPassword authentication on prometheus1
- Truncate descriptions
Quote from TPA-RFC-33:
Authentication
To unify the clusters as we intend to, we need to fix authentication on the Prometheus and Grafana servers.
Current situation
Authentication is currently handled as follows:
- Icinga: static
htpasswd
file, not managed by Puppet, modified manually when onboarding/off-boarding- Prometheus 1: static
htpasswd
file with dummy password managed by Puppet- Grafana 1: same, with an extra admin password kept in Trocla, using the auth proxy configuration
- Prometheus 2: static htpasswd file with real admin password deployed, extra password generated for [prometheus-alerts][] continuous integration (CI) validation, all deployed through Puppet
- Grafana 2: static htpasswd file with real admin password for "admin" and "metrics", both of which are shared with an unclear number of people
Originally, both Prometheus servers had the same authentication system but that was split in 2019 to protect the external server.
Proposed changes
The plan was originally to just delegate authentication to Grafana but we're concerned this is going to introduce yet another authentication source, which we want to avoid. Instead, we should re-enable the
webPassword
field in LDAP, which has been mysteriously inuserdir-ldap-cgi
's7cba921
(drop many fields from update form, 2016-03-20), a trivial patch.This would allow any tor-internal person to access the dashboards. Access levels would be managed inside the Grafana database.
Prometheus servers would reuse the same password file, allowing tor-internal users to issue "raw" queries, browse and manage alerts.
Note that this change will negatively impact the
prometheus-alerts
CI which will require another way to validate its rulesets.We have briefly considered making Grafana dashboards publicly available, but ultimately rejected this idea, as it would mean having two entirely different time series datasets, which would be too hard to separate reliably. That would also impose a cardinal explosion of servers if we want to provide high availability.
TL;DR: deploy the new webPassword
file from LDAP (probably by
tweaking the host entry in the LDAP DB) and hook the webserver up to
it. Notify users.
- prom1
- merge https://gitlab.torproject.org/tpo/tpa/puppet-control/-/merge_requests/44 and deploy on both prom1 and prom2
-
set a timeline for the retirement of
tor-guest
- april 17th
- announce the change to TPA
-
fix prometheus-alerts/tor-puppet so nothing relies on the shared user
- only found the minio container that hardcodes the credentials
-
add a fallback credentials for prom1 so that we have a fallback admin account even if ud-ldap fails (probably the same password as
services/grafana.torproject.org
intor-password.git
) - send reminder to TPA before the cutoff date
-
fix permissions for the individual users
- setup TPA team with access to the main org and the folder users as admins of the main org
- wait for the tor-guest retirement, see if anything breaks
- prom2
-
set a timeline for retiring the shared passwords from /etc/apache2/prom_htpasswd (essentially the
metrics
account)- april 17th
- announce retirement of shared user to the teams that use it
-
fix permissions for users who set their new users
- setup a grafana team that has access to the main org and the folder that contains the team's dashboards. add users to the team. if they need to be able to modify dashboards, set them as
Admin
in the team.
- setup a grafana team that has access to the main org and the folder that contains the team's dashboards. add users to the team. if they need to be able to modify dashboards, set them as
- send reminder before the cutoff date
-
at the planned cutoff date, remove the
metrics
account entirely
-
set a timeline for retiring the shared passwords from /etc/apache2/prom_htpasswd (essentially the
- Show labels
- Show closed items