puppet fleet doesn't converge

Open Issue created 4 months ago by anarcat

while building a grafana dahsboard over the prometheus puppet metrics, i noticed we have a solid 6-8 hosts that never converge.

https://grafana.torproject.org/d/fe1vcz4hlvgu8b/puppet-health?orgId=1&from=now-30d&to=now&timezone=utc&refresh=auto&viewPanel=panel-2

Notice the min on that: 6 hosts.

I made a panel showing what those hosts are, they are currently:

Each of those hosts should be inspect and tweaked so that they converge. I believe most of those (except tb-build*) hosts have this issue:

root@weather-01:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Warning: /Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]: Unable to mark 'unless' as sensitive: unless is a parameter and not a property, and cannot be automatically redacted.
Info: Caching catalog for weather-01.torproject.org
Info: Applying configuration version '1732290058'
Notice: /Stage[main]/Profile::Weather/Postgresql::Server::Db[torweather]/Postgresql::Server::Role[torweather]/Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]/command: changed [redacted] to [redacted]
Notice: Applied catalog in 12.38 seconds

crm-int-01 doesn't run psql of course, but i think it's a similar issue with mysql passwords.

for the tb-build servers, it's more like this:

root@tb-build-03:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tb-build-03.torproject.org
Info: Applying configuration version '1732290110'
Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: 

Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: content changed '{mtime}2024-11-22 15:10:45 +0000' to '{mtime}2024-11-22 15:41:59 +0000'
Notice: Applied catalog in 4.79 seconds
root@tb-build-03:~#

It would be nice to resolve those so we could have an alert to warn us about those.

(Honestly, I'm not sure this is actually an issue: is it really a bad thing that hosts don't converge? It can be a sign of a serious issue: for example we could be constantly trying to install a package and failing, and we should probably notice that... but i'm not sure this is high priority which is why this is Icebox and TPA-RFC-33-C, as we definitely were not monitoring this before Prometheus came in, let alert alert on that...)

2 of 6 checklist items completed · Edited 4 months ago by Jérôme Charaoui

puppet fleet doesn't converge

Linked items ... 0

Activity