puppet fleet doesn't converge
- Truncate descriptions
while building a grafana dahsboard over the prometheus puppet metrics, i noticed we have a solid 6-8 hosts that never converge.
Notice the min on that: 6 hosts.
I made a panel showing what those hosts are, they are currently:
- anonticket-01.torproject.org
- crm-int-01.torproject.org
- puppetdb-01.torproject.org
- tb-build-02.torproject.org
- tb-build-03.torproject.org
- weather-01.torproject.org
Each of those hosts should be inspect and tweaked so that they converge. I believe most of those (except tb-build*) hosts have this issue:
root@weather-01:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Warning: /Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]: Unable to mark 'unless' as sensitive: unless is a parameter and not a property, and cannot be automatically redacted.
Info: Caching catalog for weather-01.torproject.org
Info: Applying configuration version '1732290058'
Notice: /Stage[main]/Profile::Weather/Postgresql::Server::Db[torweather]/Postgresql::Server::Role[torweather]/Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]/command: changed [redacted] to [redacted]
Notice: Applied catalog in 12.38 seconds
crm-int-01 doesn't run psql of course, but i think it's a similar issue with mysql passwords.
for the tb-build servers, it's more like this:
root@tb-build-03:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tb-build-03.torproject.org
Info: Applying configuration version '1732290110'
Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content:
Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: content changed '{mtime}2024-11-22 15:10:45 +0000' to '{mtime}2024-11-22 15:41:59 +0000'
Notice: Applied catalog in 4.79 seconds
root@tb-build-03:~#
It would be nice to resolve those so we could have an alert to warn us about those.
(Honestly, I'm not sure this is actually an issue: is it really a bad thing that hosts don't converge? It can be a sign of a serious issue: for example we could be constantly trying to install a package and failing, and we should probably notice that... but i'm not sure this is high priority which is why this is Icebox and TPA-RFC-33-C, as we definitely were not monitoring this before Prometheus came in, let alert alert on that...)
- Show labels
- Show closed items