The machine has been deployed on gnt-dal and a new role/class pair specifically for managing PuppetDB has been pushed to tor-puppet. So far it seems to be working great!
Tomorrow I'll start testing data migration from the old PuppetDB.
I copied the PuppetDB database from pauli to puppetdb-01, using pg_dump. At startup PuppetDB automatically executed the necessary database migrations, and did so successfully.
On pauli, I've added submit_only_server_urls = https://puppetdb-01.torproject.org:8081 to /etc/puppet/puppetdb.conf to test whether our current Puppet master is able to submit reports and catalog to the new PuppetDB and it appears this is also working without issues. Upon enabling HTTP session logging I can see submissions being received and processed without any sign of errors.
To view the PuppetDB dashboard, one can ssh -NL8080:localhost:8080 puppetdb-01.torproject.org & and open http://localhost:8080/pdb/dashboard/index.html in a browser. It's also possible to use that connection for cumin queries.
I've disabled the puppet across all nodes, switched to the new PuppetDB and tested:
puppet agent --enable && time puppet agent --test --noop && puppet agent --disable "testing new bookworm puppetdb endpoint"
This works. Resources are found and applied as expected. Yay!
However, using the new PuppetDB, puppet agent --test is noticeably slower, approximately 15 to 20 seconds more per run, which is a bit confounding. I'm not sure if this is due to PuppetDB itself, PostgreSQL, Puppet master, or what. Perhaps at least part of the explanation is that pauli is on gnt-fsn and puppetdb-01 is on gnt-dal. I'm wondering if it would be worth it to move pauli to gnt-dal?
I ran more tests, enabling --debug on the Puppet master.
With puppetdb-01 :
puppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Profile::Prometheus::Node_exporter])) Collected 1 Ferm::Rule resource in 0.47 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Nagios::Client])) Collected 1 Concat::Fragment resource in 0.42 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Nagios::Client])) Collected 1 Concat::Fragment resource in 0.44 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.43 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.42 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.42 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Ssh])) Collected 89 Ferm::Rule::Simple resources in 0.48 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Bacula::Client])) Collected 1 Bacula::Client::Director resource in 0.44 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Bacula::Client])) Collected 1 Ferm::Rule::Simple resource in 0.44 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Profile::Gitlab::Runner])) Collected 1 Ferm::Rule resource in 0.42 secondspuppet-master[8458]: Creating new connection for https://puppetdb-01.torproject.org:8081puppet-master[8458]: (Scope(Class[Profile::Gitlab::Runner])) Collected 1 Ferm::Rule::Simple resource in 0.44 seconds
With the current PuppetDB:
puppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Profile::Prometheus::Node_exporter])) Collected 1 Ferm::Rule resource in 0.02 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Nagios::Client])) Collected 1 Concat::Fragment resource in 0.02 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Nagios::Client])) Collected 1 Concat::Fragment resource in 0.02 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.01 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.01 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Nagios::Client])) Collected 1 Ferm::Rule resource in 0.02 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Ssh])) Collected 89 Ferm::Rule::Simple resources in 0.03 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Bacula::Client])) Collected 1 Bacula::Client::Director resource in 0.02 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Bacula::Client])) Collected 1 Ferm::Rule::Simple resource in 0.01 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Profile::Gitlab::Runner])) Collected 1 Ferm::Rule resource in 0.01 secondspuppet-master[8985]: Creating new connection for https://puppet.torproject.org:8081puppet-master[8985]: (Scope(Class[Profile::Gitlab::Runner])) Collected 1 Ferm::Rule::Simple resource in 0.01 seconds
To be clear, we've got a difference of ~750ms for one PuppetDB query, based on the origin of the query. There are a lot of PuppetDB queries in one Puppet run, so I'm not surprised this can add up to 15 seconds of extra delay.
@anarcat Do you see any serious issues in migrating pauli to gnt-dal ?
That shouldn't be a problem, and you might even use the process to act
as a backup: you could migrate the host to gnt-dal, then upgrade it
there, and revert to the previous cluster if all goes to shit.
Just an idea.
...
On 2023-10-03 18:48:23, Jérôme Charaoui (@lavamind) wrote:
@anarcat Do you see any serious issues in migrating pauli to gnt-dal ?
Total query time is now ~6-7ms instead of ~2-3ms, which I think is acceptable (and probably inevitable).
Just to make sure I understand what you're saying here: we have a 3-fold
performance penalty after the upgrade, and that's inevitable
because... upgrade? :)
...
On 2023-10-04 14:29:52, Jérôme Charaoui (@lavamind) wrote:
I think labeling this as a "3-fold performance penalty" as you're doing here is quite unfair because individual query performance is not a useful metric: what we care about is the performance of puppet agent runs and this has not changed in any significant or perceptible way.
And to be precise, I was saying inevitable because the Puppet master <-> PuppetDB connections are now hitting the (real-world) wire and opposed to being local to the same machine.
Sorry, I was a little harsh there. I was just trying to understand the
underlying cause and impact.
So what are those? Cause is moving PuppetDB to a different VM?
Impact on run time is how much? twice slower?
I don't mean to question the track we've taken, I still think it's the
right direction. There's probably many optimizations we can make to the
setup if we want to improve runtime from here on too.
...
On 2023-10-04 16:30:16, Jérôme Charaoui (@lavamind) wrote:
And to be precise, I was saying inevitable because the Puppet master <-> PuppetDB connections are now hitting the (real-world) wire and opposed to being local to the same machine.