puppet catalog runs are failing intermittently with an error like:
root@gitlab-02:~# patInfo: Using configured environment 'production'Info: Retrieving pluginfactsInfo: Retrieving pluginInfo: Retrieving localesInfo: Loading factsError: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=b88fe540833780418991630b46806cf81256b973&version=9&certname=gitlab-02.torproject.org&command=replace_catalog&producer-timestamp=2022-03-29T15:57:42.071Z' on at least 1 of the following 'server_urls': https://puppet.torproject.org:8081Warning: Not using cache on failed catalogError: Could not retrieve catalog; skipping run
it's fairly harmless: puppet skips the run and moves on. eventually it succeeds, which is why things are flapping like mad in nagios.
this also makes the problem pretty hard to reproduce. because of this, diagnostics are spotty. but it seems this is the error we get from java:
Mar 29 16:00:19 pauli/pauli java[6063]: WARNING: An illegal reflective access operation has occurredMar 29 16:00:19 pauli/pauli java[6063]: WARNING: Illegal reflective access by clojure.lang.InjectedInvoker/0x0000000840062040 (file:/usr/share/java/clojure.jar) to method sun.nio.ch.ChannelInputStream.close()Mar 29 16:00:19 pauli/pauli java[6063]: WARNING: Please consider reporting this to the maintainers of clojure.lang.InjectedInvoker/0x0000000840062040Mar 29 16:00:19 pauli/pauli java[6063]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operationsMar 29 16:00:19 pauli/pauli java[6063]: WARNING: All illegal access operations will be denied in a future release
Edited
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
@hacim suggested the fix might be a downgrade of libjetty to an earlier version, but i was worried about the security implications so instead i tried upgrading to the backports, and that didn't go well at all. puppetdb just crashes:
Mar 29 15:58:33 pauli/pauli java[5522]: 15:58:32.770 [main] DEBUG puppetlabs.puppetdb.http - The v1 API has been retired; please use v4 Caught HTTP processing exceptionMar 29 15:58:33 pauli/pauli java[5522]: 15:58:32.778 [main] DEBUG puppetlabs.puppetdb.http - The v2 API has been retired; please use v4 Caught HTTP processing exceptionMar 29 15:58:33 pauli/pauli java[5522]: 15:58:32.779 [main] DEBUG puppetlabs.puppetdb.http - The v3 API has been retired; please use v4 Caught HTTP processing exceptionMar 29 15:58:33 pauli/pauli java[5522]: 15:58:33.358 [main] DEBUG puppetlabs.trapperkeeper.bootstrap - Loading bootstrap config from classpath: 'jar:file:/usr/share/puppetdb/puppetdb.jar!/bootstrap.cfg'Mar 29 15:58:35 pauli/pauli java[5522]: Exception in thread "main" java.lang.IllegalStateException: KeyStores with multiple certificates are not supported on the base class org.eclipse.jetty.util.ssl.SslContextFactory. (Use org.eclipse.jetty.util.ssl.SslContextFactory$Server or org.eclipse.jetty.util.ssl.SslContextFactory$Client instead)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.ssl.SslContextFactory.newSniX509ExtendedKeyManager(SslContextFactory.java:1283)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.ssl.SslContextFactory.getKeyManagers(SslContextFactory.java:1265)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.ssl.SslContextFactory.load(SslContextFactory.java:372)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:243)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFactory.java:97)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:321)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:81)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.server.Server.doStart(Server.java:401)Mar 29 15:58:35 pauli/pauli java[5522]: at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/java.lang.reflect.Method.invoke(Method.java:566)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:167)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:438)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$eval43528$start_webserver_BANG___43533$fn__43534$fn__43535.invoke(jetty9_core.clj:685)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$eval43528$start_webserver_BANG___43533$fn__43534.invoke(jetty9_core.clj:684)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$eval43528$start_webserver_BANG___43533.invoke(jetty9_core.clj:677)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$start_server_single_default.invokeStatic(jetty9_core.clj:929)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$start_server_single_default.invoke(jetty9_core.clj:926)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$eval43947$start_BANG___43952$fn__43953.invoke(jetty9_core.clj:1008)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_core$eval43947$start_BANG___43952.invoke(jetty9_core.clj:1003)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services.webserver.jetty9_service$reify__44354$service_fnk__23931__auto___positional$reify__44361.start(jetty9_service.clj:44)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services$eval23729$fn__23743$G__23719__23746.invoke(services.clj:8)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.services$eval23729$fn__23743$G__23718__23750.invoke(services.clj:8)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24312$run_lifecycle_fn_BANG___24319$fn__24320.invoke(internal.clj:204)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24312$run_lifecycle_fn_BANG___24319.invoke(internal.clj:187)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24341$run_lifecycle_fns__24346$fn__24347.invoke(internal.clj:238)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24341$run_lifecycle_fns__24346.invoke(internal.clj:215)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24899$build_app_STAR___24908$fn$reify__24920.start(internal.clj:591)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24947$boot_services_for_app_STAR__STAR___24954$fn__24955$fn__24957.invoke(internal.clj:617)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24947$boot_services_for_app_STAR__STAR___24954$fn__24955.invoke(internal.clj:615)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24947$boot_services_for_app_STAR__STAR___24954.invoke(internal.clj:609)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core$partial$fn__5826.invoke(core.clj:2630)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24383$initialize_lifecycle_worker__24394$fn__24395$fn__24534$state_machine__12865__auto____24559$fn__24562.invoke(internal.clj:255)Mar 29 15:58:35 pauli/pauli java[5522]: at puppetlabs.trapperkeeper.internal$eval24383$initialize_lifecycle_worker__24394$fn__24395$fn__24534$state_machine__12865__auto____24559.invoke(internal.clj:255)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:973)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:972)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:977)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:975)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async$ioc_alts_BANG_$fn__13094.invoke(async.clj:384)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async$do_alts$fn__13026$fn__13029.invoke(async.clj:253)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.core.async.impl.channels.ManyToManyChannel$fn__7046$fn__7047.invoke(channels.clj:95)Mar 29 15:58:35 pauli/pauli java[5522]: at clojure.lang.AFn.run(AFn.java:22)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)Mar 29 15:58:35 pauli/pauli java[5522]: at java.base/java.lang.Thread.run(Thread.java:829)Mar 29 15:58:35 pauli/pauli systemd[1]: puppetdb.service: Main process exited, code=exited, status=1/FAILUREMar 29 15:58:35 pauli/pauli systemd[1]: puppetdb.service: Failed with result 'exit-code'.
i had to do add a snapshots.debian.org line, which means i had to disable puppet (otherwise it would have removed it).
i'll also have to add pinning to keep unattended-upgrades from upgrading the package, but before that, i will probably consider just upgrading the host altogether (#40696 (closed)), hoping that bullseye fixes that problem.
so I don't understand why that downgrade would fix this. it is also a security liability. but it beats having warnings in nagios all the time like this for now.
so, as a quick fix, this will remain for now. i'll pin this in puppet and will think of next steps.
note that the downgrade was canceled and this problem still occurs, it is just silenced by a change in tor-nagios.git, specifically:
From ea86005a55e35d62b197195012ab6ff15e57fc5d Mon Sep 17 00:00:00 2001From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>Date: Thu, 31 Mar 2022 14:01:07 -0400Subject: [PATCH] silence intermittent warnings from PuppetDB (tpo/tpa/team#40706)We think those warnings are harmless in the sense that they eventuallyresolve themselves and occur relatively infrequently. Unfortunately,the plugin is quite intolerant and warns as soon as such an error isfound and therefore doesn't handle transient failures.This is not ideal: we'd rather fix the bug in itself. But that hasproven to be very difficult. The only fix that worked so far is todowngrade jetty9, which introduces a whole set of security issues wefeel is not worth auditing individually.We also hope this will be solved when the puppetdb Debian package isput back into shape, somewhere in may. See also tpo/tpa/team#33588.We haven't rolled out a new version of the Debian package for this,and instead have distributed the new check manually on pauli, sinceit's the only place it's ran.diff --git a/config/nagios-master.cfg b/config/nagios-master.cfgindex b809195..0b8eef3 100644--- a/config/nagios-master.cfg+++ b/config/nagios-master.cfg@@ -1640,13 +1640,13 @@ services: # picked double that for critical arbitrarily. - name: puppet - catalog run- remotecheck: "/usr/lib/nagios/plugins/check_puppetdb_nodes -a 4 --node $HOSTNAME$.torproject.org -w 720 -c 1440"+ remotecheck: "/usr/lib/nagios/plugins/check_puppetdb_nodes -a 4 --node $HOSTNAME$.torproject.org -w 720 -c 1440 --exclude \"Failed to execute '/pdb/cmd/v1\"" hostgroups: computers runfrom: pauli - name: puppet - all catalog runs- nrpe: "/usr/lib/nagios/plugins/check_puppetdb_nodes -a 4 -w 720 -c 1440"+ nrpe: "/usr/lib/nagios/plugins/check_puppetdb_nodes -a 4 -w 720 -c 1440 --exclude \"Failed to execute '/pdb/cmd/v1\"" hosts: pauli # build boxesdiff --git a/tor-nagios-checks/checks/check_puppetdb_nodes b/tor-nagios-checks/checks/check_puppetdb_nodesold mode 100644new mode 100755index f76674e..eed29a1--- a/tor-nagios-checks/checks/check_puppetdb_nodes+++ b/tor-nagios-checks/checks/check_puppetdb_nodes@@ -117,6 +117,12 @@ $np->add_arg( default => '', );+$np->add_arg(+ spec => 'exclude|e=s',+ help => 'exclude the given log entries regex from error logs (default: %s)',+ default => '',+);+ $np->getopts; my %apiurls = (@@ -232,8 +238,10 @@ foreach my $node (@$data) { my $logs = decode_json( $response->decoded_content ); foreach my $log (@$logs) { my $tags = $log->{'tags'};- if ( grep(/^err$/, @$tags) ) { - $np->add_message( WARNING, "$certname, $log->{'message'}" );+ if ( grep(/^err$/, @$tags) ) {+ if ( !grep($np->opts->exclude, $log->{'message'} ) ){+ $np->add_message( WARNING, "$certname, $log->{'message'}" );+ } } } }
we will still trip on the warning while running cumin, for example. i opened #40707 (closed) to followup on a puppetdb upgrade that could eventually fix this issue and then we should remove that warning.