In the last few days collector has been getting stuck and is not updating bridges descriptors. I was under the impression maybe this was due to #40037 but maybe it is something else.
I am seeing in logs that it is finding many duplicate lines in the bridgepool assignments but this shouldn't prevent the service from creating the descriptors archives.
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
2024-01-10 14:20:31,575 ERROR o.t.m.c.c.CollecTorMain:59 The relaydescs module failed: Java heap spacejava.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538) at java.base/java.lang.StringBuilder.append(StringBuilder.java:178) at ch.qos.logback.core.pattern.FormattingConverter.write(FormattingConverter.java:39) at ch.qos.logback.core.pattern.PatternLayoutBase.writeLoopOnConverters(PatternLayoutBase.java:114) at ch.qos.logback.classic.PatternLayout.doLayout(PatternLayout.java:141) at ch.qos.logback.classic.PatternLayout.doLayout(PatternLayout.java:1) at ch.qos.logback.core.encoder.LayoutWrappingEncoder.doEncode(LayoutWrappingEncoder.java:130) at ch.qos.logback.core.OutputStreamAppender.writeOut(OutputStreamAppender.java:187) at ch.qos.logback.core.FileAppender.writeOut(FileAppender.java:269) at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:212) at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:235) at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:100) at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84) at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48) at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270) at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257) at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421) at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383) at ch.qos.logback.classic.Logger.info(Logger.java:579) at org.torproject.metrics.collector.relaydescs.ReferenceChecker.checkReferences(ReferenceChecker.java:312) at org.torproject.metrics.collector.relaydescs.ReferenceChecker.check(ReferenceChecker.java:86) at org.torproject.metrics.collector.relaydescs.ArchiveWriter.startProcessing(ArchiveWriter.java:212) at org.torproject.metrics.collector.cron.CollecTorMain.run(CollecTorMain.java:55) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
It seems the heap error has disappeared after forcing the service to create a new reference file. The service is still using a lot of memory but I wonder if the issue is the GC algorithm again in this case. I'll try another algorithm this week and will try to experiment with a few tuning options. Hopefully we would have a better utilization of the memory on the VM.
It seems we have somehow a related problem: the first_seen date of a lot of relays got reset to 2024-01-10 16:00:00 for some reason. It feels a bit like a self-healing process for some of the relays that got reported e.g. to our tor-relays@ list in that their correct first_seen date shows up again. However, some relays still seem to be stuck on that date.