Configure timeout for metrics-lib clients, e.g., those using DescriptorIndexCollector
Tonight the relaydescs module of the main CollecTor instance froze when attempting to fetch the remote index.json from the backup CollecTor instance. Here's a stack trace I got from jcmd
:
"CollecTor-Scheduled-Thread-10" #38 daemon prio=5 os_prio=0 tid=0x00007fb7ac009800 nid=0xea7 runnable [0x00007fb7e5893000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
- locked <0x000000008061f060> (a java.lang.Object)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
- locked <0x000000008061f188> (a java.lang.Object)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
- locked <0x000000008061f1f8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
- locked <0x000000008061f1f8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
- locked <0x000000008061f2e0> (a sun.net.www.protocol.https.HttpsURLConnectionImpl)
at java.net.URL.openStream(URL.java:1045)
at org.torproject.descriptor.index.IndexNode.fetchIndex(IndexNode.java:101)
at org.torproject.descriptor.index.DescriptorIndexCollector.collectDescriptors(DescriptorIndexCollector.java:74)
at org.torproject.collector.sync.SyncManager.collectFromOtherInstances(SyncManager.java:59)
at org.torproject.collector.sync.SyncManager.merge(SyncManager.java:43)
at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:76)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I didn't look at the code in detail, but I could imagine that a timeout would have helped in this case. Maybe there are other possible fixes, though.