The rdsys distributor connections are long-polling, they are kept open for ever and reconnected automatically if they get closed. Apache is closing them every 30s. Can we disable or rise this timeout just for rdsys-frontend.torproject.org (in polyanthum)?
i'm ... not sure. i don't understand why you'd want a long running socket opened like this... maybe i'm too old school, but i have the feeling this consumes needless resources and exposes the webserver to denial of service attacks. but since this vhost is restricted to a single IP, i guess we could allow this.
how big should the timeout be? we tuned that down to 30 seconds from the upstream default (60 seconds). from what i can tell, nginx also has a 60 seconds timeout as well...
i can't help but think there's a design issue here, shouldn't this be handled through websockets or something? or some half-open socket? i'm not very familiar with the modern web, to be honest, it doesn't quite feel right to have such long-running sockets open, why aren't they open as needed?
maybe if you expand a bit on the underlying design here, it would be easier to help you.
AFAIK long-polling is a common patter in APIs no a days when you want the client to be waiting for the server to send things.
The situation is that the distributor needs to get resources (like bridges) updates from the backend. The way this is solved in rdsys is by keeping an open connection where the backend can send updates whenever needed.
We can think about redesigning the API, I'm not sure long-polling is the best solution (neither websockets that seems to be kind of broken in apache). But I can't block the deployment of gettor to redesign the rdsys backend API, we can rollback the decision of splitting rdsys in different VMs and get back to host everything in polyanthum so apache is not a problem here.
i am happy to keep the servers separate and I would rather keep rdsys-frontend separate, and keep splitting polyanthum in multiple inter-communicating services like this, even if that means figuring out more complex issues like this we might not have (and why would that be?) in a local environment.
i'm just worried that I can't actually define an infinite timeout here. at some point you will have to handle timeouts, even if because the remote servers reboots once in a while, or apache gets killed for whatever other reason.
in my mind, having something not too big is helpful, because if you have a failure to reconnect in your code path, we will detect it earlier. if we have (say) a 7 days timeout or something, then it's going to take possibly a long time to notice the failure.
inversely, if the delay is short, then you quickly will see if you have problems reconnecting. in that sense, 1min actually seems like a good default.
but sure, i'll raise it to 30 minutes. i just hope this won't sweep problems under the rug in the future... but hey, it's your service, you get to break it as you wish... ;)
this is hidden behind an IP-level access list so the security risk should be minimal... and we can deal with that as it comes.