Moat process unexpected quit issue
Last Saturday(March 5, 2022) there is a failure observed at the moat service. The root cause of service failure is due to the unexpected quit of the moat process. @arma restored the service by restarting the moat process. However, the root cause of the issue is still present.
To solve this issue we have created the following roadmap:
- Setup an automatic restart procedure like systemd to restart moat automatically when it fails
- Setup a log collection system to capture the diagnostic output
- redact the log file to hide the user IP address
- Try to fix the root cause of service crashing by analyzing the diagnostic output
The relevant IRC log is as follows:
<n8fr8[m]> not sure where the right channel to report this, but the moat service endpoint seems to be down. can't request a bridge in TB, Orbot, OnionBrowser, etc.
<lavamind> n8fr8[m]: thanks for the heads up, usually a good place to start is #tor of forum.torproject.net
<lavamind> or*
<n8fr8[m]> well... a key part of our anti-censorship services for users in places like Russia where tor is blocked is down, and you want me to post it to the general tor forum on a message board? I suppose I will send an email to the right people instead
<lavamind> well I know there was an upgrade to bridgedb done this week, so I've poked a member of the anticensorship team responsible for it
<lavamind> I've looked at the system and couldn't find anything obviously wrong
<n8fr8[m]> thanks for checking!
<n8fr8[m]> could be something with fastly front domain, hopefully temporary
<n8fr8[m]> I am getting a TLS cert error on https://moat.torproject.org.global.prod.fastly.net/
<lavamind> oh nice catch
<lavamind> well we definitely dont generate those certificates :p
<lavamind> so yeah, maybe an issue on Fastly's end
<+armadev> n8fr8[m]: i think i might have fixed moat
<+armadev> if somebody wants to verify, please do :)
<shelikhoo> I think moat is recovered.... how did armadev fix it?
<+armadev> shelikhoo: the moat fix was 'sudo -u moat /srv/bridges.torproject.org/bin/run-moat-shim'
<+armadev> as specified in https://gitlab.torproject.org/tpo/anti-censorship/team/-/wikis/Survival-Guides/Moat-Survival-Guide
<+armadev> (and i had said this in #tpo-admin but i guess not in more detail here too)
<+armadev> shelikhoo: the bigger issue though is: that moat-shim thing has died at least twice now, mysteriously, for no reason
<+armadev> so somebody should debug it, set up monitoring for when it disappears, etc
<shelikhoow> Yes, so I think a systemd based solution that restart it automatically will work....
<+armadev> can you open a ticket somewhere?
<shelikhoow> Yes, consider it done.
<+armadev> i agree that using systemd to auto restart it is a good first step
<+armadev> there is definitely also a part of me that wants to know why it exits though :)
<+armadev> thanks!
<shelikhoo> That means we need to investigate the log....
<+armadev> yep. the log is full of tls complaints and ip addresses,
<+armadev> but i think the log only goes to stdout
<+armadev> i.e. when you log out from running run-moat-shim, the log now goes nowhere
<shelikhoo> Yes... If we setup a systemd based deployment, we can configure it to store the log from stdout,err
<+armadev> great. so, step one, switch to systemd and start logging stuff somewhere,
<shelikhoo> Systemd have its issues, but it is quite convenient
<shelikhoo> Yes.
<+armadev> then step two, when it dies, see if it said anything useful
<shelikhoo> The effort to do so will be tracked in the issue.
<shelikhoo> Yes
<+armadev> step one-point-five, notice that the log has a bunch of ip addresses in it and wonder if that is urgent enough to fix
<shelikhoo> Yes. I don't think it is urgent but we should fix it....
<+armadev> great
Edited by shelikhoo