Skip to content

Moat process unexpected quit issue

Last Saturday(March 5, 2022) there is a failure observed at the moat service. The root cause of service failure is due to the unexpected quit of the moat process. @arma restored the service by restarting the moat process. However, the root cause of the issue is still present.

To solve this issue we have created the following roadmap:

  1. Setup an automatic restart procedure like systemd to restart moat automatically when it fails
  2. Setup a log collection system to capture the diagnostic output
  3. redact the log file to hide the user IP address
  4. Try to fix the root cause of service crashing by analyzing the diagnostic output

The relevant IRC log is as follows:

<n8fr8[m]> not sure where the right channel to report this, but the moat service endpoint seems to be down. can't request a bridge in TB, Orbot, OnionBrowser, etc.
<lavamind> n8fr8[m]: thanks for the heads up, usually a good place to start is #tor of forum.torproject.net
<lavamind> or*
<n8fr8[m]> well... a key part of our anti-censorship services for users in places like Russia where tor is blocked is down, and you want me to post it to the general tor forum on a message board? I suppose I will send an email to the right people instead
<lavamind> well I know there was an upgrade to bridgedb done this week, so I've poked a member of the anticensorship team responsible for it
<lavamind> I've looked at the system and couldn't find anything obviously wrong
<n8fr8[m]> thanks for checking!
<n8fr8[m]> could be something with fastly front domain, hopefully temporary
<n8fr8[m]> I am getting a TLS cert error on https://moat.torproject.org.global.prod.fastly.net/
<lavamind> oh nice catch
<lavamind> well we definitely dont generate those certificates :p
<lavamind> so yeah, maybe an issue on Fastly's end
<+armadev> n8fr8[m]: i think i might have fixed moat
<+armadev> if somebody wants to verify, please do :)
<shelikhoo> I think moat is recovered.... how did armadev fix it?
<+armadev> shelikhoo: the moat fix was 'sudo -u moat /srv/bridges.torproject.org/bin/run-moat-shim'
<+armadev> as specified in https://gitlab.torproject.org/tpo/anti-censorship/team/-/wikis/Survival-Guides/Moat-Survival-Guide
<+armadev> (and i had said this in #tpo-admin but i guess not in more detail here too)
<+armadev> shelikhoo: the bigger issue though is: that moat-shim thing has died at least twice now, mysteriously, for no reason
<+armadev> so somebody should debug it, set up monitoring for when it disappears, etc
<shelikhoow> Yes, so I think a systemd based solution that restart it automatically will work.... 
<+armadev> can you open a ticket somewhere?
<shelikhoow> Yes, consider it done. 
<+armadev> i agree that using systemd to auto restart it is a good first step
<+armadev> there is definitely also a part of me that wants to know why it exits though :)
<+armadev> thanks!
<shelikhoo> That means we need to investigate the log....
<+armadev> yep. the log is full of tls complaints and ip addresses,
<+armadev> but i think the log only goes to stdout
<+armadev> i.e. when you log out from running run-moat-shim, the log now goes nowhere
<shelikhoo> Yes... If we setup a systemd based deployment, we can configure it to store the log from stdout,err
<+armadev> great. so, step one, switch to systemd and start logging stuff somewhere,
<shelikhoo> Systemd have its issues, but it is quite convenient 
<shelikhoo> Yes.
<+armadev> then step two, when it dies, see if it said anything useful
<shelikhoo> The effort to do so will be tracked in the issue. 
<shelikhoo> Yes
<+armadev> step one-point-five, notice that the log has a bunch of ip addresses in it and wonder if that is urgent enough to fix
<shelikhoo> Yes. I don't think it is urgent but we should fix it.... 
<+armadev> great
Edited by shelikhoo
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information