this is a larger issue than just moly. I've filed the following ticket ([TC-Support #212964]) with cymru (upstream):
Hi!
Since around 2019-12-15 23:55:01 UTC, we have started seeing some weird
networking issues with moly.torproject.org. I can reach the machine okay
and it pings properly, but some TCP connexions do not work
correctly. For example, this works:
.. but it also affects a DNS server (fallax), a build box and a web
mirror. It would be great if you could look into this promptly because
it's a bit of a show stopper for us.
Thanks!
a.
--
Antoine Beaupré
torproject.org system administration
I'll try to do more network diagnostics after lunch, with the hope this can be resolved in the more short term. But we have started a mitigation strategy that involves restoring majus from backups.
12:03:19 <+anarcat> this is the last backup that ran:12:03:22 <+anarcat> +---------+-------+----------+---------------+---------------------+-------------------------------------------------------+12:03:22 <+anarcat> | jobid | level | jobfiles | jobbytes | starttime | volumename |12:03:22 <+anarcat> +---------+-------+----------+---------------+---------------------+-------------------------------------------------------+12:03:26 <+anarcat> | 118,333 | I | 1,510 | 42,379,972 | 2019-12-15 10:27:15 | torproject-majus.torproject.org-inc.2019-12-15_10:27 |12:03:26 <+anarcat> +---------+-------+----------+---------------+---------------------+-------------------------------------------------------+
We've been meaning to move majus to the ganeti cluster (#31784 (moved) as part of #29974 (moved)), exactly for this kind of scenario. Thankfully, we migrated the director and getulum already, so this problem is not as bad as it would have been.
But it will still take some time (days?) to restore the service, if Cymru doesn't figure it out in time.
Sorry about this trouble everyone! Hopefully we'll be able to get back on track soon... I'm just happy it's happening this week, instead of during the holidays. :p
Trac: Summary: majus cannot connect to internet with git or the transifex client (although it pings ok) to major networking issues on moly, affects: majus, fallax, web-cymru-01, build-x86-05, build-x86-06 Priority: High to Very High Severity: Normal to Major
cymru responded and explained there was a networking change on their end that changed the MTU, which I suspected. at 20:40UTC, they said they implemented a workaround, but 4 minutes later, at 20:44UTC, they said the investigation was still ongoing.
and now the entire cymru network is unreachable, at least as seen from nagios:
15:42:40 <nsa> tor-nagios: [mini-nag/auto-dns] moly.torproject.org is considered BAD (ping-check (50.00%)) @ 20:42:31 +0000.15:46:50 <nsa> tor-nagios: [gw-cymru] gw-cymru is DOWN: Date/Time: Mon Dec 16 20:46:39 UTC 2019
above times UTC-5. so i'm not sure what's up with cymru, but clearly there's a huge outage going on there.
now i'm really happy to have moved those other services off of that box...
network has returned, but the MTU problem remains.
i've temporarily lowered the MTU to 1474 after bisecting to see what actually works, on both moly and majus.
i've turned off the other VMs back again after realizing all those VMs would also need to be tweaked and I don't have to do so right now.
@emmapeel: majus should be back online now, for what that's worth. please consider speeding up the cleanup work on that box or whatever is required to migrate over to the new cluster, now that we have access again.
backups should again run tonight and get us a fresh copy.
@emmapeel: majus should be back online now, for what that's worth. please consider speeding up the cleanup work on that box or whatever is required to migrate over to the new cluster, now that we have access again.
I may need a little help on this. Where are the backups or where can i see what is being backed up, to spot problems? or shall I may a list of what I think should be moved?
Also, are we going to have some puppet thing running the new server?
I may need a little help on this. Where are the backups or where can i see what is being backed up, to spot problems? or shall I may a list of what I think should be moved?
Also, are we going to have some puppet thing running the new server?
In the meantime, the actual underlying problem was fixed here. Cymru were making changes to the network this weekend, and they somewhat screwed up a transition. This was resolved last night and today I restored the MTUs on the various machines back to normal.
Everything should be back to normal again today (and in fact was operational last night, UTC-5, when I commented on this issue).
Please do reopen this ticket if you see further networking issues.
Thanks for the report!
Trac: Status: accepted to closed Resolution: N/Ato fixed