i believe that should also involve moving the crm-ext-01 machine, since it's closely related.
@lavamind do you think we have everything ready in Puppet to enable migrations between gnt-fsn and gnt-dal? would you be interested in performing such migration, to see if my documentation works okay?
ETA i gave @mathieu in the other ticket is "one-two weeks".
@lavamind do you think we have everything ready in Puppet to enable migrations between gnt-fsn and gnt-dal? would you be interested in performing such migration, to see if my documentation works okay?
I'm happy to take this on. Based on the Backlog label, I'd plan this for next week or the week after. I think we have everything in Puppet ready, minus the firewall rules.
We could deploy a rule that pokes holes permanently between the ganeti masters, or just add it as a preparation step in the migration docs. I'd prefer of the latter: helps keep the Puppet complexity down and I don't think we do this often enough that automation is a high priority here.
In my experience, though, transfers are slow enough that it can
definitely happen something else kicks ferm and trashes the rules,
breaking the transfer.
It could have been because I was doing multiple transfers and that one
new transfer was changing the firewall rules, and then Puppet went
around trashing the firewall. Maybe for one-time changes we won't have
this problem?
Worth a try, let's try without adding the firewall rules in Puppet for
now.
...
On 2023-03-27 21:09:29, Jérôme Charaoui (@lavamind) wrote:
We could deploy a rule that pokes holes permanently between the ganeti masters, or just add it as a preparation step in the migration docs. I'd prefer of the latter: helps keep the Puppet complexity down and I don't think we do this often enough that automation is a high priority here.
@anarcat Do you have an estimate for how long the migration might take? I've tried looking through the mass-migrate ticket but couldn't find conclusive info about that. From one of the logs I saw 60MB/s, is that about right? If so that would put us just under one hour to migrate crm-int-01.
I feel we may want to plan a maintenance window for this with some warning on the donate page that donations will be offline for a period of time. @smith do you think that's a good idea?
@anarcat Do you have an estimate for how long the migration might take? I've tried looking through the mass-migrate ticket but couldn't find conclusive info about that. From one of the logs I saw 60MB/s, is that about right? If so that would put us just under one hour to migrate crm-int-01.
that sounds about right, for the raw transfer speed. you'd also need to
migrate crm-ext and do some IP changes, which is what takes the longest really...
I feel we may want to plan a maintenance window for this with some warning on the donate page that donations will be offline for a period of time. @smith do you think that's a good idea?
yeah, that seems like a wise choice.
...
On 2023-04-18 18:45:52, Jérôme Charaoui (@lavamind) wrote:
--
Antoine Beaupré
torproject.org system administration
The problem with the generic error is its no different from entering the wrong URL in the browser bar, so I think what @kez is proposing here will be useful so people wanting to donate understand why they can't at the moment, and reassure them that they will be able to, later on.
lavamind explained it perfectly. we're telling the user "hey, the donate site is still here, it's going to be working again Soon™, try again later" instead of an unhelpful "you went to the wrong place", or "something went wrong with the gateway".
especially since the frontend is not served by the middleware. if we shut down the middleware but the frontend is still being served by the static mirror system, users might (not entirely sure) be able to get far enough in the donation process to be charged, but not far enough that we receive their donation in civi
@mattlav I want to plan a 3-hour maintenance window on donate.tpo to complete the migration. Do you have any stats that might inform us of the best time to proceed? The plan would be to post something to status.tpo and flip a new maintenanceMode switch on the front-end, while the back-end is migrating.
Without taking a bunch of time to analyze donation stats, the answer is that there's about as many users east of our time zone as west, and they use CiviCRM when they are awake, so the time that the fewest users will be inconvenienced is basically the most inconvenient time that you can stand to do it. Midnight to 3 AM on a Sunday? It might be helpful to put a notice up / temporarily replace the donate page altogether.
After discussion with @mattlav over IRC we figured out midnight to 3 AM would somewhat painful, and that 10 PM to 1 AM would also do. So, let's plan this maintenance window at this time on Sunday, so Sunday 10 PM to Monday 1 AM.
Before:
Look for any hard-coded IPs in donate code
Review cross-cluster transfer procedure
Post message on status.tpo
During:
Toggle maintenance mode on frontend
Suspend Puppet on origin and destination clusters
Deploy required firewall rules on origin and destination nodes
Transfer backend, fixup IPs
Transfer frontend, fixup IPs
Fix backend IPSec tunnel IPs
Add NVMe volume on backend instance and move MySQL database to it
Just making a note here that as I prepare to shut down crm-int-01 for the migration, I notice there's a mariadb thread that is consuming a full vCPU running some absolutely dreadful caching query with over a dozen subqueries... I've taken note of it.
Alright, so that took longer than expected because of the DRBD sync "pre-step", but crm-int-01 is now moved to the new cluster and renumbered (both ipv4 and ipv6). I've made sure the IPSec link with crm-ext-01 is up, and works. Newsletter subscription and donate staging both are working.
Given the time, I'm going to defer any further changes for now. The MySQL volume change and frontend move may come at a later time.
For what it's worth, even after migrating, MariaDB is still churning along one full vCPU at 100% on some weird, complicated queries that all start with INSERT IGNORE INTO civicrm_tmp_e_gccache_<UUID> (group_id, contact_id).
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-06-06. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
So, this work didn't happen last night because I was just too tired at the end of the day. I could try to give it a go tonight if it's OK with you @mattlav, otherwise, we can punt it again to next Sunday night.
I'd say go ahead tonight or even today during the day.
...
On 2023-05-29 17:23:59, Jérôme Charaoui (@lavamind) wrote:
So, this work didn't happen last night because I was just too tired at the end of the day. I could try to give it a go tonight if it's OK with you @mattlav, otherwise, we can punt it again to next Sunday night.
Finished transferring and renumbering crm-ext-01. The ipsec tunnel is up and running, donate and newsletter seem to work. Donate et back online, marked the status site entry has resolved.
Add NVMe volume on backend instance and move MySQL database to it
Note that I didn't implement this change, since there seems to have been no i/o issues whatsoever since the migration and also since we fixed the CiviCRM smart groups issue generating impossible SQL queries.
I’m not sure if this is related but the timing seems like it is — last
night I got a copy if the Civimail newsletter draft Pavel had been trying
to send me, without it ever getting delivered, last week. Maybe not
important to report but I figured better share anyway
Thanks for reporting this, if you and Pavel experience issues with mail delivery do open a ticket. It should not take more than a few hours at most for an email to be delivered.