migrate CiviCRM machines to gnt-dal

changed due date to April 10, 2023

changed milestone to %trusted high performance cluster (gnt-dal migration)

@lavamind do you think we have everything ready in Puppet to enable migrations between gnt-fsn and gnt-dal? would you be interested in performing such migration, to see if my documentation works okay?

I'm happy to take this on. Based on the Backlog label, I'd plan this for next week or the week after. I think we have everything in Puppet ready, minus the firewall rules.

We could deploy a rule that pokes holes permanently between the ganeti masters, or just add it as a preparation step in the migration docs. I'd prefer of the latter: helps keep the Puppet complexity down and I don't think we do this often enough that automation is a high priority here.

In my experience, though, transfers are slow enough that it can definitely happen something else kicks ferm and trashes the rules, breaking the transfer.

It could have been because I was doing multiple transfers and that one new transfer was changing the firewall rules, and then Puppet went around trashing the firewall. Maybe for one-time changes we won't have this problem?

Worth a try, let's try without adding the firewall rules in Puppet for now.

...

On 2023-03-27 21:09:29, Jérôme Charaoui (@lavamind) wrote:

We could deploy a rule that pokes holes permanently between the ganeti masters, or just add it as a preparation step in the migration docs. I'd prefer of the latter: helps keep the Puppet complexity down and I don't think we do this often enough that automation is a high priority here.

added Next label and removed Backlog label

@anarcat Do you have an estimate for how long the migration might take? I've tried looking through the mass-migrate ticket but couldn't find conclusive info about that. From one of the logs I saw 60MB/s, is that about right? If so that would put us just under one hour to migrate crm-int-01.

I feel we may want to plan a maintenance window for this with some warning on the donate page that donations will be offline for a period of time. @smith do you think that's a good idea?

Jérôme Charaoui commented:

@anarcat Do you have an estimate for how long the migration might take? I've tried looking through the mass-migrate ticket but couldn't find conclusive info about that. From one of the logs I saw 60MB/s, is that about right? If so that would put us just under one hour to migrate crm-int-01.

that sounds about right, for the raw transfer speed. you'd also need to migrate crm-ext and do some IP changes, which is what takes the longest really...

I feel we may want to plan a maintenance window for this with some warning on the donate page that donations will be offline for a period of time. @smith do you think that's a good idea?

yeah, that seems like a wise choice.

...

On 2023-04-18 18:45:52, Jérôme Charaoui (@lavamind) wrote:

-- Antoine Beaupré torproject.org system administration

we don't currently have any way to warn donate users that their donations are about to disappear into the ether because of maintenance.

i'll hack together a quick "maintenance mode" that pops up an alert window and disables form submission

maintenance mode MR created at tpo/web/donate-static!109 (merged). it's extremely basic, but works perfectly for what we need.

i'm not sure we need anything fancy here. we can just shutdown the boxes and people will be served the generic errors from their web browser, no?

The problem with the generic error is its no different from entering the wrong URL in the browser bar, so I think what @kez is proposing here will be useful so people wanting to donate understand why they can't at the moment, and reassure them that they will be able to, later on.

lavamind explained it perfectly. we're telling the user "hey, the donate site is still here, it's going to be working again Soon™, try again later" instead of an unhelpful "you went to the wrong place", or "something went wrong with the gateway".

especially since the frontend is not served by the middleware. if we shut down the middleware but the frontend is still being served by the static mirror system, users might (not entirely sure) be able to get far enough in the donation process to be charged, but not far enough that we receive their donation in civi

mentioned in commit tpo/web/donate-static@c88dd21b

mentioned in commit tpo/web/donate-static@c894f145

mentioned in merge request tpo/web/donate-static!109 (merged)

@mattlav I want to plan a 3-hour maintenance window on donate.tpo to complete the migration. Do you have any stats that might inform us of the best time to proceed? The plan would be to post something to status.tpo and flip a new maintenanceMode switch on the front-end, while the back-end is migrating.

Without taking a bunch of time to analyze donation stats, the answer is that there's about as many users east of our time zone as west, and they use CiviCRM when they are awake, so the time that the fewest users will be inconvenienced is basically the most inconvenient time that you can stand to do it. Midnight to 3 AM on a Sunday? It might be helpful to put a notice up / temporarily replace the donate page altogether.

mentioned in commit tpo/web/donate-static@90b565f6

added Doing label and removed Next label

After discussion with @mattlav over IRC we figured out midnight to 3 AM would somewhat painful, and that 10 PM to 1 AM would also do. So, let's plan this maintenance window at this time on Sunday, so Sunday 10 PM to Monday 1 AM.

Before:

Look for any hard-coded IPs in donate code
Review cross-cluster transfer procedure
Post message on status.tpo

During:

Toggle maintenance mode on frontend
Suspend Puppet on origin and destination clusters
Deploy required firewall rules on origin and destination nodes
Transfer backend, fixup IPs
Transfer frontend, fixup IPs
Fix backend IPSec tunnel IPs
Add NVMe volume on backend instance and move MySQL database to it

After:

Clear temporary firewall rules
Reenable Puppet
Disable frontend maintenance mode
Validate donate site works
Mark status.tpo entry as resolved

You might want to do this right now, to make sure the procedure you documented here is correct. :)

...

On 2023-05-04 14:32:17, Jérôme Charaoui (@lavamind) wrote:

Review cross-cluster transfer procedure

The IPSec tunnel configuration will need adjusting after the IP address switch.

mentioned in commit status-site@a6d07af0

Just making a note here that as I prepare to shut down crm-int-01 for the migration, I notice there's a mariadb thread that is consuming a full vCPU running some absolutely dreadful caching query with over a dozen subqueries... I've taken note of it.

mentioned in commit tpo/web/donate-static@e143202e

crm-int-01 transfer initiated.

I'm quite happy that the transfer command seems to have worked on the very first try. Now 25% completed.

mentioned in commit tpo/web/donate-static@245d3c4a

Alright, so that took longer than expected because of the DRBD sync "pre-step", but crm-int-01 is now moved to the new cluster and renumbered (both ipv4 and ipv6). I've made sure the IPSec link with crm-ext-01 is up, and works. Newsletter subscription and donate staging both are working.

Given the time, I'm going to defer any further changes for now. The MySQL volume change and frontend move may come at a later time.

For what it's worth, even after migrating, MariaDB is still churning along one full vCPU at 100% on some weird, complicated queries that all start with INSERT IGNORE INTO civicrm_tmp_e_gccache_<UUID> (group_id, contact_id).

Moved that discussion to https://gitlab.torproject.org/tpo/web/civicrm/-/issues/102#note_2900633

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-06-06. ~"Needs Information" tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

added Stale label

@mattlav How's CiviCRM these days? Can we move the other server some time soon? Another 3 hours maintenance window would be required, again.

CiviCRM seems to be working just fine. Feel free to move the other server, I'd suggest generally the same time of day / week

removed Stale label

changed due date to May 26, 2023

changed due date to May 28, 2023

Alright, let's move crm-ext-01 over to gnt-dal this Sunday, May 28, between 02:00 to 05:00 UTC.

@lavamind what's the status here? was this work done?

So, this work didn't happen last night because I was just too tired at the end of the day. I could try to give it a go tonight if it's OK with you @mattlav, otherwise, we can punt it again to next Sunday night.

I'd say go ahead tonight or even today during the day.

...

On 2023-05-29 17:23:59, Jérôme Charaoui (@lavamind) wrote:

So, this work didn't happen last night because I was just too tired at the end of the day. I could try to give it a go tonight if it's OK with you @mattlav, otherwise, we can punt it again to next Sunday night.

mentioned in commit tpo/web/donate-static@1fe0f6fa

mentioned in commit status-site@dc205374

mentioned in commit tpo/web/donate-static@5988f4c7

mentioned in commit status-site@5160a923

Finished transferring and renumbering crm-ext-01. The ipsec tunnel is up and running, donate and newsletter seem to work. Donate et back online, marked the status site entry has resolved.

Add NVMe volume on backend instance and move MySQL database to it

Note that I didn't implement this change, since there seems to have been no i/o issues whatsoever since the migration and also since we fixed the CiviCRM smart groups issue generating impossible SQL queries.

All done here, closing as completed.

closed

I’m not sure if this is related but the timing seems like it is — last night I got a copy if the Civimail newsletter draft Pavel had been trying to send me, without it ever getting delivered, last week. Maybe not important to report but I figured better share anyway

...

On Mon, May 29, 2023 at 9:22 PM Jérôme Charaoui (@lavamind) < git@gitlab.torproject.org> wrote:

Issue was closed by Jérôme Charaoui

— Reply to this email directly or view it on GitLab #41109 (closed). You're receiving this email because of your account on gitlab.torproject.org. Unsubscribe https://gitlab.torproject.org/-/sent_notifications/REDACTED/unsubscribe from this thread · Manage all notifications https://gitlab.torproject.org/-/profile/notifications · Help https://gitlab.torproject.org/help

-- via

Thanks for reporting this, if you and Pavel experience issues with mail delivery do open a ticket. It should not take more than a few hours at most for an email to be delivered.

marked this issue as related to #41775 (closed)

mentioned in issue #41775 (closed)

migrate CiviCRM machines to gnt-dal

Designs

Child items ...

Activity