Changes

probably still incomplete
anarcat · deccd3a9
--- a/howto/gitlab.md
+++ b/howto/gitlab.md
@@ -611,6 +611,12 @@ becomes:

    /var/opt/gitlab/git-data/repositories/@hashed/86/bc/86bc00bf176c8b99e9cbdd89afdd2492de002c1dcce63606f711e0c04203c4da.git

+or, on `gitaly-01`:
+
+    /home/git/repositories/@hashed/86/bc/86bc00bf176c8b99e9cbdd89afdd2492de002c1dcce63606f711e0c04203c4da.git
+
+### Finding objects common to forks
+
 Note that forks are "special" in the sense that they store some of
 their objects outside of their repository. For example, the
 [ahf/arti](https://gitlab.torproject.org/ahf/arti) fork (project ID 744) is in:
@@ -645,6 +651,20 @@ root@gitlab-02:~# du -sh /var/opt/gitlab/git-data/repositories/@pools/ef/2d/ef2d
 6.1G    /var/opt/gitlab/git-data/repositories/@pools/ef/2d/ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d.git/objects
 ```

+### Finding the right Gitaly server
+
+Repositories are stored on a Gitaly server, which is currently
+`gitaly-01.torproject.org` (but could also be on `gitlab-02` or
+another `gitaly-NN` server). So typically, just look on
+`gitaly-01`. But if you're unsure, to find which server a repository
+is on, use the [get a single project API endpoint](https://docs.gitlab.com/api/projects/#get-a-single-project):
+
+    curl"https://gitlab.torproject.org/api/v4/projects/647" | jq .repository_storage
+
+The convention is that `storage1` is `gitaly-01`, `storage2` would be
+`gitaly-02`, but that is currently `gitlab-02` and *that* is currently
+`default`.
+
 ## Find the project associated with a project ID

 Sometimes you'll find a numeric project ID instead of a human-readable
@@ -715,6 +735,310 @@ for i in range(2000):
    print(i, h.hexdigest())
 ```

+
+## Moving projects between Gitaly servers
+
+If there are multiple Gitaly servers (and there currently aren't:
+there's only one, named `gitaly-01`), you can *move* repositories
+between Gitaly servers through the GitLab API.
+
+They call this [project repository storage moves](https://docs.gitlab.com/api/project_repository_storage_moves/), see also the
+[moving repositories](https://docs.gitlab.com/administration/operations/moving_repositories/) documentation. You can move individual
+groups, snippets or projects, or *all* of them.
+
+### Moving one project at a time
+
+This procedure only concerns moving a *single* repository.  Do NOT use
+the batch-migration API that migrates all repositories unless you know
+what you're doing (see below).
+
+The overall GitLab API is simple, by sending a `POST` to
+[`/project/:project_id/repository_storage_moves`](https://docs.gitlab.com/api/project_repository_storage_moves/#schedule-a-repository-storage-move-for-a-project), for example,
+assuming you have a GitLab admin personal access token in
+`$PRIVATE_TOKEN`:
+
+    curl -X POST -H "PRIVATE-TOKEN: $private_token" -H "Content-Type: application/json"  --data '{"destination_storage_name":"storage1"}'  --url "https://gitlab.torproject.org/api/v4/projects/1600/repository_storage_moves"
+
+This returns a JSON object with an `id` that is the unique identifier
+for this move. You can see the status of the transfer by polling the
+`project_repository_storage_moves` endpoint, for example for a while
+we were doing this:
+
+    watch -d -c 'curl -s -X GET -H "PRIVATE-TOKEN: $private_token"   --url "https://gitlab.torproject.org/api/v4/project_repository_storage_moves" | jq -C . '
+
+Then you need to wait for the transfer to complete and, ideally, run
+housekeeping to deduplicate objects.
+
+There is a Fabric task named `gitlab.move-repo` that does all of this
+at once. Here's an example run:
+
+```
+anarcat@angela:fabric-tasks$ fab gitlab.move-repo --dest-storage=default --project=3466
+INFO: Successfully connected to https://gitlab.torproject.org
+move repository tpo/anti-censorship/connectivity-measurement/uget (3466) from storage1 to default? [Y/n] 
+INFO: waiting for repository move 3758 to complete
+INFO: Successfully connected to https://gitlab.torproject.org
+INFO: going to try 15 times over 2 hours
+INFO: move completed with status finished
+INFO: starting housekeeping task...
+```
+
+See also the [underlying design of repository moves](https://docs.gitlab.com/development/repository_storage_moves/).
+
+But you would likely prefer batch moves instead, see below.
+
+### Moving all repositories with `rsync`
+
+Repositories can be more usefully moved in *batches*. Typically, this
+occurs in a disaster recovery situation, when you need to evacuate a
+Gitaly server in favor of another one.
+
+We are *not* going to use the API for this, although that procedure
+(and its caveats) is documented further down.
+
+Note that this procedure uses `rsync`, which upstream warns against in
+their [official documentation](https://docs.gitlab.com/administration/operations/moving_repositories/#the-target-directory-contains-an-outdated-copy-of-the-repositories-use-rsync) ([gitlab-org/gitlab#270422](https://gitlab.com/gitlab-org/gitlab/-/issues/270422)) but
+we believe this procedure is sufficiently safe in a disaster recovery
+scenario or with a maintenance window planned.
+
+This procedure is also untested. It's an expanded version of the
+upstream docs. One unclear part of the upstream procedure is how to
+handle the leftover repositories on the original project. It is
+presumed they can either be deleted or left there, but it's currently
+unclear.
+
+Let's say, for example, say you're migrating from `gitaly-01` to
+`gitaly-03`, assuming the `gitaly-03` server has been installed
+properly and has a weight of "zero" (so no new repository is created
+there yet).
+
+ 1. analyze how much disk space is used by various components on each
+    end:
+    
+        du -sch /home/git/repositories/* | sort -h
+        
+    For example:
+    
+        root@gitaly-01:~# du -sch /home/git/repositories/* | sort -h
+        704K    /home/git/repositories/+gitaly
+        1.2M    /home/git/repositories/@groups
+        17M     /home/git/repositories/@snippets
+        35G     /home/git/repositories/@pools
+        98G     /home/git/repositories/@hashed
+        132G    total
+
+    Keep a copy of this to give you a rough idea that all the data was
+    transferred correctly. Using Prometheus metrics is also acceptable
+    here.
+    
+ 2. do a first `rsync` pass between the two server to copy the bulk of
+    the data, even if it's inconsistent:
+    
+        sudo -u git rsync -a /home/git/repositories/ git@gitaly-03:/var/opt/gitlab/git-data/repositories/
+
+    Notice the different paths here
+    (`/var/opt/gitlab/git-data/repositories/` vs
+    `/home/git/repositories`). Those may differ according to how the
+    server was setup. For example, on `gitaly-01`, it's the former, as
+    it's a standalone Gitaly server, but on `gitlab-02` it's the
+    latter because it's a omnibus install.
+
+ 3. set the server in [maintenance mode](https://docs.gitlab.com/administration/maintenance_mode/) or at least [set
+    repositories read-only](https://docs.gitlab.com/administration/read_only_gitlab/).
+
+ 4. rerun the synchronization:
+ 
+        sudo -u git rsync -a --delete /home/git/repositories/  git@gitaly-03:/var/opt/gitlab/git-data/repositories/
+
+    Note that this is destructive! DO NOT MIX UP THE SOURCE AND
+    TARGETS HERE!
+
+ 5. reverse the weights: mark `gitaly-01` as weight 0 and `gitaly-03`
+    as 100.
+
+ 6. disable Gitaly on the original server (e.g. `gitaly['enable'] =
+    false` in omnibus)
+
+ 7. turn off maintenance or read-only mode
+
+### Batch project migrations
+
+It is *NOT* recommended to use the "all" endpoint. In the [gitaly-01
+migration](https://gitlab.torproject.org/tpo/tpa/team/-/issues/42225), this approach was used, and it led to an explosion in
+disk usage, as forks do not automatically deduplicate the space with
+their parents. A "housekeeping" job is needed before space is regain
+so, in the case of large fork trees or large repositories, can lead to
+catastrophic disk usage explosion and an overall migration
+failure. Housekeeping *can* be ran and the migration retried, but it's
+a scary and inconvenient way to move *all* repos.
+
+In any case, here's how part of that migration was done.
+
+First, you need a personal access token with the Admin privileges on
+GitLab. Let's say you set it in the environment in PRIVATE_TOKEN from
+here on.
+
+Let's say you're migrating from the gitaly storage `default` to
+`storage1`. In the above migration, those were `gitlab-02` and
+`gitaly-01`.
+
+ 1. First, we evaluated the number of repositories on each server with:
+
+        curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN"   --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=default&simple=true" 2>&1 | grep x-total
+
+        curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN"   --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=storage&simple=true" 2>&1 | grep x-total
+
+    It's also possible to extract the number of repositories with the
+    `gitlab.list-projects` task, but that's much slower as it needs to
+    page through all projects.
+
+ 2. Then we migrated a couple of repositories by hand, again with
+    `curl`, to see how things worked. But eventually this was
+    automated with the `fab gitlab.move-repo` fabric task, see above
+    for individual moves.
+
+ 3. We then migrated *groups* of repositories, by piping list of
+    projects into a script, with this:
+    
+        fab gitlab.list-projects -g tpo/tpa  | while read id path; do
+            echo "moving project $id ($path)" 
+            curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" \
+                -H 'Content-Type: application/json' \
+                --data '{"destination_storage_name":"storage1"}' 
+                --url "https://gitlab.torproject.org/api/v4/projects/$id/repository_storage_moves" | jq .
+        done
+
+    This is went we made the wrong decision. This went extremely well:
+    even when migrating all groups, we were under the impression
+    everything would be fast and smooth. We had underestimated the
+    volume of the work remaining, because we were not checking the
+    repository counts.
+    
+    For this, you should look at [this Grafana panel](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&refresh=1m&from=now-24h&to=now&timezone=utc&var-node=gitlab-02.torproject.org&viewPanel=panel-47) which shows
+    per server repository counts.
+    
+    Indeed, there are vastly more user forks than project
+    repositories, so those simulations were only the tip of the
+    iceberg. But we didn't realize that, so we plowed ahead.
+
+ 4. We then migrated essentially *everything* at once, by using the
+    [all projects endpoint](https://docs.gitlab.com/api/project_repository_storage_moves/#schedule-repository-storage-moves-for-all-projects-on-a-storage-shard):
+    
+        curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" \
+            -H 'Content-Type: application/json' \
+            --data '{"destination_storage_name":"storage1", "source_storage_name": "default"}' \
+            --url "https://gitlab.torproject.org/api/v4/project_repository_storage_moves" | jq .
+
+    This is where things went wrong.
+    
+    The first thing that happened is that the Sidekiq queue flooded,
+    triggering an alert in monitoring:
+    
+        15:32:10 -ALERTOR1:#tor-alerts- SidekiqQueueSize [firing] Sidekiq queue default on gitlab-02.torproject.org is too large
+
+    That's because all the migrations are dumped in the default
+    Sidekiq queue. There are notes about tweaking the Sidekiq
+    configuration to avoid this in [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/270422#note_437064984) which might have
+    prevented this flood from blocking other things in GitLab. It's
+    unclear why having a dedicated queue for this is not default, [the
+    idea seem to have been rejected upstream](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/7177).
+    
+    The other problem is that each repository is copied as is, with
+    all its objects, including a copy of all the objects from the
+    parent in the fork tree. This "reduplicates" the objects between
+    parent and fork on the target server and creates an explosion of
+    disk space. In theory, that `@pool` stuff [should be handled
+    correctly](https://gitlab.com/groups/gitlab-org/-/epics/10361) but it seems it needs maintenance so objects are
+    deduplicated again.
+
+ 5. At this point, we waited for moves to complete, ran housekeeping,
+    and tried again until it worked (see below). Then we also migrated
+    snippets:
+    
+        curl -s -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" -H 'Content-Type: application/json'  --data '{"destination_storage_name":"storage1", "source_storage_name": "default"}'  --url "https://gitlab.torproject.org/api/v4/snippet_repository_storage_moves"
+
+    and groups:
+    
+        curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" -H 'Content-Type: application/json'  --data '{"destination_storage_name":"storage1", "source_storage_name": "default"}'  --url "https://gitlab.torproject.org/api/v4/group_repository_storage_moves" | jq .;  date
+
+    Ultimately, we ended up automating a "one-by-one" migration script
+    with:
+    
+        fab gitlab.move-repos --source-storage=default --dest-storage=storage1 --no-prompt;
+
+    ... which migrated each repository one by one. It's possible a
+    full server migration could be performed this way, but it's much
+    slower because it doesn't parallelize. An issue should be filed
+    upstream so that housekeeping is scheduled on migrated repositores
+    so the normal API works correctly. The reason why this is not the
+    case is likely because GitLab.com has their own tool called
+    [`gitalyctl`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/gitaly/gitalyctl.md?ref_type=heads) to perform migrations between Gitaly clusters
+    part of a toolset called [woodhouse](https://gitlab.com/gitlab-com/gl-infra/woodhouse)
+
+ 6. Finally, we checked how many repositories were left on the servers again:
+
+        curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN"   --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=default&simple=true" 2>&1 | grep x-total
+
+        curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN"   --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=storage&simple=true" 2>&1 | grep x-total
+
+    And at this point, `list-projects` worked for the origin server as
+    there were so few repositories left:
+    
+        fab gitlab.list-projects --storage=default
+
+While migration happened, the Grafana panels [repository count per
+server](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&refresh=1m&from=now-24h&to=now&timezone=utc&var-node=gitlab-02.torproject.org&viewPanel=panel-47), [disk usage](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage), [CPU usage](https://grafana.torproject.org/d/gex9eLcWz/cpu-usage) and [sidekiq](https://grafana.torproject.org/d/c3201e86-7dde-4897-9d67-a161d0b8d2bf/gitlab-sidekiq?folderUid=faa9db2b-c105-4c67-8f83-a918aaeac5e5&orgId=1&from=now-24h&to=now&timezone=utc&var-query0&var-node=gitlab-02.torproject.org&var-alias=gitlab-02.torproject.org) were used
+to keep track of progress.
+    
+The `fab gitlab.list-moves` task was also used (and written!) to keep
+track of individual states. For example, this lists the name of
+projects in-progress:
+    
+    fab gitlab.list-moves  --since 2025-07-16T19:30 --status=started | jq  -rc '.project.path_with_namespace' | sort
+
+... or scheduled:
+
+    fab gitlab.list-moves  --since 2025-07-16T19:30 --status=scheduled | jq -r  -c '.project.path_with_namespace' 
+
+Or everything but finished tasks:
+
+    fab gitlab.list-moves  --since 2025-07-16T19:30 --not-status=finished | jq -c '.'
+
+The `--since` should be set to when the batch migration was started,
+otherwise you get a flood of requests from the beginning of time (yes,
+it's weird like that).
+    
+This was used to list move failures:
+    
+    fab gitlab.list-moves  --since 2025-07-16T19:30 --status=failed | jq  -rc '[.project.id, .project.path_with_namespace, .error_message] | join(" ")'
+
+And this, the number of jobs by state:
+    
+    fab gitlab.list-moves  --since 2025-07-16T19:30 | jq -r .state | sort | uniq -c
+
+This was used to collate all failures and check for anomalies:
+    
+    fab gitlab.list-moves  --kind=project --not-status=finished | jq -r .error_message | sed 's,/home/git/repositories/+gitaly/tmp/[^:]*,/home/git/repositories/+gitaly/tmp/XXXX,' | sort | uniq -c  | sort -n 
+
+Note that, while the failures were kind of scary, things eventually
+turned out okay. Gitaly, when running out of disk space, handles it
+gracefully: the job is marked as failed, and it moves on to the next
+one. Then housekeeping can be ran and the moves can be resumed.
+
+[Heuristical housekeeping](https://docs.gitlab.com/administration/housekeeping/#heuristical-housekeeping) can be scheduled by tweaking
+gitaly's `daily_maintenance.start_hour` setting.
+    
+It is *possible* that scheduling a maintenance *while* doing the
+migration could resolve the disk space issue.
+
+Note that maintenance logs can be tailed on gitaly-01 with:
+
+    journalctl -u gitaly --grep maintenance.daily -f
+
+Or this will show maintenance tasks that take longer than one second:
+
+    journalctl -o cat -u gitaly --since 2025-07-17T03:45 -f | jq -c '. | select (.source == "maintenance.daily") | select (.time_ms > 1000)' 
+
 ## Connect to the PostgreSQL server

 We previously had instructions on how to connect to the GitLab Omnibus