... | ... | @@ -611,6 +611,12 @@ becomes: |
|
|
|
|
|
/var/opt/gitlab/git-data/repositories/@hashed/86/bc/86bc00bf176c8b99e9cbdd89afdd2492de002c1dcce63606f711e0c04203c4da.git
|
|
|
|
|
|
or, on `gitaly-01`:
|
|
|
|
|
|
/home/git/repositories/@hashed/86/bc/86bc00bf176c8b99e9cbdd89afdd2492de002c1dcce63606f711e0c04203c4da.git
|
|
|
|
|
|
### Finding objects common to forks
|
|
|
|
|
|
Note that forks are "special" in the sense that they store some of
|
|
|
their objects outside of their repository. For example, the
|
|
|
[ahf/arti](https://gitlab.torproject.org/ahf/arti) fork (project ID 744) is in:
|
... | ... | @@ -645,6 +651,20 @@ root@gitlab-02:~# du -sh /var/opt/gitlab/git-data/repositories/@pools/ef/2d/ef2d |
|
|
6.1G /var/opt/gitlab/git-data/repositories/@pools/ef/2d/ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d.git/objects
|
|
|
```
|
|
|
|
|
|
### Finding the right Gitaly server
|
|
|
|
|
|
Repositories are stored on a Gitaly server, which is currently
|
|
|
`gitaly-01.torproject.org` (but could also be on `gitlab-02` or
|
|
|
another `gitaly-NN` server). So typically, just look on
|
|
|
`gitaly-01`. But if you're unsure, to find which server a repository
|
|
|
is on, use the [get a single project API endpoint](https://docs.gitlab.com/api/projects/#get-a-single-project):
|
|
|
|
|
|
curl"https://gitlab.torproject.org/api/v4/projects/647" | jq .repository_storage
|
|
|
|
|
|
The convention is that `storage1` is `gitaly-01`, `storage2` would be
|
|
|
`gitaly-02`, but that is currently `gitlab-02` and *that* is currently
|
|
|
`default`.
|
|
|
|
|
|
## Find the project associated with a project ID
|
|
|
|
|
|
Sometimes you'll find a numeric project ID instead of a human-readable
|
... | ... | @@ -715,6 +735,310 @@ for i in range(2000): |
|
|
print(i, h.hexdigest())
|
|
|
```
|
|
|
|
|
|
|
|
|
## Moving projects between Gitaly servers
|
|
|
|
|
|
If there are multiple Gitaly servers (and there currently aren't:
|
|
|
there's only one, named `gitaly-01`), you can *move* repositories
|
|
|
between Gitaly servers through the GitLab API.
|
|
|
|
|
|
They call this [project repository storage moves](https://docs.gitlab.com/api/project_repository_storage_moves/), see also the
|
|
|
[moving repositories](https://docs.gitlab.com/administration/operations/moving_repositories/) documentation. You can move individual
|
|
|
groups, snippets or projects, or *all* of them.
|
|
|
|
|
|
### Moving one project at a time
|
|
|
|
|
|
This procedure only concerns moving a *single* repository. Do NOT use
|
|
|
the batch-migration API that migrates all repositories unless you know
|
|
|
what you're doing (see below).
|
|
|
|
|
|
The overall GitLab API is simple, by sending a `POST` to
|
|
|
[`/project/:project_id/repository_storage_moves`](https://docs.gitlab.com/api/project_repository_storage_moves/#schedule-a-repository-storage-move-for-a-project), for example,
|
|
|
assuming you have a GitLab admin personal access token in
|
|
|
`$PRIVATE_TOKEN`:
|
|
|
|
|
|
curl -X POST -H "PRIVATE-TOKEN: $private_token" -H "Content-Type: application/json" --data '{"destination_storage_name":"storage1"}' --url "https://gitlab.torproject.org/api/v4/projects/1600/repository_storage_moves"
|
|
|
|
|
|
This returns a JSON object with an `id` that is the unique identifier
|
|
|
for this move. You can see the status of the transfer by polling the
|
|
|
`project_repository_storage_moves` endpoint, for example for a while
|
|
|
we were doing this:
|
|
|
|
|
|
watch -d -c 'curl -s -X GET -H "PRIVATE-TOKEN: $private_token" --url "https://gitlab.torproject.org/api/v4/project_repository_storage_moves" | jq -C . '
|
|
|
|
|
|
Then you need to wait for the transfer to complete and, ideally, run
|
|
|
housekeeping to deduplicate objects.
|
|
|
|
|
|
There is a Fabric task named `gitlab.move-repo` that does all of this
|
|
|
at once. Here's an example run:
|
|
|
|
|
|
```
|
|
|
anarcat@angela:fabric-tasks$ fab gitlab.move-repo --dest-storage=default --project=3466
|
|
|
INFO: Successfully connected to https://gitlab.torproject.org
|
|
|
move repository tpo/anti-censorship/connectivity-measurement/uget (3466) from storage1 to default? [Y/n]
|
|
|
INFO: waiting for repository move 3758 to complete
|
|
|
INFO: Successfully connected to https://gitlab.torproject.org
|
|
|
INFO: going to try 15 times over 2 hours
|
|
|
INFO: move completed with status finished
|
|
|
INFO: starting housekeeping task...
|
|
|
```
|
|
|
|
|
|
See also the [underlying design of repository moves](https://docs.gitlab.com/development/repository_storage_moves/).
|
|
|
|
|
|
But you would likely prefer batch moves instead, see below.
|
|
|
|
|
|
### Moving all repositories with `rsync`
|
|
|
|
|
|
Repositories can be more usefully moved in *batches*. Typically, this
|
|
|
occurs in a disaster recovery situation, when you need to evacuate a
|
|
|
Gitaly server in favor of another one.
|
|
|
|
|
|
We are *not* going to use the API for this, although that procedure
|
|
|
(and its caveats) is documented further down.
|
|
|
|
|
|
Note that this procedure uses `rsync`, which upstream warns against in
|
|
|
their [official documentation](https://docs.gitlab.com/administration/operations/moving_repositories/#the-target-directory-contains-an-outdated-copy-of-the-repositories-use-rsync) ([gitlab-org/gitlab#270422](https://gitlab.com/gitlab-org/gitlab/-/issues/270422)) but
|
|
|
we believe this procedure is sufficiently safe in a disaster recovery
|
|
|
scenario or with a maintenance window planned.
|
|
|
|
|
|
This procedure is also untested. It's an expanded version of the
|
|
|
upstream docs. One unclear part of the upstream procedure is how to
|
|
|
handle the leftover repositories on the original project. It is
|
|
|
presumed they can either be deleted or left there, but it's currently
|
|
|
unclear.
|
|
|
|
|
|
Let's say, for example, say you're migrating from `gitaly-01` to
|
|
|
`gitaly-03`, assuming the `gitaly-03` server has been installed
|
|
|
properly and has a weight of "zero" (so no new repository is created
|
|
|
there yet).
|
|
|
|
|
|
1. analyze how much disk space is used by various components on each
|
|
|
end:
|
|
|
|
|
|
du -sch /home/git/repositories/* | sort -h
|
|
|
|
|
|
For example:
|
|
|
|
|
|
root@gitaly-01:~# du -sch /home/git/repositories/* | sort -h
|
|
|
704K /home/git/repositories/+gitaly
|
|
|
1.2M /home/git/repositories/@groups
|
|
|
17M /home/git/repositories/@snippets
|
|
|
35G /home/git/repositories/@pools
|
|
|
98G /home/git/repositories/@hashed
|
|
|
132G total
|
|
|
|
|
|
Keep a copy of this to give you a rough idea that all the data was
|
|
|
transferred correctly. Using Prometheus metrics is also acceptable
|
|
|
here.
|
|
|
|
|
|
2. do a first `rsync` pass between the two server to copy the bulk of
|
|
|
the data, even if it's inconsistent:
|
|
|
|
|
|
sudo -u git rsync -a /home/git/repositories/ git@gitaly-03:/var/opt/gitlab/git-data/repositories/
|
|
|
|
|
|
Notice the different paths here
|
|
|
(`/var/opt/gitlab/git-data/repositories/` vs
|
|
|
`/home/git/repositories`). Those may differ according to how the
|
|
|
server was setup. For example, on `gitaly-01`, it's the former, as
|
|
|
it's a standalone Gitaly server, but on `gitlab-02` it's the
|
|
|
latter because it's a omnibus install.
|
|
|
|
|
|
3. set the server in [maintenance mode](https://docs.gitlab.com/administration/maintenance_mode/) or at least [set
|
|
|
repositories read-only](https://docs.gitlab.com/administration/read_only_gitlab/).
|
|
|
|
|
|
4. rerun the synchronization:
|
|
|
|
|
|
sudo -u git rsync -a --delete /home/git/repositories/ git@gitaly-03:/var/opt/gitlab/git-data/repositories/
|
|
|
|
|
|
Note that this is destructive! DO NOT MIX UP THE SOURCE AND
|
|
|
TARGETS HERE!
|
|
|
|
|
|
5. reverse the weights: mark `gitaly-01` as weight 0 and `gitaly-03`
|
|
|
as 100.
|
|
|
|
|
|
6. disable Gitaly on the original server (e.g. `gitaly['enable'] =
|
|
|
false` in omnibus)
|
|
|
|
|
|
7. turn off maintenance or read-only mode
|
|
|
|
|
|
### Batch project migrations
|
|
|
|
|
|
It is *NOT* recommended to use the "all" endpoint. In the [gitaly-01
|
|
|
migration](https://gitlab.torproject.org/tpo/tpa/team/-/issues/42225), this approach was used, and it led to an explosion in
|
|
|
disk usage, as forks do not automatically deduplicate the space with
|
|
|
their parents. A "housekeeping" job is needed before space is regain
|
|
|
so, in the case of large fork trees or large repositories, can lead to
|
|
|
catastrophic disk usage explosion and an overall migration
|
|
|
failure. Housekeeping *can* be ran and the migration retried, but it's
|
|
|
a scary and inconvenient way to move *all* repos.
|
|
|
|
|
|
In any case, here's how part of that migration was done.
|
|
|
|
|
|
First, you need a personal access token with the Admin privileges on
|
|
|
GitLab. Let's say you set it in the environment in PRIVATE_TOKEN from
|
|
|
here on.
|
|
|
|
|
|
Let's say you're migrating from the gitaly storage `default` to
|
|
|
`storage1`. In the above migration, those were `gitlab-02` and
|
|
|
`gitaly-01`.
|
|
|
|
|
|
1. First, we evaluated the number of repositories on each server with:
|
|
|
|
|
|
curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=default&simple=true" 2>&1 | grep x-total
|
|
|
|
|
|
curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=storage&simple=true" 2>&1 | grep x-total
|
|
|
|
|
|
It's also possible to extract the number of repositories with the
|
|
|
`gitlab.list-projects` task, but that's much slower as it needs to
|
|
|
page through all projects.
|
|
|
|
|
|
2. Then we migrated a couple of repositories by hand, again with
|
|
|
`curl`, to see how things worked. But eventually this was
|
|
|
automated with the `fab gitlab.move-repo` fabric task, see above
|
|
|
for individual moves.
|
|
|
|
|
|
3. We then migrated *groups* of repositories, by piping list of
|
|
|
projects into a script, with this:
|
|
|
|
|
|
fab gitlab.list-projects -g tpo/tpa | while read id path; do
|
|
|
echo "moving project $id ($path)"
|
|
|
curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" \
|
|
|
-H 'Content-Type: application/json' \
|
|
|
--data '{"destination_storage_name":"storage1"}'
|
|
|
--url "https://gitlab.torproject.org/api/v4/projects/$id/repository_storage_moves" | jq .
|
|
|
done
|
|
|
|
|
|
This is went we made the wrong decision. This went extremely well:
|
|
|
even when migrating all groups, we were under the impression
|
|
|
everything would be fast and smooth. We had underestimated the
|
|
|
volume of the work remaining, because we were not checking the
|
|
|
repository counts.
|
|
|
|
|
|
For this, you should look at [this Grafana panel](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&refresh=1m&from=now-24h&to=now&timezone=utc&var-node=gitlab-02.torproject.org&viewPanel=panel-47) which shows
|
|
|
per server repository counts.
|
|
|
|
|
|
Indeed, there are vastly more user forks than project
|
|
|
repositories, so those simulations were only the tip of the
|
|
|
iceberg. But we didn't realize that, so we plowed ahead.
|
|
|
|
|
|
4. We then migrated essentially *everything* at once, by using the
|
|
|
[all projects endpoint](https://docs.gitlab.com/api/project_repository_storage_moves/#schedule-repository-storage-moves-for-all-projects-on-a-storage-shard):
|
|
|
|
|
|
curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" \
|
|
|
-H 'Content-Type: application/json' \
|
|
|
--data '{"destination_storage_name":"storage1", "source_storage_name": "default"}' \
|
|
|
--url "https://gitlab.torproject.org/api/v4/project_repository_storage_moves" | jq .
|
|
|
|
|
|
This is where things went wrong.
|
|
|
|
|
|
The first thing that happened is that the Sidekiq queue flooded,
|
|
|
triggering an alert in monitoring:
|
|
|
|
|
|
15:32:10 -ALERTOR1:#tor-alerts- SidekiqQueueSize [firing] Sidekiq queue default on gitlab-02.torproject.org is too large
|
|
|
|
|
|
That's because all the migrations are dumped in the default
|
|
|
Sidekiq queue. There are notes about tweaking the Sidekiq
|
|
|
configuration to avoid this in [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/270422#note_437064984) which might have
|
|
|
prevented this flood from blocking other things in GitLab. It's
|
|
|
unclear why having a dedicated queue for this is not default, [the
|
|
|
idea seem to have been rejected upstream](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/7177).
|
|
|
|
|
|
The other problem is that each repository is copied as is, with
|
|
|
all its objects, including a copy of all the objects from the
|
|
|
parent in the fork tree. This "reduplicates" the objects between
|
|
|
parent and fork on the target server and creates an explosion of
|
|
|
disk space. In theory, that `@pool` stuff [should be handled
|
|
|
correctly](https://gitlab.com/groups/gitlab-org/-/epics/10361) but it seems it needs maintenance so objects are
|
|
|
deduplicated again.
|
|
|
|
|
|
5. At this point, we waited for moves to complete, ran housekeeping,
|
|
|
and tried again until it worked (see below). Then we also migrated
|
|
|
snippets:
|
|
|
|
|
|
curl -s -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" -H 'Content-Type: application/json' --data '{"destination_storage_name":"storage1", "source_storage_name": "default"}' --url "https://gitlab.torproject.org/api/v4/snippet_repository_storage_moves"
|
|
|
|
|
|
and groups:
|
|
|
|
|
|
curl -X POST -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" -H 'Content-Type: application/json' --data '{"destination_storage_name":"storage1", "source_storage_name": "default"}' --url "https://gitlab.torproject.org/api/v4/group_repository_storage_moves" | jq .; date
|
|
|
|
|
|
Ultimately, we ended up automating a "one-by-one" migration script
|
|
|
with:
|
|
|
|
|
|
fab gitlab.move-repos --source-storage=default --dest-storage=storage1 --no-prompt;
|
|
|
|
|
|
... which migrated each repository one by one. It's possible a
|
|
|
full server migration could be performed this way, but it's much
|
|
|
slower because it doesn't parallelize. An issue should be filed
|
|
|
upstream so that housekeeping is scheduled on migrated repositores
|
|
|
so the normal API works correctly. The reason why this is not the
|
|
|
case is likely because GitLab.com has their own tool called
|
|
|
[`gitalyctl`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/gitaly/gitalyctl.md?ref_type=heads) to perform migrations between Gitaly clusters
|
|
|
part of a toolset called [woodhouse](https://gitlab.com/gitlab-com/gl-infra/woodhouse)
|
|
|
|
|
|
6. Finally, we checked how many repositories were left on the servers again:
|
|
|
|
|
|
curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=default&simple=true" 2>&1 | grep x-total
|
|
|
|
|
|
curl -v -s -X GET -H "PRIVATE-TOKEN: $PRIVATE_TOKEN" --url "https://gitlab.torproject.org/api/v4/projects?repository_storage=storage&simple=true" 2>&1 | grep x-total
|
|
|
|
|
|
And at this point, `list-projects` worked for the origin server as
|
|
|
there were so few repositories left:
|
|
|
|
|
|
fab gitlab.list-projects --storage=default
|
|
|
|
|
|
While migration happened, the Grafana panels [repository count per
|
|
|
server](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&refresh=1m&from=now-24h&to=now&timezone=utc&var-node=gitlab-02.torproject.org&viewPanel=panel-47), [disk usage](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage), [CPU usage](https://grafana.torproject.org/d/gex9eLcWz/cpu-usage) and [sidekiq](https://grafana.torproject.org/d/c3201e86-7dde-4897-9d67-a161d0b8d2bf/gitlab-sidekiq?folderUid=faa9db2b-c105-4c67-8f83-a918aaeac5e5&orgId=1&from=now-24h&to=now&timezone=utc&var-query0&var-node=gitlab-02.torproject.org&var-alias=gitlab-02.torproject.org) were used
|
|
|
to keep track of progress.
|
|
|
|
|
|
The `fab gitlab.list-moves` task was also used (and written!) to keep
|
|
|
track of individual states. For example, this lists the name of
|
|
|
projects in-progress:
|
|
|
|
|
|
fab gitlab.list-moves --since 2025-07-16T19:30 --status=started | jq -rc '.project.path_with_namespace' | sort
|
|
|
|
|
|
... or scheduled:
|
|
|
|
|
|
fab gitlab.list-moves --since 2025-07-16T19:30 --status=scheduled | jq -r -c '.project.path_with_namespace'
|
|
|
|
|
|
Or everything but finished tasks:
|
|
|
|
|
|
fab gitlab.list-moves --since 2025-07-16T19:30 --not-status=finished | jq -c '.'
|
|
|
|
|
|
The `--since` should be set to when the batch migration was started,
|
|
|
otherwise you get a flood of requests from the beginning of time (yes,
|
|
|
it's weird like that).
|
|
|
|
|
|
This was used to list move failures:
|
|
|
|
|
|
fab gitlab.list-moves --since 2025-07-16T19:30 --status=failed | jq -rc '[.project.id, .project.path_with_namespace, .error_message] | join(" ")'
|
|
|
|
|
|
And this, the number of jobs by state:
|
|
|
|
|
|
fab gitlab.list-moves --since 2025-07-16T19:30 | jq -r .state | sort | uniq -c
|
|
|
|
|
|
This was used to collate all failures and check for anomalies:
|
|
|
|
|
|
fab gitlab.list-moves --kind=project --not-status=finished | jq -r .error_message | sed 's,/home/git/repositories/+gitaly/tmp/[^:]*,/home/git/repositories/+gitaly/tmp/XXXX,' | sort | uniq -c | sort -n
|
|
|
|
|
|
Note that, while the failures were kind of scary, things eventually
|
|
|
turned out okay. Gitaly, when running out of disk space, handles it
|
|
|
gracefully: the job is marked as failed, and it moves on to the next
|
|
|
one. Then housekeeping can be ran and the moves can be resumed.
|
|
|
|
|
|
[Heuristical housekeeping](https://docs.gitlab.com/administration/housekeeping/#heuristical-housekeeping) can be scheduled by tweaking
|
|
|
gitaly's `daily_maintenance.start_hour` setting.
|
|
|
|
|
|
It is *possible* that scheduling a maintenance *while* doing the
|
|
|
migration could resolve the disk space issue.
|
|
|
|
|
|
Note that maintenance logs can be tailed on gitaly-01 with:
|
|
|
|
|
|
journalctl -u gitaly --grep maintenance.daily -f
|
|
|
|
|
|
Or this will show maintenance tasks that take longer than one second:
|
|
|
|
|
|
journalctl -o cat -u gitaly --since 2025-07-17T03:45 -f | jq -c '. | select (.source == "maintenance.daily") | select (.time_ms > 1000)'
|
|
|
|
|
|
## Connect to the PostgreSQL server
|
|
|
|
|
|
We previously had instructions on how to connect to the GitLab Omnibus
|
... | ... | |