Skip to content
Snippets Groups Projects
Verified Commit ad0d489a authored by anarcat's avatar anarcat
Browse files

document all troubleshooting that was done in mass migration (team#40972)

parent c5a266ef
No related branches found
No related tags found
No related merge requests found
......@@ -1679,15 +1679,22 @@ Note that it currently migrates only one VM at a time, because of the
Also note that this procedure depends on a patched version of
`move-instance`, which was changed after the 3.0 Ganeti release, see
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of
that which fix various issues we have found during the gnt-chi to
gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the
following PRs:
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have patches on top of that
which fix various issues we have found during the gnt-chi to gnt-dal
migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) for a discussion.
On 2023-03-16, @anarcat uploaded a patched version of Ganeti to our
internal repositories (on `db.torproject.org`) with a debdiff
documented in [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972#note_2887055) and featuring the following three
patches.
* [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional
* [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory
* [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory
An extra optimisation was reported as [issue 1702](https://github.com/ganeti/ganeti/issues/1702) and patched on
`dal-node-01` manually (see [PR 1703](https://github.com/ganeti/ganeti/pull/1703)).
Once those patches have been deployed, use the following procedure to
migrate a VM. In this example, we migrate a VM named
`test-01.torproject.org` from the gnt-chi cluster to gnt-dal.
......@@ -1756,6 +1763,11 @@ will move *one* VM, in this example the `test-01` VM from the
1. stop the VM, on the source cluster:
gnt-instance stop test-01
Note that this is necessary only if you are worried changes will
happen on the source node and not be reproduced on the target
cluster. If the service is fully redundant and ephemeral (e.g. a
DNS secondary), the VM can be kept running.
2. move the VM to the new cluster:
......@@ -1782,25 +1794,177 @@ will move *one* VM, in this example the `test-01` VM from the
Note how we use the name of the Ganeti node where the VM resides.
TODO: the above rewrites `/etc/network/interfaces` while many VMs
actually configure `/etc/network/interfaces.d/eth0` instead
4. test the new VM
5. if satisfied, change DNS to new VM
6. schedule destruction of the old VM (7 days)
This procedure was tested on a test VM migrating from gnt-chi to
gnt-dal, see [tpo/tpa/team#40972][] for the gory details.
tsa-misc$ ./ganeti -v -H test-01.torproject.org retire --master-host=chi-node-01.torproject.org
### Troubleshooting
The above procedure was tested on a test VM migrating from gnt-chi to
gnt-dal ([tpo/tpa/team#40972][]). In that process, *many* hurdles were
overcome. If the above procedure is followed again and somewhat fails,
this section documents workarounds for the issues we have encountered
so far.
[tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972)
### Troubleshooting
#### Debugging and logs
If the above procedure doesn't work, try again with `--debug` instead
of `--verbose`, you might see extra error messages. The import/export
logs can also be visible in `/var/log/ganeti/os/...`.
logs can also be visible in `/var/log/ganeti/os/` on the node where
the import or export happened.
#### Missing patches
This error:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
... is [upstream bug 1696][] fixed in master with [PR 1697](https://github.com/ganeti/ganeti/pull/1697). An
alternative is to add those flags to the `move-instance` command:
--opportunistic-tries=1 --iallocator=hail
This error:
ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')
... is also documented in [upstream bug 1696][] and fixed with [PR
1698](https://github.com/ganeti/ganeti/pull/1698).
This mysterious failure:
Disk 0 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.2305 s, 0.0 kB/s)
Is *probably* a due to a certification verification bug in Ganeti's
import-export daemon. It should be confirmed in the logs in
`/var/log/ganeti/os` on the relevant node. The actual confirmation log
is:
Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")
That is upstream bug [1681](https://github.com/ganeti/ganeti/issues/1681) that should have been fixed in [PR
1699](https://github.com/ganeti/ganeti/pull/1699).
#### Not enough space on the volume group
If the export fail on the source cluster with:
WARNING: Could not snapshot disk/2 on node chi-node-10.torproject.org: Error while executing backend function: Not enough free space: required 20480, available 15364.0
That is because the volume group doesn't have enough room to make a
snapshot. In this case, there was a 300GB swap partition on the node
(!) that could easily be removed, but an alternative would be to
evacuate other instances off of the node (even as secondaries) to free
up some space.
#### Snapshot failure
If the procedure fails with:
ganeti.errors.OpExecError: Not all disks could be snapshotted, and you did not allow the instance to remain offline for a longer time through the --long-sleep option;
aborting
... try again with the VM stopped.
#### Connectivity issues
If the procedure fails during the data transfer with:
pycurl.error: (7, 'Failed to connect to chi-node-01.torproject.org port 5080: Connection refused')
or:
Disk 0 failed to send data: Exited with status 1 (recent output: dd: 0 bytes copied, 0.996381 s, 0.0 kB/s\ndd: 0 bytes copied, 5.99901 s, 0.0 kB/s\nsocat: E SSL_connect(): Connection refused)
... make sure you have the firewalls opened. Note that Puppet or other
things might clear out the temporary firewall rules established in the
preparation step.
#### DNS issues
This error:
ganeti.errors.OpPrereqError: ('The given name (metrics-psqlts-01.torproject.org.2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa) does not resolve: Name or service not known', 'resolver_error')
... means the reverse DNS on the instance has not been properly
configured. In this case, the fix was to add a trailing dot to the
`PTR` record:
```diff
--- a/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
+++ b/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
@@ -55,7 +55,7 @@ b.c.b.7.0.c.e.f.f.f.8.3.6.6.4.0 IN PTR ci-runner-x8
6-01.torproject.org.
; 2604:8800:5000:82:466:38ff:fe3c:f0a7
7.a.0.f.c.3.e.f.f.f.8.3.6.6.4.0 IN PTR dangerzone-01.torproject.org.
; 2604:8800:5000:82:466:38ff:fe97:24ac
-c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.
org
+c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.org.
; 2604:8800:5000:82:466:38ff:fed4:51a1
1.a.1.5.4.d.e.f.f.f.8.3.6.6.4.0 IN PTR onion-test-01.torproject.org.
; 2604:8800:5000:82:466:38ff:fea3:7c78
```
#### Capacity issues
If the procedure fails with:
ganeti.errors.OpPrereqError: ('Instance allocation to group 64c116fc-1ab2-4f6d-ba91-89c65875f888 (default) violates policy: memory-size value 307200 is not in range [128, 65536]', 'wrong_input')
It's because the VM is smaller or bigger than the cluster
configuration allow. You need to change the `--ipolicy-bounds-specs`
in the cluster, see, for example, the [gnt-dal cluster
initialization](#gnt-dal-cluster-initialization) instructions.
If the procedure fails with:
ganeti.errors.OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group default (preferred): No valid allocation solutions, failure reasons: FailMem: 6", 'insufficient_resources')
... you may be able to workaround the problem by specifying a
destination node by hand, add this to the `move-instance` command, for
example:
--dest-primary-node=dal-node-02.torproject.org \
--dest-secondary-node=dal-node-03.torproject.org
The error:
ganeti.errors.OpPrereqError: Disk template 'blockdev' is not enabled in cluster. Enabled disk templates are: drbd,plain
... means that you should pass a supported `--dest-disk-template`
argument to the `move-instance` command.
#### Rerunning failed migrations
This error obviously means the instance already exists in the cluster:
ganeti.errors.OpPrereqError: ("Instance 'rdsys-frontend-01.torproject.org' is already in the cluster", 'already_exists')
... maybe you're retrying a failed move? In that case, delete the
*target* instance (yes, really make sure you delete the target, not
the source!!!):
gnt-instance remove --shutdown-timeout-0 test-01.torproject.org
#### Other issues
This error is harmless and can be ignored:
WARNING: Failed to run rename script for tpa-bootstrap-01.torproject.org on node dal-node-02.torproject.org: OS rename script failed (exited with exit code 1), last lines in the log file:\nCannot rename from tpa-bootstrap-01.torproject.org to tpa-bootstrap-01.torproject.org:\nInstance has a different hostname (tpa-bootstrap-01)
It's probably a flaw in the `ganeti-instance-debootstrap` backend that
doesn't properly renumber the instance. We have our own renumbering
procedure in Fabric instead, but that could be merged inside
`ganeti-instance-debootstrap` eventually.
#### Tracing executed commands
Finally, to trace which commands are executed (which can be
challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace
......@@ -1823,6 +1987,11 @@ The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian.
work but it truncates the command after 128 characters ([Debian
1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)).
This was used to troubleshoot the certificate issues with `socat` in
[upstream bug 1681](https://github.com/ganeti/ganeti/issues/1681).
[upstream bug 1696]: https://github.com/ganeti/ganeti/issues/1696
## Pager playbook
### I/O overload
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment