... | ... | @@ -1679,15 +1679,22 @@ Note that it currently migrates only one VM at a time, because of the |
|
|
|
|
|
Also note that this procedure depends on a patched version of
|
|
|
`move-instance`, which was changed after the 3.0 Ganeti release, see
|
|
|
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of
|
|
|
that which fix various issues we have found during the gnt-chi to
|
|
|
gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the
|
|
|
following PRs:
|
|
|
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have patches on top of that
|
|
|
which fix various issues we have found during the gnt-chi to gnt-dal
|
|
|
migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) for a discussion.
|
|
|
|
|
|
On 2023-03-16, @anarcat uploaded a patched version of Ganeti to our
|
|
|
internal repositories (on `db.torproject.org`) with a debdiff
|
|
|
documented in [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972#note_2887055) and featuring the following three
|
|
|
patches.
|
|
|
|
|
|
* [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional
|
|
|
* [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory
|
|
|
* [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory
|
|
|
|
|
|
An extra optimisation was reported as [issue 1702](https://github.com/ganeti/ganeti/issues/1702) and patched on
|
|
|
`dal-node-01` manually (see [PR 1703](https://github.com/ganeti/ganeti/pull/1703)).
|
|
|
|
|
|
Once those patches have been deployed, use the following procedure to
|
|
|
migrate a VM. In this example, we migrate a VM named
|
|
|
`test-01.torproject.org` from the gnt-chi cluster to gnt-dal.
|
... | ... | @@ -1757,6 +1764,11 @@ will move *one* VM, in this example the `test-01` VM from the |
|
|
|
|
|
gnt-instance stop test-01
|
|
|
|
|
|
Note that this is necessary only if you are worried changes will
|
|
|
happen on the source node and not be reproduced on the target
|
|
|
cluster. If the service is fully redundant and ephemeral (e.g. a
|
|
|
DNS secondary), the VM can be kept running.
|
|
|
|
|
|
2. move the VM to the new cluster:
|
|
|
|
|
|
/usr/lib/ganeti/tools/move-instance \
|
... | ... | @@ -1782,25 +1794,177 @@ will move *one* VM, in this example the `test-01` VM from the |
|
|
|
|
|
Note how we use the name of the Ganeti node where the VM resides.
|
|
|
|
|
|
TODO: the above rewrites `/etc/network/interfaces` while many VMs
|
|
|
actually configure `/etc/network/interfaces.d/eth0` instead
|
|
|
|
|
|
4. test the new VM
|
|
|
|
|
|
5. if satisfied, change DNS to new VM
|
|
|
|
|
|
6. schedule destruction of the old VM (7 days)
|
|
|
|
|
|
This procedure was tested on a test VM migrating from gnt-chi to
|
|
|
gnt-dal, see [tpo/tpa/team#40972][] for the gory details.
|
|
|
tsa-misc$ ./ganeti -v -H test-01.torproject.org retire --master-host=chi-node-01.torproject.org
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
The above procedure was tested on a test VM migrating from gnt-chi to
|
|
|
gnt-dal ([tpo/tpa/team#40972][]). In that process, *many* hurdles were
|
|
|
overcome. If the above procedure is followed again and somewhat fails,
|
|
|
this section documents workarounds for the issues we have encountered
|
|
|
so far.
|
|
|
|
|
|
[tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972)
|
|
|
|
|
|
### Troubleshooting
|
|
|
#### Debugging and logs
|
|
|
|
|
|
If the above procedure doesn't work, try again with `--debug` instead
|
|
|
of `--verbose`, you might see extra error messages. The import/export
|
|
|
logs can also be visible in `/var/log/ganeti/os/...`.
|
|
|
logs can also be visible in `/var/log/ganeti/os/` on the node where
|
|
|
the import or export happened.
|
|
|
|
|
|
#### Missing patches
|
|
|
|
|
|
This error:
|
|
|
|
|
|
TypeError: '>' not supported between instances of 'NoneType' and 'int'
|
|
|
|
|
|
... is [upstream bug 1696][] fixed in master with [PR 1697](https://github.com/ganeti/ganeti/pull/1697). An
|
|
|
alternative is to add those flags to the `move-instance` command:
|
|
|
|
|
|
--opportunistic-tries=1 --iallocator=hail
|
|
|
|
|
|
This error:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')
|
|
|
|
|
|
... is also documented in [upstream bug 1696][] and fixed with [PR
|
|
|
1698](https://github.com/ganeti/ganeti/pull/1698).
|
|
|
|
|
|
This mysterious failure:
|
|
|
|
|
|
Disk 0 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.2305 s, 0.0 kB/s)
|
|
|
|
|
|
Is *probably* a due to a certification verification bug in Ganeti's
|
|
|
import-export daemon. It should be confirmed in the logs in
|
|
|
`/var/log/ganeti/os` on the relevant node. The actual confirmation log
|
|
|
is:
|
|
|
|
|
|
Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")
|
|
|
|
|
|
That is upstream bug [1681](https://github.com/ganeti/ganeti/issues/1681) that should have been fixed in [PR
|
|
|
1699](https://github.com/ganeti/ganeti/pull/1699).
|
|
|
|
|
|
#### Not enough space on the volume group
|
|
|
|
|
|
If the export fail on the source cluster with:
|
|
|
|
|
|
WARNING: Could not snapshot disk/2 on node chi-node-10.torproject.org: Error while executing backend function: Not enough free space: required 20480, available 15364.0
|
|
|
|
|
|
That is because the volume group doesn't have enough room to make a
|
|
|
snapshot. In this case, there was a 300GB swap partition on the node
|
|
|
(!) that could easily be removed, but an alternative would be to
|
|
|
evacuate other instances off of the node (even as secondaries) to free
|
|
|
up some space.
|
|
|
|
|
|
#### Snapshot failure
|
|
|
|
|
|
If the procedure fails with:
|
|
|
|
|
|
ganeti.errors.OpExecError: Not all disks could be snapshotted, and you did not allow the instance to remain offline for a longer time through the --long-sleep option;
|
|
|
aborting
|
|
|
|
|
|
... try again with the VM stopped.
|
|
|
|
|
|
#### Connectivity issues
|
|
|
|
|
|
If the procedure fails during the data transfer with:
|
|
|
|
|
|
pycurl.error: (7, 'Failed to connect to chi-node-01.torproject.org port 5080: Connection refused')
|
|
|
|
|
|
or:
|
|
|
|
|
|
Disk 0 failed to send data: Exited with status 1 (recent output: dd: 0 bytes copied, 0.996381 s, 0.0 kB/s\ndd: 0 bytes copied, 5.99901 s, 0.0 kB/s\nsocat: E SSL_connect(): Connection refused)
|
|
|
|
|
|
... make sure you have the firewalls opened. Note that Puppet or other
|
|
|
things might clear out the temporary firewall rules established in the
|
|
|
preparation step.
|
|
|
|
|
|
#### DNS issues
|
|
|
|
|
|
This error:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: ('The given name (metrics-psqlts-01.torproject.org.2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa) does not resolve: Name or service not known', 'resolver_error')
|
|
|
|
|
|
... means the reverse DNS on the instance has not been properly
|
|
|
configured. In this case, the fix was to add a trailing dot to the
|
|
|
`PTR` record:
|
|
|
|
|
|
```diff
|
|
|
--- a/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
|
|
|
+++ b/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
|
|
|
@@ -55,7 +55,7 @@ b.c.b.7.0.c.e.f.f.f.8.3.6.6.4.0 IN PTR ci-runner-x8
|
|
|
6-01.torproject.org.
|
|
|
; 2604:8800:5000:82:466:38ff:fe3c:f0a7
|
|
|
7.a.0.f.c.3.e.f.f.f.8.3.6.6.4.0 IN PTR dangerzone-01.torproject.org.
|
|
|
; 2604:8800:5000:82:466:38ff:fe97:24ac
|
|
|
-c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.
|
|
|
org
|
|
|
+c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.org.
|
|
|
; 2604:8800:5000:82:466:38ff:fed4:51a1
|
|
|
1.a.1.5.4.d.e.f.f.f.8.3.6.6.4.0 IN PTR onion-test-01.torproject.org.
|
|
|
; 2604:8800:5000:82:466:38ff:fea3:7c78
|
|
|
```
|
|
|
|
|
|
#### Capacity issues
|
|
|
|
|
|
If the procedure fails with:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: ('Instance allocation to group 64c116fc-1ab2-4f6d-ba91-89c65875f888 (default) violates policy: memory-size value 307200 is not in range [128, 65536]', 'wrong_input')
|
|
|
|
|
|
It's because the VM is smaller or bigger than the cluster
|
|
|
configuration allow. You need to change the `--ipolicy-bounds-specs`
|
|
|
in the cluster, see, for example, the [gnt-dal cluster
|
|
|
initialization](#gnt-dal-cluster-initialization) instructions.
|
|
|
|
|
|
If the procedure fails with:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group default (preferred): No valid allocation solutions, failure reasons: FailMem: 6", 'insufficient_resources')
|
|
|
|
|
|
... you may be able to workaround the problem by specifying a
|
|
|
destination node by hand, add this to the `move-instance` command, for
|
|
|
example:
|
|
|
|
|
|
--dest-primary-node=dal-node-02.torproject.org \
|
|
|
--dest-secondary-node=dal-node-03.torproject.org
|
|
|
|
|
|
The error:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: Disk template 'blockdev' is not enabled in cluster. Enabled disk templates are: drbd,plain
|
|
|
|
|
|
... means that you should pass a supported `--dest-disk-template`
|
|
|
argument to the `move-instance` command.
|
|
|
|
|
|
#### Rerunning failed migrations
|
|
|
|
|
|
This error obviously means the instance already exists in the cluster:
|
|
|
|
|
|
ganeti.errors.OpPrereqError: ("Instance 'rdsys-frontend-01.torproject.org' is already in the cluster", 'already_exists')
|
|
|
|
|
|
... maybe you're retrying a failed move? In that case, delete the
|
|
|
*target* instance (yes, really make sure you delete the target, not
|
|
|
the source!!!):
|
|
|
|
|
|
gnt-instance remove --shutdown-timeout-0 test-01.torproject.org
|
|
|
|
|
|
#### Other issues
|
|
|
|
|
|
This error is harmless and can be ignored:
|
|
|
|
|
|
WARNING: Failed to run rename script for tpa-bootstrap-01.torproject.org on node dal-node-02.torproject.org: OS rename script failed (exited with exit code 1), last lines in the log file:\nCannot rename from tpa-bootstrap-01.torproject.org to tpa-bootstrap-01.torproject.org:\nInstance has a different hostname (tpa-bootstrap-01)
|
|
|
|
|
|
It's probably a flaw in the `ganeti-instance-debootstrap` backend that
|
|
|
doesn't properly renumber the instance. We have our own renumbering
|
|
|
procedure in Fabric instead, but that could be merged inside
|
|
|
`ganeti-instance-debootstrap` eventually.
|
|
|
|
|
|
#### Tracing executed commands
|
|
|
|
|
|
Finally, to trace which commands are executed (which can be
|
|
|
challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace
|
... | ... | @@ -1823,6 +1987,11 @@ The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian. |
|
|
work but it truncates the command after 128 characters ([Debian
|
|
|
1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)).
|
|
|
|
|
|
This was used to troubleshoot the certificate issues with `socat` in
|
|
|
[upstream bug 1681](https://github.com/ganeti/ganeti/issues/1681).
|
|
|
|
|
|
[upstream bug 1696]: https://github.com/ganeti/ganeti/issues/1696
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
### I/O overload
|
... | ... | |