... | ... | @@ -2191,9 +2191,26 @@ possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree) |
|
|
|
|
|
### Mass migrating instances to a new cluster
|
|
|
|
|
|
The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this, apparently. In practice,
|
|
|
this procedure doesn't currently work, see the end of the section for
|
|
|
details.
|
|
|
If an entire cluster needs to be evacuated, the [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) can
|
|
|
be used to automatically propagate instances between clusters. It
|
|
|
currently migrates only one VM at a time (because of the `--net`
|
|
|
argument, a limitation which could eventually be waived), but should
|
|
|
be easier to do than the export/import procedure above.
|
|
|
|
|
|
Note that this procedure depends on a patched version of
|
|
|
`move-instance`, which was changed after the 3.0 Ganeti release, see
|
|
|
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of
|
|
|
that which fix various issues we have found during the gnt-chi to
|
|
|
gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the
|
|
|
following PRs:
|
|
|
|
|
|
* [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional
|
|
|
* [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory
|
|
|
* [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory
|
|
|
|
|
|
Once those patches have been deployed, use the following procedure to
|
|
|
migrate a VM. In this example, we migrate a VM named
|
|
|
`test-01.torproject.org` from the gnt-chi cluster to gnt-dal.
|
|
|
|
|
|
1. create a new secret on the source cluster:
|
|
|
|
... | ... | @@ -2230,15 +2247,24 @@ details. |
|
|
echo 38.229.82.104 chignt.torproject.org >> /etc/hosts
|
|
|
echo 204.8.99.101 dalgnt.torproject.org >> /etc/hosts
|
|
|
|
|
|
TODO: maybe those records should point at the public IP addresses
|
|
|
in the normal torproject.org zonefile? Right now it points at the
|
|
|
private IP space, but I'm not sure why.
|
|
|
|
|
|
7. make RAPI listen on the public network, on both master nodes:
|
|
|
|
|
|
echo 'RAPI_ARGS="--require-authentication"' >> /etc/default/ganeti
|
|
|
|
|
|
TODO: add a flag in Puppet to make this configurable, so that we
|
|
|
don't have to stop Puppet.
|
|
|
|
|
|
5. enable an [API user](https://docs.ganeti.org/docs/ganeti/3.0/html/rapi.html#users-and-passwords) on the source *and* on the target cluster:
|
|
|
|
|
|
echo move-instance $(tr -dc '[:alnum:]' < /dev/urandom | head -c 30) write >> /var/lib/ganeti/rapi/users
|
|
|
systemctl restart ganeti
|
|
|
|
|
|
TODO: add to Puppet
|
|
|
|
|
|
6. enter the passwords in two files on the target cluster, for
|
|
|
example:
|
|
|
|
... | ... | @@ -2269,42 +2295,46 @@ details. |
|
|
--keep-source-instance \
|
|
|
--verbose \
|
|
|
|
|
|
Note that the above procedure depends on a patched version of
|
|
|
`move-instance`, which was changed after the 3.0 Ganeti release, see
|
|
|
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details.
|
|
|
|
|
|
Also note, that the `/var/lib/ganeti/rapi/users` files get overwritten
|
|
|
by Puppet, so that might be cleaned up after (or during) your attempt.
|
|
|
|
|
|
Currently fails with:
|
|
|
|
|
|
==> /var/log/ganeti/jobs.log <==
|
|
|
2023-03-06 21:57:25,346: job-1270 pid=1733692 ERROR Op 1/1: Caught exception in INSTANCE_CREATE(test-01.torproject.org)
|
|
|
Traceback (most recent call last):
|
|
|
File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked
|
|
|
result = self.opexec_fn(op.input,
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode
|
|
|
result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1,
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
|
|
|
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
|
|
|
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
|
|
|
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
|
|
|
[Previous line repeated 1 more time]
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
|
|
|
result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU
|
|
|
result = self._ExecLU(lu)
|
|
|
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 483, in _ExecLU
|
|
|
lu.CheckPrereq()
|
|
|
File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_create.py", line 827, in CheckPrereq
|
|
|
self.nics = ComputeNics(self.op, cluster, self.check_ip, self.cfg,
|
|
|
File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_utils.py", line 1240, in ComputeNics
|
|
|
raise errors.OpPrereqError("If network is given, no mode or link"
|
|
|
ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')
|
|
|
|
|
|
Reported as [bug 1696 upstream](https://github.com/ganeti/ganeti/issues/1696), blocked.
|
|
|
9. finally, the IP address inside the VM must be changed:
|
|
|
|
|
|
tsa-misc$ ./ganeti -H test-01.torproject.org -v renumber-instance dal-node-02.torproject.org
|
|
|
|
|
|
Note how we use the name of the Ganeti node where the VM resides.
|
|
|
|
|
|
TODO: the above rewrites `/etc/network/interfaces` while many VMs
|
|
|
actually configure `/etc/network/interfaces.d/eth0` instead
|
|
|
|
|
|
This procedure was tested on a test VM migrating from gnt-chi to
|
|
|
gnt-dal, see [tpo/tpa/team#40972][] for the gory details.
|
|
|
|
|
|
[tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972)
|
|
|
|
|
|
#### Troubleshooting
|
|
|
|
|
|
If the above procedure doesn't work, try again with `--debug` instead
|
|
|
of `--verbose`, you might see extra error messages. The import/export
|
|
|
logs can also be visible in `/var/log/ganeti/os/...`.
|
|
|
|
|
|
Finally, to trace which commands are executed (which can be
|
|
|
challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace
|
|
|
package](https://tracker.debian.org/bpftrace)) is invaluable. Make sure the `debugfs` is loaded first
|
|
|
and the package installed:
|
|
|
|
|
|
mount -t debugfs debugfs /sys/kernel/debug
|
|
|
apt install bpftrace
|
|
|
|
|
|
Then simply run:
|
|
|
|
|
|
execsnoop.bt
|
|
|
|
|
|
This will show *every* [`execve(2)`](https://manpages.debian.org/execve.2) system call executed on the
|
|
|
system. Filtering is probably a good idea, in my case I was doing:
|
|
|
|
|
|
execsnoop.bt | grep socat
|
|
|
|
|
|
The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian.org/libbbpf-tools)) may also
|
|
|
work but it truncates the command after 128 characters ([Debian
|
|
|
1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)).
|
|
|
|
|
|
### Reboot procedures
|
|
|
|
... | ... | @@ -3183,11 +3213,13 @@ Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.c |
|
|
## Logs and metrics
|
|
|
|
|
|
Ganeti logs a significant amount of information in
|
|
|
`/var/log/ganeti.log`. Those logs are of particular interest:
|
|
|
`/var/log/ganeti/`. Those logs are of particular interest:
|
|
|
|
|
|
* `node-daemon.log`: all low-level commands and HTTP requests on the
|
|
|
node daemon, includes, for example, LVM and DRBD commands
|
|
|
* `os/*$hostname*.log`: installation log for machine `$hostname`
|
|
|
* `os/*$hostname*.log`: installation log for machine `$hostname`,
|
|
|
this also includes VM migration logs for the `move-instance` or
|
|
|
`gnt-instance export` commands
|
|
|
|
|
|
It does not expose performance metrics that are digested by Prometheus
|
|
|
right now, but that would be an interesting feature to add.
|
... | ... | |