Loading howto/ganeti.md +73 −41 Original line number Diff line number Diff line Loading @@ -2191,9 +2191,26 @@ possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree) ### Mass migrating instances to a new cluster The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this, apparently. In practice, this procedure doesn't currently work, see the end of the section for details. If an entire cluster needs to be evacuated, the [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) can be used to automatically propagate instances between clusters. It currently migrates only one VM at a time (because of the `--net` argument, a limitation which could eventually be waived), but should be easier to do than the export/import procedure above. Note that this procedure depends on a patched version of `move-instance`, which was changed after the 3.0 Ganeti release, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of that which fix various issues we have found during the gnt-chi to gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the following PRs: * [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional * [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory * [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory Once those patches have been deployed, use the following procedure to migrate a VM. In this example, we migrate a VM named `test-01.torproject.org` from the gnt-chi cluster to gnt-dal. 1. create a new secret on the source cluster: Loading Loading @@ -2230,15 +2247,24 @@ details. echo 38.229.82.104 chignt.torproject.org >> /etc/hosts echo 204.8.99.101 dalgnt.torproject.org >> /etc/hosts TODO: maybe those records should point at the public IP addresses in the normal torproject.org zonefile? Right now it points at the private IP space, but I'm not sure why. 7. make RAPI listen on the public network, on both master nodes: echo 'RAPI_ARGS="--require-authentication"' >> /etc/default/ganeti TODO: add a flag in Puppet to make this configurable, so that we don't have to stop Puppet. 5. enable an [API user](https://docs.ganeti.org/docs/ganeti/3.0/html/rapi.html#users-and-passwords) on the source *and* on the target cluster: echo move-instance $(tr -dc '[:alnum:]' < /dev/urandom | head -c 30) write >> /var/lib/ganeti/rapi/users systemctl restart ganeti TODO: add to Puppet 6. enter the passwords in two files on the target cluster, for example: Loading Loading @@ -2269,42 +2295,46 @@ details. --keep-source-instance \ --verbose \ Note that the above procedure depends on a patched version of `move-instance`, which was changed after the 3.0 Ganeti release, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. Also note, that the `/var/lib/ganeti/rapi/users` files get overwritten by Puppet, so that might be cleaned up after (or during) your attempt. Currently fails with: ==> /var/log/ganeti/jobs.log <== 2023-03-06 21:57:25,346: job-1270 pid=1733692 ERROR Op 1/1: Caught exception in INSTANCE_CREATE(test-01.torproject.org) Traceback (most recent call last): File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked result = self.opexec_fn(op.input, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, [Previous line repeated 1 more time] File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending) File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU result = self._ExecLU(lu) File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 483, in _ExecLU lu.CheckPrereq() File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_create.py", line 827, in CheckPrereq self.nics = ComputeNics(self.op, cluster, self.check_ip, self.cfg, File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_utils.py", line 1240, in ComputeNics raise errors.OpPrereqError("If network is given, no mode or link" ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input') Reported as [bug 1696 upstream](https://github.com/ganeti/ganeti/issues/1696), blocked. 9. finally, the IP address inside the VM must be changed: tsa-misc$ ./ganeti -H test-01.torproject.org -v renumber-instance dal-node-02.torproject.org Note how we use the name of the Ganeti node where the VM resides. TODO: the above rewrites `/etc/network/interfaces` while many VMs actually configure `/etc/network/interfaces.d/eth0` instead This procedure was tested on a test VM migrating from gnt-chi to gnt-dal, see [tpo/tpa/team#40972][] for the gory details. [tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972) #### Troubleshooting If the above procedure doesn't work, try again with `--debug` instead of `--verbose`, you might see extra error messages. The import/export logs can also be visible in `/var/log/ganeti/os/...`. Finally, to trace which commands are executed (which can be challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace package](https://tracker.debian.org/bpftrace)) is invaluable. Make sure the `debugfs` is loaded first and the package installed: mount -t debugfs debugfs /sys/kernel/debug apt install bpftrace Then simply run: execsnoop.bt This will show *every* [`execve(2)`](https://manpages.debian.org/execve.2) system call executed on the system. Filtering is probably a good idea, in my case I was doing: execsnoop.bt | grep socat The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian.org/libbbpf-tools)) may also work but it truncates the command after 128 characters ([Debian 1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)). ### Reboot procedures Loading Loading @@ -3183,11 +3213,13 @@ Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.c ## Logs and metrics Ganeti logs a significant amount of information in `/var/log/ganeti.log`. Those logs are of particular interest: `/var/log/ganeti/`. Those logs are of particular interest: * `node-daemon.log`: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands * `os/*$hostname*.log`: installation log for machine `$hostname` * `os/*$hostname*.log`: installation log for machine `$hostname`, this also includes VM migration logs for the `move-instance` or `gnt-instance export` commands It does not expose performance metrics that are digested by Prometheus right now, but that would be an interesting feature to add. Loading Loading
howto/ganeti.md +73 −41 Original line number Diff line number Diff line Loading @@ -2191,9 +2191,26 @@ possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree) ### Mass migrating instances to a new cluster The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this, apparently. In practice, this procedure doesn't currently work, see the end of the section for details. If an entire cluster needs to be evacuated, the [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) can be used to automatically propagate instances between clusters. It currently migrates only one VM at a time (because of the `--net` argument, a limitation which could eventually be waived), but should be easier to do than the export/import procedure above. Note that this procedure depends on a patched version of `move-instance`, which was changed after the 3.0 Ganeti release, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of that which fix various issues we have found during the gnt-chi to gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the following PRs: * [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional * [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory * [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory Once those patches have been deployed, use the following procedure to migrate a VM. In this example, we migrate a VM named `test-01.torproject.org` from the gnt-chi cluster to gnt-dal. 1. create a new secret on the source cluster: Loading Loading @@ -2230,15 +2247,24 @@ details. echo 38.229.82.104 chignt.torproject.org >> /etc/hosts echo 204.8.99.101 dalgnt.torproject.org >> /etc/hosts TODO: maybe those records should point at the public IP addresses in the normal torproject.org zonefile? Right now it points at the private IP space, but I'm not sure why. 7. make RAPI listen on the public network, on both master nodes: echo 'RAPI_ARGS="--require-authentication"' >> /etc/default/ganeti TODO: add a flag in Puppet to make this configurable, so that we don't have to stop Puppet. 5. enable an [API user](https://docs.ganeti.org/docs/ganeti/3.0/html/rapi.html#users-and-passwords) on the source *and* on the target cluster: echo move-instance $(tr -dc '[:alnum:]' < /dev/urandom | head -c 30) write >> /var/lib/ganeti/rapi/users systemctl restart ganeti TODO: add to Puppet 6. enter the passwords in two files on the target cluster, for example: Loading Loading @@ -2269,42 +2295,46 @@ details. --keep-source-instance \ --verbose \ Note that the above procedure depends on a patched version of `move-instance`, which was changed after the 3.0 Ganeti release, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. Also note, that the `/var/lib/ganeti/rapi/users` files get overwritten by Puppet, so that might be cleaned up after (or during) your attempt. Currently fails with: ==> /var/log/ganeti/jobs.log <== 2023-03-06 21:57:25,346: job-1270 pid=1733692 ERROR Op 1/1: Caught exception in INSTANCE_CREATE(test-01.torproject.org) Traceback (most recent call last): File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked result = self.opexec_fn(op.input, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, [Previous line repeated 1 more time] File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending) File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU result = self._ExecLU(lu) File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 483, in _ExecLU lu.CheckPrereq() File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_create.py", line 827, in CheckPrereq self.nics = ComputeNics(self.op, cluster, self.check_ip, self.cfg, File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_utils.py", line 1240, in ComputeNics raise errors.OpPrereqError("If network is given, no mode or link" ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input') Reported as [bug 1696 upstream](https://github.com/ganeti/ganeti/issues/1696), blocked. 9. finally, the IP address inside the VM must be changed: tsa-misc$ ./ganeti -H test-01.torproject.org -v renumber-instance dal-node-02.torproject.org Note how we use the name of the Ganeti node where the VM resides. TODO: the above rewrites `/etc/network/interfaces` while many VMs actually configure `/etc/network/interfaces.d/eth0` instead This procedure was tested on a test VM migrating from gnt-chi to gnt-dal, see [tpo/tpa/team#40972][] for the gory details. [tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972) #### Troubleshooting If the above procedure doesn't work, try again with `--debug` instead of `--verbose`, you might see extra error messages. The import/export logs can also be visible in `/var/log/ganeti/os/...`. Finally, to trace which commands are executed (which can be challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace package](https://tracker.debian.org/bpftrace)) is invaluable. Make sure the `debugfs` is loaded first and the package installed: mount -t debugfs debugfs /sys/kernel/debug apt install bpftrace Then simply run: execsnoop.bt This will show *every* [`execve(2)`](https://manpages.debian.org/execve.2) system call executed on the system. Filtering is probably a good idea, in my case I was doing: execsnoop.bt | grep socat The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian.org/libbbpf-tools)) may also work but it truncates the command after 128 characters ([Debian 1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)). ### Reboot procedures Loading Loading @@ -3183,11 +3213,13 @@ Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.c ## Logs and metrics Ganeti logs a significant amount of information in `/var/log/ganeti.log`. Those logs are of particular interest: `/var/log/ganeti/`. Those logs are of particular interest: * `node-daemon.log`: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands * `os/*$hostname*.log`: installation log for machine `$hostname` * `os/*$hostname*.log`: installation log for machine `$hostname`, this also includes VM migration logs for the `move-instance` or `gnt-instance export` commands It does not expose performance metrics that are digested by Prometheus right now, but that would be an interesting feature to add. Loading