Verified Commit 6e19b4be authored by anarcat's avatar anarcat
Browse files

update move-instance documentation with latest success (team#40972)

parent 9353163d
Loading
Loading
Loading
Loading
+73 −41
Original line number Diff line number Diff line
@@ -2191,9 +2191,26 @@ possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree)

### Mass migrating instances to a new cluster

The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this, apparently. In practice,
this procedure doesn't currently work, see the end of the section for
details.
If an entire cluster needs to be evacuated, the [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) can
be used to automatically propagate instances between clusters. It
currently migrates only one VM at a time (because of the `--net`
argument, a limitation which could eventually be waived), but should
be easier to do than the export/import procedure above.

Note that this procedure depends on a patched version of
`move-instance`, which was changed after the 3.0 Ganeti release, see
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details. We also have three patches on top of
that which fix various issues we have found during the gnt-chi to
gnt-dal migration, see [this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1470693963) and specifically the
following PRs:

 * [GitHub ganeti#1697](https://github.com/ganeti/ganeti/pull/1697) - Python 3 tweak, optional
 * [GitHub ganeti#1698](https://github.com/ganeti/ganeti/pull/1698) - network configuration hack, mandatory
 * [GitHub ganeti#1699](https://github.com/ganeti/ganeti/pull/1699) - OpenSSL verification hack, mandatory

Once those patches have been deployed, use the following procedure to
migrate a VM. In this example, we migrate a VM named
`test-01.torproject.org` from the gnt-chi cluster to gnt-dal.

 1. create a new secret on the source cluster:

@@ -2230,15 +2247,24 @@ details.
        echo 38.229.82.104   chignt.torproject.org >> /etc/hosts
        echo 204.8.99.101    dalgnt.torproject.org >> /etc/hosts

    TODO: maybe those records should point at the public IP addresses
    in the normal torproject.org zonefile? Right now it points at the
    private IP space, but I'm not sure why.

 7. make RAPI listen on the public network, on both master nodes:
 
        echo 'RAPI_ARGS="--require-authentication"' >> /etc/default/ganeti

    TODO: add a flag in Puppet to make this configurable, so that we
    don't have to stop Puppet.

 5. enable an [API user](https://docs.ganeti.org/docs/ganeti/3.0/html/rapi.html#users-and-passwords) on the source *and* on the target cluster:

        echo move-instance $(tr -dc '[:alnum:]' < /dev/urandom | head -c 30) write >> /var/lib/ganeti/rapi/users
        systemctl restart ganeti

    TODO: add to Puppet

 6. enter the passwords in two files on the target cluster, for
    example:
    
@@ -2269,42 +2295,46 @@ details.
            --keep-source-instance \
            --verbose \

Note that the above procedure depends on a patched version of
`move-instance`, which was changed after the 3.0 Ganeti release, see
[this comment](https://github.com/ganeti/ganeti/issues/1696#issuecomment-1465221351) for details.

Also note, that the `/var/lib/ganeti/rapi/users` files get overwritten
by Puppet, so that might be cleaned up after (or during) your attempt.

Currently fails with:

    ==> /var/log/ganeti/jobs.log <==
    2023-03-06 21:57:25,346: job-1270 pid=1733692 ERROR Op 1/1: Caught exception in INSTANCE_CREATE(test-01.torproject.org)
    Traceback (most recent call last):
      File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked
        result = self.opexec_fn(op.input,
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode
        result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1,
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
        result = self._LockAndExecLU(lu, level + 1, calc_timeout,
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
        result = self._LockAndExecLU(lu, level + 1, calc_timeout,
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
        result = self._LockAndExecLU(lu, level + 1, calc_timeout,
      [Previous line repeated 1 more time]
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
        result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU
        result = self._ExecLU(lu)
      File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 483, in _ExecLU
        lu.CheckPrereq()
      File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_create.py", line 827, in CheckPrereq
        self.nics = ComputeNics(self.op, cluster, self.check_ip, self.cfg,
      File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_utils.py", line 1240, in ComputeNics
        raise errors.OpPrereqError("If network is given, no mode or link"
    ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')

Reported as [bug 1696 upstream](https://github.com/ganeti/ganeti/issues/1696), blocked.
 9. finally, the IP address inside the VM must be changed:
 
        tsa-misc$ ./ganeti -H test-01.torproject.org -v renumber-instance dal-node-02.torproject.org 

    Note how we use the name of the Ganeti node where the VM resides.

    TODO: the above rewrites `/etc/network/interfaces` while many VMs
    actually configure `/etc/network/interfaces.d/eth0` instead

This procedure was tested on a test VM migrating from gnt-chi to
gnt-dal, see [tpo/tpa/team#40972][] for the gory details.

[tpo/tpa/team#40972]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40972)

#### Troubleshooting

If the above procedure doesn't work, try again with `--debug` instead
of `--verbose`, you might see extra error messages. The import/export
logs can also be visible in `/var/log/ganeti/os/...`.

Finally, to trace which commands are executed (which can be
challenging in Ganeti), the `execsnoop.bt` command (from the [bpftrace
package](https://tracker.debian.org/bpftrace)) is invaluable. Make sure the `debugfs` is loaded first
and the package installed:

    mount -t debugfs debugfs /sys/kernel/debug
    apt install bpftrace

Then simply run:

    execsnoop.bt

This will show *every* [`execve(2)`](https://manpages.debian.org/execve.2) system call executed on the
system. Filtering is probably a good idea, in my case I was doing:

    execsnoop.bt | grep socat

The `execsnoop` command (from the [libbpf-tools package](https://tracker.debian.org/libbbpf-tools)) may also
work but it truncates the command after 128 characters ([Debian
1033013](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033013), [upstream 740](https://github.com/iovisor/bcc/issues/740)).

### Reboot procedures

@@ -3183,11 +3213,13 @@ Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.c
## Logs and metrics

Ganeti logs a significant amount of information in
`/var/log/ganeti.log`. Those logs are of particular interest:
`/var/log/ganeti/`. Those logs are of particular interest:

 * `node-daemon.log`: all low-level commands and HTTP requests on the
   node daemon, includes, for example, LVM and DRBD commands
 * `os/*$hostname*.log`: installation log for machine `$hostname`
 * `os/*$hostname*.log`: installation log for machine `$hostname`,
   this also includes VM migration logs for the `move-instance` or
   `gnt-instance export` commands

It does not expose performance metrics that are digested by Prometheus
right now, but that would be an interesting feature to add.