Verified Commit e465b607 authored by anarcat's avatar anarcat
Browse files

document the ghost disk error that occured after team#40910

parent ff88340e
Loading
Loading
Loading
Loading
+52 −0
Original line number Diff line number Diff line
@@ -1956,6 +1956,58 @@ to remove the logical volumes on the target node:
    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data

### Cleaning up ghost disks

Under certain circumstances, you might end up with "ghost" disks, for
example:

    Tue Oct  4 13:24:07 2022   - ERROR: cluster : ghost disk 'ed225e68-83af-40f7-8d8c-cf7e46adad54' in temporary DRBD map

It's unclear how this happens, but in this specific case it is
believed the problem occurred because a disk failed to add to an
instance being resized.

It's *possible* this is a situation similar to the one above, in which
case you must first find *where* the ghost disk is, with something
like:

    gnt-cluster command 'lvs --noheadings' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

If this finds a device, you can remove it as normal:

    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/ed225e68-83af-40f7-8d8c-cf7e46adad54.disk1_data

... but in this case, the DRBD map is *not* associated with a logical
volume. You can also check the `dmsetup` output for a match as well:

    gnt-cluster command 'dmsetup ls' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

According to [this discussion](https://groups.google.com/g/ganeti/c/s5qoh26T1yA), it's possible that restarting
ganeti on all nodes might clear out the issue:

    gnt-cluster command 'service ganeti restart'

If *all* the "ghost" disks mentioned are not actually found anywhere
in the cluster, either in the device mapper or logical volumes, it
might just be stray data leftover in the data file.

So it *looks* like the proper way to do this is to *remove* the
temporary file where this data is stored:

    gnt-cluster command  'grep ed225e68-83af-40f7-8d8c-cf7e46adad54 /var/lib/ganeti/tempres.data'
    ssh ... service ganeti stop
    ssh ... rm /var/lib/ganeti/tempres.data
    ssh ... service ganeti start
    gnt-cluster verify

That solution was proposed in [this discussion](https://groups.google.com/g/ganeti/c/SMR3yNek3Js). Anarcat toured the
Ganeti source code and found that the `ComputeDRBDMap` function, in
the Haskell codebase, basically just sucks the data out of that
`tempres.data` JSON file, and dumps it into the Python side of
things. Then the Python code looks for those disks in its internal
disk list and compares. It's pretty unlikely that the warning would
happen with the disks still being around, therefore.

### Fixing inconsistent disks

Sometimes `gnt-cluster verify` will give this error: