Skip to content
Snippets Groups Projects
Verified Commit e465b607 authored by anarcat's avatar anarcat
Browse files

document the ghost disk error that occured after team#40910

parent ff88340e
No related branches found
No related tags found
No related merge requests found
......@@ -1956,6 +1956,58 @@ to remove the logical volumes on the target node:
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data
### Cleaning up ghost disks
Under certain circumstances, you might end up with "ghost" disks, for
example:
Tue Oct 4 13:24:07 2022 - ERROR: cluster : ghost disk 'ed225e68-83af-40f7-8d8c-cf7e46adad54' in temporary DRBD map
It's unclear how this happens, but in this specific case it is
believed the problem occurred because a disk failed to add to an
instance being resized.
It's *possible* this is a situation similar to the one above, in which
case you must first find *where* the ghost disk is, with something
like:
gnt-cluster command 'lvs --noheadings' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'
If this finds a device, you can remove it as normal:
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/ed225e68-83af-40f7-8d8c-cf7e46adad54.disk1_data
... but in this case, the DRBD map is *not* associated with a logical
volume. You can also check the `dmsetup` output for a match as well:
gnt-cluster command 'dmsetup ls' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'
According to [this discussion](https://groups.google.com/g/ganeti/c/s5qoh26T1yA), it's possible that restarting
ganeti on all nodes might clear out the issue:
gnt-cluster command 'service ganeti restart'
If *all* the "ghost" disks mentioned are not actually found anywhere
in the cluster, either in the device mapper or logical volumes, it
might just be stray data leftover in the data file.
So it *looks* like the proper way to do this is to *remove* the
temporary file where this data is stored:
gnt-cluster command 'grep ed225e68-83af-40f7-8d8c-cf7e46adad54 /var/lib/ganeti/tempres.data'
ssh ... service ganeti stop
ssh ... rm /var/lib/ganeti/tempres.data
ssh ... service ganeti start
gnt-cluster verify
That solution was proposed in [this discussion](https://groups.google.com/g/ganeti/c/SMR3yNek3Js). Anarcat toured the
Ganeti source code and found that the `ComputeDRBDMap` function, in
the Haskell codebase, basically just sucks the data out of that
`tempres.data` JSON file, and dumps it into the Python side of
things. Then the Python code looks for those disks in its internal
disk list and compares. It's pretty unlikely that the warning would
happen with the disks still being around, therefore.
### Fixing inconsistent disks
Sometimes `gnt-cluster verify` will give this error:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment