Skip to content
Snippets Groups Projects
Unverified Commit acda93e5 authored by anarcat's avatar anarcat
Browse files

more ganeti tricks i found in the last rebalancing

parent 6f49bd94
No related branches found
No related tags found
No related merge requests found
......@@ -230,6 +230,39 @@ everything, be very careful with it:
gnt-instance remove test01.torproject.org
## Getting information
Information about an instances can be found in the rather verbose
`gnt-instance info`:
root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
- Instance name: tb-build-02.torproject.org
UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
Serial number: 5
Creation time: 2020-12-15 14:06:41
Modification time: 2020-12-15 14:07:31
State: configured to be up, actual state is up
Nodes:
- primary: fsn-node-03.torproject.org
group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
- secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
Operating system: debootstrap+buster
A quick command that can be done is this, which shows the
primary/secondary for a given instance:
gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes
An equivalent command will show the primary and secondary for *all*
instances, on top of extra information (like the CPU count, memory and
disk usage):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort
It can be useful to run this in a loop to see changes:
watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'
## Disk operations (DRBD)
Instances should be setup using the DRBD backend, in which case you
......@@ -242,7 +275,9 @@ not be necessary.
This will list instances repeatedly, but also show their assigned
memory, and compare it with the node's capacity:
watch -n5 -d 'gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort; echo; gnt-node list'
gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
echo &&
gnt-node list
The latter does not show disk usage for secondary volume groups (see
[upstream issue 1379](https://github.com/ganeti/ganeti/issues/1379)), for a complete picture of disk usage, use:
......@@ -870,6 +905,40 @@ to workaround that issue, but [those do not work for secondary
instances](https://github.com/ganeti/ganeti/issues/1497). For this we would need to setup [node groups](http://docs.ganeti.org/ganeti/current/html/man-gnt-group.html)
instead.
Another option is to specifically look for instances that do not have
a HDD and migrate only those. In my situation, `gnt-cluster verify`
was complaining that `fsn-node-02` was full, so I looked for all the
instances on that node and found the ones which didn't have a HDD:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \
| sort | grep 'fsn-node-02' | awk '{print $3}' | \
while read instance ; do
printf "checking $instance: "
if gnt-instance info $instance | grep -q hdd ; then
echo "HAS HDD"
else
echo "NO HDD"
fi
done
Then you can manually `migrate -f` (to fail over to the secondary) and
`replace-disks -n` (to find another secondary) the instances that
*can* be migrated out of the four first machines (which have HDDs) to
the last three (which do not). Look at the memory usage in `gnt-node
list` to pick the best node.
In general, if a given node in the first four is overloaded, a good
trick is to look for one that can be failed over, with, for example:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]'
... or, for a particular node (say fsn-node-04):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]'
The instances listed there would be ones that can be migrated to their
secondary to give `fsn-node-04` some breathing room.
## Adding and removing addresses on instances
Say you created an instance but forgot to need to assign an extra
......@@ -930,6 +999,61 @@ passphrase. Re-enable the machine with this command on mandos:
mandos-ctl --enable chi-node-02.torproject
### Cleaning up orphan disks
Sometimes `gnt-cluster verify` will give this warning, particularly
after a failed rebalance:
* Verifying orphan volumes
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown
This can happen when an instance was partially migrated to a node (in
this case `fsn-node-06`) but the migration failed because (for
example) there was no HDD on the target node. The fix here is simply
to remove the logical volumes on the target node:
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data
### Fixing inconsistent disks
Sometimes `gnt-cluster verify` will give this error:
WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok'
... or worse:
ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)>
The fix for both is to run:
gnt-instance activate-disks materculae.torproject.org
This will make sure disks are correctly setup for the instance.
If you have a lot of those warnings, pipe the output into this filter,
for example:
gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' |
sed 's/.*instance//;s/:.*//' |
sort -u |
while read instance; do
gnt-instance activate-disks $instance
done
### Not enough memory for failovers
Another error that `gnt-cluster verify` can give you is, for example:
- ERROR: node fsn-node-04.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available)
The solution is to [rebalance the cluster](#Rebalancing-a-cluster).
### Other troubleshooting
Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment