replace "Tor VM hosts" spreadsheet with Grafana dashboard

Trac:
Parent Ticket: #30273 (moved)

added component::internal services/tor sysadmin team owner::tpa parent::30273 priority::medium severity::minor status::assigned type::task labels

if I might add, given the trouble I am having figuring out how moly was built and what hardware it's running, I'm thinking more and more we should keep more details about the various devices somewhere. maybe it could be in LDAP, but I can't help but think this is stuff that could very well live in a YAML file in Hiera.

other possible inspiration include:

this dashboard which shows how many servers have how many cores - a little backwards really: we'd want to list each server instead..
this libvirt dashboard (that may belong with this libvirt exporter) - interesting, but doesn't show capacity, only actual usage

the latter specifically mentions some interesting metrics that we might be able to use for our purposes:

  "block.<num>.capacity" - logical size in bytes of the block device
                           backing image as unsigned long long.
  "block.<num>.physical" - physical size in bytes of the container of the
                           backing image as unsigned long long.

if I might add, given the trouble I am having figuring out how moly was built and what hardware it's running, I'm thinking more and more we should keep more details about the various devices somewhere. maybe it could be in LDAP, but I can't help but think this is stuff that could very well live in a YAML file in Hiera.

another thing to consider here is that we don't have a clear, global view of which (physical) machines we have and how much they cost. we do have a list of machines in LDAP, but that includes limited information and does not include cost, so it's hard to do requirements assessment and depreciation evaluation.

i'll start looking into this more directly as part of the Hiera move in #30020 (moved).

Trac:
Owner: tpa to anarcat
Status: new to assigned

Trac:
Description: Our KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:

it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
it's not real time data - for example, even if a host is allocated one vCPU, it might be totally idle most of the time and doing mostly network or disk, while another one might hit the CPU hard. actual load is what matters
it's hosted by Google - that has a few problems, the most important of which is that some TPA do not actually want to use Google services and might be reluctant to update it, worsening problem 1

I propose we shift this to a Grafana dashboard. I already have a prototype in the form of the Node exporter server metrics Grafana Dashboard which shows multiple hosts basic stats in parallel. I set the default of the dashboard in Grafana to show the 6 KVM hosts:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-node=unifolium.torproject.org:9100

That looks like this:

![https://screenshotscdn.firefoxusercontent.com/images/444d04c8-bea4-4ac9-803e-5397126877a2.png, 700px](https://screenshotscdn.firefoxusercontent.com/images/444d04c8-bea4-4ac9-803e-5397126877a2.png, 700px)

.. but it's not ideal:

it's showing irrelevant stats for this purpose like context switches or detailed disk or memory stats
it's missing critical information like the number of KVM guests hosted on the machine, how many CPUs and disk space is allocated and so on

This is the information we should be showing:

disk capacity vs allocation
disk utilization
CPU count vs allocation
actual CPU utilization
load?
memory capacity vs allocation
actual memory usage

Some of that information currently lives only in the spreadsheet. For example, disk allocations are only available there, as the KVM guests run on QCOW (Qemu Copy On Write) filesystems that only take space when actually used by the guest. This has the advantage of allowing us to over-provision, but means we must keep that metadata somewhere else.

So for now it's in the spreadsheet, but we could find a way to move it somewhere Prometheus can scrape. One trick that Prometheus has is that it can expose metrics stored as text files in /var/lib/prometheus/node-exporter/*.prom. This is how the smartctl and APT metrics get shipped for example: a cron job (well, a systemd timer) regularly writes that file, atomically. So one option could be to move this information to (say) LDAP or Puppet/Hiera and write that information into that file using a cronjob (LDAP) or Puppet (Hiera).

Then we'd build a custom Grafana dashboard and get rid of the other spreadsheet.

A stop-gap measure might be to simplify the spreadsheet and move it to a plain text markdown file. We would lose the automatic calculations the spreadsheet provide, in exchange for easier updating and transparency.

to

Our KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:

it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
it's not real time data - for example, even if a host is allocated one vCPU, it might be totally idle most of the time and doing mostly network or disk, while another one might hit the CPU hard. actual load is what matters
it's hosted by Google - that has a few problems, the most important of which is that some TPA do not actually want to use Google services and might be reluctant to update it, worsening problem 1

I propose we shift this to a Grafana dashboard. I already have a prototype in the form of the Node exporter server metrics Grafana Dashboard which shows multiple hosts basic stats in parallel. I set the default of the dashboard in Grafana to show the 6 KVM hosts:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-node=unifolium.torproject.org:9100

That looks like this:

![https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png, 700px](https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png, 700px)

.. but it's not ideal:

it's showing irrelevant stats for this purpose like context switches or detailed disk or memory stats
it's missing critical information like the number of KVM guests hosted on the machine, how many CPUs and disk space is allocated and so on

This is the information we should be showing:

disk capacity vs allocation
disk utilization
CPU count vs allocation
actual CPU utilization
load?
memory capacity vs allocation
actual memory usage

Some of that information currently lives only in the spreadsheet. For example, disk allocations are only available there, as the KVM guests run on QCOW (Qemu Copy On Write) filesystems that only take space when actually used by the guest. This has the advantage of allowing us to over-provision, but means we must keep that metadata somewhere else.

So for now it's in the spreadsheet, but we could find a way to move it somewhere Prometheus can scrape. One trick that Prometheus has is that it can expose metrics stored as text files in /var/lib/prometheus/node-exporter/*.prom. This is how the smartctl and APT metrics get shipped for example: a cron job (well, a systemd timer) regularly writes that file, atomically. So one option could be to move this information to (say) LDAP or Puppet/Hiera and write that information into that file using a cronjob (LDAP) or Puppet (Hiera).

Then we'd build a custom Grafana dashboard and get rid of the other spreadsheet.

A stop-gap measure might be to simplify the spreadsheet and move it to a plain text markdown file. We would lose the automatic calculations the spreadsheet provide, in exchange for easier updating and transparency.

this is part of the broader inventory problem, which i documented in a separate ticket

Trac:
Parent: N/A to #30273 (moved)

I have added moly to the spreadsheet. Memory and CPU allocations were extracted from libvirt with:

grep -e memory -e vcpu /etc/libvirt/qemu/*.xml

Because moly used LVM to allocate disks, I could also get the allocations with lvdisplay -C.

The following was used to extract information from VMs in hetzner cloud:

cumin -o txt 'hetzner*' 'facter | grep -e processorcount -e memorysize -e "blockdevice_[^_]*_size"'

the spreadsheet should now cover all hosting locations with costs attached to them. i'd like to also have an inventory of all bare metal, but that will have to do for now.

the spreadsheet was moved into nextcloud by weasel, at least, so that has slightly improved. but we could use more automation here.

i am not sure we're ready to do this jump just yet. i've included the new spreadsheet in our host creation/retirement procedures so it should be consistently updated now at least. i'm not exactly sure if the benefit of creating a new dashboard in grafana outweights the costs at this point, so i'm unassigning this for now.

Trac:
Owner: anarcat to tpa

mentioned in issue #30020 (moved)

mentioned in issue #30028 (moved)

mentioned in issue #30273 (moved)

mentioned in issue tpo/tpa/team#29387

replace "Tor VM hosts" spreadsheet with Grafana dashboard

Child items ...

Activity