title: TPA-RFC-68: Idle canary servers
costs: marginal
approval: TPA
affected users: TPA
deadline: 2024-09-19
status: standard
discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41750
Summary: provision test servers that sit idle to monitor infrastructure and stage deployments
Background
In various recent incidents, it became apparent that we don't have a good place to test deployments or "normal" behavior on servers.
Examples:
-
While deploying the
needrestart
package (tpo/tpa/team#41633), we had to deploy onperdulce
(AKApeople.tpo
) and test there. This had no negative impact. -
While testing a workaround to mini-nag's deprecation (tpo/tpa/team#41734),
perdulce
was used again, but an operator error destroyed/dev/null
, and the operator failed to recreate it. Impact was minor: some errors during a nightly job, which a reboot promptly fixed. -
While diagnosing a network outage (e.g. tpo/tpa/team#41740), it can be hard to tell if issues are related to a server's exotic configuration or our baseline (in that case, single-stack IPv4 vs IPv6)
-
While diagnosing performance issues in Ganeti clusters, we can sometimes suffer from the "noisy neighbor" syndrome, where another VM in the cluster "pollutes" the server and causes bad performance
-
Rescue boxes were setup with not enough disk space, because we actually have no idea what our minimum space requirements are (tpo/tpa/team#41666)
We previously had a ipv6only.torproject.org
server, which was
retired in TPA-RFC-23 (tpo/tpa/team#40727) because it was
undocumented and blocking deployment. It also didn't seem to have any
sort of configuration management.
Proposal
Create a pair of "idle canary servers", one per cluster, named
idle-fsn-01
and idle-dal-02
.
Optionally deploy an idle-dal-ipv6only-03
and idle-dal-ipv4only-04
pair to test single-stack configuration for eventual dual-stack
monitoring (tpo/tpa/team#41714).
Server specifications and usage
- zero configuration in Puppet, unless specifically required for the role (e.g. an IPv4-only or IPv6 stack might be an acceptable configuration)
- some test deployments are allowed, but should be reverted cleanly as much as possible. on total failure, a new host should be reinstalled from scratch instead of letting it drift into unmanaged chaos
- files in
/home
and/tmp
cleared out automatically on a weekly basis,motd
clearly stating that fact
Hardware configuration
component | current minimum | proposed spec | note |
---|---|---|---|
CPU count | 1 | 1 | |
RAM | 960MiB | 512MiB | covers 25% of current servers |
Swap | 50MiB | 100MiB | covers 90% of current servers |
Total Disk | 10GiB | ~5.6GiB | |
/ | 3GiB | 5GiB | current median used size |
/boot | 270MiB | 512MiB | /boot often filling up on dal-rescue hosts |
/boot/efi | 124MiB | N/A | no EFI support in Ganeti clusters |
/home | 10GiB | N/A | /home on root filesystem |
/srv | 10GiB | N/A | same |
Goals
- identify "noisy neighbors" in each Ganeti cluster
- keep a long term "minimum requirements" specification for servers, continuously validated throughout upgrades
- provide a impact-less testing ground for upgrades, test deployments and environments
- trace long-term usage trends, for example electric power usage (tpo/tpa/team#40163) or recurring jobs like unattended upgrades (tpo/tpa/team#40934) basic CPU usage cycles
Timeline
No fixed timeline. Those servers can be deployed in our precious free time, but it would be nice to actually have them deployed eventually. No rush.
Appendix
Some observations on current usage:
Memory usage
Sample query (25th percentile):
quantile(0.25, node_memory_MemTotal_bytes -
node_memory_MemFree_bytes - (node_memory_Cached_bytes +
node_memory_Buffers_bytes))
≈ 486 MiB
- minimum is currently carinatum, at 228MiB, perdulce and ssh-dal are more around 300MiB
- a quarter of servers use less than 512MiB of RAM, median is 1GiB, 90th %ile is 17GB
- largest memory used is dal-node-01, at 310GiB used (out of 504GiB, 61.5%)
- largest used ratio is colchicifolium at 94.2%, followed by gitlab-02 at 68%
- largest memory size is ci-runner-x86-03 at 1.48TiB, followed by the dal-node cluster at 504GiB each, median is 8GiB, 90%ile is 74GB
Swap usage
Sample query (median used swap):
quantile(0.5, node_memory_SwapTotal_bytes-node_memory_SwapFree_bytes)
= 0 bytes
- Median swap usage is zero, in other words, 50% of servers do not touch swap at all
- median size is 2GiB
- some servers have large swap space (
tb-build-02
and-03
have 300GiB,-06
has 100GiB and gnt-fsn nodes have 64GiB)
Percentile | Usage | Size |
---|---|---|
50% | 0 | 2GiB |
75% | 16MiB | 4GiB |
90% | 100MiB | N/A |
95% | 400MiB | N/A |
99% | 1.2GiB | N/A |
Disk usage
Sample query (median root partition used space):
quantile(0.5,
sum(node_filesystem_size_bytes{mountpoint="/"}) by (alias, mountpoint)
- sum(node_filesystem_avail_bytes{mountpoint="/"}) by (alias,mountpoint)
)
≈ 5GiB
- 90% of servers fit in 10GiB of disk space for the root, median around 5GiB filesystem usage
- median /boot usage is actually much lower than our specification, at 139,4 MiB, but the problem is with edge cases, and we know we're having trouble at the 2^8MiB (256MiB) boundary, so we're simply doubling that
CPU usage
Sample query (median percentage with one decimal):
quantile(0.5,
round(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[24h])
) by (instance)
/ count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
/10
)
≈ 2.5%
Servers sorted by CPU usage in the last 7 days:
sort_desc(
round(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[7d])
) by (instance)
/ count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
/10
)
- Half of servers use only 2.5% of CPU time per day over the last 24h.
- median is, perhaps surprisingly, similar for the last 30 days.
-
metricsdb-01
used 76% of a CPU in the last 24h at the time of writing - over the last week, results vary more,
relay-01
using 45%,colchicifolium
andcheck-01
40%,metricsdb-01
33%...
Percentile | last 24h usage ratio |
---|---|
50th (median) | 2.5% |
90th | 22% |
95th | 32% |
99th | 45% |