... | ... | @@ -3225,6 +3225,13 @@ reserved, this is for hosts outside of the cluster. |
|
|
|
|
|
The network name must follow the [naming convention](doc/naming-scheme).
|
|
|
|
|
|
## Upgrades
|
|
|
|
|
|
Ganeti upgrades need to be handled specially, and have their own
|
|
|
documentation in the [howto/upgrades](howto/upgrades) documents.
|
|
|
|
|
|
TODO: move procedures here?
|
|
|
|
|
|
## SLA
|
|
|
|
|
|
As long as the cluster is not over capacity, it should be able to
|
... | ... | @@ -3236,7 +3243,7 @@ without problems. |
|
|
New nodes can be provisioned within a week or two, depending on budget
|
|
|
and hardware availability.
|
|
|
|
|
|
## Design
|
|
|
## Design and architecture
|
|
|
|
|
|
Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines
|
|
|
hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting
|
... | ... | @@ -3396,7 +3403,12 @@ We have custom configurations on top of that to: |
|
|
There is work underway to refactor and automate the install better,
|
|
|
see [ticket 31239](https://bugs.torproject.org/31239) for details.
|
|
|
|
|
|
### Storage
|
|
|
## Services
|
|
|
|
|
|
TODO: document a bit how the different Ganeti services interface with
|
|
|
each other
|
|
|
|
|
|
## Storage
|
|
|
|
|
|
TODO: document how DRBD works in general, and how it's setup here in
|
|
|
particular.
|
... | ... | @@ -3420,6 +3432,20 @@ links: |
|
|
For now, iSCSI volumes are manually created and passed to new virtual
|
|
|
machines.
|
|
|
|
|
|
## Queues
|
|
|
|
|
|
TODO: document gnt-job
|
|
|
|
|
|
## Interfaces
|
|
|
|
|
|
## Authentication
|
|
|
|
|
|
## Implementation
|
|
|
|
|
|
## Related services
|
|
|
|
|
|
ref DRBD
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][] or
|
... | ... | @@ -3431,14 +3457,20 @@ There is no issue tracker specifically for this project, [File][] or |
|
|
|
|
|
Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.com/ganeti/ganeti/issues).
|
|
|
|
|
|
## Monitoring and testing
|
|
|
## Users
|
|
|
|
|
|
## Upstream
|
|
|
|
|
|
## Monitoring and metrics
|
|
|
|
|
|
<!-- TODO: describe how this service is monitored and how it can be tested -->
|
|
|
<!-- after major changes like IP address changes or upgrades -->
|
|
|
|
|
|
TODO: https://github.com/ganeti/prometheus-ganeti-exporter
|
|
|
|
|
|
## Logs and metrics
|
|
|
## Tests
|
|
|
|
|
|
## Logs
|
|
|
|
|
|
Ganeti logs a significant amount of information in
|
|
|
`/var/log/ganeti/`. Those logs are of particular interest:
|
... | ... | @@ -3449,8 +3481,9 @@ Ganeti logs a significant amount of information in |
|
|
this also includes VM migration logs for the `move-instance` or
|
|
|
`gnt-instance export` commands
|
|
|
|
|
|
It does not expose performance metrics that are digested by Prometheus
|
|
|
right now, but that would be an interesting feature to add.
|
|
|
## Backups
|
|
|
|
|
|
TODO
|
|
|
|
|
|
## Other documentation
|
|
|
|
... | ... | @@ -3468,8 +3501,54 @@ right now, but that would be an interesting feature to add. |
|
|
|
|
|
# Discussion
|
|
|
|
|
|
The Ganeti cluster has served us well over the years. This section
|
|
|
aims at discussing the current limitations and possible future.
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
Ganeti works well for our purposes, which is hosting generic virtual
|
|
|
machine. It's less efficient at managing mixed-usage or specialized
|
|
|
setups like large file storage or high performance database, because
|
|
|
of cross-machine contamination and storage overhead.
|
|
|
|
|
|
## Security and risk assessment
|
|
|
|
|
|
No in-depth security review or risk assessment has been done on the
|
|
|
Ganeti clusters recently. It is believe the cryptography and design of
|
|
|
Ganeti cluster is sound. There's a concern with the server host keys
|
|
|
reuse and, in general, there's some confusion over what goes over TLS
|
|
|
and what goes over SSH.
|
|
|
|
|
|
Deleting VMs is relatively too easy in Ganeti. You just need one
|
|
|
confirmation, and a VM is completely wiped, so there's always a risk
|
|
|
of accidental removal.
|
|
|
|
|
|
## Technical debt and next steps
|
|
|
|
|
|
The ganeti-instance-debootstrap installer is slow and almost abandoned
|
|
|
upstream. It required significant patching to get cross-cluster
|
|
|
migrations working.
|
|
|
|
|
|
There are concerns that the DRBD and memory redundancy required by the
|
|
|
Ganeti allocators lead to resource waste, that is to be investigated
|
|
|
in [tpo/tpa/team#40799](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40799).
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
No recent proposal was done for the Ganeti clusters, although the
|
|
|
Cymru migration is somewhat relevant:
|
|
|
|
|
|
- [TPA-RFC-40: Cymru migration](policy/tpa-rfc-40-cymru-migration)
|
|
|
- [TPA-RFC-43: Cymru migration plan](policy/tpa-rfc-43-cymru-migration-plan)
|
|
|
- [TPA-RFC-52: Cymru migration timeline](policy/tpa-rfc-52-cymru-migration-timeline)
|
|
|
|
|
|
## Other alternatives
|
|
|
|
|
|
Proxmox is probably the biggest contender here. OpenStack is also
|
|
|
marginally similar.
|
|
|
|
|
|
# Old libvirt cluster retirement
|
|
|
|
|
|
The project of creating a Ganeti cluster for Tor has appeared in the
|
|
|
summer of 2019. The machines were delivered by Hetzner in July 2019
|
|
|
and setup by weasel by the end of the month.
|
... | ... | |