Currently onionoo is a service comprised of 4 VMs: two backends with the onionoo java apps serving and updating the data, and two frontends.
At the time the service was launched this architecture made a lot of sense, but I think now we could simplify its maintenance by reducing it to a backend with a web server (like nginx) with some aggressive caching.
I was hoping that we would get sooner to the point where onionoo would be retired, but given the current pace of development of the metrics pipeline, I personally think it makes sense to reduce this service now so that it is easier to maintain for metrics and tpa.
What do you think?
Edited
Designs
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
i've been considering haproxy/varnish's retirement in favor of nginx in the past (#32462 (closed)), and this is the last place where we do use haproxy, so i'm all for retiring that part of the architecture, for sure.
availbaility here
i'm just not quite clear on the gain here: what burden would be reduced exactly? two less machines to maintain, in general? but then we have some work to do to replace haproxy... we'd also probably need to build one (or two?) machines from scratch as well, which is also ... well, work.
if it's less work for you, sure, makes sense. but for us, at this point, the benefits would be relatively marginal, especially if the plan is to retire the entire thing long term.
we would save disk space, that's for sure, so that's useful. but i kind of worry about availability here; right now this service is one of our sturdiest ones because we can take out any one machine in the cluster and the service still works (heck, we can take out half if we pick frontend/backend right). so i would love to keep that architecture, if that's a SLA you're still interested in.
how long can we afford to lose onionoo before things start to fall apart?
oh, i have just read tpo/network-health/metrics/relay-search#40024 (closed) where there is an inconsistency between the two backends, and removing one of the backends would clearly make debugging easier. so that's at least one good justification...
it's as you wish, really: if you don't need HA on this and can afford to live off a single VM, there's surely lots of tricks we can do with nginx, and we've got past experience setting it up as a cache, so it could replace the haproxy/varnish stack quite advantageously. if that's a word. :)
The only reason to have two backends and two frontends was to split load, so if we can do that differently there is no need for this complicated setup.
Well, if we need to split the load and scale the machines, that will be
hard if there's a single machine... have you looked on resource usage on
those servers to see if a single box could handle it? what's the
limiting factor here, memory? io? cpu?
there you can see that the hosts are comfortable with their current memory allocations, which is 16G for the backend, 4G for the frontend. we're probably need to add that up for a merged host, so something like 20G or even 32G might be good enough.
we see a pattern emerge, which is that CPU usage is super spikey: every hour it maxes out a CPU in full, for 2-3 minutes. which is probably fine, but we shouldn't get rid of cores in the new setup.
we peak at about 1k IOPS / 16MiB/s which should be fine for NVMe.
in other words, hell yeah let's rebuild this in the new ganeti cluster and get rid of 4 machines! how about 128GB NVMe disk, 32GB RAM, and 8 cores? @lavamind, how does that sound?
For "simplifying debugging" I would not make such changes, but thats just me. I would rather find out what exactly is the issue with that backend box or this java application on it. Otherwise it looks bit like trying to change boxes to fix that unknown issue.
Java apps might have some monitoring interface available (rmi, jmx), to graph its internal resource usage. That has been helpful for me few times in the past, though I have no clue about java itself, just hosted few such thingies. As I recall, some of them needed reconfiguring, when some specific jvm config limit was exceeded, while machine itself had still plenty of everything available - jvm config just disallowed its usage for java app.
You can try with reconfiguring both current frontends to use only one of the backends. Should be enough to indicate if just one of them will be plenty. 1K iops sounds like nothing much.
This io-peaking is not too nice, if you ask me. Not as absolute values but trend as such. Maybe this application needs some changes so that peaking-read/write habit spreads around bit more. Otherwise it looks not much taxing load for those machines.
@anarcat thanks so much for this. The only thing I wanted to mention about your analysis is that the two backends have/should have the same data, so about 100GB is fine IMO.
@er this issue started when we started having memory issues on collector because data from collector started being late.
Basically onionoo fetches network documents from collector, parses the data and writes tiny text files to disk which contains the nodes statuses (this is why we see i/o spikes).
Some of the assumptions that we have in onionoo code need some re-work and we are aware of it, that's why we are working on a different backend and API.
We are not just getting rid of one backend to fix this issue. Our plan is to first have 1 backend and allocate more resources to it, while also doing the load balancing with nginx. Then we need to look at the bandwidth history data and see if we can reset the first_seen field from there.
that all sounds good to me! what's the next step here? @hiro do you want us to make a VM and let you loose in there? i could probably make up some Puppet blobs to setup nginx and friends and then you could do the rest...
actually, now that i'm looking into this again (and sorry for the horrid delays here), "loadbalancing"? what do you mean there? "caching", sure, but not sure how i can do load balancing with a single VM here...
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2024-03-14. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.
basic server install has been done, hostname is onionoo-backend-03.torproject.org. there's a caching nginx proxy setup as well, which forwards traffic to localhost:8080, let me know if you need anything else. entry point in puppet is role::onionoo_backend (not to be confused with roles::onionoo_backend) and most code should live in profile::onionoo_backend.
I think nothing is needed here from you. I will assign this ticket to me and will try to migrate onionoo to the single backend time permitting. Thanks.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2024-04-24. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.
I have the new onionoo backend working at https://onionoo-backend-03.torproject.org/. I will monitor it for a few days and if everything is ok we can switch onionoo.tpo to this backend only and retire everything else.
Results returned on onionoo-backend-03 are the same as onionoo.tpo. Can we switch onionoo tpo to point to onionoo-backend-03? Then we can also retire the two frontends and backends.
this is going to be a tad more complicated than i thought: onionoo.tpo is in the "auto-dns" setup, so we'll need to either twist that in a "single node" auto-dns setup (which i'm not even sure works), or try to simplify this and retire the auto-dns part of this in favor of a simple CNAME.
i don't think i've ever done either.
@hiro what do you think? should we keep the capacity to have automatic failovers for this, in case we build a second backend?
@hiro i'm still unsure: the rotation hosts are currently onionoo-frontend-01 and -02 (not onionoo-backend-01 and -02), are we sure we want to just replace that with onionoo-backend-03? (in which case i have to wonder why we still call this host a "backend". :))
this is the commit i have pending:
master c494505c9f647d2bc3266868197d53e05ea53998Author: Antoine Beaupré <anarcat@debian.org>Date: Tue Oct 22 12:42:41 2024 -0400Parent: 6677c39 retire web-bhs-* servers (tpo/tpa/team#41111)Merged: masterContained: masteradd onionoo-backend-03 to the onionoo.torproject.org rotationThis is a bit counter-intuitive, but onionoo-backend-03 does bothfrontend and backend.See: tpo/tpa/team#415122 files changed, 5 insertions(+)hosts.yaml | 4 ++++services/onionoo.torproject.org.service | 1 +modified hosts.yaml@@ -31,5 +31,9 @@ hosts: 4: 49.12.57.137 6: 2a01:4f8:fff0:4f:266:37ff:fe33:259b checks: ['ping-check', 'shutdown-check', 'http-check']+ onionoo-backend-03.torproject.org:+ 4: 204.8.99.156+ 6: 2620:7:6002:0:466:39ff:fe3c:7686+ checks: ['ping-check', 'shutdown-check', 'http-check'] # vim:set shiftwidth=2:modified services/onionoo.torproject.org.service@@ -4,4 +4,5 @@ hosts: default: - onionoo-frontend-01.torproject.org - onionoo-frontend-02.torproject.org+ - onionoo-backend-03.torproject.org # vim:syn=yaml:
@hiro after confirmation from IRC, i pushed the above change and onionoo-backend-03 is now in rotation, let me know if everything looks okay from your end, and later today (or tomorrow) i'll remove one of the other backends to see if the new host can take the increased load.
Are you sure the cert is really fixed? I am getting intermittent errors pulling from onionoo, and folks on #tor-relays are reporting it too.
<bauruine[m]> * Server certificate:<bauruine[m]> * subject: CN=onionoo-backend-03.torproject.org<bauruine[m]> * subjectAltName does not match onionoo.torproject.org<bauruine[m]> The other two A records (hosted at hetzner) work.
HTTPSConnectionPool(host='onionoo.torproject.org', port=443): Max retriesexceeded with url: /details (Caused by SSLError(CertificateError("hostname'onionoo.torproject.org' doesn't match either of'onionoo-backend-03.torproject.org', 'oniooo.torproject.org'")))