A staging server for rdsys

I'm going to be mostly AFK until September, will be great if we manage to get this VM working so I can start setting it up once I'm back. I'll try to keep an eye to this issue while I'm AFK so you are not blocked, but it might take days some times.

mentioned in issue tpo/anti-censorship/rdsys#170

some questions:

an account that we can ssh automatically from the CI to setup everything. We'll also need everybody from anti-censorship to be able to sudo into that account.

so both CI and users will need access to this. what's the story for "oops, jane destroyed the test server", you rebuild it?

in other words, how's deployment from scratch done here? is that something you do in CI or something we help you do in Puppet, or a mix?

[...] gettor-tst@torproject.org?

https://bridges-tst.torproject.org [...]

tst looks weird to me, can we make that test or staging, since, well, it's a staging server?

I don't think we'll connect them to the prometheus server

why not?

otherwise, we're a bit crammed right now but i'll try to squeeze this in August, no promises. but i'm happy you're looking at staging, i think it's a great idea!

@kez can you share your experience of setting up a dev server for donate? how would you do it here?

in general, this gets us closer to doing "continuous deployments" in GitLab, which is, in theory, designed for this. we don't have much experience with that, but @lavamind did work extensively on environments for the static site deployments, so we at least have that running in prod. it's not a long-running server process however (unless you count apache, but that's managed by us/puppet, not GitLab)...

would help tremendously if you have existing examples of other projects deploying such staging servers, otherwise we'll look into it of course.

so both CI and users will need access to this. what's the story for "oops, jane destroyed the test server", you rebuild it?

in other words, how's deployment from scratch done here? is that something you do in CI or something we help you do in Puppet, or a mix?

I don't have all shorted out in my head yet, but my first idea is to do it from the CI. So basically the CI deletes some folders and replaces binaries, configuration and things like that. The idea will be to have a script for that, I was hopping to don't get into puppet.

But I'm happy to hear ideas on how to do it.

tst looks weird to me, can we make that test or staging, since, well, it's a staging server?

I'm ok with that, let's go for test. staging is too long.

I don't think we'll connect them to the prometheus server

why not?

It will require us to redesign a bit our dashboards to handle the staging data, but maybe.

would help tremendously if you have existing examples of other projects deploying such staging servers, otherwise we'll look into it of course.

I agree, I'll poke around to see if I can find something.

changed due date to August 28, 2023

added Anti-Censorship Backlog Prometheus lifecycle labels

assigned to @anarcat

@meskio could you read up on https://docs.gitlab.com/ee/ci/services/ and let me know if you'd be autonomous in setting that up? for podman there's a feature flag to enable for this to work, but i'm just setting up the podman executor now (#41296 (closed))...

What are you suggesting here? To set up rdsys as a CI service? AFAIK this is not meant to leave a service running after the CI has finished neither for us to access the service and modify it. We might want to do that at some point for integration tests, but for now we want to have a server with the latest version of rdsys were we can try things while doing development.

In my head the staging server is loosely connected to do integration tests in the CI and I want to use this work to think on how to do integration tests and see if we can reuse some pieces for both.

What are you suggesting here? To set up rdsys as a CI service?

not the prod version, but i was under the impression that would work for the staging version...

In my head the staging server is loosely connected to do integration tests in the CI and I want to use this work to think on how to do integration tests and see if we can reuse some pieces for both.

Maybe my mental model of what your proposing is unclear. Could you make a diagram or a longer-form explanation of how things would work?

Maybe my mental model of what your proposing is unclear. Could you make a diagram or a longer-form explanation of how things would work?

The idea is to have a rdsys setup with fake bridges where we can test new features we are developing. I thought it might be handy if that setup is automatically deployed on every commit on main, so our messy tests get deleted and we have a clean system with the latest piece of code.

But I have to recognize I don't have a clear plan here.

I guess the normal steps in a workflow with my proposal will be:

a commit into main triggered the CI to do a clean deployment in our staging server
I start working on a new feature
I deploy my work in progress feature in the staging server to test it
I finish the development and create a merge request to rdsys
My merge request gets merged into main and the CI cleans up my mess in the staging server and leaves it in a clean working status (back to 1.)

okay, so that's a good first draft. let's push this a little bit.

a commit into main triggered the CI to do a clean deployment in our staging server

that's the same as step 5 below, right?

I start working on a new feature

I deploy my work in progress feature in the staging server to test it

I finish the development and create a merge request to rdsys

My merge request gets merged into main and the CI cleans up my mess in the staging server and leaves it in a clean working status (back to 1.)

okay, i find this really confusing. why does CI mess with your setup at all in this case? it seems you would have a CI job running only on the protected branch here (main) and all it would do would be to trash your dev environment to rebuilt it with the stuff from main, and then... do nothing?

how do you intend on doing the other steps above? like when you "work on a new feature" and "deploy [on] the staging server", what does that mean concretely? you deploy with git? rsync? copy-paste?

i ask, because typically what we would do is that merge requests would create MR-specific "environments" and those get deployed on their own to wherever we choose. this could be a new VM, a vhost in a VM, a prefix on a vhost in a VM, or could be a container image, it can be anything really.

so i find the above process a bit confusing because it mixes up manual deployments and automated deployments on the same host. i think that's error prone and bound to create problems, for example permissions problems between files managed by you and the ones deployed by CI.

i would much rather have all of this deployed by CI, including pipelines running from your merge request. that might require a bit more thought on how we actually design this thing, but it seems like we need to think this through anyway...

how do you develop this locally right now? maybe we can take inspiration from that?

We had a discussion about that on irc. I need to rethink the CI setup, but we'll move along to setup a staging server and see if we connect it to the CI after.

@anarcat What is the status of this issue? How can I help with to make this happening?

The current status here is "due, mostly forgotten, oops". I'll see how i can schedule this, thanks.

added Next label and removed Backlog label

added Sponsor 150 label

this was done:

    gnt-instance add \
      -o debootstrap+bookworm \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-dal-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=20G \
      --backend-parameters memory=8g,vcpus=2 \
      rdsys-frontend-test-02.torproject.org

next up is to run the new machine setup.

update: removed and recreated the box as -02 to avoid confusion.

server bootstrapped and now in DNS, @meskio you should have shell access to rdsys-frontend-test-02.torproject.org, can you check? you can sync the host keys from another existing TPO host, or use DNSSEC.

changed the description

marked the checklist item We'll also need everybody from anti-censorship to be able to sudo into that account. as completed

We'll also need everybody from anti-censorship to be able to sudo into that account.

That should be done now, same permissions as rdsys-frontend-01, which is members of the gettor or rdsys group can login.

I'm not sure how to do this. We might be ok with a single rdsys account that we can login from all ACT and also we can somehow ssh-it from the CI.

How do you want to do the CI ssh thing? Can we ssh into the rdsys user from the CI using a ssh key?

right now there's a rdsys account in the rdsys group that you can all sudo into. we could make another account that's also part of the group for CI, but i'm still unclear on how you expect CI to work in the first place...

then again, maybe i can just setup all the pieces and you play with them as you see fit after!

so i need to remove the gettor group here.

https://bridges-tst.torproject.org proxing to http://localhost:7200

what's running on port 7200? does it have a name?

This is will be the HTTPS distributor, the replacement that we are working on for the current https://bridges.torproject.org.

https://bridges-tst.torproject.org/status proxing to http://localhost:710/status

did you really mean port 710 here, i'll assume port 7100.

ups, you are right 7100

changed the description

now i feel like i made a terrible mistake in naming that server like rdsys-frontend, it's the backend, isn't it? so i should probably rip all that out and name it rdsys-backend-01?

I see now I'm trigger happy and did write a comment about it before reading this one. Yes, will be better to rename it. But is not going to be only the backend, but also the distributors will live there. Also this is going to be a staging/test server, so let's not call it backend. Maybe to call it rdsys-test-01.

okay, i think i'm stuck now. i need to be clearer on what this thing is, whether it's the frontend or the backend, to make sure i name things correctly.

added Doing label and removed Next label

marked the checklist item We'll also need everybody from anti-censorship to be able to sudo into that account. as incomplete

added Needs Information label and removed Doing label

I don't think it makes sense to name this rdsys-frontend-..., is a staging server for the whole rdsys, not just the frontends, we'll test the backend there too. Can we just call it rdsys-test-01 (or the number you want)?

changed the description

added Doing label and removed Needs Information label

retiring the wrongly-named instance:

anarcat@angela:tsa-misc$ ./retire -H rdsys-frontend-test-02.torproject.org retire-all --parent-host=dal-node-01.torproject.org -v
starting tasks at 2023-08-30 16:11:19.845657+00:00
checking for ganeti master on host dal-node-01.torproject.org
ganeti node detected with master dal-node-01.torproject.org
checking on dal-node-01.torproject.org if instance rdsys-frontend-test-02.torproject.org is running
stopping instance rdsys-frontend-test-02.torproject.org on dal-node-01.torproject.org
Waiting for job 102720 for rdsys-frontend-test-02.torproject.org ...
scheduling rdsys-frontend-test-02.torproject.org instance removal on host dal-node-01.torproject.org
scheduling gnt-instance remove --force rdsys-frontend-test-02.torproject.org to run on dal-node-01.torproject.org in 7 days
warning: commands will be executed using /bin/sh
job 7 at Wed Sep  6 16:11:00 2023
scheduling rdsys-frontend-test-02.torproject.org backup disks removal on host bungei.torproject.org and director bacula-director-01.torproject.org
checking for path "/srv/backups/bacula/rdsys-frontend-test-02.torproject.org/" on bungei.torproject.org
scheduling rm -rf "/srv/backups/bacula/rdsys-frontend-test-02.torproject.org/" to run on bungei.torproject.org in 30 days
warning: commands will be executed using /bin/sh
job 108 at Fri Sep 29 16:11:00 2023
checking for path "/srv/backups/pg/rdsys-frontend-test-02/" on bungei.torproject.org
path /srv/backups/pg/rdsys-frontend-test-02/ not found: [Errno 2] No such file
scheduling echo delete client=rdsys-frontend-test-02.torproject.org-fd yes | bconsole to run on bacula-director-01.torproject.org in 30 days
warning: commands will be executed using /bin/sh
job 59 at Fri Sep 29 16:11:00 2023
Notice: Revoked certificate with serial 185
Notice: Removing file Puppet::SSL::Certificate rdsys-frontend-test-02.torproject.org at '/var/lib/puppet/ssl/ca/signed/rdsys-frontend-test-02.torproject.org.pem'
rdsys-frontend-test-02.torproject.org
Submitted 'deactivate node' for rdsys-frontend-test-02.torproject.org with UUID df1e580d-e663-4442-8571-c67cd40bdeb6
completed tasks, elasped: 0:00:26.924690 (user 2.43 system 0.07 chlduser 0.04 chldsystem 0.01 RSS 57.6 MB)
anarcat@angela:tsa-misc$

deleted from LDAP:

589 host=rdsys-frontend-test-02,ou=hosts,dc=torproject,dc=org
host: rdsys-frontend-test-02
hostname: rdsys-frontend-test-02.torproject.org
objectClass: top
objectClass: debianServer
l: Dallas, TX, USA
distribution: Debian
access: restricted
admin: torproject-admin@torproject.org
architecture: amd64
physicalHost: gnt-dal
description: rdsys test server
ipHostNumber: 204.8.99.151
ipHostNumber: 2620:7:6002:0:466:39ff:fe6f:3fa9
rebootPolicy: justdoit
sshRSAHostKey: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEXZsnH3dIegPgnrhNLgLA1CWVT8yJTjDrtgCectYCfv root@rdsys-frontend-test-02.torproject.org
sshRSAHostKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC/HeDiGAA8kM9+jbZJHRJh0MoEG3sf8pWdcC46D+A7fBGT4a+fQOjE+al47GWsJu/DPnsC5iUxL58EE/ni7/eYveJNi1SDesmDiCwDqEoXGBoNBI3mYmJOhJszjP0eBkueEbVFv+RXwmCq74IAW9vw8nUh09L3ihzgrJdI7sVPdjtemJSLFEpY2a9GjUM5ORbNoBJnPD8KzxmRcdKYrv4sWk6LVn8Q1qmDpnWHSZUPlGTq185/FRJQXW2boy/hcxIRHuZUmzM6Nh4fqhWhHaZe+gQtYfJI3wvdKHLwwQJnGHyF50aP/LZT5LDf9kzzaEz6GawqQVPQEd94a8uAuhDPLpWM8jkhUGHME6OZVkgvAcJsPcZoRllRTDL5jEWkGS3C8IqY8SwR6PEY8Di9220divziJ3/IvXLA4/b+g1LmeChY+ZOxoLbsthUL51ulxso0GxZh3Ev3W5W4sPAAVQJhNa/ftHy+TOEw5Z4x69hkwQZX9fDGzgMyba6OoDkKiRk= root@rdsys-frontend-test-02.torproject.org
allowedGroups: rdsys
allowedGroups: gettor

changes to the password manager and nagios will be done through a rename instead.

re-creating the VM:

root@dal-node-01:~# gnt-instance add       -o debootstrap+bookworm       -t drbd --no-wait-for-sync       --net 0:ip=pool,network=gnt-dal-01       --no-ip-check       --no-name-check       --disk 0:size=10G       --disk 1:size=20G       --backend-parameters memory=8g,vcpus=2       rdsys-test-01.torproject.org
Wed Aug 30 16:14:34 2023  - INFO: Selected nodes for instance rdsys-test-01.torproject.org via iallocator hail: dal-node-03.torproject.org, dal-node-01.torproject.org
Wed Aug 30 16:14:34 2023  - INFO: NIC/0 inherits netparams ['br0', 'bridged', '']
Wed Aug 30 16:14:34 2023  - INFO: Chose IP 204.8.99.152 from network gnt-dal-01
Wed Aug 30 16:14:35 2023 * creating instance disks...
Wed Aug 30 16:14:52 2023 adding instance rdsys-test-01.torproject.org to cluster config
Wed Aug 30 16:14:52 2023 adding disks to cluster config
Wed Aug 30 16:14:52 2023 * checking mirrors status
Wed Aug 30 16:14:52 2023  - INFO: - device disk/0:  3.60% done, 4m 7s remaining (estimated)
Wed Aug 30 16:14:52 2023  - INFO: - device disk/1:  0.30% done, 9m 10s remaining (estimated)
Wed Aug 30 16:14:52 2023 * checking mirrors status
Wed Aug 30 16:14:53 2023  - INFO: - device disk/0:  3.80% done, 4m 20s remaining (estimated)
Wed Aug 30 16:14:53 2023  - INFO: - device disk/1:  0.40% done, 5m 50s remaining (estimated)
Wed Aug 30 16:14:53 2023 * pausing disk sync to install instance OS
Wed Aug 30 16:14:54 2023 * running the instance OS create scripts...

done:

256 SHA256:yCBRUO5sAy2GXJ4mhO6VOYIOomYWBKWt/hETSSf274A root@rdsys-test-01.torproject.org (ED25519)
3072 SHA256:gpoC7x5VYcorPLXv6Kyj92C1DbRB/gll6OUzwnpEcJE root@rdsys-test-01.torproject.org (RSA)

puppet bootstrapping, next step is to make a puppet recipe for this host to do the forwards and everything, but within the hour, @meskio and ACT should have access to rdsys-test-01.torproject.org already.

found latency issues during bootstrap, reported as #41311 (closed).

those seem to be gone after reboot, interestingly.

marked the checklist item We'll also need everybody from anti-censorship to be able to sudo into that account. as completed

We'll also need everybody from anti-censorship to be able to sudo into that account.

so, back into this again, @meskio people in the rdsys group should have access to that server, sudo rules coming up shortly.

I can ssh into the machine and sudo into rdsys, it works thanks :)

prometheus exporters to be exposed, I don't think we'll connect them to the prometheus server, but will be useful to be able to reach them for tests:

Okay hold on here. I was about to open ports 7100, 7600, 7700 and 7800 to the big bad internet right there, but right above that we proxy at least port 7100 (and what's up with the other ports), so which one is it? do you need that stuff proxied or not? :) It's okay to have it both proxied and not, but i just want to make double-sure i don't expose stuff that shouldn't be exposed.

That is a good catch. I think is fine to expose it as it will be a testing machine, but nothing private.

But that makes me realize that either I stop reusing ports for API+metrics or we define a diffrent way to share prometheus exporters. I'll chime in #41280 about that.

okay, so I won't proxy them and just open the firewall wide.

an email account that can send and receive emails over imap and smtp. Maybe gettor-tst@torproject.org?

that is gettor-test@rdsys-test.torproject.org, no global forward for now, do let me know if you really need gettor-test@torproject.org.

That is fine for me, we don't need it in @torproject.org.

Can you send me the credentials and config details for the imap/stmp servers?

oh, that's interesting. i would have figured we'd have a way to automatically do this, but it seems we haven't implemented this yet.

for now i've dropped a plain-text version in ~meskio/rdsys-mail-password.txt on rdsys-test-01. i am happy to write a plain-text version owned by the right user in the right format somewhere else if that's better for you, as for now that's a static copy that won't change if we rotate the password (or when we deploy this to prod)...

what format would you like, and where?

Let me try some ideas and I'll come back with a proposal for this.

changed the description

marked the checklist item an email account that can send and receive emails over imap and smtp. Maybe gettor-tst@torproject.org? as completed

marked the checklist item a web server with: as completed

marked the checklist item https://bridges-tst.torproject.org proxing to http://localhost:7200 as completed

marked the checklist item https://bridges-tst.torproject.org/moat proxing to http://localhost:7500/moat as completed

marked the checklist item https://bridges-tst.torproject.org/status proxing to http://localhost:7100/status as completed

i think we're all good here, here's the current puppet config:

# test backend vhost for a polyanthum replacement
#
# this is not built on top of profile::rdsys::backend because that's
# really messy and too mixed up with polyanthum. we also like the way
# the frontend was deployed, using modular bits of nginx config.
class profile::rdsys::backend_test(
  Array[Stdlib::Host] $vhost_aliases = ['bridges-test.torproject.org'],
) {
  $allow_ipv4 = query_nodes('Class[role::rdsys::frontend]', 'networking.ip')
  $allow_ipv6 = query_nodes('Class[role::rdsys::frontend]', 'networking.ip6')
  $allow_addresses = join($allow_ipv4 + $allow_ipv6, ' ')

    # sudo rules
  sudo::conf { 'rdsys':
    content => @(EOT)
    # This file is managed by Puppet.
    %rdsys      ALL=(rdsys)       ALL
    | EOT
  }

  loginctl_user { 'rdsys':
    linger => enabled,
  }

  ssl::service { $::fqdn:
    notify => Class['nginx::service'],
    key    => true,
    onion  => false,
  }

  include profile::nginx
  nginx::resource::server { $::fqdn:
    server_name         => [$::fqdn] + $vhost_aliases,
    ipv6_enable         => true,
    ipv6_listen_options => '',
    ssl                 => true,
    ssl_redirect        => false,
    ssl_cert            => "/etc/ssl/torproject/certs/${::fqdn}.crt-chained",
    ssl_key             => "/etc/ssl/private/${::fqdn}.key",
    access_log          => "/var/log/nginx/${::fqdn}.access.log",
    format_log          => 'privacy',
    error_log           => "/var/log/nginx/${::fqdn}.error.log",
    location_allow      => [
      query_nodes('Class[roles::monitoring::external]', 'networking.ip')
      + query_nodes('Class[roles::monitoring::external]', 'networking.ip6')
    ],
    proxy               => 'http://localhost:7200/',
    require             => [
      Ssl::Service[$::fqdn],
    ],
  }
  nginx::resource::location { '/moat':
    server   => $::fqdn,
    ssl      => true,
    ssl_only => false,
    proxy    => 'http://localhost:7500/',
  }
  nginx::resource::location { '/status':
    server   => $::fqdn,
    ssl      => true,
    ssl_only => false,
    proxy    => 'http://localhost:7100/',
  }
  include profile::dovecot::private

  # journald mail namespace
  concat::fragment { 's_local-journal.mail':
    target  => '/etc/syslog-ng/conf.d/s_local.conf',
    order   => '10',
    content => @(EOT);
        # ingest messages captured in journald mail namespace
        unix-stream("/run/systemd/journal.mail/dev-log");
      | EOT
  }
  file { '/etc/systemd/journald@mail.conf':
    content => @(EOT);
      # This file is managed by Puppet.
      #
      # keep up to 500M of dovecot and
      # postfix logs in journald
      [Journal]
      SystemMaxUse=500M
      | EOT
  }
  dsa_systemd::override { [ 'postfix', 'postfix@-', 'dovecot' ]:
    content => @(EOF),
      # This file is managed by Puppet.
      #
      # routing these logs to a separate namespace
      # so they don't count towards the default
      # journald storage namespace limit of 1G
      [Service]
      LogNamespace=mail
      | EOF
    require => File['/etc/systemd/journald@mail.conf'],
  }

  # metrics
  # ferm::rule::simple { 'rdsys-backend-metrics':
  #   description => 'Expose rdsys metrics to the Internet',
  #   port        => 7100,
  # }
  # ferm::rule::simple { 'rdsys-backend-metrics':
  #   description => 'Expose rdsys metrics to the Internet',
  #   port        => 7600,
  # }
  # ferm::rule::simple { 'rdsys-backend-metrics':
  #   description => 'Expose rdsys metrics to the Internet',
  #   port        => 7700,
  # }
  # ferm::rule::simple { 'rdsys-backend-metrics':
  #   description => 'Expose rdsys metrics to the Internet',
  #   port        => 7800,
  # }
}

notice the commented out firewall rules there, let me know if that's all good or if it should be enabled.

Looks good, let's enable them.

done, and deployed

added Needs Review label and removed Doing label

marked the checklist item prometheus exporters to be exposed, I don't think we'll connect them to the prometheus server, but will be useful to be able to reach them for tests: as completed

marked the checklist item backend bridges-tst.torproject.org:7100/metrics as completed

marked the checklist item telegram bridges-tst.torproject.org:7600/metrics as completed

marked the checklist item gettor-distributor bridges-tst.torproject.org:7700/metrics as completed

A staging server for rdsys

Designs

Child items ...

Activity