Setup Snowflake Staging Server

added Next label

marked this issue as related to tpo/anti-censorship/pluggable-transports/snowflake#40398

changed milestone to %Pluggable transports and bridges are reliable, resilient, diverse, and scalable

assigned to @lelutin

Hi!

I'm a little confused by this request, as it seems to somewhat connect (yet ignore) the related issue #41769 where we are asked to perform containerization work for rdsys. Are those two related in any way?

In any case, we're not planning on setting up kubernetes, docker-compose, or container orchestration technologies in our 2025 roadmap, as far as I know, but i would encourage folks to converge over #41769 to discuss how to containerize your workloads in the future.

For now, are there container images for snowflake we can deploy?

added CI lifecycle labels

assigned to @lavamind and unassigned @lelutin

I can state from the outset that I've got very little experience setting up containers orchestrated via any of the k*'s (so, 1, 2 and 3).

I think podman-compose would be the most likely candidate considering our existing infrastructure, at least to explore as a first option.

i'd be a little hesitant in picking any orchestration system until we have a better idea of what we're dealing with.

@shelikhoo could you clarify why you think we need some container orchestration system for snowflake in particular? how do you deploy the service? our service documentation points at this survival guide that doesn't offer me much information...

do you already have a kubernetes deployment file or a docker compose file you could share? of even various Containerfile samples?

So right now there is a file to build all the containers: https://gist.github.com/xiaokangwang/0aecf8e40789a91ca3426038045b35f3

It will build 3 containers which can be run with podman run commands:

snowflake-broker
snowflake-proxy
snowflake-server

right now for local testing, running these containers are manual, no https has been setup.

It is currently something like(example only):

podman network create --subnet 192.5.0.0/16 snowflake

podman run -d --rm --network snowflake:interface_name=eth0,alias=broker --entrypoint "/snowflake-broker" --name "snowflake-broker" -e 'SNOWFLAKE_TEST_DEBUG=1' -v $(pwd)/data/broker:/opt/broker/ localhost/snowflake-broker -disable-tls -addr :8080 -disable-geoip -default-relay-pattern '^snowflake.torproject.net$' -allowed-relay-pattern 'snowflake.torproject.net$' -bridge-list-path '/opt/broker/bridgelist.jsonl'

podman run -d --rm --network snowflake:interface_name=eth0,alias=stund --entrypoint "/stund" --name "snowflake-stund" localhost/snowflake-stund

for i in {1..8}
do
podman run -d --rm --network snowflake:interface_name=eth0 --entrypoint "/snowflake-proxy" -e "SNOWFLAKE_TEST_ASSUMEUNRESTRICTED=1" -e "SNOWFLAKE_TEST_PROXY_DEBUG=1" localhost/snowflake-proxy -broker http://broker:8080/ -verbose -unsafe-logging -keep-local-addresses -stun "stun:stund:3478" -allowed-relay-hostname-pattern 'snowflake.torproject.net$' -allow-non-tls-relay
done

podman run -d --rm --network snowflake:interface_name=eth0,alias=httpserver --name snowflake-httpserver  -v $(pwd)/data/http:/opt/httpserver/ localhost/snowflake-httpserver python3 -m http.server

podman run -d --rm --network snowflake:interface_name=eth0,alias=transientsnow1-snowflake.torproject.net --entrypoint "/snowflake-server" -e "TOR_PT_MANAGED_TRANSPORT_VER=1" -e "TOR_PT_SERVER_BINDADDR=snowflake-0.0.0.0:8888" -e "TOR_PT_SERVER_TRANSPORTS=snowflake" -e "TOR_PT_ORPORT=$(podman inspect snowflake-httpserver --format {{.NetworkSettings.Networks.snowflake.IPAddress}}):8000" -e "SNOWFLAKE_TEST_KCP_FAST3MODE=1" localhost/snowflake-server -disable-tls

podman run --rm -it --tty --cap-add NET_ADMIN --network snowflake:interface_name=eth0 -e "TOR_PT_MANAGED_TRANSPORT_VER=1" -e "TOR_PT_CLIENT_TRANSPORTS=snowflake" -e "SNOWFLAKE_TEST_FORCELISTENADDR=127.0.0.1:1080" -v $(pwd)/data/clientcompare:/opt/clientcompare/ localhost/snowflake-clientcompare bash

but the command above is for local testing only, so it does not represent how it actually work in deployment environment. If there is just podman then the script similar to the one above would need to be run each time when there is a new version to be deployed, plus the script to adjust reverse proxy.

So a orchestration system will be used to:

have a standardized and structured way to run more than one containerized service, instead of using podman commands
deal with https, domain names with its ingress unit, removing the need to deal with it in the container itself

The container image will be build on the CI automatically, and generate a manifest. To deploy a staging snowflake server, one just need to apply the manifest generated. In this way there is no need to give CI access to the machine, while still minimizing the effort to deploy a new version of the staging server.

marked this issue as related to #41769

mentioned in issue #41769

hey @shelikhoo - i had a conversation with @meskio about the rdsys containerization work he is doing with @lavamind in #41769 and it touches a bit on this project, so we talked about this as well.

just to make things crystal clear: we won't be able to setup a full orchestration framework for this any time soon. timeline for this is that we might look at this in 2026, with a possible deployment in 2026 or 2027, but for now it's not roadmapped at all, and won't be unless we're explicitly asked about this in the 2026 roadmapping process (which typically starts in november or december for us).

what we could do is provide you with a VM where you could run podman-compose or whatever you want. right now we favor podman deployments because it provides a smooth migration from our current systemd services approach, but if you want we might set you up with docker as well. i would recommend podman because it's easier to run as a normal user.

this can be a VM where you deploy compose files yourself, or we can also set you up with a "shell runner", which is a special kind of gitlab runner that execute shell commands directly on the server instead of inside a container. we're using this to do container deployments in donate-neo right now, and will likely use a similar approach to deploy rdsys containers, so maybe that would be interesting to you as well.

podman might also be able to deal with a subset of kubernetes deployment files, and you might want to work on those instead of podman-compose if you want to look forward kubernetes deployments, but don't rely on the full kubernetes stack being available any time soon.

this, in particular, would exclude any sort of control over the ingress units for now, for example, in any case.

i'm kind of sorry we don't have a kubernetes cluster ready for you: i've been thinking k8s (or a subset of) is actually something we need here to help various teams do their work, but it's kind of a big pill to swallow, and a lot of complexity to add to our stack, with no promise in reduction of our workload in short term, so it's been hard target to set. i don't exclude working on this in the future, but it's going to require more careful planning than "just this issue" kind of thing.

i think the next step here is for you to experiment with podman-compose. if you want TPA to be involved in the deployment, we can look into deploying this the same way we deploy rdsys, likely after rdsys (end of march, in theory).

otherwise we're happy to create a VM and give you the keys. note that you can also ask for a "TPA-RFC-7 root exception" (https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-7-root) explicitly as well.

in either case, we're going to need more precise specs on hardware requirements (CPU/RAM/disk speed and size).

thanks, and sorry for the grumpiness! :)

Thanks for the super long reply. I will have a have a closer look and provide a itemized reply soon, but there is one thing really in my mind:

The reason an ingress unit will be beneficial is for it to create and manage a wild card certificate which is acquired via acme automatically. So each components can have its own domain name and get automatic domain name level routing.

How would domain name and certificates be managed on such a "IaaS" instead of "SaaS" machine? Because of how cookies works on browser, a compromised subdomain could set cookie for another subdomain so long as they belong to the same public suffix.

How would domain name and certificates be managed on such a "IaaS" instead of "SaaS" machine?

i'm not sure what IaaS or SaaS refers to in this context, could you clarify?

Because of how cookies works on browser, a compromised subdomain could set cookie for another subdomain so long as they belong to the same public suffix.

well, given that this is a staging server, that shouldn't be too much of an issue, should it?

i would assume you'd have a domain and cert specifically for the staging environment, perhaps multiple or a wildcard. What i would avoid is multiple names per branch.

Sorry, "IaaS" means TPA provide a machine and user decides what to install on it(like a VPS). "SaaS" means TPA provide an online service, and user use this online service(like an email service or git hosting service).

i would assume you'd have a domain and cert specifically for the staging environment

The easiest way would be a wild card certificate like: "staging-snowflake.xxxxx.net, *.staging-snowflake.xxxxx.net" so all use cases would be covered with a single certificate. The ingress would then forward the traffics to services based on domain name.

"What i would avoid is multiple names per branch."

I imagine each for each deployment, there will be more than domain names like "server-mergerequest123.staging-snowflake.xxxxx.net" and "broker-mergerequest123.staging-snowflake.xxxxx.net", however each deployment name like "mergerequest123" would be the postfix for every domain names associated with it. Does this sounds like something would work to you?

Sorry, "IaaS" means TPA provide a machine and user decides what to install on it(like a VPS). "SaaS" means TPA provide an online service, and user use this online service(like an email service or git hosting service).

Thank you for the clarification. I ask because Kubernetes and friends are often called IaaS platforms precisely because you rent hardware. I imagine SaaS platforms, as you say, like "here, you get a wordpress" kind of platforms, which i don't think is what you actually had in mind here in the first place (as you're the developer!).

i would assume you'd have a domain and cert specifically for the staging environment

The easiest way would be a wild card certificate like: "staging-snowflake.xxxxx.net, *.staging-snowflake.xxxxx.net" so all use cases would be covered with a single certificate. The ingress would then forward the traffics to services based on domain name.

"What i would avoid is multiple names per branch."

I imagine each for each deployment, there will be more than domain names like "server-mergerequest123.staging-snowflake.xxxxx.net" and "broker-mergerequest123.staging-snowflake.xxxxx.net", however each deployment name like "mergerequest123" would be the postfix for every domain names associated with it. Does this sounds like something would work to you?

That's exactly what I'm trying to avoid here.

Having multiple names per merge request is a complication we cannot support at the moment, I think.

I understand that having more than one domain name per branch is too complex to appropriately deal with on TPA side. Unfortunately, not having a domain name per component also requires a lots of engineering to workaround. I have an alternative proposal: I can use one of my existing domain name(non-tor related) I personally managed as the "certificate domain" and use dns-01 to get get certificates for the machine. On the staging server, the ingress(reverse proxy) listen on 20443, so that it does need root to bind to this port and communicate snowflake testing clients and proxies.

Would this work from TPA's side?

I understand that having more than one domain name per branch is too complex to appropriately deal with on TPA side. Unfortunately, not having a domain name per component also requires a lots of engineering to workaround. I have an alternative proposal: I can use one of my existing domain name(non-tor related) I personally managed as the "certificate domain" and use dns-01 to get get certificates for the machine. On the staging server, the ingress(reverse proxy) listen on 20443, so that it does need root to bind to this port and communicate snowflake testing clients and proxies.

If you're willing to run the ingress, as I said, you can pretty much do whatever you want. :)

Just be careful you are not reinventing kubernetes from scratch, which is my whole concern here.

Yes, I will try invent 0 wheel this time if possible. I am also trying to avoid inventing any wheels.

I have attempted with work with podman compose but it does not work with terraform and would require additional works to get it running with an script-able ingress controller. So I decided not to reinventing kubernetes by just running kubernetes... It is taking more resources than what could be achieved by doing everything by hand but I think it is worth it with the time saved by not having to build a custom solution.

The real setup is done with terraform(actually opentofu), with fully automated deployment.

added Needs Information label and removed Next label

I have got a working setup for running a rootless single node kubernetes instance.

Custom Setup Instruction for root

(Please do not disable unprivileged_userns_clone)

run the following file as root: https://gitlab.torproject.org/shelikhoo/snowflakestaging/-/blob/d5c1fd304e95b928acfe463fda332c5dfb817a28/structure_config/k3s/sbin/init_by_root.sh, and then

useradd -m k3shost
loginctl enable-linger k3shost
apt install systemd-container libpam-systemd

Unprivileged setup

(This step will be done by @shelikhoo , content here for documentation)

# Switch to service account, must use machinectl not sudo, as sudo does not have systemd integration 
machinectl shell --uid k3shost

# !!! copy the content of https://gitlab.torproject.org/shelikhoo/snowflakestaging/-/tree/main/structure_config?ref_type=heads to ~/.config

# !!! copy the k3s binary to ~/.config/k3s/bin/k3s

# Setup ACME
export HOME=/home/k3shost/.config/acmesh/state

/home/k3shost/.config/acmesh/bin/acme.sh --register-account -m shelikhoo@torproject.org --server ***
export ACMEDNS_BASE_URL="https://auth.acme-dns.io"
export ACMEDNS_USERNAME="***"
export ACMEDNS_PASSWORD="***"
export ACMEDNS_SUBDOMAIN="***"

/home/k3shost/.config/acmesh/bin/acme.sh --issue --dns dns_acmedns -d vwyjlwqyoh3sqmycg6wmi5e732the58s3png-testing.*** -d '*.vwyjlwqyoh3sqmycg6wmi5e732the58s3png-testing.***' --server ***

# Enable and Run k3s services
export HOME=/home/k3shost/
systemctl enable --user --now k3s-rootless.service

cd ~/.config/k3s/conf
cat * |HOME=~/.config/k3s/state/ ~/.config/k3s/bin/k3s kubectl apply -f 00-traefik-tls.yaml

# Enable and Run ACME Renew services
systemctl enable --user --now acme-cron.timer

Server Spec advise

(Assuming all resources can be upgraded later if necessary)

Volatile Memory : 4G

Non-Volatile Memory : 32G

CPU: 2 Core

IPv4: 1

@lavamind, i think we need a VM like this now:

Server Spec advise

(Assuming all resources can be upgraded later if necessary)

Volatile Memory : 4G

Non-Volatile Memory : 32G

CPU: 2 Core

@shelikhoo / @lavamind how does snowflake-staging-01.torproject.org sound as a name?

also:

IPv4: 1

does that mean you do not want IPv6? normally all our VMs get one.

@shelikhoo / @lavamind how does snowflake-staging-01.torproject.org sound as a name?

I think this name is nice. This name will not be the seen by those connecting to the staging snowflake instance a separate domain name will be used to acquire the certificate via acme.dns-01 .

does that mean you do not want IPv6? normally all our VMs get one.

No, I means it does not need an IPv6, having it will not adversely impacted it.

acquire the certificate via acme.dns-01 .

if you're going to roll out your own TLS certs, you will have trouble doing that on torproject.org and .net as we have CAA records bound to our specific account-uris. we can add exceptions of course, but i thought you should know.

yes, I am aware of that. Also since unprivileged user does not have access to 80 or 443 port, this would need some dns adjustment anyway.

added Next label and removed Needs Information label

mentioned in issue tpo/anti-censorship/pluggable-transports/snowflake#40398

assigned to @lelutin and unassigned @lavamind

Hello @shelikhoo !

I'm preparing to provision this VM soon. In the team we're thinking of starting provisioning new hosts in trixie (currently "testing") right away to avoid a release upgrade further down the line. Does this sound like a plan that works for you?

checklist for VM deployment:

~~BIOS and OOB setup~~
~~burn-in and basic testing~~
OS install and security sources check
partitions check
hostname check
ip address allocation
reverse DNS
DNS resolution
root password set
~~grub check~~
~~Nextcloud spreadsheet update~~
~~hosters.yaml update (rare)~~
fabric-tasks install
puppet bootstrap
~~dnswl~~
/srv filesystem
upgrade and reboot
silence alerts

hello @shelikhoo @meskio !

the new VM is now online. The only detail that I'm missing is who to give ssh access to that host. can we reuse the rdsys group or do we need a group with a different set of users? if so, can you list who needs to be able to ssh into snowflake-staging-01 ?

It will make more sense to call the group anti-censorship and rdsys, asi this machine is not related to rdsys. But the people that should have access to it is the same.

If group name alias is not possible, the name "rdsys" should be fine as well.

added Needs Information label and removed Next label

trying to create a group alias in ldap. online documentation on how to do this is not great. It seems as though it directly depends on what schema and object classes are being used. posixGroup seems to have some mechanism for this called RFC2307bis but we're not using the corresponding objectClass.

I've tried copying the following, but I get an error saying that member is not allowed for the object class:

add gid=anti-censorship,ou=users,dc=torproject,dc=org
gid: anti-censorship
gidNumber: 2158
objectClass: top
objectClass: debianGroup
member: gid=rdsys,ou=users,dc=torproject,dc=org

if I try adding objectClass: groupOfNames to the above to enable memer and memberOf I get the error invalid structural object class chain (debianGroup/groupOfNames)

apparently debianGroup has a subgroup attribute: https://salsa.debian.org/dsa-team/mirror/userdir-ldap/-/blob/master/userdir-ldap.schema?ref_type=heads#L394-399

I've created the group using this attribute. I'll have to test whether it's achieving what we want or not. but so far I'm not even seeing users and groups from ldap on snowflake-staging-01 so I'll have to figure out what's happening there

i'm not sure using group aliases or subgroups is a good idea, sounds flaky... just make a new group?

I think a new group would work for us as well.

ok I've removed the subGroup line from the anti-censorship group and added all members of the rdsys group to the anti-censorship group. we'll have to maintain membership to this new group additionally.

and now the ldap users and groups show up. so I'll assume that subGroup just broke things.

I've also added a corresponding role uid and the creation of its home dir on the right volume via puppet.

added Doing label and removed Needs Information label

Setup Snowflake Staging Server

Designs

Child items 0

Activity

Custom Setup Instruction for root

Unprivileged setup

Server Spec advise

Server Spec advise