Snowflake protocol is in constant revision and there is an increasing need to have a staging test to test the changes before rolling them to the production network. (related to P146, O3.2, Setup staging servers and CI infrastructure for Snowflake (and no~ the stop work order on this project have been lifted.))
There was some previous more or less ad-hoc attempts to setup such a hosting server on a VPS, and such effort are quite manual and could be improved with a more automated process.
So here is the requirement:
a system running container orchestration tools that allows automated deployment of containers.
a router that forward traffic to containers based on routing rules(some tools includes a router)
sufficient system resources to run the containers
Let's discuss what kind of container orchestration tools and routers that you think would works the best?
Nominated Candidates:
k3s
minikube
microk8s
podman-compose
docker-compose
traefik (router)
Designs
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
I'm a little confused by this request, as it seems to somewhat connect (yet ignore) the related issue #41769 where we are asked to perform containerization work for rdsys. Are those two related in any way?
In any case, we're not planning on setting up kubernetes, docker-compose, or container orchestration technologies in our 2025 roadmap, as far as I know, but i would encourage folks to converge over #41769 to discuss how to containerize your workloads in the future.
For now, are there container images for snowflake we can deploy?
i'd be a little hesitant in picking any orchestration system until we have a better idea of what we're dealing with.
@shelikhoo could you clarify why you think we need some container orchestration system for snowflake in particular? how do you deploy the service? our service documentation points at this survival guide that doesn't offer me much information...
do you already have a kubernetes deployment file or a docker compose file you could share? of even various Containerfile samples?
It will build 3 containers which can be run with podman run commands:
snowflake-broker
snowflake-proxy
snowflake-server
right now for local testing, running these containers are manual, no https has been setup.
It is currently something like(example only):
podman network create --subnet 192.5.0.0/16 snowflakepodman run -d--rm--network snowflake:interface_name=eth0,alias=broker --entrypoint"/snowflake-broker"--name"snowflake-broker"-e'SNOWFLAKE_TEST_DEBUG=1'-v$(pwd)/data/broker:/opt/broker/ localhost/snowflake-broker -disable-tls-addr :8080 -disable-geoip-default-relay-pattern'^snowflake.torproject.net$'-allowed-relay-pattern'snowflake.torproject.net$'-bridge-list-path'/opt/broker/bridgelist.jsonl'podman run -d--rm--network snowflake:interface_name=eth0,alias=stund --entrypoint"/stund"--name"snowflake-stund" localhost/snowflake-stundfor i in{1..8}dopodman run -d--rm--network snowflake:interface_name=eth0 --entrypoint"/snowflake-proxy"-e"SNOWFLAKE_TEST_ASSUMEUNRESTRICTED=1"-e"SNOWFLAKE_TEST_PROXY_DEBUG=1" localhost/snowflake-proxy -broker http://broker:8080/ -verbose-unsafe-logging-keep-local-addresses-stun"stun:stund:3478"-allowed-relay-hostname-pattern'snowflake.torproject.net$'-allow-non-tls-relaydonepodman run -d--rm--network snowflake:interface_name=eth0,alias=httpserver --name snowflake-httpserver -v$(pwd)/data/http:/opt/httpserver/ localhost/snowflake-httpserver python3 -m http.serverpodman run -d--rm--network snowflake:interface_name=eth0,alias=transientsnow1-snowflake.torproject.net --entrypoint"/snowflake-server"-e"TOR_PT_MANAGED_TRANSPORT_VER=1"-e"TOR_PT_SERVER_BINDADDR=snowflake-0.0.0.0:8888"-e"TOR_PT_SERVER_TRANSPORTS=snowflake"-e"TOR_PT_ORPORT=$(podman inspect snowflake-httpserver --format{{.NetworkSettings.Networks.snowflake.IPAddress}}):8000"-e"SNOWFLAKE_TEST_KCP_FAST3MODE=1" localhost/snowflake-server -disable-tlspodman run --rm-it--tty--cap-add NET_ADMIN --network snowflake:interface_name=eth0 -e"TOR_PT_MANAGED_TRANSPORT_VER=1"-e"TOR_PT_CLIENT_TRANSPORTS=snowflake"-e"SNOWFLAKE_TEST_FORCELISTENADDR=127.0.0.1:1080"-v$(pwd)/data/clientcompare:/opt/clientcompare/ localhost/snowflake-clientcompare bash
but the command above is for local testing only, so it does not represent how it actually work in deployment environment. If there is just podman then the script similar to the one above would need to be run each time when there is a new version to be deployed, plus the script to adjust reverse proxy.
have a standardized and structured way to run more than one containerized service, instead of using podman commands
deal with https, domain names with its ingress unit, removing the need to deal with it in the container itself
The container image will be build on the CI automatically, and generate a manifest. To deploy a staging snowflake server, one just need to apply the manifest generated. In this way there is no need to give CI access to the machine, while still minimizing the effort to deploy a new version of the staging server.
hey @shelikhoo - i had a conversation with @meskio about the rdsys containerization work he is doing with @lavamind in #41769 and it touches a bit on this project, so we talked about this as well.
just to make things crystal clear: we won't be able to setup a full orchestration framework for this any time soon. timeline for this is that we might look at this in 2026, with a possible deployment in 2026 or 2027, but for now it's not roadmapped at all, and won't be unless we're explicitly asked about this in the 2026 roadmapping process (which typically starts in november or december for us).
what we could do is provide you with a VM where you could run podman-compose or whatever you want. right now we favor podman deployments because it provides a smooth migration from our current systemd services approach, but if you want we might set you up with docker as well. i would recommend podman because it's easier to run as a normal user.
this can be a VM where you deploy compose files yourself, or we can also set you up with a "shell runner", which is a special kind of gitlab runner that execute shell commands directly on the server instead of inside a container. we're using this to do container deployments in donate-neo right now, and will likely use a similar approach to deploy rdsys containers, so maybe that would be interesting to you as well.
podman might also be able to deal with a subset of kubernetes deployment files, and you might want to work on those instead of podman-compose if you want to look forward kubernetes deployments, but don't rely on the full kubernetes stack being available any time soon.
this, in particular, would exclude any sort of control over the ingress units for now, for example, in any case.
i'm kind of sorry we don't have a kubernetes cluster ready for you: i've been thinking k8s (or a subset of) is actually something we need here to help various teams do their work, but it's kind of a big pill to swallow, and a lot of complexity to add to our stack, with no promise in reduction of our workload in short term, so it's been hard target to set. i don't exclude working on this in the future, but it's going to require more careful planning than "just this issue" kind of thing.
i think the next step here is for you to experiment with podman-compose. if you want TPA to be involved in the deployment, we can look into deploying this the same way we deploy rdsys, likely after rdsys (end of march, in theory).
Thanks for the super long reply. I will have a have a closer look and provide a itemized reply soon, but there is one thing really in my mind:
The reason an ingress unit will be beneficial is for it to create and manage a wild card certificate which is acquired via acme automatically. So each components can have its own domain name and get automatic domain name level routing.
How would domain name and certificates be managed on such a "IaaS" instead of "SaaS" machine? Because of how cookies works on browser, a compromised subdomain could set cookie for another subdomain so long as they belong to the same public suffix.
How would domain name and certificates be managed on such a "IaaS" instead of "SaaS" machine?
i'm not sure what IaaS or SaaS refers to in this context, could you clarify?
Because of how cookies works on browser, a compromised subdomain could set cookie for another subdomain so long as they belong to the same public suffix.
well, given that this is a staging server, that shouldn't be too much of an issue, should it?
i would assume you'd have a domain and cert specifically for the staging environment, perhaps multiple or a wildcard. What i would avoid is multiple names per branch.
Sorry, "IaaS" means TPA provide a machine and user decides what to install on it(like a VPS). "SaaS" means TPA provide an online service, and user use this online service(like an email service or git hosting service).
i would assume you'd have a domain and cert specifically for the staging environment
The easiest way would be a wild card certificate like: "staging-snowflake.xxxxx.net, *.staging-snowflake.xxxxx.net" so all use cases would be covered with a single certificate. The ingress would then forward the traffics to services based on domain name.
"What i would avoid is multiple names per branch."
I imagine each for each deployment, there will be more than domain names like "server-mergerequest123.staging-snowflake.xxxxx.net" and "broker-mergerequest123.staging-snowflake.xxxxx.net", however each deployment name like "mergerequest123" would be the postfix for every domain names associated with it. Does this sounds like something would work to you?
Sorry, "IaaS" means TPA provide a machine and user decides what to install on it(like a VPS). "SaaS" means TPA provide an online service, and user use this online service(like an email service or git hosting service).
Thank you for the clarification. I ask because Kubernetes and friends
are often called IaaS platforms precisely because you rent hardware. I
imagine SaaS platforms, as you say, like "here, you get a wordpress"
kind of platforms, which i don't think is what you actually had in mind
here in the first place (as you're the developer!).
i would assume you'd have a domain and cert specifically for the staging environment
The easiest way would be a wild card certificate like: "staging-snowflake.xxxxx.net, *.staging-snowflake.xxxxx.net" so all use cases would be covered with a single certificate. The ingress would then forward the traffics to services based on domain name.
"What i would avoid is multiple names per branch."
I imagine each for each deployment, there will be more than domain names like "server-mergerequest123.staging-snowflake.xxxxx.net" and "broker-mergerequest123.staging-snowflake.xxxxx.net", however each deployment name like "mergerequest123" would be the postfix for every domain names associated with it. Does this sounds like something would work to you?
That's exactly what I'm trying to avoid here.
Having multiple names per merge request is a complication we cannot
support at the moment, I think.
I understand that having more than one domain name per branch is too complex to appropriately deal with on TPA side. Unfortunately, not having a domain name per component also requires a lots of engineering to workaround. I have an alternative proposal: I can use one of my existing domain name(non-tor related) I personally managed as the "certificate domain" and use dns-01 to get get certificates for the machine. On the staging server, the ingress(reverse proxy) listen on 20443, so that it does need root to bind to this port and communicate snowflake testing clients and proxies.
I understand that having more than one domain name per branch is too complex to appropriately deal with on TPA side. Unfortunately, not having a domain name per component also requires a lots of engineering to workaround. I have an alternative proposal: I can use one of my existing domain name(non-tor related) I personally managed as the "certificate domain" and use dns-01 to get get certificates for the machine. On the staging server, the ingress(reverse proxy) listen on 20443, so that it does need root to bind to this port and communicate snowflake testing clients and proxies.
If you're willing to run the ingress, as I said, you can pretty much do
whatever you want. :)
Just be careful you are not reinventing kubernetes from scratch, which
is my whole concern here.
I have attempted with work with podman compose but it does not work with terraform and would require additional works to get it running with an script-able ingress controller. So I decided not to reinventing kubernetes by just running kubernetes... It is taking more resources than what could be achieved by doing everything by hand but I think it is worth it with the time saved by not having to build a custom solution.
The real setup is done with terraform(actually opentofu), with fully automated deployment.
I have got a working setup for running a rootless single node kubernetes instance.
Custom Setup Instruction for root
(Please do not disable unprivileged_userns_clone)
run the following file as root: https://gitlab.torproject.org/shelikhoo/snowflakestaging/-/blob/d5c1fd304e95b928acfe463fda332c5dfb817a28/structure_config/k3s/sbin/init_by_root.sh, and then
(This step will be done by @shelikhoo , content here for documentation)
# Switch to service account, must use machinectl not sudo, as sudo does not have systemd integration machinectl shell --uid k3shost# !!! copy the content of https://gitlab.torproject.org/shelikhoo/snowflakestaging/-/tree/main/structure_config?ref_type=heads to ~/.config# !!! copy the k3s binary to ~/.config/k3s/bin/k3s# Setup ACMEexport HOME=/home/k3shost/.config/acmesh/state/home/k3shost/.config/acmesh/bin/acme.sh --register-account -m shelikhoo@torproject.org --server ***export ACMEDNS_BASE_URL="https://auth.acme-dns.io"export ACMEDNS_USERNAME="***"export ACMEDNS_PASSWORD="***"export ACMEDNS_SUBDOMAIN="***"/home/k3shost/.config/acmesh/bin/acme.sh --issue --dns dns_acmedns -d vwyjlwqyoh3sqmycg6wmi5e732the58s3png-testing.*** -d '*.vwyjlwqyoh3sqmycg6wmi5e732the58s3png-testing.***' --server ***# Enable and Run k3s servicesexport HOME=/home/k3shost/systemctl enable --user --now k3s-rootless.servicecd ~/.config/k3s/confcat * |HOME=~/.config/k3s/state/ ~/.config/k3s/bin/k3s kubectl apply -f 00-traefik-tls.yaml# Enable and Run ACME Renew servicessystemctl enable --user --now acme-cron.timer
Server Spec advise
(Assuming all resources can be upgraded later if necessary)
@shelikhoo / @lavamind how does snowflake-staging-01.torproject.org sound as a name?
I think this name is nice. This name will not be the seen by those connecting to the staging snowflake instance a separate domain name will be used to acquire the certificate via acme.dns-01 .
does that mean you do not want IPv6? normally all our VMs get one.
No, I means it does not need an IPv6, having it will not adversely impacted it.
if you're going to roll out your own TLS certs, you will have trouble doing that on torproject.org and .net as we have CAA records bound to our specific account-uris. we can add exceptions of course, but i thought you should know.
I'm preparing to provision this VM soon. In the team we're thinking of starting provisioning new hosts in trixie (currently "testing") right away to avoid a release upgrade further down the line. Does this sound like a plan that works for you?
the new VM is now online. The only detail that I'm missing is who to give ssh access to that host. can we reuse the rdsys group or do we need a group with a different set of users? if so, can you list who needs to be able to ssh into snowflake-staging-01 ?
It will make more sense to call the group anti-censorship and rdsys, asi this machine is not related to rdsys. But the people that should have access to it is the same.
trying to create a group alias in ldap. online documentation on how to do this is not great. It seems as though it directly depends on what schema and object classes are being used. posixGroup seems to have some mechanism for this called RFC2307bis but we're not using the corresponding objectClass.
I've tried copying the following, but I get an error saying that member is not allowed for the object class:
if I try adding objectClass: groupOfNames to the above to enable memer and memberOf I get the error invalid structural object class chain (debianGroup/groupOfNames)
I've created the group using this attribute. I'll have to test whether it's achieving what we want or not. but so far I'm not even seeing users and groups from ldap on snowflake-staging-01 so I'll have to figure out what's happening there
ok I've removed the subGroup line from the anti-censorship group and added all members of the rdsys group to the anti-censorship group. we'll have to maintain membership to this new group additionally.
and now the ldap users and groups show up. so I'll assume that subGroup just broke things.
I've also added a corresponding role uid and the creation of its home dir on the right volume via puppet.