New build machine for nightlies

assigned to @anarcat

would be happy to spec out some hardware to ship at the new colo (which should be ready to accept it real soon now), or rent whatever at hetzner, as you wish.

mentioned in issue #40964 (closed)

ping @micah

@richard fyi - i'm working with anarcat to spec something out so we can get an idea of what the resource ask would be, be back to you asap

but we should probably run some verify that.

is that something you "some verify" plan to do? :)

So we have this interesting situation where it takes longer than a day to build all of our browser nightlies, and this situation is only going to worsen once we add privacy-browser to the mix.

I think another machine with the same specs as tb-build-04 would be sufficient, but we should probably run some verify that.

Do you mean with the additional disks that haven't been added yet, or without?

Also, your sentence "we should probably run some verify that" is incomplete, were you asking @boklm and @pierov to do something there, or us?

...

On 2022-12-02 17:38:38, Richard Pospesel (@richard) wrote:

-- micah

added lifecycle label

I think it might be better if we take a step back and paint a broader picture of where we want this to be in a few months. in #40964 (comment 2860430) you did a good job at presenting your needs and current infrastructure, but I'd like to see if we can push that idea a bit further...

If you could spec out the 2 or 3 (or 4?) servers you will actually need for the 2-3 next years, then we could see how we implement this. tb-build-04 and -05 have worked well as a stopgap measure to allow you to iterate much faster, and i really like that, but they're going to cost us hell if we pile up more resources on there... they charge prime money for those little extras at hetzner...

I understand 2-3 years is a big ask: it's hard to plan that far. If you want, just look at the next 6 months or something. Also feel free to say "i just want three tb-build-04 with 10TB of disk each, ktxbye", that works too, but then I would worry a little that you'd be in the same situation again in 6 months... ;)

Also keep in mind we're going to setup a huge, high performance cluster in January. The roadmap for that is here:

https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/2#tab-issues

We'd be happy to host you there. You might also want to look at the specs of those new machines here:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-43-cymru-migration-plan#hardware-acquisition

So I don't really have preferences about how/where the machines are allocated. I am a bit hesitant to say yeah totally colocate the build machines with other tor stuff, just because Firefox builds can be a bit greedy w/ respect to resources (I don't want to reduce perf of other TPO services while doing builds).

In terms #goals of what I'd like to see over the next several years:

sufficient remote build resources for devs: currently met with the hetzner servers given the current team size;
- historically we've basically had the implicit assumption that browser devs have sufficient spare hardware laying around to do cross-platform development, build verification, etc.
- Unfortunately this kind of need is more spikey rather than consistent (ie the team may be working entirely on general desktop features in which case local linux dev-builds are fine, other times people may be working on bugs/features specific to other platforms in which case the build machines make everything a lot easier).
- The tb-build-04 server is fast enough for this sort of dev work, and the Android situation has improved a bit with some software/build changes and improvements (tor-browser-build improvements as well as local Android dev build improvements)
- tb-build-04/05 are absolutely invaluable for releases
nightly builds: as mentioned in the other ticket, our current nightly/test machine cannot do all the nightly builds within a day which makes us sad
test infra: also described in the other ticket; long term I want to be able to run some fraction of Mozilla's Firefox test suite on each desktop platform (Linux i686+x86_64; Windows i686+x86_64, macOS x86_64+aarch64) automatically as part of nightlies and on-demand for alpha and stable releases. Even longer term I'd like the same for android, but that's a ways away.

Richard Pospesel commented:

So I don't really have preferences about how/where the machines are allocated. I am a bit hesitant to say yeah totally colocate the build machines with other tor stuff, just because Firefox builds can be a bit greedy w/ respect to resources (I don't want to reduce perf of other TPO services while doing builds).

I think this is not something you should worry about: if (say) your build or VM is causing problems with the rest of the infra (and we did have that problem in other clusters before), it's our problem, not yours. And we can fix it in a myriad of ways, either by giving you dedicated hardware, throttling, or grouping tasks better...

In terms #goals of what I'd like to see over the next several years:

sufficient remote build resources for devs: currently met with the hetzner servers given the current team size;

historically we've basically had the implicit assumption that browser devs have sufficient spare hardware laying around to do cross-platform development, build verification, etc.

Unfortunately this kind of need is more spikey rather than consistent (ie the team may be working entirely on general desktop features in which case local linux dev-builds are fine, other times people may be working on bugs/features specific to other platforms in which case the build machines make everything a lot easier).

So that, this very part here is what makes me think the browser stuff is prime candidate for running in GitLab CI or some sort of shared infrastructure. CI is definitely "spikey": we have everything in there from the arti folks doing regular CI tests on every push to the research folks doing days-long simulations on dedicated hardware, with persistent storage.

(You can see how full is the CI queue here, for reference:

https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&viewPanel=42&from=now-24h&to=now

... keep in mind we're in a transition period right now with fewer runners than normal as we're moving between clusters...)

It's where we're going for everything right now, and I think you'd fit right in. We'd probably need some work to adapt to your giant repos and peculiar workflow, but we've done it for shadow, and we can do it for you.

The benefit is that the "spikey" nature is then distributed: we don't need to rent a 600$+/mth cluster just for you to sometimes run jobs. We have that cluster, and then it's available for you, for shadow, and for regular builds...

Of course, the downside is it's shared, so if you end up needing resources at the same time that shadow needs to run a billion sims that all take a week each, we're in trouble... But I don't think that's worse than the current situation, in the sense that right now shadow runs into resource limits because we only have one beefy server for them... if we had many, everyone benefits most of the time, at the cost of sometimes getting stuck.

The tb-build-04 server is fast enough for this sort of dev work, and the Android situation has improved a bit with some software/build changes and improvements (tor-browser-build improvements as well as local Android dev build improvements)

tb-build-04/05 are absolutely invaluable for releases

nightly builds: as mentioned in the other ticket, our current nightly/test machine cannot do all the nightly builds within a day which makes us sad

test infra: also described in the other ticket; long term I want to be able to run some fraction of Mozilla's Firefox test suite on each desktop platform (Linux i686+x86_64; Windows i686+x86_64, macOS x86_64+aarch64) automatically as part of nightlies and on-demand for alpha and stable releases. Even longer term I'd like the same for android, but that's a ways away.

It sounds like what you're saying is "i need basically one more of tb-build-04/05 for nightlies and something else for tests, I'm not sure what". Do I summarize this right?

...

On 2022-12-06 18:10:58, Richard Pospesel (@richard) wrote:

-- Antoine Beaupré torproject.org system administration

It's where we're going for everything right now, and I think you'd fit right in. We'd probably need some work to adapt to your giant repos and peculiar workflow, but we've done it for shadow, and we can do it for you.

The benefit is that the "spikey" nature is then distributed: we don't need to rent a 600$+/mth cluster just for you to sometimes run jobs. We have that cluster, and then it's available for you, for shadow, and for regular builds...

I mean if we had the resources to make testbuilds in CI each time we pushed a commit to various component repos that would def be cool but also probably overkill. It did take 19ish hours on tb-build-05 to do a full alpha build for a point of comparison (and the artifacts would need to persist for faster subsequent builds). But you know, the dream would be to have a build and test run after a set of commits is pushed .

It sounds like what you're saying is "i need basically one more of tb-build-04/05 for nightlies and something else for tests, I'm not sure what". Do I summarize this right?

That pretty much covers it, should know more once we have tests working and passing on at least Linux.

Richard Pospesel commented on a discussion: #40984 (comment 2860525)

It's where we're going for everything right now, and I think you'd fit right in. We'd probably need some work to adapt to your giant repos and peculiar workflow, but we've done it for shadow, and we can do it for you.

The benefit is that the "spikey" nature is then distributed: we don't need to rent a 600$+/mth cluster just for you to sometimes run jobs. We have that cluster, and then it's available for you, for shadow, and for regular builds...

I mean if we had the resources to make testbuilds in CI each time we pushed a commit to various component repos that would def be cool but also probably overkill. It did take 19ish hours on tb-build-05 to do a full alpha build for a point of comparison (and the artifacts would need to persist for faster subsequent builds). But you know, the dream would be to have a build and test run after a set of commits is pushed .

Well, we can probably figure out a way to rate-limit that stuff or keep it to specific branches, for example. I also don't think it's a good idea to trigger a build or test on every commit, but surely you could do that to test sets of commits, merge requests, or release branches?

It sounds like what you're saying is "i need basically one more of tb-build-04/05 for nightlies and something else for tests, I'm not sure what". Do I summarize this right?

That pretty much covers it, should know more once we have tests working and passing on at least Linux.

Awesome, thanks for the feedback.

...

On 2022-12-06 20:03:17, Richard Pospesel (@richard) wrote:

-- Antoine Beaupré torproject.org system administration

I think that we could keep some artifacts such as container images and compilers in some persistent storage, if possible, for the CI, rather than building everything from scratch. That would mean having tor-browser.git and tor-browser-build.git to somehow interact with the same CI. Maybe not easy, but it would match closely what we do for releases.

When I only have to compile tor-browser.git/Firefox, a Linux testbuild takes about 30 minutes on my computer which has similar specs to tb-build-04.

This time is inside our containerized build system (tor-browser-build.git). A fresh build outside it takes about 10 minutes less. The additional time is used to prepare the container, to extract everything, to do additional packaging steps, etc...

I also don't think it's a good idea to trigger a build or test on every commit, but surely you could do that to test sets of commits, merge requests, or release branches?

Definitely not at every commit. We create two release branches with ~50 new commits or more for each Firefox ESR update (two branches at least once per month: one for the stable and one for the alpha). These fresh branches should not be compiled until we also add our patch set, which is ~90 commits per branch.

We push Firefox updates directly (we fetch https://github.com/mozilla/gecko-dev on our machine, create a new branch, then push it to gitlab.tpo). For the rebased patch set we use a single big MR for each version, instead. After the new release branch has been created, we shift all the new developments there and start merging MRs.

So, building for MRs is definitely a better idea.

or release branches?

For tor-browser.git we only use release branches. All the development is done in forks. But these release branches changes for each Firefox release, and the alpha rebased on the latest Firefox version becomes the default branch.

However, we have at least two branches for which we need the builds: current alpha and current stable. Most of the development is done in alphas, and then backported to stable.

Having a build after backporting would be a great idea in my opinion, especially if we could run (automatic) tests after the build. At the moment, we build on the stable branch only at the release time, which is not ideal for QA.

Forks should not have the CI enabled automatically, in my opinion, but at most trigger a rebuild before being merged. At least for starters.

something else for tests

For the tests, we could try to analyze what Mozilla does. They have some algorithms to decide which tests to run, even though they run all the suite from time to time. So far I've seen that tests seem to take forever. You could leave them go for hours.

From a hardware point of view, the most crucial point is/will be macOS: would it be possible to have one or two Mac machines in our new cluster? Maybe even some old Mac mini is okay for x86, but we've just started shipping also ARM builds.

In general, what are TPA's policies regarding proprietary OS in our infrastructure, and letting us having VMs/machines with them?

The majority of our users are on Windows, so we should run tests on Windows sooner than macOS. But at least for Windows we won't need additional hardware, only licenses...

I think that we could keep some artifacts such as container images and compilers in some persistent storage, if possible, for the CI, rather than building everything from scratch. [...]

We've been considering setting up the container registry in GitLab, see gitlab#89 (closed) for that discussion. I think you'd basically need to have one pipeline/project building up to date containers on a schedule or something, and then other builds could reuse those images.

That is definitely something that could optimise things for us.

[...]

something else for tests

For the tests, we could try to analyze what Mozilla does. They have some algorithms to decide which tests to run, even though they run all the suite from time to time. So far I've seen that tests seem to take forever. You could leave them go for hours.

Right. So that's still something that might be possible, with more hardware than what we have now of course. Maybe before a release you could have that as a last step?

From a hardware point of view, the most crucial point is/will be macOS: would it be possible to have one or two Mac machines in our new cluster? Maybe even some old Mac mini is okay for x86, but we've just started shipping also ARM builds.

There's an old issue tracking that for GitLab CI in #40095...

In general, what are TPA's policies regarding proprietary OS in our infrastructure, and letting us having VMs/machines with them?

The policy is generally a hard no, but we're ready to make an exception for CI. In particular, for the above ticket, we had a full machine setup for ahf to do a libvirt-based deployment... but that never came to life, and the machine will be retired soon.

In general, I think I'd rather have Debian running proprietary OS images so that we don't have to really manage those OSes ourselves: we just pull an image from... somewhere and run it. I understand this is actually problematic for some OSes, particularly Mac OS, and particularly for the ARM infra, however, so we're ready to give you some slack on that.

The big challenge is, I think, Mac ARM support...

The majority of our users are on Windows, so we should run tests on Windows sooner than macOS. But at least for Windows we won't need additional hardware, only licenses...

I think there might be ways of running Windows stuff without paying licenses, but I defer on ahf on that. And maybe that discussion is better carried in #40095 as well...

a.

...

On 2022-12-07 09:47:14, Pier Angelo Vendrame (@pierov) wrote:

-- Antoine Beaupré torproject.org system administration

We've been considering setting up the container registry in GitLab, see gitlab#89 (closed) for that discussion. I think you'd basically need to have one pipeline/project building up to date containers on a schedule or something, and then other builds could reuse those images.

Please notice we don't use docker, but we've received feedback that you can use tor-browser-build inside Docker.

This could make things easier, maybe. Basically, we'd need:

the possibility to have subuids, subguids and to use user_namespaces in the container
a simple Debian container with a few Perl libraries installed (here's the list)
a persistent volume, where we store our artifacts

The same applies to nightly builds, if you want to move them to CI somehow, but if I understand correctly we clean everything for every build (or maybe we keep the git clones, so some persistent storage might still be helpful).

I think there might be ways of running Windows stuff without paying licenses, but I defer on ahf on that.

There are some development VMs whose license lasts for 90 days... But you'd better ask Kendra, rather than ahf I think .

FWIW, Windows doesn't even ask for a product key these days, and magically self-activates on KVM VMs... But again, questions for lawyers, rather than technical issues.

changed due date to January 18, 2023

ran out of time on this, see last comment in the other ticket (#40964 (comment 2864366))... let's redesign this in mid january.

@richard This issue has been waiting for information two weeks. It needs attention. Please take care of this before the end of 2023-01-18 or it will be moved to the Icebox.

added Stale label

i'm a bit swamped right now... i think the next step here is to summarize the current situation, what's needed next, and make a proposal.

it also doesn't help that we have delays in provisioning the new cluster which, depending on how that goes, will inform whether we go with our own metal, rented or full on cloud...

@richard how long can i postpone this? can this wait for february?

changed due date to January 25, 2023

removed Stale label

@anarcat yep delay away, our current hardware is adequate :)

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-02-15 or it will be moved to the Icebox.

(Any ticket left in Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label, or close the ticket to get rid of the bot.)

added Stale label

sorry for the noise, punting another two weeks while we setup the new cluster.

removed Stale label

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-03-02 or it will be moved to the Icebox.

(Any ticket left in Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label, or close the ticket to get rid of the bot.)

added Stale label

removed Stale label

again, sorry, more delays in the cluster setup, let's punt this another month.

changed due date to March 16, 2023

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-03-17 or it will be moved to the Icebox.

(Any ticket left in Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label, or close the ticket to get rid of the bot.)

added Stale label

removed Stale label

changed due date to March 30, 2023

changed due date to April 12, 2023

@richard how about we sit down (maybe with @micah and others from your team?) and look at this ticket and #40964 (closed) while in costa rica?

changed due date to April 25, 2023

changed title from New build machine or nightlies to New build machine for nightlies

marked this issue as related to #40964 (closed)

we talked about this briefly with @boklm last night and it seems one thing we could try here would be to throw in a big tmpfs where the build happens to optimize the performance. an alternative is to create a new VM in the new ganeti cluster, which has some capacity still.

alternatively, this could be totally new hardware, see also #40964 (closed)

this was discussed in person, see notes in #40964 (comment 2898837)

TL;DR: relevant here: next step is to try with a tmpfs and move to the new cluster.

@lavamind do you want to batch this up with the crm box for the move? or rebuild?

If we still want a separate machine for nightly builds, as opposed to doing them on tb-build-04 or -05 then I would suggest we rebuild it on gnt-chi and a plain instance, with no DRBD.

If we still want a separate machine for nightly builds, as opposed to doing them on tb-build-04 or -05

Since we'll be using tb-build-04/05 a lot with the esr transition, I think we want to keep it separate.

added Next label and removed Needs Information label

changed due date to May 09, 2023

assigned to @lavamind and unassigned @anarcat

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-05-30. ~"Needs Information" tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

added Stale label

changed due date to June 07, 2023

removed Stale label

hey folks we're getting a little late here, swamped with other priorities, okay with bumping this another two weeks?

just another bump :)

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2023-06-21. ~"Needs Information" tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

added Stale label

added Doing label and removed Next label

New build machine for nightlies

Designs

Child items ...

Activity