So we have this interesting situation where it takes longer than a day to build all of our browser nightlies, and this situation is only going to worsen once we add privacy-browser to the mix.
I think another machine with the same specs as tb-build-04 would be sufficient, but we should probably run some verify that.
@micah Do you think we could get funding approval for this?
would be happy to spec out some hardware to ship at the new colo (which should be ready to accept it real soon now), or rent whatever at hetzner, as you wish.
So we have this interesting situation where it takes longer than a day
to build all of our browser nightlies, and this situation is only
going to worsen once we add privacy-browser to the mix.
I think another machine with the same specs as tb-build-04 would be
sufficient, but we should probably run some verify that.
Do you mean with the additional disks that haven't been added yet, or
without?
Also, your sentence "we should probably run some verify that" is
incomplete, were you asking @boklm and @pierov to do something there, or
us?
...
On 2022-12-02 17:38:38, Richard Pospesel (@richard) wrote:
I think it might be better if we take a step back and paint a broader picture of where we want this to be in a few months. in #40964 (comment 2860430) you did a good job at presenting your needs and current infrastructure, but I'd like to see if we can push that idea a bit further...
If you could spec out the 2 or 3 (or 4?) servers you will actually need for the 2-3 next years, then we could see how we implement this. tb-build-04 and -05 have worked well as a stopgap measure to allow you to iterate much faster, and i really like that, but they're going to cost us hell if we pile up more resources on there... they charge prime money for those little extras at hetzner...
I understand 2-3 years is a big ask: it's hard to plan that far. If you want, just look at the next 6 months or something. Also feel free to say "i just want three tb-build-04 with 10TB of disk each, ktxbye", that works too, but then I would worry a little that you'd be in the same situation again in 6 months... ;)
Also keep in mind we're going to setup a huge, high performance cluster in January. The roadmap for that is here:
So I don't really have preferences about how/where the machines are allocated. I am a bit hesitant to say yeah totally colocate the build machines with other tor stuff, just because Firefox builds can be a bit greedy w/ respect to resources (I don't want to reduce perf of other TPO services while doing builds).
In terms #goals of what I'd like to see over the next several years:
sufficient remote build resources for devs: currently met with the hetzner servers given the current team size;
historically we've basically had the implicit assumption that browser devs have sufficient spare hardware laying around to do cross-platform development, build verification, etc.
Unfortunately this kind of need is more spikey rather than consistent (ie the team may be working entirely on general desktop features in which case local linux dev-builds are fine, other times people may be working on bugs/features specific to other platforms in which case the build machines make everything a lot easier).
The tb-build-04 server is fast enough for this sort of dev work, and the Android situation has improved a bit with some software/build changes and improvements (tor-browser-build improvements as well as local Android dev build improvements)
tb-build-04/05 are absolutely invaluable for releases
nightly builds: as mentioned in the other ticket, our current nightly/test machine cannot do all the nightly builds within a day which makes us sad
test infra: also described in the other ticket; long term I want to be able to run some fraction of Mozilla's Firefox test suite on each desktop platform (Linux i686+x86_64; Windows i686+x86_64, macOS x86_64+aarch64) automatically as part of nightlies and on-demand for alpha and stable releases. Even longer term I'd like the same for android, but that's a ways away.
So I don't really have preferences about how/where the machines are allocated. I am a bit hesitant to say yeah totally colocate the build machines with other tor stuff, just because Firefox builds can be a bit greedy w/ respect to resources (I don't want to reduce perf of other TPO services while doing builds).
I think this is not something you should worry about: if (say) your
build or VM is causing problems with the rest of the infra (and we did
have that problem in other clusters before), it's our problem, not
yours. And we can fix it in a myriad of ways, either by giving you
dedicated hardware, throttling, or grouping tasks better...
In terms #goals of what I'd like to see over the next several years:
sufficient remote build resources for devs: currently met with the hetzner servers given the current team size;
historically we've basically had the implicit assumption that browser devs have sufficient spare hardware laying around to do cross-platform development, build verification, etc.
Unfortunately this kind of need is more spikey rather than consistent (ie the team may be working entirely on general desktop features in which case local linux dev-builds are fine, other times people may be working on bugs/features specific to other platforms in which case the build machines make everything a lot easier).
So that, this very part here is what makes me think the browser stuff is
prime candidate for running in GitLab CI or some sort of shared
infrastructure. CI is definitely "spikey": we have everything in there
from the arti folks doing regular CI tests on every push to the research
folks doing days-long simulations on dedicated hardware, with persistent
storage.
(You can see how full is the CI queue here, for reference:
... keep in mind we're in a transition period right now with fewer
runners than normal as we're moving between clusters...)
It's where we're going for everything right now, and I think you'd fit
right in. We'd probably need some work to adapt to your giant repos and
peculiar workflow, but we've done it for shadow, and we can do it for
you.
The benefit is that the "spikey" nature is then distributed: we don't
need to rent a 600$+/mth cluster just for you to sometimes run
jobs. We have that cluster, and then it's available for you, for shadow,
and for regular builds...
Of course, the downside is it's shared, so if you end up needing
resources at the same time that shadow needs to run a billion sims that
all take a week each, we're in trouble... But I don't think that's worse
than the current situation, in the sense that right now shadow runs into
resource limits because we only have one beefy server for them... if we
had many, everyone benefits most of the time, at the cost of
sometimes getting stuck.
The tb-build-04 server is fast enough for this sort of dev work, and the Android situation has improved a bit with some software/build changes and improvements (tor-browser-build improvements as well as local Android dev build improvements)
tb-build-04/05 are absolutely invaluable for releases
nightly builds: as mentioned in the other ticket, our current nightly/test machine cannot do all the nightly builds within a day which makes us sad
test infra: also described in the other ticket; long term I want to be able to run some fraction of Mozilla's Firefox test suite on each desktop platform (Linux i686+x86_64; Windows i686+x86_64, macOS x86_64+aarch64) automatically as part of nightlies and on-demand for alpha and stable releases. Even longer term I'd like the same for android, but that's a ways away.
It sounds like what you're saying is "i need basically one more of
tb-build-04/05 for nightlies and something else for tests, I'm not sure
what". Do I summarize this right?
...
On 2022-12-06 18:10:58, Richard Pospesel (@richard) wrote:
--
Antoine Beaupré
torproject.org system administration
It's where we're going for everything right now, and I think you'd fit right in. We'd probably need some work to adapt to your giant repos and peculiar workflow, but we've done it for shadow, and we can do it for you.
The benefit is that the "spikey" nature is then distributed: we don't need to rent a 600$+/mth cluster just for you to sometimes run jobs. We have that cluster, and then it's available for you, for shadow, and for regular builds...
I mean if we had the resources to make testbuilds in CI each time we pushed a commit to various component repos that would def be cool but also probably overkill. It did take 19ish hours on tb-build-05 to do a full alpha build for a point of comparison (and the artifacts would need to persist for faster subsequent builds). But you know, the dream would be to have a build and test run after a set of commits is pushed .
It sounds like what you're saying is "i need basically one more of tb-build-04/05 for nightlies and something else for tests, I'm not sure what". Do I summarize this right?
That pretty much covers it, should know more once we have tests working and passing on at least Linux.
It's where we're going for everything right now, and I think you'd fit right in. We'd probably need some work to adapt to your giant repos and peculiar workflow, but we've done it for shadow, and we can do it for you.
The benefit is that the "spikey" nature is then distributed: we don't need to rent a 600$+/mth cluster just for you to sometimes run jobs. We have that cluster, and then it's available for you, for shadow, and for regular builds...
I mean if we had the resources to make testbuilds in CI each time we pushed a commit to various component repos that would def be cool but also probably overkill. It did take 19ish hours on tb-build-05 to do a full alpha build for a point of comparison (and the artifacts would need to persist for faster subsequent builds). But you know, the dream would be to have a build and test run after a set of commits is pushed .
Well, we can probably figure out a way to rate-limit that stuff or keep
it to specific branches, for example. I also don't think it's a good
idea to trigger a build or test on every commit, but surely you could do
that to test sets of commits, merge requests, or release branches?
It sounds like what you're saying is "i need basically one more of tb-build-04/05 for nightlies and something else for tests, I'm not sure what". Do I summarize this right?
That pretty much covers it, should know more once we have tests working and passing on at least Linux.
Awesome, thanks for the feedback.
...
On 2022-12-06 20:03:17, Richard Pospesel (@richard) wrote:
--
Antoine Beaupré
torproject.org system administration
I think that we could keep some artifacts such as container images and compilers in some persistent storage, if possible, for the CI, rather than building everything from scratch.
That would mean having tor-browser.git and tor-browser-build.git to somehow interact with the same CI.
Maybe not easy, but it would match closely what we do for releases.
When I only have to compile tor-browser.git/Firefox, a Linux testbuild takes about 30 minutes on my computer which has similar specs to tb-build-04.
This time is inside our containerized build system (tor-browser-build.git).
A fresh build outside it takes about 10 minutes less.
The additional time is used to prepare the container, to extract everything, to do additional packaging steps, etc...
I also don't think it's a good idea to trigger a build or test on every commit, but surely you could do that to test sets of commits, merge requests, or release branches?
Definitely not at every commit.
We create two release branches with ~50 new commits or more for each Firefox ESR update (two branches at least once per month: one for the stable and one for the alpha).
These fresh branches should not be compiled until we also add our patch set, which is ~90 commits per branch.
We push Firefox updates directly (we fetch https://github.com/mozilla/gecko-dev on our machine, create a new branch, then push it to gitlab.tpo).
For the rebased patch set we use a single big MR for each version, instead.
After the new release branch has been created, we shift all the new developments there and start merging MRs.
So, building for MRs is definitely a better idea.
or release branches?
For tor-browser.git we only use release branches.
All the development is done in forks.
But these release branches changes for each Firefox release, and the alpha rebased on the latest Firefox version becomes the default branch.
However, we have at least two branches for which we need the builds: current alpha and current stable.
Most of the development is done in alphas, and then backported to stable.
Having a build after backporting would be a great idea in my opinion, especially if we could run (automatic) tests after the build.
At the moment, we build on the stable branch only at the release time, which is not ideal for QA.
Forks should not have the CI enabled automatically, in my opinion, but at most trigger a rebuild before being merged. At least for starters.
something else for tests
For the tests, we could try to analyze what Mozilla does.
They have some algorithms to decide which tests to run, even though they run all the suite from time to time.
So far I've seen that tests seem to take forever. You could leave them go for hours.
From a hardware point of view, the most crucial point is/will be macOS: would it be possible to have one or two Mac machines in our new cluster?
Maybe even some old Mac mini is okay for x86, but we've just started shipping also ARM builds.
In general, what are TPA's policies regarding proprietary OS in our infrastructure, and letting us having VMs/machines with them?
The majority of our users are on Windows, so we should run tests on Windows sooner than macOS.
But at least for Windows we won't need additional hardware, only licenses...
I think that we could keep some artifacts such as container images and
compilers in some persistent storage, if possible, for the CI, rather
than building everything from scratch. [...]
We've been considering setting up the container registry in GitLab, see
gitlab#89 (closed) for that discussion. I think you'd basically need to
have one pipeline/project building up to date containers on a schedule
or something, and then other builds could reuse those images.
That is definitely something that could optimise things for us.
[...]
something else for tests
For the tests, we could try to analyze what Mozilla does.
They have some algorithms to decide which tests to run, even though they run all the suite from time to time.
So far I've seen that tests seem to take forever. You could leave them go for hours.
Right. So that's still something that might be possible, with more
hardware than what we have now of course. Maybe before a release you
could have that as a last step?
From a hardware point of view, the most crucial point is/will be macOS: would it be possible to have one or two Mac machines in our new cluster?
Maybe even some old Mac mini is okay for x86, but we've just started shipping also ARM builds.
There's an old issue tracking that for GitLab CI in
#40095...
In general, what are TPA's policies regarding proprietary OS in our infrastructure, and letting us having VMs/machines with them?
The policy is generally a hard no, but we're ready to make an exception
for CI. In particular, for the above ticket, we had a full machine setup
for ahf to do a libvirt-based deployment... but that never came to life,
and the machine will be retired soon.
In general, I think I'd rather have Debian running proprietary OS images
so that we don't have to really manage those OSes ourselves: we just
pull an image from... somewhere and run it. I understand this is
actually problematic for some OSes, particularly Mac OS, and
particularly for the ARM infra, however, so we're ready to give you some
slack on that.
The big challenge is, I think, Mac ARM support...
The majority of our users are on Windows, so we should run tests on Windows sooner than macOS.
But at least for Windows we won't need additional hardware, only licenses...
I think there might be ways of running Windows stuff without paying
licenses, but I defer on ahf on that. And maybe that discussion is
better carried in #40095 as well...
a.
...
On 2022-12-07 09:47:14, Pier Angelo Vendrame (@pierov) wrote:
--
Antoine Beaupré
torproject.org system administration
We've been considering setting up the container registry in GitLab, see gitlab#89 (closed) for that discussion. I think you'd basically need to have one pipeline/project building up to date containers on a schedule or something, and then other builds could reuse those images.
Please notice we don't use docker, but we've received feedback that you can use tor-browser-build inside Docker.
This could make things easier, maybe.
Basically, we'd need:
the possibility to have subuids, subguids and to use user_namespaces in the container
a simple Debian container with a few Perl libraries installed (here's the list)
a persistent volume, where we store our artifacts
The same applies to nightly builds, if you want to move them to CI somehow, but if I understand correctly we clean everything for every build (or maybe we keep the git clones, so some persistent storage might still be helpful).
I think there might be ways of running Windows stuff without paying licenses, but I defer on ahf on that.
There are some development VMs whose license lasts for 90 days...
But you'd better ask Kendra, rather than ahf I think .
FWIW, Windows doesn't even ask for a product key these days, and magically self-activates on KVM VMs...
But again, questions for lawyers, rather than technical issues.
@richard This issue has been waiting for information two
weeks. It needs attention. Please take care of this before
the end of 2023-01-18 or it
will be moved to the Icebox.
i'm a bit swamped right now... i think the next step here is to summarize the current situation, what's needed next, and make a proposal.
it also doesn't help that we have delays in provisioning the new cluster which, depending on how that goes, will inform whether we go with our own metal, rented or full on cloud...
@richard how long can i postpone this? can this wait for february?
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-02-15 or it will be
moved to the Icebox.
(Any ticket left in Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label, or close the
ticket to get rid of the bot.)
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-03-02 or it will be
moved to the Icebox.
(Any ticket left in Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label, or close the
ticket to get rid of the bot.)
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-03-17 or it will be
moved to the Icebox.
(Any ticket left in Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label, or close the
ticket to get rid of the bot.)
we talked about this briefly with @boklm last night and it seems one thing we could try here would be to throw in a big tmpfs where the build happens to optimize the performance. an alternative is to create a new VM in the new ganeti cluster, which has some capacity still.
alternatively, this could be totally new hardware, see also #40964 (closed)
If we still want a separate machine for nightly builds, as opposed to doing them on tb-build-04 or -05 then I would suggest we rebuild it on gnt-chi and a plain instance, with no DRBD.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-05-30. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-06-21. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)