RFC: CI gate nightly updates, make nightly CI failures blocking
Currently, we treat failures with nightly as CI warnings. This is because nightly keeps breaking stuff.
I would like to propose an alternative strategy which would still defer the "fix with recent nightly" treadmill work, but prevent us from introducing new problems-with-nightly-caused-by-stuff-in-tree.
- Pin the nightly image to a particular date in the in-tree CI config
- Have a robot automatically update it, but merge that update only if it passes; have the robot notify us if nightly remains broken
Suggestions for the robot details:
- Robot maintains a "prospective new nightly" branch in its own clone of arti in its gitlab namespace
- Every day, the robot rewrites that branch to be precisely current main + update the nightly image data and hash
- 8h later (say), the robot looks to see if CI on the branch passed. If it did, it merges it into main (using a gitlab API endpoint, perhaps)
- Otherwise, if the current main is more than 7 days out of date in its nightly version, the robot files a ticket (if there isn't one open already), giving the url of its own branch (which can be used to review the CI logs).
- A human who wants to fix things, can then fetch the robot's branch, and fix what needs to be fixed in-tree, and make a normal MR of it.