- Truncate descriptions
Activity
- Edited by anarcat
Incidents
We've crossed the 100 incidents mark.
What constitutes an incident is a little fuzzy, but generally it's an unexpected service interruption that requires urgent or imminent action, as opposed to normal issues that are more long-term improvements or bugs that can be delayed.
Looking at the first page of 20 incidents there, we can see almost half of those (9!) in the last 3 weeks (well, 22 days, since Oct 30th). Going to the next page, we get only ~30 more incidents for the last 12 months.
In other words, since 2022-10-24, 13 months ago, we had 40 incidents, and a quarter of those were in the last three weeks. That's massive!
Some of those incidents are also serious incidents:
- #41361 (closed) - bungei filling up
- #41388 (closed) - tor weather database loss
- #41398 (closed) - database backups failing globally
- #41402 (closed) - gitlab-02 filling up (still open)
So there's that.
Issues
We haven't had the regular pattern of monthly meetings we used to have during the summer, but we did have a meeting on October 2nd. Then, the stats were:
-
GitLab tickets: 196 tickets including...
- open: 0
- icebox: 163
- needs information: 5
- backlog: 13
- next: 9
- doing: 4
- needs review: 2
- (closed: 3301)
If we'd do that report now, it would say:
-
GitLab tickets: 203 tickets including...
- open: 0
- icebox: 160
- needs information: 6
- backlog: 20
- next: 9
- doing: 4
- needs review: 5
- (closed: 3361)
We closed 60 tickets in those 50 days, but we have more tickets opened than when we started.
It's the first time since we do those reports that we ever cross the "200 tickets" mark:
anarcat@angela:meeting$ grep 'tickets inc' *.md | sed 's/.md:.*:/|/;s/ ti ckets including.../|/;s/^/|/' [...]
report tickets 2021-01-19 113 2021-02-02 130 2021-03-02 ? 2021-04-07 138 2021-05-03 ? 2021-06-02 132 2021-09-07 ? 2021-10-07 156 2021-11-01 ? 2021-12-06 164 2022-01-11 159 2022-02-14 166 2022-03-14 177 2022-04-04 185 2022-05-09 178 2022-06-06 183 2022-07-24 184 2022-08-29 180 2022-10-03 186 2022-11-07 175 2022-12-06 183 2023-02-06 192 2023-03-13 177 2023-05-08 192 2023-06-05 193 2023-10-02 196 2023-11-21 203 Part of this rise is normal: we document and discover progressively more and more issues with the infrastructure as we formalize our work. But normally, this would eventually settle down and ticket count would ideally go down, now we're in a period where the count just keeps rising.
We don't have good metrics for this, but the last time we did a burn rate report we had a "mean" of "-2.3" spillover, which means, on average, there were 2.3 new open tickets per report.
Now we have 7 new open tickets since the last report, and we haven't had a positive spillover since since march (when we seem to have knocked down 15 tickets extra!).
-
GitLab tickets: 196 tickets including...
- Edited by anarcat
Qualitative assesment
Things are not going to hell completely. Email still holding up, mostly, but it's been a challenge to get ops to send email now that gmail tightened their requirements. We're working on it (#41396 (closed)), so that's good. But all long term plans (#41009) have basically been thrown out the window and are unlikely to be drafted before the year end, which means we're likely going to end up in another crisis/interrupt before long.
Storage is an issue. We keep throwing hardware at metrics and there's no end in sight. Porbably related to the recent network churn. We still have capacity at both ganeti clusters, so we can probably keep up for a while, but the virtual machine approach is showing its limits, because we constantly need to resize individual boxes instead of centralizing the problem. Moving things to object storage could help, but we need a strategy for backups first, which requires thinking long term (#41403).
Monitoring is falling apart. Too many alerts, too much noise, lagging behind in upgrades, and in between systems. We keep finding solutions that would work better and be easier to implement in Prometheus but we're stuck in Icinga. Probably needs to just make a decision and move (#40695 (closed)), but that is again another project. Just the conversion is a massive undertaking.
Right now this is my focal point: I'm trying to get nagios back in the green so we can get some peace there and start looking at the transition.
In general, we have large, complex, legacy systems that are closely coupled and intertwined that are difficult to maintain and require a massive investment in time to properly document, standardize, and modernize. This cannot be done while fires are happening, and at the rate things are going, fires are going to accumulate without giving us the opportunity to fix their causes.
i sent an email with a heads up to tor-internal.
this analysis makes me feel that we should still focus on monitoring. email is too hard and too broad to fix, and seems to hold together. puppet is still holding up, and the bookworm upgrades (#41252 (closed)) can wait. (the bullseye upgrade of eugeni, however, might become eventually urgent... but until June 2024, it is still covered by LTS. same with the gitolite migration, which can be postponed to at least january while still respecting the deadline set in the RFC.)
so, once we cleared icinga, i'm going back to #40755 (closed) and moving out of icinga. the main challenge there is to deal with checks with side effects, particularly DNSSEC.
so, moving on and closing.