Incidents

What constitutes an incident is a little fuzzy, but generally it's an unexpected service interruption that requires urgent or imminent action, as opposed to normal issues that are more long-term improvements or bugs that can be delayed.

Looking at the first page of 20 incidents there, we can see almost half of those (9!) in the last 3 weeks (well, 22 days, since Oct 30th). Going to the next page, we get only ~30 more incidents for the last 12 months.

In other words, since 2022-10-24, 13 months ago, we had 40 incidents, and a quarter of those were in the last three weeks. That's massive!

Some of those incidents are also serious incidents:

#41361 (closed) - bungei filling up
#41388 (closed) - tor weather database loss
#41398 (closed) - database backups failing globally
#41402 (closed) - gitlab-02 filling up (still open)

So there's that.

Issues

We haven't had the regular pattern of monthly meetings we used to have during the summer, but we did have a meeting on October 2nd. Then, the stats were:

GitLab tickets: 196 tickets including...
- open: 0
- icebox: 163
- needs information: 5
- backlog: 13
- next: 9
- doing: 4
- needs review: 2
- (closed: 3301)

If we'd do that report now, it would say:

GitLab tickets: 203 tickets including...
- open: 0
- icebox: 160
- needs information: 6
- backlog: 20
- next: 9
- doing: 4
- needs review: 5
- (closed: 3361)

We closed 60 tickets in those 50 days, but we have more tickets opened than when we started.

It's the first time since we do those reports that we ever cross the "200 tickets" mark:

anarcat@angela:meeting$ grep 'tickets inc' *.md | sed 's/.md:.*:/|/;s/ ti
ckets including.../|/;s/^/|/' 
[...]

report	tickets
2021-01-19	113
2021-02-02	130
2021-03-02	?
2021-04-07	138
2021-05-03	?
2021-06-02	132
2021-09-07	?
2021-10-07	156
2021-11-01	?
2021-12-06	164
2022-01-11	159
2022-02-14	166
2022-03-14	177
2022-04-04	185
2022-05-09	178
2022-06-06	183
2022-07-24	184
2022-08-29	180
2022-10-03	186
2022-11-07	175
2022-12-06	183
2023-02-06	192
2023-03-13	177
2023-05-08	192
2023-06-05	193
2023-10-02	196
2023-11-21	203

Part of this rise is normal: we document and discover progressively more and more issues with the infrastructure as we formalize our work. But normally, this would eventually settle down and ticket count would ideally go down, now we're in a period where the count just keeps rising.

We don't have good metrics for this, but the last time we did a burn rate report we had a "mean" of "-2.3" spillover, which means, on average, there were 2.3 new open tickets per report.

Now we have 7 new open tickets since the last report, and we haven't had a positive spillover since since march (when we seem to have knocked down 15 tickets extra!).

Qualitative assesment

Things are not going to hell completely. Email still holding up, mostly, but it's been a challenge to get ops to send email now that gmail tightened their requirements. We're working on it (#41396 (closed)), so that's good. But all long term plans (#41009) have basically been thrown out the window and are unlikely to be drafted before the year end, which means we're likely going to end up in another crisis/interrupt before long.

Storage is an issue. We keep throwing hardware at metrics and there's no end in sight. Porbably related to the recent network churn. We still have capacity at both ganeti clusters, so we can probably keep up for a while, but the virtual machine approach is showing its limits, because we constantly need to resize individual boxes instead of centralizing the problem. Moving things to object storage could help, but we need a strategy for backups first, which requires thinking long term (#41403).

Monitoring is falling apart. Too many alerts, too much noise, lagging behind in upgrades, and in between systems. We keep finding solutions that would work better and be easier to implement in Prometheus but we're stuck in Icinga. Probably needs to just make a decision and move (#40695 (closed)), but that is again another project. Just the conversion is a massive undertaking.

Right now this is my focal point: I'm trying to get nagios back in the green so we can get some peace there and start looking at the transition.

In general, we have large, complex, legacy systems that are closely coupled and intertwined that are difficult to maintain and require a massive investment in time to properly document, standardize, and modernize. This cannot be done while fires are happening, and at the rate things are going, fires are going to accumulate without giving us the opportunity to fix their causes.

i sent an email with a heads up to tor-internal.

this analysis makes me feel that we should still focus on monitoring. email is too hard and too broad to fix, and seems to hold together. puppet is still holding up, and the bookworm upgrades (#41252 (closed)) can wait. (the bullseye upgrade of eugeni, however, might become eventually urgent... but until June 2024, it is still covered by LTS. same with the gitolite migration, which can be postponed to at least january while still respecting the deadline set in the RFC.)

so, once we cleared icinga, i'm going back to #40755 (closed) and moving out of icinga. the main challenge there is to deal with checks with side effects, particularly DNSSEC.

so, moving on and closing.

Activity

Incidents

Issues

Qualitative assesment