Skip to content
Snippets Groups Projects
  • View options
  • View options
  • Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    • anarcat added Doing label
    • anarcat
      Owner

      Incidents

      We've crossed the 100 incidents mark.

      What constitutes an incident is a little fuzzy, but generally it's an unexpected service interruption that requires urgent or imminent action, as opposed to normal issues that are more long-term improvements or bugs that can be delayed.

      Looking at the first page of 20 incidents there, we can see almost half of those (9!) in the last 3 weeks (well, 22 days, since Oct 30th). Going to the next page, we get only ~30 more incidents for the last 12 months.

      In other words, since 2022-10-24, 13 months ago, we had 40 incidents, and a quarter of those were in the last three weeks. That's massive!

      Some of those incidents are also serious incidents:

      So there's that.

      Edited by anarcat
    • anarcat
      Owner

      Issues

      We haven't had the regular pattern of monthly meetings we used to have during the summer, but we did have a meeting on October 2nd. Then, the stats were:

      • GitLab tickets: 196 tickets including...
        • open: 0
        • icebox: 163
        • needs information: 5
        • backlog: 13
        • next: 9
        • doing: 4
        • needs review: 2
        • (closed: 3301)

      If we'd do that report now, it would say:

      • GitLab tickets: 203 tickets including...
        • open: 0
        • icebox: 160
        • needs information: 6
        • backlog: 20
        • next: 9
        • doing: 4
        • needs review: 5
        • (closed: 3361)

      We closed 60 tickets in those 50 days, but we have more tickets opened than when we started.

      It's the first time since we do those reports that we ever cross the "200 tickets" mark:

      anarcat@angela:meeting$ grep 'tickets inc' *.md | sed 's/.md:.*:/|/;s/ ti
      ckets including.../|/;s/^/|/' 
      [...]
      report tickets
      2021-01-19 113
      2021-02-02 130
      2021-03-02 ?
      2021-04-07 138
      2021-05-03 ?
      2021-06-02 132
      2021-09-07 ?
      2021-10-07 156
      2021-11-01 ?
      2021-12-06 164
      2022-01-11 159
      2022-02-14 166
      2022-03-14 177
      2022-04-04 185
      2022-05-09 178
      2022-06-06 183
      2022-07-24 184
      2022-08-29 180
      2022-10-03 186
      2022-11-07 175
      2022-12-06 183
      2023-02-06 192
      2023-03-13 177
      2023-05-08 192
      2023-06-05 193
      2023-10-02 196
      2023-11-21 203

      Part of this rise is normal: we document and discover progressively more and more issues with the infrastructure as we formalize our work. But normally, this would eventually settle down and ticket count would ideally go down, now we're in a period where the count just keeps rising.

      We don't have good metrics for this, but the last time we did a burn rate report we had a "mean" of "-2.3" spillover, which means, on average, there were 2.3 new open tickets per report.

      Now we have 7 new open tickets since the last report, and we haven't had a positive spillover since since march (when we seem to have knocked down 15 tickets extra!).

    • anarcat
      Owner

      Qualitative assesment

      Things are not going to hell completely. Email still holding up, mostly, but it's been a challenge to get ops to send email now that gmail tightened their requirements. We're working on it (#41396 (closed)), so that's good. But all long term plans (#41009) have basically been thrown out the window and are unlikely to be drafted before the year end, which means we're likely going to end up in another crisis/interrupt before long.

      Storage is an issue. We keep throwing hardware at metrics and there's no end in sight. Porbably related to the recent network churn. We still have capacity at both ganeti clusters, so we can probably keep up for a while, but the virtual machine approach is showing its limits, because we constantly need to resize individual boxes instead of centralizing the problem. Moving things to object storage could help, but we need a strategy for backups first, which requires thinking long term (#41403).

      Monitoring is falling apart. Too many alerts, too much noise, lagging behind in upgrades, and in between systems. We keep finding solutions that would work better and be easier to implement in Prometheus but we're stuck in Icinga. Probably needs to just make a decision and move (#40695 (closed)), but that is again another project. Just the conversion is a massive undertaking.

      Right now this is my focal point: I'm trying to get nagios back in the green so we can get some peace there and start looking at the transition.

      In general, we have large, complex, legacy systems that are closely coupled and intertwined that are difficult to maintain and require a massive investment in time to properly document, standardize, and modernize. This cannot be done while fires are happening, and at the rate things are going, fires are going to accumulate without giving us the opportunity to fix their causes.

      Edited by anarcat
    • anarcat marked this issue as related to #41396 (closed)
    • anarcat marked this issue as related to #41009
    • anarcat marked this issue as related to #41403
    • anarcat marked this issue as related to #40695 (closed)
    • anarcat marked this issue as related to #41321 (closed)
    • anarcat marked this issue as related to #41252 (closed)
    • anarcat
      Owner

      i sent an email with a heads up to tor-internal.

      this analysis makes me feel that we should still focus on monitoring. email is too hard and too broad to fix, and seems to hold together. puppet is still holding up, and the bookworm upgrades (#41252 (closed)) can wait. (the bullseye upgrade of eugeni, however, might become eventually urgent... but until June 2024, it is still covered by LTS. same with the gitolite migration, which can be postponed to at least january while still respecting the deadline set in the RFC.)

      so, once we cleared icinga, i'm going back to #40755 (closed) and moving out of icinga. the main challenge there is to deal with checks with side effects, particularly DNSSEC.

      so, moving on and closing.

    • anarcat closed
    • anarcat mentioned in issue #41252 (closed)
    • micah mentioned in issue #41405
    • anarcat mentioned in issue #41214 (closed)
    • anarcat marked this issue as related to #41214 (closed)
    • anarcat marked this issue as related to #40260
    • anarcat marked this issue as related to #41219 (closed)
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading