monitor technical debt and legacy

I often say that we have a huge technical debt in TPA, and that we keep needing to close things down and document and so on.

But we do not have hard data on this. After reading Managing Technical Debt, I realized we should at least keep track of metrics about this. What's interesting about that article is it says we shouldn't necessarily set targets, but keeping track of metrics would be a good start.

He specifically suggests DORA metrics, but I'm not sure it's the best match for us. Here's what I think we should monitor:

tickets
- "lead time" (time between when a ticket enters backlog/next/doing and closing)
- start using the Technical Debt ticket and measure ticket counts
- general per-queue ticket counts (already done in monthly reports, put in prometheus, see #40591 (closed))
incidents:
- "lead time" is specially important here: how long do tickets get opened in incidents? might also be a measure of MTTR (mean time to recovery)
- "change failure rate": measure how many incidents are caused by deployment failures
documentation: systematically measure how many services we have and how well they are documented (this is partially done, by hand, in the service.md wiki page, but could be somehow automated)
untracked package counts: use anarcat's puppet-package-check to generate metrics on how many packages are not managed by puppet, per host, as a rough estimate of the "puppetization ratio"
unit test coverage: across all our software projects (or maybe per project?), what is the coverage of unit tests? (requires CI and extraction of those numbers in an exporter)
out of date systems: how long does it take to update the fleet, and how long do we live on LTS? (at least partly tracked in Prometheus now, but not retained long enough to have good metrics, see also #40330)

The end result here is a small set of metrics that describe the current state of affairs, and its evolution over time. It will allow us to more easily realize when we're in trouble (e.g. #41411 (closed)) and evaluate how much effort we should put into this.

It might be more effective to have those metrics beyond the "one year" mark. Ticket counts, for example, are kept forever in the minutes, and that's a good thing, so we should consider expanding the storage retention here (#40330).

One thing Kaplan-Moss advises is to set time apart to deal with technical debt, he advises 10%. He also says we shouldn't set "sprints" to deal with technical debt, but I disagree with that: I have found that Debian upgrades are working well with sprints and wonder to what else we could extend the practice. On the other hand, the docs hack week wasn't a clear success for us, so maybe he's at least partly right in some aspects.

Edited Dec 21, 2023 by anarcat