answer the opsreportcard, AKA the "limoncelli test", 2022 edition
a few months after starting work inside TPA (in July 2019), i had enough of a footing to think, "okay, I think i can find my way around here, what's next". then I made #30881 (closed), which goes like this:
Tom Limoncelli is the reknowned author of Time management for sysadmins and practice of network and system administration, two excellent books I recommend every sysadmin reads attentively.
He made up a 32-question test (PDF, website version on opsreportcard.com or the previous one-page HTML version) that covers the basic of a well-rounded setup. I believe we will get a good score, but going through the list will make sure we don't miss anything.
I didn't establish what a "good score" was, but we certainly didn't get a "passing grade" (60%+??), according to the summary (#30881 (comment 2541524)), produced in October 2019:
- Section A: Public Facing Practices: 1.5/3 (50%) tickets: #31242 (closed), #31243 (closed), #31244 (closed)
- Section B: Modern Team Practices: 3.5/7 (50%) tickets: #30880 (closed), #29387, missing: post-mortem, total puppetization, design docs, ticket prioritization of stability
- Section C: Operational Practices: 0.5/5 (10%) tickets: none yet, missing: "ops docs" for each service, pager rotation schedule, dev/stage/prod environments, canary process
- Section D: Automation Practices: 1.5/3 (50%) tickets: #31242 (closed), missing: reduce email noise
- Section E: Fleet Management Processes: 2.5/4 (63%) tickets: #30273, #31969, #31239, #31957 (closed), #29304
- Section F: Disaster Preperation Practices: 4/5 (80%) tickets: none yet, missing: disaster recovery plan
- Section G: Security Practices: 0.5/5 (10%) tickets: #32519 (closed), missing: malware scanners, security policy, security audits, global root password rotation
Final score: 14/32 (44%)
A lot of good things came out of this process, like the service templates, lots of automation, formal support policies, and so on. We should look at what was fixed in there (i see, for example, lots of the tickets above marked as closed, which is a good thing!) and how we could improve. This involves redoing the questionnaire, but also revisiting whether the process worked in the first place, and how well.
-
section A Public Facing Practices: 3/3 (2020: 1.5/3), excellent, mostly done -
section B Modern team practices: 6/7 (2020: 3.5/7), excellent, just need to formalize post-mortem process and publishing our source code (#29387) -
section C Operational practices: 1.5/5 (2020: 0.5/5), slight improvement, but still lots of docs missing, need monitoring for all services, figure out monitoring (#40755), import a dev/stage/prod culture -
section D Automation practices: 1.5/3 (2020: 1.5/3), unchanged, still lots of email noise, no configuration management without Puppet access -
section E Fleet management practices: 2/4 (2020: 2.5/4), mostly unchanged: installs still not automated (#31239), inventory chaotic (#30273) -
section F "We acknowledge that hardware breaks" practices: 4/5 (2020: 4/5), unchanged, still missing disaster recovery plan (#40628) -
section G Security practices: 0/5 (2020: 0.5/5), worse: no security policy (tpo/team#41), needs improvement to the password manager (#29677), need to rethink central authentication
Final score: 18/32 56% (2020: 14/32, 44%)
This is an improvement, but there is still a lot of work to do. We're almost at the passing grade!
It seems like the most critical aspects we need to work on (outlined by a "star" in the PDF version of the test) are:
- C: Operational practices:
- *11. Does each service have an OpsDoc? (no plan)
- *12. Does each service have appropriate monitoring? (improving thanks to Prometheus)
- E: Fleet management practices:
- *19. Is there a database of all machines? (#30273, no plan)
- F. "We acknowledge that hardware breaks" practices:
- *26. Are your disaster recovery plans tested periodically? (#40628)
- G. Security practices:
- *28. Do desktops/laptops/servers run self-updating, silent, anti-malware software? (no plan)
- *29. Do you have a written security policy? (no, tpo/team#41)