Commits · 5806d89c237bf3b2b5da421a3852504554651491 · The Tor Project / TPA / Wiki Replica

Sep 26, 2024
- add another pager playbook (for team#41770 ) · 5806d89c
  anarcat authored 5 months ago
  
  This page really is a mess now, ugh.
  Verified
  
  5806d89c
- cosmetic · 3b9f6a05
  anarcat authored 5 months ago
  
  Verified
  
  3b9f6a05
Sep 25, 2024
- document how to write pager playbooks · 135593d2
  anarcat authored 6 months ago
  
  Verified
  
  135593d2
- refer to the OpenPGP docs from yubikey, and vice-versa · 2db6130e
  anarcat authored 6 months ago
  
  Verified
  
  2db6130e
Sep 24, 2024
- add playbook for textfile collector failures (found in team#41774 ) · 0fee57ce
  anarcat authored 6 months ago
  
  Verified
  
  0fee57ce
- Fix heading level for blackbox exporter · 00e18a0b
  lelutin authored 6 months ago
  
  with the move I should've fixed that
  Verified
  
  00e18a0b
- Move reference doc for blackbox exporter out of installation · b7b66e17
  lelutin authored 6 months ago
  
  I wanted to place this further down but failed to see the same section names were also present in the section Installation.
  Verified
  
  b7b66e17
- Document how to debug blackbox exporter · 3448442a
  lelutin authored 6 months ago
  
  It's clear enough in the upsteam readme, but it's nice to have a reference to this trick closer to our eyes.
  Verified
  
  3448442a
- add pager playbook entries for postgresql alerts (team#41774 ) · faa5af8d
  anarcat authored 6 months ago
  
  Verified
  
  faa5af8d
- expand the job down playbook · 2bdbad31
  anarcat authored 6 months ago
  
  We currently have a gitlab-runner warning pending and the playbook didn't cover it at all.
  Verified
  
  2bdbad31
- document postgres exporter failures · 5bb412df
  anarcat authored 6 months ago
  
  Verified
  
  5bb412df
- document that the postgres exporter is configured automatically · 4c574f0f
  anarcat authored 6 months ago
  
  Verified
  
  4c574f0f
- clarify missing base warning · cf7a64dc
  anarcat authored 6 months ago
  
  Verified
  
  cf7a64dc
- howto/puppet: document filebucket · 07826ae9
  Jérôme Charaoui authored 6 months ago
  
  Verified
  
  07826ae9
Sep 23, 2024
- make copy-paste from source markdown easier · ecfce360
  anarcat authored 6 months ago
  
  Verified
  
  ecfce360
- start a prometheus cheat sheet · 48a347de
  anarcat authored 6 months ago
  
  Those are queries I often find myself having to dig out of dashboards and alerts, but that are useful on their own.
  Verified
  
  48a347de
Sep 20, 2024
- fix typos · 965282c1
  anarcat authored 6 months ago
  
  Verified
  
  965282c1
- document the new alert logger (team#41745 ) · de66b91f
  anarcat authored 6 months ago
  
  Verified
  
  de66b91f
Sep 19, 2024

alertmanager/alert timers: missing words · eca288ac
lelutin authored 6 months ago

Verified

eca288ac

try to rephrase the group_wait stuff again · 30146e2f

anarcat authored 6 months ago

I didn't find the result to be particularly legible, and it had lost
the separation of source code references I had before. I am not sure
we should dig too much into implementation details (like "threads"),
but I kept that anyways.

This is mostly reformulations.

Verified

30146e2f

Some clarifications on the timers · 209eff5b

lelutin authored 6 months ago

Starting with the "fourth" timer, knowing about alert grouping and how
it's done is useful to understand the rest of the timers, so I've added
a bit of context there.

During discussions on IRC, we took the time to dig into the alertmanager
code and we've confirmed what @cks was mentioning in their blog post:
once a group is created for a route, a thread is launched for processing
new notifications every `group_interval`, so that setting is really like
a group-specific ticker for new notifications.

Verified

209eff5b

document alert timings details, see prometheus-alerts#18 · d96aa943
anarcat authored 6 months ago

Verified

d96aa943
cross-ref to the "how to add people to donate page" · d1d680bb
anarcat authored 6 months ago

Verified

d1d680bb

Sep 18, 2024

cross-ref cumin to direct-ssh setup and expand on effects of using batch · 9e3ecd39

lelutin authored 6 months ago

Using cumin's batch size is still a possibility to avoid issues, but it
is preferred to configure yourself for direct ssh connections and avoid
using the batch size if not necessary.

if direct-ssh connection is not possible, then using the batch size hack
is still possible. using it does have some side-effects that one should
be aware of though.

small correction in the text after my tests today: the limitation is
imposed by the MaxStartups setting, not MaxSessions.

Verified

9e3ecd39

cumin: default cumin ssh connections to the root user · e5bf6970

lelutin authored 6 months ago

without this, if you have some blocks in your ssh config that set you up
for connecting to certain hosts as an unprivileged users, you'll end up
running cumin commands with that user and very probably failing.

cumin is mostly used for running ad-hoc admin commands on hosts so it
makes sense to make it force connection to root.

Verified

e5bf6970

follow s/runbook/playbook/ in prometheus · 109e46c7
anarcat authored 6 months ago

Verified

109e46c7

move service template to its correct location · d5e6b875

anarcat authored 6 months ago

This has been bugging me since basically forever: the howto/template
is not a template for the "howto" section (which is now poorly defined
anyways) at *all*. It's precisely the template for *services*, and
really just belongs there.

I've been hesitant in performing that rename for a long time. First
because GitLab wikis didn't support redirects (they do now, and we add
one here), but also because we probably link to the wiki-replica
version of this in a few places.

I've tried to fix the links inside the wiki, but there are certainly
others that will break. We'll fix those as we go.

For now it seems better and more intuitive to have this at the right
place than preserve the legacy location.

Verified

d5e6b875

mention tpa-rfc-33 in alternatives · 342971af
anarcat authored 6 months ago

Verified

342971af
please harper · d38d00b4
anarcat authored 6 months ago

Verified

d38d00b4

prometheus: move munin section down into the discussion section · b893ca6c

anarcat authored 6 months ago

That was relevant in 2019, when we actually were replacing
Munin (which "died in a fire"), but we're really far past that
now. Perhaps we could also have a "migrating from Nagios" section here
as well though, see also #41655.

Verified

b893ca6c

clarify and scrape_job usage · 86f73912

anarcat authored 6 months ago

We were missing key bits about the firewall rules and a simpler
example for `collect_scrape_jobs`.

Verified

86f73912

Sep 17, 2024

Rework information about alert routes and recipients · 8ede3aad

lelutin authored 6 months ago

Clearer distinction between recipients and routes. They're set in
different hiera keys.

The definitions are now in hiera, not in puppet manifests, so the
examples need to be refreshed.

Verified

8ede3aad

Reorganize and rephrase rules + scrape jobs/targets · e8741f85

lelutin authored 6 months ago

Currently rules are *not* defined in puppet. However, scrape jobs and
targets should be for all TPA-related services.

Verified

e8741f85

howto/prometheus: better overview of the systems involved · c0dda00c

lelutin authored 6 months ago

By using a numbered list with unnumbered subpoints, we can convey a good
sense of what are all systems that collaborate for the monitoring.

The higher-level list is numbered since it follows the path of what
happens in time. First the alert is created by prom, then received by
alertmanager and finally consulted by sysadmins via karma/grafana

Verified

c0dda00c

prometheus: tiny bit of rewording to make it easier to read · 4c07f6a4

lelutin authored 6 months ago

The first paragraph says that we are not using prom alerting, and while
it's still technically true that we haven't fully switched to it yet, we
do have alerts for TPA services in prometheus now and we're slowly
moving towards switching to that completely. So we might as well change
that now to say that we do indeed use this for our montiring.

The "Looking for alerts" paragraph gives a better overview of things if
we make the list of URLs that one needs to know about in a list format
with verbosity reduced.

Verified

4c07f6a4

clarify next steps in irc bridge (team#41761) · 01e71def
anarcat authored 6 months ago
```
I still can't actually do this, but this is the way it works according
to @ahf.
```
Verified

01e71def
try to document how to bridge a matrix channel (team#41761 ) · 6a32f305
anarcat authored 6 months ago

Verified

6a32f305
new-person: fix typo · 19b6be3a
lelutin authored 6 months ago

Verified

19b6be3a
document how to look at prometheus alerts better · a2d851b1
anarcat authored 6 months ago

Verified

a2d851b1

document blocked upgrades more directly (team#41671 ) · 7eddf801

anarcat authored 6 months ago

The previous runbook wasn't directly mentioning the alert and might
have been a little jarring.

Now that we have a magic command to dump the packages pending upgrade,
use it!

Verified

7eddf801