Commits · ecfce360ead697239031a2a02ebb7a8a8f679796 · The Tor Project / TPA / Wiki Replica

Sep 23, 2024
- make copy-paste from source markdown easier · ecfce360
  anarcat authored 6 months ago
  
  ecfce360
- start using our stable-backports image (base-images#13 ) · 19c89fca
  anarcat authored 6 months ago
  
  19c89fca
- use backports image for markdownlint, reducing supply chain length · 3b2ba6cb
  anarcat authored 6 months ago
  
  We'd use our own image here, but i can't find a stable-backports image (base-images#13).
  3b2ba6cb
- use our own container image for wiki-replica CI · 3833109e
  anarcat authored 6 months ago
  
  This will reduce the impact on Docker hub rate limiting and improve our supply chain, among many other things.
  3833109e
- TPA-RFC-68: mark idle canary servers as adopted (team#41750 ) · c699772e
  anarcat authored 6 months ago
  
  c699772e
- start a prometheus cheat sheet · 48a347de
  anarcat authored 6 months ago
  
  Those are queries I often find myself having to dig out of dashboards and alerts, but that are useful on their own.
  48a347de
Sep 22, 2024
- Merge branch 'policy_map' into 'master' · 17906bd2
  anarcat authored 6 months ago
  
  Describe a smtp_tls_policy_maps setup See merge request !57
  17906bd2
Sep 21, 2024
- Describe a smtp_tls_policy_maps setup · b3378dff
  Sebastian Hahn authored 6 months ago
  
  b3378dff
Sep 20, 2024
- fix typos · 965282c1
  anarcat authored 6 months ago
  
  965282c1
- document the new alert logger (team#41745 ) · de66b91f
  anarcat authored 6 months ago
  
  de66b91f
Sep 19, 2024

alertmanager/alert timers: missing words · eca288ac
lelutin authored 6 months ago

eca288ac

try to rephrase the group_wait stuff again · 30146e2f

anarcat authored 6 months ago

I didn't find the result to be particularly legible, and it had lost
the separation of source code references I had before. I am not sure
we should dig too much into implementation details (like "threads"),
but I kept that anyways.

This is mostly reformulations.

30146e2f

Some clarifications on the timers · 209eff5b

lelutin authored 6 months ago

Starting with the "fourth" timer, knowing about alert grouping and how
it's done is useful to understand the rest of the timers, so I've added
a bit of context there.

During discussions on IRC, we took the time to dig into the alertmanager
code and we've confirmed what @cks was mentioning in their blog post:
once a group is created for a route, a thread is launched for processing
new notifications every `group_interval`, so that setting is really like
a group-specific ticker for new notifications.

209eff5b

create missing, empty page for rfc-70 · 308f70a8
anarcat authored 6 months ago

308f70a8

convert policy.md back to unix line endings · 00c4aef8

anarcat authored 6 months ago

I don't understand wtf is going on here, but it looks like edits done
through the wiki interface somehow rewrite the entire file with DOS
line endings.

00c4aef8

Merge remote-tracking branch 'wiki/master' · d7e90ca4
anarcat authored 6 months ago

d7e90ca4
document alert timings details, see prometheus-alerts#18 · d96aa943
anarcat authored 6 months ago

d96aa943
show more "wtf is my ip" tricks · 0f315bdb
anarcat authored 6 months ago

0f315bdb
cross-ref to the "how to add people to donate page" · d1d680bb
anarcat authored 6 months ago

d1d680bb
Update policy · 5e27bf7c
groente authored 6 months ago

5e27bf7c

Sep 18, 2024

cross-ref cumin to direct-ssh setup and expand on effects of using batch · 9e3ecd39

lelutin authored 6 months ago

Using cumin's batch size is still a possibility to avoid issues, but it
is preferred to configure yourself for direct ssh connections and avoid
using the batch size if not necessary.

if direct-ssh connection is not possible, then using the batch size hack
is still possible. using it does have some side-effects that one should
be aware of though.

small correction in the text after my tests today: the limitation is
imposed by the MaxStartups setting, not MaxSessions.

9e3ecd39

cumin: default cumin ssh connections to the root user · e5bf6970

lelutin authored 6 months ago

without this, if you have some blocks in your ssh config that set you up
for connecting to certain hosts as an unprivileged users, you'll end up
running cumin commands with that user and very probably failing.

cumin is mostly used for running ad-hoc admin commands on hosts so it
makes sense to make it force connection to root.

e5bf6970

show when *not* to use a jump host · 8773d85b
anarcat authored 6 months ago
```
/cc @lelutin
```
8773d85b
follow s/runbook/playbook/ in prometheus · 109e46c7
anarcat authored 6 months ago

109e46c7
fix typos found by harper · 2bcb5280
anarcat authored 6 months ago

2bcb5280
cross-ref civicrm and donate metrics sections · 96e152ab
anarcat authored 6 months ago
```
I was looking for the answer to "what are the metrics here".
```
96e152ab

cross-reference the gitlab labels proposals · 11b740f3

anarcat authored 6 months ago

Amazingly, those two didn't know each other... At least now we can
find one another when looking at one, but perhaps the triage stuff
could be merged in the labels proposal?

11b740f3

fix pager playbook link on crm jobs · 85b0ab48
anarcat authored 6 months ago

85b0ab48

move service template to its correct location · d5e6b875

anarcat authored 6 months ago

This has been bugging me since basically forever: the howto/template
is not a template for the "howto" section (which is now poorly defined
anyways) at *all*. It's precisely the template for *services*, and
really just belongs there.

I've been hesitant in performing that rename for a long time. First
because GitLab wikis didn't support redirects (they do now, and we add
one here), but also because we probably link to the wiki-replica
version of this in a few places.

I've tried to fix the links inside the wiki, but there are certainly
others that will break. We'll fix those as we go.

For now it seems better and more intuitive to have this at the right
place than preserve the legacy location.

d5e6b875

document more generic job failures in CiviCRM · c4076d60
anarcat authored 6 months ago
```
This is so we have a runbook to link to in a new alert about this.
```
c4076d60
mention tpa-rfc-33 in alternatives · 342971af
anarcat authored 6 months ago

342971af
please harper · d38d00b4
anarcat authored 6 months ago

d38d00b4

prometheus: move munin section down into the discussion section · b893ca6c

anarcat authored 6 months ago

That was relevant in 2019, when we actually were replacing
Munin (which "died in a fire"), but we're really far past that
now. Perhaps we could also have a "migrating from Nagios" section here
as well though, see also team#41655.

b893ca6c

clarify and scrape_job usage · 86f73912

anarcat authored 6 months ago

We were missing key bits about the firewall rules and a simpler
example for `collect_scrape_jobs`.

86f73912

Sep 17, 2024

Rework information about alert routes and recipients · 8ede3aad

lelutin authored 6 months ago

Clearer distinction between recipients and routes. They're set in
different hiera keys.

The definitions are now in hiera, not in puppet manifests, so the
examples need to be refreshed.

8ede3aad

Reorganize and rephrase rules + scrape jobs/targets · e8741f85

lelutin authored 6 months ago

Currently rules are *not* defined in puppet. However, scrape jobs and
targets should be for all TPA-related services.

e8741f85

howto/prometheus: better overview of the systems involved · c0dda00c

lelutin authored 6 months ago

By using a numbered list with unnumbered subpoints, we can convey a good
sense of what are all systems that collaborate for the monitoring.

The higher-level list is numbered since it follows the path of what
happens in time. First the alert is created by prom, then received by
alertmanager and finally consulted by sysadmins via karma/grafana

c0dda00c

prometheus: tiny bit of rewording to make it easier to read · 4c07f6a4

lelutin authored 6 months ago

The first paragraph says that we are not using prom alerting, and while
it's still technically true that we haven't fully switched to it yet, we
do have alerts for TPA services in prometheus now and we're slowly
moving towards switching to that completely. So we might as well change
that now to say that we do indeed use this for our montiring.

The "Looking for alerts" paragraph gives a better overview of things if
we make the list of URLs that one needs to know about in a list format
with verbosity reduced.

4c07f6a4

add another possible architecture diagram (tpo/web/donate-neo#79) · cd20b009
anarcat authored 6 months ago
```
/cc @stephen
```
cd20b009
clarify next steps in irc bridge (team#41761) · 01e71def
anarcat authored 6 months ago
```
I still can't actually do this, but this is the way it works according
to @ahf.
```
01e71def