The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2020-06-27T14:17:32Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29948Add micah to nextcloud-admin@2020-06-27T14:17:32ZLinus Nordberglinus@torproject.orgAdd micah to nextcloud-admin@Linus Nordberglinus@torproject.orgLinus Nordberglinus@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29942Please create a new git repo project/tor-browser/fastlane2020-06-27T14:17:32ZMatthew FinkelPlease create a new git repo project/tor-browser/fastlane```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hi,
Please create a new public git repo project/tor-browser/fastlane under "Infrastructure and Administration".
The description for this repo should be: Tor Browser app store and dep...```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hi,
Please create a new public git repo project/tor-browser/fastlane under "Infrastructure and Administration".
The description for this repo should be: Tor Browser app store and deployment configuration for Fastlane
Please give sysrqb and gk write access.
Thanks!
signed for trac.torproject.org on 2019-03-28 17:25:00 UTC
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEmQpn3DVLpEMbqGYohK8DqE7aGAAFAlydA5EACgkQhK8DqE7a
GACbMg/+NNf2u/EO+r5mzGVRSFRfFA8zk4izvJtgZqhuv6smEC5E9Utz5hNZPIEr
sZW9UPw5YFtDTs2hd3A37mgGoosRFctoodKsTF2valdQqGr4+ur/7gPW0Q2qDU+I
dZVbPbXpHQBRr7nKcT+4u1L4NSb3JHDKpkoeD4nI3HlG6xG32lixXpvvWIY33YVx
D91gkFxLuibZ5IL0z1VuH8GIjNe8MAv8kp38mc97zyjWvLd86qGDggZ8csw3DCxJ
4YLYwRl7ULzjUeX34wqaWujH4AWEtGqG2aU6BlsFLM3igKSXlAufzww87j8hjbYN
J3g7MpvOfWvij9I0vUTvtOwq5J0uwy8gbGpbWe5vvg76o+HjcR5eXD6gdrIW2W9h
oFfySg2uVVF9oQtS1dRwwP5cZfpQOXygnXHkQe/YmPA5MoLb9hjzADgQrBTM4HlQ
fxmeFSp/CsHIKpVBEqvlPK03SRflHUQcwmiCMG5xMoekE1jazXS3Eyjl2eeA7J4T
Xn+kG5YAuSAbqtk3nwIO+s25STs3RRwUWfw+sHXCkX0Dyd+nd94A36dYVAcIoQMH
rrUT/T78j0LS5XGb1ImgELF13VuoRXhTWfiSdJ3dK7pGm7+PUYJ6XWtKpN0QLBZm
eAqLOU/4pPgPxi/KBTtoe3ZOC7lYFEhHM8zuaiNI8pV/6KA7Cnw=
=Q7cI
-----END PGP SIGNATURE-----
```https://gitlab.torproject.org/tpo/tpa/team/-/issues/29921Please create new static component onionperf.torproject.org2020-06-27T14:17:32ZirlPlease create new static component onionperf.torproject.orgThis static component will contain the development/deployment documentation for OnionPerf. This should be built by Jenkins from the git.tpo onionperf.git repository automatically when new pushes are made and I would hope that we can copy...This static component will contain the development/deployment documentation for OnionPerf. This should be built by Jenkins from the git.tpo onionperf.git repository automatically when new pushes are made and I would hope that we can copy over the config from stem.
Please let me know if there are any actions needed from me to progress this (like creating any other tickets for other people).anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29901Point www.tp.o to tpo repository2020-06-30T11:06:05ZHiroPoint www.tp.o to tpo repositoryHi,
We would like to switch www.tp.o to the new website.
Could you please point it to the following repository:
https://gitweb.torproject.org/project/web/tpo.git/
I will need to have htaccess overrides enabled.Hi,
We would like to switch www.tp.o to the new website.
Could you please point it to the following repository:
https://gitweb.torproject.org/project/web/tpo.git/
I will need to have htaccess overrides enabled.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29889Add email alias nextcloud-admin@tpo2020-06-27T14:17:32ZLinus Nordberglinus@torproject.orgAdd email alias nextcloud-admin@tpoIn order to not have to add individuals to trac tickets related to nextcloud service administration, let's have an email alias for this group. We can then let that alias be the owner of the trac component.In order to not have to add individuals to trac tickets related to nextcloud service administration, let's have an email alias for this group. We can then let that alias be the owner of the trac component.Linus Nordberglinus@torproject.orgLinus Nordberglinus@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29888retire nova2023-03-27T18:57:51Zweasel (Peter Palfrader)retire novaremove from NSset and delegation,
then zero the diskremove from NSset and delegation,
then zero the diskhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29864TPA-RFC-33: consider replacing nagios with prometheus2023-05-17T18:06:54ZanarcatTPA-RFC-33: consider replacing nagios with prometheusAs a followup to the Prometheus/Grafana setup started in #29681, I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to a...As a followup to the Prometheus/Grafana setup started in #29681, I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to at least document the current state of affairs.
This would remove a complex piece of architecture we have at TPO that was designed before Puppet was properly deployed. Prometheus has an interesting federated design that allows it to scale to multiple machines easily, along with a high availability component for the alertmanager that allows it to be more reliable than a traditionnal Nagios configuration. It would also simplify our architecture as the Nagios server automation is a complex mix of Debian packages and git hooks that is serving us well, but hard to comprehend and debug for new administrators. (I managed to wipe the entire Nagios config myself on my first week on the job by messing up a configuration file.) Having the monitoring server fully deployed by Puppet would be a huge improvement, even if it would be done with Nagios instead of Prometheus, of course.
Right now the Nagios server is actually running Icinga 1.13, a Nagios fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job generally well although it feels a *little* noisy, but that's to be expected form Nagios servers. Reducing the number of alerts seems to be an objective, explicitely documented in #29410, for example.
Both Grafana and Prometheus can do alerting, with various mechanisms and plugins. I haven't investigated those deeply, but in general that's not a problem in alerting: you fire some script or API and the rest happens. I suspect we could port the current Nagios alerting scripts to Prometheus fairly easily, although I haven't investigated our scripts in details.
The problem is reproducing the check scripts and their associated alert threshold. In the Nagios world, when a check is installed, it *comes* with its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has developed a wide variety of such checks. According to the current Nagios dashboard, it monitors 4612 services on 88 hosts (which is interesting considering LDAP thinks there are 78). That looks terrifying, but it's actually a set of 9 commands running on the Nagios server, including the complex `check_nrpe` system, which is basically a client-side nagios that has its own set of checks. And that's where the "cardinal explosion" happens: on a typical host, there are 315 such checks implemented.
That's the hard part: convert those 324 checks into Prometheus alerts, one at a time. Unfortunately, there are no "built-in" or even "third-party" "prometheus alert sets" that I could find in my [original research](https://anarc.at/blog/2018-01-17-monitoring-prometheus/), although that might have changed in the last year.
Each check in Prometheus is basically a YAML file describing a Prometheus query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an alert. It's not impossible to do that conversion, it's just a lot of work.
To do this progressively while allowing us to make new alerts on Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare did, which is to establish a "Nagios to Prometheus" bridge, by which Nagios doesn't send the alerts on its own and instead forwards them to the Prometheus server, a plugin they called [Promsaint](https://github.com/cloudflare/promsaint).
With the bridge in place, Nagios checks can be migrated into Prometheus alerts progressively without disruption. Note that Cloudflare documented their experience with Prometheus in [this 2017 promcon talk](https://promcon.io/2017-munich/talks/monitoring-cloudflares-planet-scale-edge-network-with-prometheus/). Cloudflare also made an alert dashboard called [unsee](https://github.com/cloudflare/unsee) (see also the fork called [karma](https://github.com/prymitive/karma)) and [elasticsearch integration](https://github.com/cloudflare/alertmanager2es) which might be good to investigate further.
Another useful piece is this [NRPE to Prometheus exporter](https://www.robustperception.io/nagios-nrpe-prometheus-exporter), which allows Prometheus to directly scrape NRPE targets. It doesn't include Prometheus alerts and instead relies on a Grafana dashboard to show possible problems so, as such, I don't think it's that useful an alternative. There's a [similar approach using check_mk](https://github.com/m-lab/prometheus-nagios-exporter) instead.
Another possible approach is to send alerts from Nagios based on Prometheus checks, using the [Prometheus nagios plugins](https://github.com/prometheus/nagios_plugins). This might allow us to get rid of NRPE everywhere but it would probably be useful only if we do want to keep Nagios in the long term and remove NRPE in favor of the existing Prometheus exporters.
So, battle plan is basically this:
1. `apt install prometheus-alertmanager`
2. reimplement the Nagios alerting commands
3. send Nagios alerts through the alertmanager
4. rewrite (non-NRPE) commands (9) as Prometheus alerts
5. optionnally, scrape the NRPE metrics from Prometheus
6. optionnally, create a dashboard and/or alerts for the NRPE metrics
7. rewrite NRPE commands (300+) as Prometheus alerts
8. turn off the Nagios server
9. remove all traces of NRPE on all nodes
Update: this, obviously, will require more discussion than just implementing the above battle plan, as there isn't a consensus in the team towards Prometheus as a replacement for Icinga. I have assigned TPA-RFC-33 to this and started drafting requirements and personas in #40755Debian 11 bullseye upgradeanarcatanarcat2022-09-01https://gitlab.torproject.org/tpo/tpa/team/-/issues/29852Let ahf and cohosh push to /pluggable-transports/snowflake.git2020-06-27T14:17:33ZDavid Fifielddcf@torproject.orgLet ahf and cohosh push to /pluggable-transports/snowflake.git```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I'l like ahf and cohosh to be able to push to the
/pluggable-transports/snowflake.git repo.
Signed, David 2019-03-21
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEeXoyauxKR4rwUMw64rk...```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I'l like ahf and cohosh to be able to push to the
/pluggable-transports/snowflake.git repo.
Signed, David 2019-03-21
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEeXoyauxKR4rwUMw64rk9gVzTiOUFAlyUEAoACgkQ4rk9gVzT
iOXPww//UwoNN6Fi3PaKepwqQYbO6TunJJ4SeBzwDPKOFwOe0E0F5zSzJnVP329h
8f5o/20E88gRdHRlbJ5SzZnGWwW2KJFSRStJhSx7r3glJrsbsK6gUXFB8nbIuM7B
sczbG7gzOv9b3ly/EjjBk0+mC5EjtdzqsiqKYDH2mXQCRoDQsqlnWAvCVfPPgnC4
LRyFLtaK5dZQU6gRWqBN1EhWzrtLdbl2qSG2mLot+8z/BJrUk2BkNDXsyQla6ahP
vKpS3XKN5ldBfxBXTqOLZQTgUKQwRoTIT0xZpbuSb2WCRx7N3WDWbpwKbyL4GTJO
ZNNXsrz5qiSZ6h20qHTMt4TMrmKMQrmkcocWJ/6uqCqYCrDqCyF7SFmfwJzNYgtE
36OfQnaYcuH8XJYHEviwrLJRPWfgjcxzkEvA/RxwFTl/1Er1ulwxrFq8yya8IFCy
uQjSydLtga/WLWhq993x8aUGFvapF4P8L9iWkRSEqxln+WTxbx4F3yAszH3UeALS
eCzyl5kp6N1MkEB+VbvfsEViOLYqSwgZf74XRkmcMt9R1fy72CpSqUO24gY0Neuu
XpJ7F6+ksvKhKie31fZPxQb30ifEyZgsXSh4a9ia/62X99cl13sAZ/aoJ3vCjT8L
kfCRit6QtZ5GuPCOXTuiPbgkxlBJRvnp4TZsB8iAkmIUoBuzH5U=
=nwyV
-----END PGP SIGNATURE-----
```https://gitlab.torproject.org/tpo/tpa/team/-/issues/29846fstrim script makes noises on some servers2020-06-27T14:17:33Zanarcatfstrim script makes noises on some serverswe get this nightly since bungei was installed:
```
To: root@bungei.torproject.org
Date: Thu, 21 Mar 2019 06:25:03 +0000
/etc/cron.daily/puppet-trim-fs:
fstrim: /srv/backups/pg: the discard operation is not supported
fstrim: /srv/backu...we get this nightly since bungei was installed:
```
To: root@bungei.torproject.org
Date: Thu, 21 Mar 2019 06:25:03 +0000
/etc/cron.daily/puppet-trim-fs:
fstrim: /srv/backups/pg: the discard operation is not supported
fstrim: /srv/backups/bacula: the discard operation is not supported
```
this is from the following script deployed through puppet:
```
# by weasel
if tty > /dev/null; then
verbose="-v"
else
verbose=""
fi
awk '$9 ~ "^(ext4|xfs)$" && $4 == "/" {print $3, $5}' /proc/self/mountinfo | while read mm mountpoint; do
path="/sys/dev/block/$mm"
[ -e "$path" ] || continue
path="$(readlink -f "$path")"
while : ; do
qi="$path/queue/discard_max_bytes"
if [ -e "$qi" ]; then
[ "$(cat "$qi")" -gt "0" ] && fstrim $verbose "$mountpoint"
break
fi
# else try the parent
path="$(readlink -f "$path/..")"
# as long as it's a device
[ -e "$path/dev" ] || break
done
done
```
I can confirm the mapped device cannot be "trimmed":
```
root@bungei:/home/anarcat# fstrim -v /srv/backups/pg
fstrim: /srv/backups/pg: the discard operation is not supported
```
I'm unsure why that is the case. I suspect it might be a matter of adding the `issue_discards` option to `lvm.conf`, but I'm not sure. I also note that the `discard` option is not present in `crypptab` either.
In the [Debian wiki SSDOptimization page](https://wiki.debian.org/SSDOptimization), they mention a `fstrim` systemd service (not required in Buster, apparently) that supposedly takes care of that work for us. It does, however, only the following command:
```
fstrim -av
```
... which doesn't seem to do anything here. It also doesn't silence the warnings from the script, so I'm not sure it's *doing* anything.
In any case, I welcome advice on how to deal with that one warning.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29841ipsec VPN generates gigantic logs2020-06-27T14:17:33Zanarcatipsec VPN generates gigantic logsSerious yak shaving night...
To try to silence this seemingly innocuous warning:
```
/etc/cron.daily/logrotate:
error: Compressing program wrote following message to stderr when compressing log /var/log/syslog.1:
gzip: stdin: file size...Serious yak shaving night...
To try to silence this seemingly innocuous warning:
```
/etc/cron.daily/logrotate:
error: Compressing program wrote following message to stderr when compressing log /var/log/syslog.1:
gzip: stdin: file size changed while zipping
```
... I have looked at the logrotate configuration deployed through Puppet, and it seems slightly out of date compared to the one available in stretch. This is the configuration left over from the stretch upgrade on eugeni, for example:
```
/var/log/syslog
{
rotate 7
daily
missingok
notifempty
delaycompress
compress
postrotate
invoke-rc.d syslog-ng reload > /dev/null
endscript
}
/var/log/mail.info
/var/log/mail.warn
/var/log/mail.err
/var/log/mail.log
/var/log/daemon.log
/var/log/kern.log
/var/log/auth.log
/var/log/user.log
/var/log/lpr.log
/var/log/cron.log
/var/log/debug
/var/log/messages
/var/log/error
{
rotate 4
weekly
missingok
notifempty
compress
delaycompress
sharedscripts
postrotate
invoke-rc.d syslog-ng reload > /dev/null
endscript
}
```
Out of those, we're not doing the `syslog-ng reload`, the `delaycompress`, `notifempty` and each logfile is in a separate block which makes it harder to read. So I looked at doing the postrotate action, but then I realized it was happening on the syslog logfile which *is* correctly reloaded. so then i figured the `delaycompress` might be the bit missing.
but before enabling that blindly, I figured I would check if this would blow up the disk space on a server. how to do that you ask? well with our shiny new Cumin tool of course:
```
anarcat@curie:~(master)$ cumin -p 0 '*' 'for log in /var/log/*.log ; do if [ `du -b "$log" | cut -f1` -gt 1000000000 ] ; then echo "logfile $log larger than 1GB"; exit 1 ; fi; done'
74 hosts will be targeted:
alberti.torproject.org,arlgirdense.torproject.org,bracteata.torproject.org,brulloi.torproject.org,build-arm-[None..None](../compare/None...None).torproject.org,build-x86-[None..None](../compare/None...None).torproject.org,bungei.torproject.org,carinatum.torproject.org,cdn-backend-sunet-01.torproject.org,chamaemoly.torproject.org,chiwui.torproject.org,colchicifolium.torproject.org,corsicum.torproject.org,crispum.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,cupani.torproject.org,dictyotum.torproject.org,eugeni.torproject.org,fallax.torproject.org,forrestii.torproject.org,gayi.torproject.org,getulum.torproject.org,gitlab-01.torproject.org,henryi.torproject.org,hetzner-hel1-[None..None](../compare/None...None).torproject.org,hetzner-nbg1-01.torproject.org,hyalinum.torproject.org,iranicum.torproject.org,kvm[None..None](../compare/None...None).torproject.org,listera.torproject.org,macrum.torproject.org,majus.torproject.org,materculae.torproject.org,meronense.torproject.org,moly.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,nova.torproject.org,nutans.torproject.org,omeiense.torproject.org,oo-hetzner-03.torproject.org,opacum.torproject.org,orestis.torproject.org,oschaninii.torproject.org,palmeri.torproject.org,pauli.torproject.org,peninsulare.torproject.org,perdulce.torproject.org,polyanthum.torproject.org,rouyi.torproject.org,rude.torproject.org,savii.torproject.org,saxatile.torproject.org,scw-arm-ams-01.torproject.org,scw-arm-par-01.torproject.org,staticiforme.torproject.org,subnotabile.torproject.org,textile.torproject.org,togashii.torproject.org,troodi.torproject.org,unifolium.torproject.org,vineale.torproject.org,web-cymru-01.torproject.org,web-hetzner-01.torproject.org
Confirm to continue [y/n]? y
|██████████████▌ | 12% (9/74) [00:47<08:25, 7.78s/hosts]
===== NODE GROUP ===== |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
(3) build-arm-[None..None](../compare/None...None).torproject.org |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
----- OUTPUT of 'for log in /var/...xit 1 ; fi; done' ----- |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
Connection timed out during banner exchange |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
===== NODE GROUP ===== |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
(5) hetzner-hel1-01.torproject.org,kvm4.torproject.org,macrum.torproject.org,textile.torproject.org,unifolium.torproject.org |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
----- OUTPUT of 'for log in /var/...xit 1 ; fi; done' ----- |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
logfile /var/log/daemon.log larger than 1GB |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
===== NODE GROUP ===== |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
(1) hyalinum.torproject.org |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
----- OUTPUT of 'for log in /var/...xit 1 ; fi; done' ----- |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
ssh: Could not resolve hostname hyalinum.torproject.org: No address associated with hostname |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
================ PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 88% (65/74) [00:52<00:07, 1.23hosts/s]
FAIL |██████████████▌ | 12% (9/74) [00:52<08:25, 7.78s/hosts]
12.2% (9/74) of nodes failed to execute command 'for log in /var/...xit 1 ; fi; done': build-arm-[None..None](../compare/None...None).torproject.org,hetzner-hel1-01.torproject.org,hyalinum.torproject.org,kvm4.torproject.org,macrum.torproject.org,textile.torproject.org,unifolium.torproject.org
87.8% (65/74) success ratio (>= 0.0% threshold) for command: 'for log in /var/...xit 1 ; fi; done'.: alberti.torproject.org,arlgirdense.torproject.org,bracteata.torproject.org,brulloi.torproject.org,build-x86-[None..None](../compare/None...None).torproject.org,bungei.torproject.org,carinatum.torproject.org,cdn-backend-sunet-01.torproject.org,chamaemoly.torproject.org,chiwui.torproject.org,colchicifolium.torproject.org,corsicum.torproject.org,crispum.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,cupani.torproject.org,dictyotum.torproject.org,eugeni.torproject.org,fallax.torproject.org,forrestii.torproject.org,gayi.torproject.org,getulum.torproject.org,gitlab-01.torproject.org,henryi.torproject.org,hetzner-hel1-[None..None](../compare/None...None).torproject.org,hetzner-nbg1-01.torproject.org,iranicum.torproject.org,kvm5.torproject.org,listera.torproject.org,majus.torproject.org,materculae.torproject.org,meronense.torproject.org,moly.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,nova.torproject.org,nutans.torproject.org,omeiense.torproject.org,oo-hetzner-03.torproject.org,opacum.torproject.org,orestis.torproject.org,oschaninii.torproject.org,palmeri.torproject.org,pauli.torproject.org,peninsulare.torproject.org,perdulce.torproject.org,polyanthum.torproject.org,rouyi.torproject.org,rude.torproject.org,savii.torproject.org,saxatile.torproject.org,scw-arm-ams-01.torproject.org,scw-arm-par-01.torproject.org,staticiforme.torproject.org,subnotabile.torproject.org,togashii.torproject.org,troodi.torproject.org,vineale.torproject.org,web-cymru-01.torproject.org,web-hetzner-01.torproject.org
87.8% (65/74) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.: alberti.torproject.org,arlgirdense.torproject.org,bracteata.torproject.org,brulloi.torproject.org,build-x86-[None..None](../compare/None...None).torproject.org,bungei.torproject.org,carinatum.torproject.org,cdn-backend-sunet-01.torproject.org,chamaemoly.torproject.org,chiwui.torproject.org,colchicifolium.torproject.org,corsicum.torproject.org,crispum.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,cupani.torproject.org,dictyotum.torproject.org,eugeni.torproject.org,fallax.torproject.org,forrestii.torproject.org,gayi.torproject.org,getulum.torproject.org,gitlab-01.torproject.org,henryi.torproject.org,hetzner-hel1-[None..None](../compare/None...None).torproject.org,hetzner-nbg1-01.torproject.org,iranicum.torproject.org,kvm5.torproject.org,listera.torproject.org,majus.torproject.org,materculae.torproject.org,meronense.torproject.org,moly.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,nova.torproject.org,nutans.torproject.org,omeiense.torproject.org,oo-hetzner-03.torproject.org,opacum.torproject.org,orestis.torproject.org,oschaninii.torproject.org,palmeri.torproject.org,pauli.torproject.org,peninsulare.torproject.org,perdulce.torproject.org,polyanthum.torproject.org,rouyi.torproject.org,rude.torproject.org,savii.torproject.org,saxatile.torproject.org,scw-arm-ams-01.torproject.org,scw-arm-par-01.torproject.org,staticiforme.torproject.org,subnotabile.torproject.org,togashii.torproject.org,troodi.torproject.org,vineale.torproject.org,web-cymru-01.torproject.org,web-hetzner-01.torproject.org
```
This might not be very easy to read, but the important bit is this:
```
(5) hetzner-hel1-01.torproject.org,kvm4.torproject.org,macrum.torproject.org,textile.torproject.org,unifolium.torproject.org
----- OUTPUT of 'for log in /var/...xit 1 ; fi; done' -----
|logfile /var/log/daemon.log larger than 1GB
```
So I looked at the first one of those (hetzner-hel1-01) and lo and behold, the `daemon.log` is gigantic:
```
1,4G /var/log/daemon.log
```
I looked into the file briefly and it looks like a *lot* of information from ipsec. But before I start shaving another yak, I figured I would just file this as a ticket to document how far I went and let this one rest for a while.
(I did end up setting delaycompress after doing more investigations in Prometheus about free disk space, but that's documented in the tor-puppet commit 44f86c7d and previous.)weasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29822prometheus server cannot reach build-arm* boxes2020-06-27T14:17:33Zanarcatprometheus server cannot reach build-arm* boxesThe `build-arm-0[None..None](../compare/None...None).torproject.org` boxes are behind NAT (or some sort of firewall?) which makes them unreachable from the global internet. They are therefore not monitored from the Prometheus server righ...The `build-arm-0[None..None](../compare/None...None).torproject.org` boxes are behind NAT (or some sort of firewall?) which makes them unreachable from the global internet. They are therefore not monitored from the Prometheus server right now, although they *are* reachable from the Nagios server.
We need to setup a similar configuration to have those boxes scraped like the other ones.weasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29820Change PGP key for ahf2020-06-27T14:17:33ZLinus Nordberglinus@torproject.orgChange PGP key for ahfLinus Nordberglinus@torproject.orgLinus Nordberglinus@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29817dead disk on moly2020-06-27T14:17:33Zanarcatdead disk on molyone of the hard drives on moly has died. this was spotted by cymru's staff and confirmed when smartd was installed (legacy/trac#29709).
i have done some research on the machine to figure out what's up, and wrote the following reply to C...one of the hard drives on moly has died. this was spotted by cymru's staff and confirmed when smartd was installed (legacy/trac#29709).
i have done some research on the machine to figure out what's up, and wrote the following reply to Cymru's people:
> [...] I can confirm that one of the hard drives in Moly has failed, according to SMART metrics we have available.
>
> According to smartd, that disk is:
>
> [SEAGATE ST3600057SS 0008], lu id: 0x5000c5003b5bc36f, S/N: 6SL1G7Q60000N1497K0E, 600 GB
>
> It's a 600GB SAS drive. It's part of a megaraid RAID-10 array that has marked the drive as "Firmware state: Failed". I'll go under the assertiont his means the drive is dead.
>
> Being new here, I'm not familiar with the machine either. From what I can tell, it's a Supermicro X8DTU motherboard, and possibly an iXsystems iX1204-R700UB case. Does it look like this this picture?
>
> https://static.ixsystems.co/uploads/2017/08/1204h-t_front.png
>
> If so, the only datasheet I could find is this limited PDF:
>
> https://www.ixsystems.com/wp-content/uploads/2017/09/Server_Line_2017_WEB.pdf
>
> It *does* say the hard drives are hot-swappable, so in theory, it should just be a matter of replacing the hard drive.
>
> It looks like each drive has its own LED, hopefully the one with the amber warning light should be the dead disk. I've issued a command to the RAID controller to make it "flash" the drive LED, so hopefully that will allow you to locate it better.
>
> I *think* the disk controller is new enough for you to simply hot swap the drive with a new one without any other intervention on our part. But it might be better if we are available during the operation. [...]
I've created some documentation on the hardware RAID stuff here:
https://help.torproject.org/tsa/howto/raid/
we're at the waiting step now - we'll see if Cymru can do the replacement and when. i'm still not quite certain we can just hotswap the drive, but I'm hoping we can.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29816replace "Tor VM hosts" spreadsheet with Grafana dashboard2023-08-28T19:02:17Zanarcatreplace "Tor VM hosts" spreadsheet with Grafana dashboardOur KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:
1. it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
...Our KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:
1. it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
2. it's not real time data - for example, even if a host is allocated one vCPU, it might be totally idle most of the time and doing mostly network or disk, while another one might hit the CPU hard. actual load is what matters
3. ~~it's hosted by Google - that has a few problems, the most important of which is that some TPA do not actually *want* to use Google services and might be reluctant to update it, worsening problem 1~~ that part is fixed: we have moved it to Nextcloud
I propose we shift this to a Grafana dashboard. I already have a prototype in the form of the [Node exporter server metrics Grafana Dashboard](https://grafana.com/dashboards/405) which shows multiple hosts basic stats in parallel. I set the default of the dashboard in Grafana to show the 6 KVM hosts:
<https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-node=unifolium.torproject.org:9100>
That looks like this:
![https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png](https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png)
.. but it's not ideal:
* it's showing irrelevant stats for this purpose like context switches or detailed disk or memory stats
* it's missing critical information like the number of KVM guests hosted on the machine, how many CPUs and disk space is allocated and so on
This is the information we should be showing:
* disk capacity vs allocation
* disk utilization
* CPU count vs allocation
* actual CPU utilization
* load?
* memory capacity vs allocation
* actual memory usage
Some of that information currently lives *only* in the spreadsheet. For example, disk allocations are only available there, as the KVM guests run on QCOW (Qemu Copy On Write) filesystems that only take space when actually used by the guest. This has the advantage of allowing us to over-provision, but means we must keep that metadata somewhere else.
So for now it's in the spreadsheet, but we could find a way to move it somewhere Prometheus can scrape. One trick that Prometheus has is that it can expose metrics stored as text files in `/var/lib/prometheus/node-exporter/*.prom`. This is how the smartctl and APT metrics get shipped for example: a cron job (well, a systemd timer) regularly writes that file, atomically. So one option could be to move this information to (say) LDAP or Puppet/Hiera and write that information into that file using a cronjob (LDAP) or Puppet (Hiera).
Then we'd build a custom Grafana dashboard and get rid of the other spreadsheet.
A stop-gap measure might be to simplify the spreadsheet and move it to a plain text markdown file. We would lose the automatic calculations the spreadsheet provide, in exchange for easier updating and transparency.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29796synchronize puppet and LDAP hosts2020-06-27T14:17:34Zanarcatsynchronize puppet and LDAP hostsWe have hosts that are in Puppet and not in LDAP and vice versa. Every host in LDAP should be in Puppet and vice versa.
We have 78 hosts in LDAP and 74 in Puppet, with 73 hosts in common. This is the current diff:
```
$ diff puppet lda...We have hosts that are in Puppet and not in LDAP and vice versa. Every host in LDAP should be in Puppet and vice versa.
We have 78 hosts in LDAP and 74 in Puppet, with 73 hosts in common. This is the current diff:
```
$ diff puppet ldap
29a30,31
> geyeri.torproject.org
> gillii.torproject.org
36d37
< hyalinum.torproject.org
74a76,78
> weissii.torproject.org
> winklerianum.torproject.org
> woronowii.torproject.org
```
That is, right now, we have the following hosts in LDAP but not in Puppet:
* geyeri.torproject.org
* gillii.torproject.org
* weissii.torproject.org
* winklerianum.torproject.org
* woronowii.torproject.org
The following is in Puppet, but not LDAP:
* hyalinum.torproject.org
The two lists (`puppet` and `ldap`) were obtain using the following commands:
```
ssh -t pauli.torproject.org 'sudo -u postgres psql puppetdb -P pager=off -A -t -c "SELECT c.certname FROM certnames c WHERE c.deactivated IS NULL"' | tee puppet
tail -n +2 puppet | sort | sponge puppet
ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "hostname=*.torproject.org" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort' > ldap
```
... as detailed in the [new Puppet docs](https://help.torproject.org/tsa/howto/puppet/).
I'm not exactly sure how to resolve this. When weasel saw a previous version of this list, he said:
```
12:30:00 <weasel> from a quick glance, all but the arm hosts can go.
12:30:06 <weasel> best to double-check with ldap.
12:30:19 <weasel> if they are not in ldap, and they haven't done a puppet run in a while, they should be removed from puppet also.
12:30:45 <weasel> gillii and geyeri are the old CRM hosts. I think linus wants to kill them soon but maybe keep them around (and offline) for now.
```
According to nagios, hyalinum has not checked into Puppet since 2018-02-12T08:53:13.339Z, over a month ago. So presumably that should be removed from puppet, and we should double-check the retirement procedure to see if it was completed correctly.
The hosts in LDAP and not in Puppet should probably be added to puppet, carefully (--noop is your friend) to see if it breaks anything.
In the future, we might want to add a Nagios check on the Puppet server to make sure this is synchronized.https://gitlab.torproject.org/tpo/tpa/team/-/issues/29788Create email alias to core contributor Vinícius Zavam (egypcio)2020-06-27T14:17:34ZGusCreate email alias to core contributor Vinícius Zavam (egypcio)Hello, please create the email alias for our core contributor Vinicius Zavam (egypcio):
egypcio@tpo to egypcio@riseup.net
thanks!
GusHello, please create the email alias for our core contributor Vinicius Zavam (egypcio):
egypcio@tpo to egypcio@riseup.net
thanks!
Gushttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29781Adding Fundraising Team page to https://trac.torproject.org home page2020-06-27T14:17:34ZalsmithAdding Fundraising Team page to https://trac.torproject.org home pageThe Fundraising Team now has a Wiki page, hooray! It's here: https://trac.torproject.org/projects/tor/wiki/org/teams/FundraisingTeam
I don't have the permissions to edit the trac.torproject.org home page and add the Fundraising Team und...The Fundraising Team now has a Wiki page, hooray! It's here: https://trac.torproject.org/projects/tor/wiki/org/teams/FundraisingTeam
I don't have the permissions to edit the trac.torproject.org home page and add the Fundraising Team under the 'Teams' header.Jens KubiezielJens Kubiezielhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29774Please create 2019.www.torproject.org2020-06-27T14:17:34ZHiroPlease create 2019.www.torproject.orgThis should serve as archive for the current website.This should serve as archive for the current website.https://gitlab.torproject.org/tpo/tpa/team/-/issues/29770mails relayed from lists.tpo to gmail.com bounces2022-07-09T04:32:13Zanarcatmails relayed from lists.tpo to gmail.com bouncesIt seems we're having trouble relaying mails to gmail.com and possibly other providers.
A similar question was raised by a tor-announce@ participant in `Message-ID: <CAHQQdjtuVj68EVw_TsnDGiTRj1bY7iyxuehUDj6bya=KAqE0MQ@mail.gmail.com>`
...It seems we're having trouble relaying mails to gmail.com and possibly other providers.
A similar question was raised by a tor-announce@ participant in `Message-ID: <CAHQQdjtuVj68EVw_TsnDGiTRj1bY7iyxuehUDj6bya=KAqE0MQ@mail.gmail.com>`
I myself had trouble sending mail to a @tpo account that is forwarded to gmail (bounce is `Message-ID: <20190311210933.9AB57E1B16@eugeni.torproject.org>`):
```
<target@gmail.com>: host gmail-smtp-in.l.google.com[64.233.167.26] said:
550-5.7.1 This message does not have authentication information or fails to
pass 550-5.7.1 authentication checks. To best protect our users from spam,
the 550-5.7.1 message has been blocked. Please visit 550-5.7.1
https://support.google.com/mail/answer/81126#authentication for more 550
5.7.1 information. f2si4120705wrj.403 - gsmtp (in reply to end of DATA
command)
```
I suspect this might have to do with (lack of) SPF records and/or ARC headers.Jens KubiezielJens Kubiezielhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29764Accessing this bug tracker via TorBrowser results in endless loop of captchas2020-06-27T14:17:34ZTracAccessing this bug tracker via TorBrowser results in endless loop of captchasTrac always thinks submission is spam, even on successful authentication and captcha replies.
**Trac**:
**Username**: DNiedTrac always thinks submission is spam, even on successful authentication and captcha replies.
**Trac**:
**Username**: DNiedJens KubiezielJens Kubieziel