In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
Checklist:
monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.
1 of 5 checklist items completed
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
There are a few things about what you are asking that we might be able to do with prometheus, and some others that we can't do at the moment.
Like we cannot semd an email and parsing the result because Prometheus scrapes http endpoints. Also we are not doing alerting yet, only monitoring.
There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.
Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.
There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.
Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?
To monitor BridgeDB, we need to set up an exporter, right?
Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.
I think we already have that for BridgeDB and snowflake's website but not for GetTor.
There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.
Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?
Should, yes.
To monitor BridgeDB, we need to set up an exporter, right?
In Prometheus, yes. This could be a simple configuration in a "blackbox exporter":
Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.
I think we already have that for BridgeDB and snowflake's website but not for GetTor.
From what I can tell, we check bridges.torproject.org:
- name: bridges.tpo web service nrpe: "/usr/lib/nagios/plugins/check_http -H bridges.torproject.org -S --string=bridge" hosts: polyanthum depends: network service - https
gettor-01.torproject.org (the service should respond to emails; hiro already worked on this)
Note that the strings that should be present in the respective pages are mere suggestions. Ultimately, we just need a test that guarantees that these pages are correctly serving content.
awesome summary, thanks. i turned that into a checklist and assigned the ticket to hiro who, I think, will handle followup on this. hiro, let me know if you need help or if any of this is incorrect...
Trac: Status: new to assigned Owner: tpa to hiro Description: In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
to
In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
Checklist:
monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.
Thanks! I checked the grafana box on our todo list in the ticket description because we now have access to it.
I see that BridgeDB is already being monitored. Are we able to add our own targets to Prometheus?
Trac: Description: In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
Checklist:
monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.
to
In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
Checklist:
monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.
I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?
I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?
Hmm, ok. Note that the entire reason for filing legacy/trac#32679 (moved) was that I wanted our team to have control over the list of monitoring targets, so we don't have to block on others. But we can go with your plan for now and see how it goes.
The list of default bridges is available in a table on this wiki page. Please ignore the two last rows in the table, 0.0.2.0:2 and 0.0.3.0:1. These are two pseudo IP addresses.
Hi phw,
This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.
This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.
Thanks!
I took a look at the Grafana dashboard and found it difficult to interpret the data. For example, 146.57.248.225:22 is currently offline and the panels don't reveal that. I understand that one can add panels (I think I would like an "Alert List") but I'm struggling with creating one.
I would like something similar to the following UI. Is this something you can help with?
this is indeed a complex panel to create! i managed to make one using "singlestat" - I couldn't figure how to make the "alert list" thing work - but it's kind of clunky:
now after asking on #prometheus (freenode), i was told there's a Granafa plugin specifically for that purpose. it's really heavy on the Javascript, but it seems to actually work and provide a much better visualization. here's the dashboard I created with the plugin:
I am ok with this if people are happy with the result. I will add it to puppet.
The blackbox-target-availability plugin looks great and solves this problem. However, our default bridges aren't all down (only 146.57.248.225 is, as of 2020-04-27), so there seems to be an error with the blackbox exporter?
In the meanwhile, we've set up a monit instance on my VPS, which is now monitoring all of our anti-censorship infrastructure. Frankly, this works better for us than prometheus: it's simple, effective, and we control it. There's some merit in having prometheus monitor our infrastructure but given that the sysadmin team is stretched thin, I'm inclined to close this ticket as a "wontfix".