rewrite exporter more simply
This is basically a rewrite, but the main change is it removes the "status" label completely and assumes the alerting framework or dashboards are going to process the values and interpret it as they chose.
This, essentially, removes the business logic from the exporter completely and dumbs it down to a minimum of writing numbers to a file.
We originally suggested turning this into a "counter", but it turns out the convention in Prometheus is to track things such as "last update" as a gauge, in a UNIX timestamp, with which you can do things like:
changes(process_start_time_seconds[1h])
We also exit instead of showing errors in the metrics stream. The prometheus_client library doesn't clearly show how to do this, but it seems better than to contaminate the metrics samples with garbage. Error conditions, in other words, are better checked out of band than here.
Finally, we skip the use of the filestat
script altogether. Itseems like it only does a stat on the latest files. Here's the full script:
files=(/srv/tordnsel.torproject.org/lists/*)
filename="${files[${#files[@]}-1]}"
echo $(($(date +%s) - $(date +%s -r "$filename")))
The "last file in that list" (which is what that second line does) is basically always the file named latest
, so this can be shortened to this in Python:
time.time() - os.stat("/srv/tordnsel.torproject.org/lists/latest").st_mtime
And, since this is a gauge and we don't need to bother with extra complexity, we can just track the unix timestamp directly, so we just keep the mtime as a float.
Once this is merged, the metric will change from:
exits_list_last_updated_in_minutes{status="DELAYED"} 62
To:
exits_list_timestamp_seconds 1693252379
And then a query like:
changes(exits_list_timestamp_seconds[1h])
... will show how many times it has changed in the last hour, for example. To show the equivalent of the previous metric (age in minutes), you would use:
(time() - exits_list_timestamp_seconds)/60
Closes: #32 (closed)