rewrite exporter more simply (!19) · Merge requests · The Tor Project / Network Health / Metrics / Monitoring and Alerting

anarcat requested to merge anarcat/monitoring-and-alerting:counter into main Aug 28, 2023

This is basically a rewrite, but the main change is it removes the "status" label completely and assumes the alerting framework or dashboards are going to process the values and interpret it as they chose.

This, essentially, removes the business logic from the exporter completely and dumbs it down to a minimum of writing numbers to a file.

We originally suggested turning this into a "counter", but it turns out the convention in Prometheus is to track things such as "last update" as a gauge, in a UNIX timestamp, with which you can do things like:

changes(process_start_time_seconds[1h])

We also exit instead of showing errors in the metrics stream. The prometheus_client library doesn't clearly show how to do this, but it seems better than to contaminate the metrics samples with garbage. Error conditions, in other words, are better checked out of band than here.

Finally, we skip the use of the filestat script altogether. Itseems like it only does a stat on the latest files. Here's the full script:

    files=(/srv/tordnsel.torproject.org/lists/*)
    filename="${files[${#files[@]}-1]}"
    echo $(($(date +%s) - $(date +%s -r "$filename")))

The "last file in that list" (which is what that second line does) is basically always the file named latest, so this can be shortened to this in Python:

    time.time() - os.stat("/srv/tordnsel.torproject.org/lists/latest").st_mtime

And, since this is a gauge and we don't need to bother with extra complexity, we can just track the unix timestamp directly, so we just keep the mtime as a float.

Once this is merged, the metric will change from:

exits_list_last_updated_in_minutes{status="DELAYED"} 62

To:

exits_list_timestamp_seconds 1693252379

And then a query like:

changes(exits_list_timestamp_seconds[1h])

... will show how many times it has changed in the last hour, for example. To show the equivalent of the previous metric (age in minutes), you would use:

(time() - exits_list_timestamp_seconds)/60

Closes: #32 (closed)

Edited Aug 28, 2023 by anarcat

rewrite exporter more simply

Merge request reports