Aggregating data for the relays-by-country graph has become prohibitively expensive. It keeps the server busy for 2 hours every day, affecting more important tasks like downloading descriptors. That's why I disabled this aggregation step on February 1 to the effect that relays-by-country graphs are still available but won't receive new data. The problem is the PostgreSQL-based IP-to-country lookup. I should look into making this lookup much, much faster. Creating this ticket so I don't forget.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Trac: Type: defect to enhancement Summary: Fix and re-enable relays-by-country graph on metrics website to Bring back the relays-by-country graph Sponsor: N/AtoN/A Severity: N/Ato Normal Reviewer: N/AtoN/A
Can a dedicated server allocated only to do this task help to bring back the relays-by-country graphs?
Unfortunately, it's not just a question of hardware. The code used for the blog post is good enough to run it once for a blog post, but it needs more work for being run periodically. Here are a few issues:
Every time this code runs, it processes all descriptors in the in/ directory. In a production environment we'd want it to skip descriptors it has processed before and use previously processed aggregations from them.
Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.
Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.
Can a dedicated server allocated only to do this task help to bring back the relays-by-country graphs?
Unfortunately, it's not just a question of hardware. The code used for the blog post is good enough to run it once for a blog post, but it needs more work for being run periodically. Here are a few issues:
Every time this code runs, it processes all descriptors in the in/ directory. In a production environment we'd want it to skip descriptors it has processed before and use previously processed aggregations from them.
Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.
Aren't these the same GeoIP files as the ones used for Tor metrics currently?
Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.
Nonetheless do you think that these issues can be created as separate sub-tickets?
Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.
Aren't these the same GeoIP files as the ones used for Tor metrics currently?
Well, Onionoo uses the latest of these GeoIP files in MaxMind's format. But nothing else in Tor Metrics uses these files. Nothing of this is hard, it's just a couple substeps that need to be done.
Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.
Nonetheless do you think that these issues can be created as separate sub-tickets?
Not really. These were just some examples, not a list of things that need to be done to resolve this ticket. I'd like to leave the implementation steps to whoever implements this.