Directory authorities have different opinions on MTBF and WFU

It has turned out that directory authorities have very different opinions on relays' MTBF (mean time between failure) and WFU (weighted fractional uptime). The result is that they vote differently on Guard and Stable flags:

http://freehaven.net/~karsten/metrics/relayflags-2009-04-01.pdf

One reason might be false assumptions about running relays as reflected in the router-stability files. If a relay is running, the corresponding MTBF line contains the starting time. The starting time is used to include the running session in MTBF and WFU calculation. An analysis of three router-stability files shows that authorities think there are between 6K and 24K relays currently running, which is wrong:

These lines are never removed from router-stability files, so that whenever these relays come back, they appear to be uber-stable which they of course are not.

The problem lies in the fact that this starting time is only set to 0 in a few edge cases using rep_hist_note_router_unreachable() in rephist.c. This function should be called whenever a relay has gone offline, which is of course difficult to know.

As a possible solution, Tor could check during maintenance when a relay was contacted the last time. If this time lies more than twice the reachability timeout in the past, the relay should be marked as unreachable in rephist.c, too. A simple patch (with some code duplication from dirserv_set_router_is_running() in dirserv.c) would look like this:

Index: src/or/rephist.c

--- src/or/rephist.c (revision 19341) +++ src/or/rephist.c (working copy) @@ -658,6 +658,22 @@ digestmap_iter_get(orhist_it, &d1, &or_history_p); or_history = or_history_p;

+#define DOUBLE_REACHABLE_TIMEOUT (24560)

/* If we are an authority, check if this router is still running. */
if (authority && !or_history->start_of_run) {
```
 char time_buf[ISO_TIME_LEN+1];
```

 routerinfo_t *router = router_get_by_digest(d1);

 if (!router || (router_is_me(router) && we_are_hibernating()) ||

     (!get_options()->AssumeReachable &&

     before >= router->last_reachable + DOUBLE_REACHABLE_TIMEOUT)) {

```
   format_iso_time(time_buf, before);
```

   log_info(LD_DIR, "When cleaning the reputation history at %s, "

            "we found that router %s is not running anymore.",

            time_buf, hex_str(d1, DIGEST_LEN));

   rep_hist_note_router_unreachable(d1, before);

```
 }
```
}
/* Now decide if we want to keep it. */ remove = authority ? (or_history->total_run_weights < STABILITY_EPSILON && !or_history->start_of_run) : (or_history->changed < before);

[Automatically added by flyspray2trac: Operating System: All]