Bridge usage statistics on metrics website are broken
The graph on bridge users from all countries recently went up from 10,000 to 50,000. There was no event that could explain this increase, so I looked for a possible bug.
Here's the bug: when we aggregate bridge users per day, we write single observations to a file with lines like this:
bridge,date,time,??,a1,a2,...,all
0007BC3A0CFC768DB2FA1E3EB6FB4ABF4EBE2D13,2012-05-24,07:12:18,NA,1.12,NA,...,30.55
In the next step we aggregate these lines by summing up all observations of a given day.
Turns out the file with single observations was truncated and we didn't notice. When adding lines to that file, it is read to memory, new observations are added, and the file is written to disk. The file is always kept ordered by bridge fingerprint. Here's the distribution of bridge fingerprints in the file:
0 24567
1 24623
2 11687
3 1526
4 1124
5 825
6 1352
7 1422
8 1271
9 1287
A 1336
B 1048
C 1525
D 1227
E 1497
F 994
We would expect roughly the same number of bridges in each bucket. Looks like the file was truncated after writing half of the fingerprints starting with 2. This could have happened due to Java running out of memory, the server being restarted while writing the file, etc.
The quick fix is to aggregate bridge usage statistics again and replace the single-observations file on yatei. I'm going to do that now.
The next fix is to avoid truncating the file by writing to a temp file and replacing the original file with it once we're done writing. I'll look into that next.
The real fix is to stop using flat files for something that requires a database. That's going to take me quite a bit longer.