Generate CSV of client rendezvous polls by country
Parse rendezvous method stats to generate a CSV of client polls by country and method.
Totals are presented in the CSV with country "total". This is maybe a weird way of doing things because then countries equal to "total" potentially have to be filtered out.
Here's what a graph using this data looks like:
Merge request reports
Activity
mentioned in issue tpo/anti-censorship/pluggable-transports/snowflake#40464 (closed)
- Resolved by David Fifield
In the process of cleaning up some strange artifacts in a previous version of the above plot, I noticed a problem with the way that the coverage as it's applied in practice interacts with counts of country-based stats. This is most visible in the client rendezvous method stats, but may also apply to the proxy-country
snowflake-ips
counts as well.We have to be careful. Country codes may not appear in all snowflake descriptors due to there being no users from that country. So, for example, in the CSV generated by this script, I see
2025-05-29,sv,sqs,8,0.26
because there were 8 or fewer client polls using sqs from sv in the metrics window from
2025-05-29 17:40:03
to2025-05-30 17:40:03
, but no sqs polls from sv in the previous metrics window from2025-05-28 17:40:03
to2025-05-29 17:40:03
.My feeling is that in this case it is more accurate to represent this as
2025-05-29,sv,sqs,8,1
in the CSV because we can interpret the absence of a country code as a count of zero. But, it's going to be more difficult to implement.
Edited by Cecylia Bocovich
added 1 commit
- fae9b9fd - Set coverage to highest coverage seen that day
added 10 commits
- 94a580f2 - Generate CSV of client rendezvous polls by country
- 32429346 - Generate client-polls.csv.
- 3342bae5 - Set coverage to highest coverage seen that day
- 2d147ad7 - Generate client-polls.csv.
- 439c00ba - Multiply count by frac_int
- 883863ad - Generate client-polls.csv.
- 19a0286d - Makefile targets for client_polls/snowflakes-*.client_polls.csv.
- d8f07a26 - Deduplicate code in parsing rendezvous method logic.
- 888a6e6c - Use null remainder row in client-polls, not a "total" row.
- 256cb0f3 - Generate client-polls.csv.
Toggle commit list- Resolved by David Fifield
The descriptor lines use
ampcache
(e.g.client-ampcache-count
,client-amp-count
), but the CSV output usesamp
in therendezvous_method
column. It's the only case where the string doesn't match between the input descriptors and the output CSV. Is it intended? At cohosh/snowflake-graphs@d8f07a26 I added a mapping to make the name change explicit.I pushed some additional commits.
cohosh/snowflake-graphs@d8f07a26 deduplicates common logic when the only thing that differs across descriptor lines is the rendezvous method name.
cohosh/snowflake-graphs@888a6e6c gets rid of the "total" rows and instead follows the convention of the other metrics: a row with a blank entry representing the difference between the sum of the other rows and the known total. However, there's a problem: these difference rows are coming out negative. What this means is that, for example,
client-http-count
is less than the sum ofclient-http-ips
, sometimes by a wide margin. Are we sure thatclient-http-ips
really represents a total? Here's an example descriptor where the difference is large:snowflake-stats-end 2025-03-14 17:02:02 (86400 s) client-http-count 1208040 client-http-ips RU=953792,IR=344704,CN=265784,US=162864,DE=78752,NL=63712,GB=30616,IN=27120,FR=23216,PL=18000,BR=16160,TR=16024,BY=15536,CA=14128,FI=13536,JP=12736,PK=11952,SA=10888,EG=10536,IT=9912,AU=9496,UA=7752,ES=7216,HK=6616,ID=6520,AE=6384,MX=6216,KR=5808,SE=5696,ZA=5008,KE=4784,CH=4776,CZ=4760,SG=4304,RO=4144,NG=4128,NO=3976,AT=3920,TH=3600,EC=3552,LV=3280,TN=3016,BG=2800,MY=2672,HU=2512,LT=2496,KZ=2376,BE=2352,PT=2232,DK=2232,JO=2144,PE=2112,BD=2000,TW=1976,PH=1920,IL=1752,VN=1736,IE=1704,AR=1648,CL=1568,EE=1464,LI=1448,GR=1424,NZ=1360,AZ=1344,DZ=1272,RS=1144,LU=1096,TZ=1056,HR=1008,BZ=968,GH=952,CM=864,CO=864,MD=864,CI=848,CD=848,UZ=840,MA=792,NP=776,QA=736,GE=712,ET=656,IQ=624,PA=584,MM=560,AL=544,MU=536,SK=528,YE=528,VE=488,ZM=472,SY=472,LK=464,AQ=456,DO=456,MZ=432,UG=424,KW=416,MG=376,PR=376,GN=368,IS=360,HN=344,BI=328,PY=328,SD=320,AF=296,JM=296,OM=280,AM=272,BH=272,MT=264,LY=256,MN=224,LB=216,KG=216,RW=200,SN=192,GT=184,PS=176,NC=168,KH=168,SS=160,MK=152,BW=152,BO=152,CU=144,ME=136,EU=136,SI=128,BN=120,SL=120,TT=112,AO=112,NA=104,SV=96,AP=96,TJ=96,DJ=96,BA=88,CY=80,RE=72,BS=72,AG=64,NF=56,GP=56,CR=56,MV=56,SR=56,SB=48,MW=48,BM=40,AD=40,ML=32,UY=32,SC=32,??=32,TG=32,SO=24,ZW=24,GL=16,CW=16,GD=16,TM=16,PG=16,GY=16,MR=16,NI=16,VC=8,GM=8,PF=8,BJ=8,KY=8,BB=8,GF=8
Adding up just the RU and IR counts already takes us to 1298496, which exceeds
client-http-count
. The difference lines for 2025-03-14 come out to:2025-03-14,,amp,-410528.74,1.00 2025-03-14,,http,-1042131.62,1.00 2025-03-14,,sqs,-16311.51,1.00
We have to be careful. Country codes may not appear in all snowflake descriptors due to there being no users from that country.
This is a known issue, #3. Properly fixing it would, I think, take writing auxiliary files with just the beginning and end timestamps of each descriptor. The approach of taking the maximum of other countries' values seen for that day is a reasonable heuristic, but I don't think it's quite right. It interacted badly, in some cases, with the former practice of recording a "total" row, because those rows had a coverage of over 1.00 in some cases. E.g. in https://gitlab.torproject.org/cohosh/snowflake-graphs/-/blob/883863ad1f8fef982c3657b01df28afd20ee9637/client-polls.csv there was
2024-10-31,total,amp,747896,1.37 2024-10-31,total,http,4135264,1.37
which then caused all other rows for that day to have a coverage of 1.37. The problem might have gone away with cohosh/snowflake-graphs@888a6e6c and cohosh/snowflake-graphs@256cb0f3.
One way could be to make a preliminary pass over the data to record the set of all observed countries, then fill in zero-valued rows for every country on every day that is not naturally represented.
There's a contradiction in the description of how coverage is treated in the snowflake-paper source code. https://github.com/turfed/snowflake-paper/commit/ccb19c3736c72255b7ba2f7bd1a90fb18dd2b498 explicitly says that the proxy graphs do not compensate for coverage < 1.0, but then there in the code it does. In any case, the commit message gives a reason for not adjusting by coverage (unique IP addresses do not scale linearly with partial days) that we should consider before making other adjustments.
mentioned in merge request tpo/anti-censorship/pluggable-transports/snowflake!574 (merged)
- Resolved by David Fifield
I was thinking it would be good to use a more descriptive name than "
count
" for the output column of client_polls. Like we usingunique_ips
in proxy_country.But then I wondered, what does the quantity really represent? Descriptor fields like
client-http-ips
seem to indicate that what is being counted is unique IP addresses. But the documentation in broker-spec.txt says nothing about IP address, only a count of the number of polls."client-http-count" NUM NL [At most once.] A count of the number of times a client has requested a proxy using the HTTP rendezvous method from the broker, rounded up to the nearest multiple of 8. "client-http-ips" [CC=NUM,CC=NUM,...,CC=NUM] NL [At most once.] List of mappings from two-letter country codes to the number of times a client has requested a proxy using the HTTP rendezvous method, rounded up to the nearest multiple of 8. Each country code only appears once.
What is it really? A raw count of polls, the number of unique IP addresses that have made a poll? What would be a good name for the output column?
num_polls
?
mentioned in issue tpo/network-health/metrics/collector#40053
- Resolved by David Fifield
@dcf would you like me to make the changes suggested here? I see you've already made changes to this MR. I'm happy to pick it back up again, or hand it off to you if that's what you prefer.
mentioned in merge request tpo/anti-censorship/pluggable-transports/snowflake!577 (merged)
added 24 commits
-
256cb0f3...fd93fa18 - 14 commits from branch
dcf:main
- f1692389 - Generate CSV of client rendezvous polls by country
- 21a13dce - Generate client-polls.csv.
- 3141140a - Set coverage to highest coverage seen that day
- 8b3d4658 - Generate client-polls.csv.
- e7ea7cab - Multiply count by frac_int
- e9752f75 - Generate client-polls.csv.
- 1bb6ac20 - Makefile targets for client_polls/snowflakes-*.client_polls.csv.
- d3ce7a2f - Deduplicate code in parsing rendezvous method logic.
- 890d4341 - Use null remainder row in client-polls, not a "total" row.
- d99ed06d - Generate client-polls.csv.
Toggle commit list-
256cb0f3...fd93fa18 - 14 commits from branch
- Resolved by David Fifield
Just rebased the current commits onto main. Now I'm going to address the above feedback:
- remove the
coverage
column from the client-polls CSV - for descriptors between
2024-01-31
and2024-03-20
, keep the reported total counts, knowing they are undercounted. - for descriptors published between
2024-03-21
and2025-06-24
, use the sum of the per-country counts as the totals - for descriptors after
2025-06-25
, add a blank row to make up the discrepancy between the per-country sum and the new total
- remove the
added 10 commits
- 0883728c - Generate CSV of client rendezvous polls by country
- 42785c98 - Generate client-polls.csv.
- 5887c2c5 - Set coverage to highest coverage seen that day
- ef5302eb - Generate client-polls.csv.
- 3a8ea815 - Multiply count by frac_int
- 2a003571 - Generate client-polls.csv.
- c0b2cc3c - Makefile targets for client_polls/snowflakes-*.client_polls.csv.
- 205c2379 - Deduplicate code in parsing rendezvous method logic.
- efde0ed9 - Use null remainder row in client-polls, not a "total" row.
- 01f5cdcc - Generate client-polls.csv.
Toggle commit listadded 2 commits
assigned to @cohosh
added 9 commits
- abd8819a - Generate client-polls.csv.
- 7cdf0577 - Multiply count by frac_int
- 2b23b377 - Generate client-polls.csv.
- 2a8f4a95 - Makefile targets for client_polls/snowflakes-*.client_polls.csv.
- c3a9fdc9 - Deduplicate code in parsing rendezvous method logic.
- f62f5f96 - Use null remainder row in client-polls, not a "total" row.
- 61029bf9 - Generate client-polls.csv.
- 63dd6703 - Increase accuracy of client poll totals
- 2a6365c2 - Generate client-polls.csv
Toggle commit listOkay I removed the commit that messed with the coverage and implemented the plan above, and it's ready for another look.
I did realize that we're still getting
None
country rows that are negative because of my logic in !1 (comment 3216220): Due to binning to multiples of 8 for each country code, the sum of the client poll counts per IP are going to be higher than the total counts reported. How do you want to handle that? Do we let the graph script figure out how to deal with the negative value?added 24 commits
-
90bb00ca - 1 commit from branch
dcf:main
- 90bb00ca...52484a32 - 13 earlier commits
- 16b70d2d - Use DataFrame.from_records in client-polls.
- ebef7172 - Formatting in client-polls.
- 7d70f15c - Unconditionally write client_poll_path.
- d39b5fa7 - Rename "count" to "num_polls" in client-polls.
- fc8fe237 - Generate client-polls.csv.
- f96503d1 - Remove SNOWFLAKE_CLIENT_IPS_START.
- 00a760f0 - Clean up the logic around SNOWFLAKE_CLIENT_COUNT_FIX.
- 66c95085 - Use the datetime.datetime constructor, no need for strptime.
- 48550866 - Format comment.
- b714623e - Warn about negative discrepancy rows in client-polls.
Toggle commit list-
90bb00ca - 1 commit from branch
mentioned in commit cohosh/snowflake-graphs@50024f97
added 6 commits
- f15987e8 - Remove SNOWFLAKE_CLIENT_IPS_START.
- add3f7ab - Clean up the logic around SNOWFLAKE_CLIENT_COUNT_FIX.
- dc34a7bc - Use the datetime.datetime constructor, no need for strptime.
- b7fc4735 - Format comment.
- 50024f97 - Warn about negative discrepancy rows in client-polls.
- 4a373b14 - Generate client-polls.csv.
Toggle commit listmentioned in commit cohosh/snowflake-graphs@2ed710c1
mentioned in commit cohosh/snowflake-graphs@a8cc8056
added 9 commits
- 984f06e2 - Formatting for client-polls.
- 96014d0c - Unconditionally write client_poll_path.
- a8cc8056 - Rename "count" to "num_polls" in client-polls.
- ceb5b387 - Remove SNOWFLAKE_CLIENT_IPS_START.
- 6ededdc9 - Clean up the logic around SNOWFLAKE_CLIENT_COUNT_FIX.
- 9241edd3 - Use the datetime.datetime constructor, no need for strptime.
- 863882c2 - Format comment.
- 2ed710c1 - Warn about negative discrepancy rows in client-polls.
- ada255a5 - Generate client-polls.csv.
Toggle commit listmentioned in commit bd1f24ec
I made some further code style and other changes, including changing
count
tonum_polls
as discussed in !1 (comment 3217028). Comparison if you want to see the changes.Then I pushed a squashed commit bd1f24ec to main.
All done! Thanks!
Okay wait, I'm having second thoughts about
amp
versusampcache
(!1 (comment 3215671)). Currently it'samp
in the output CSV files, but that's being done with a special-case mapping, as it's the only case where the output rendezvous method name doesn't match the string used in descriptors. What do you think? Should it beampcache
instead?Thanks, opened here: !2 (closed)
I did realize that we're still getting
None
country rows that are negative because of my logic in !1 (comment 3216220): Due to binning to multiples of 8 for each country code, the sum of the client poll counts per IP are going to be higher than the total counts reported. How do you want to handle that? Do we let the graph script figure out how to deal with the negative value?Are you okay with the negative values in the
None
rows?Sorry, I thought I had commented on the negative row. Yes, I think it's fine. I added a comment in b714623e. If someone wants to graph the total; i.e., reconstruct what
client-$METHOD-count
would have been in the descriptor, then they do the natural thing and sumnum_polls
grouped bydate
. If they want it broken down by country, they can either ignore the null rows, or take then sum as before and proportionally distribute it according to the ratios of the per-country rows. It doesn't make much difference either way, as we know literally it is rounding error.