Underestimates coverage for certain rarely occurring values
The coverage column in output files can be too small (less than 1 when it should be 1)
for certain rarely occurring values,
when 0 values are represented by absent rows.
The problem is that coverage is computed row-wise,
and missing rows are interpreted as "missing measurement"
rather than "measured value of 0".
This is an example.
Notice that snowflake-ips-total ≠ snowflake-ips-badge + snowflake-ips-iptproxy
in just one descriptor, the snowflake-stats-end 2023-01-16 18:00:00 one:
@type snowflake-stats 1.0
snowflake-stats-end 2023-01-15 18:00:00 (86400 s)
snowflake-ips-badge 100
snowflake-ips-iptproxy 200
snowflake-ips-total 300
@type snowflake-stats 1.0
snowflake-stats-end 2023-01-16 18:00:00 (86400 s)
snowflake-ips-badge 100
snowflake-ips-iptproxy 200
snowflake-ips-total 301
@type snowflake-stats 1.0
snowflake-stats-end 2023-01-17 18:00:00 (86400 s)
snowflake-ips-badge 100
snowflake-ips-iptproxy 200
snowflake-ips-total 300
@type snowflake-stats 1.0
snowflake-stats-end 2023-01-18 18:00:00 (86400 s)
snowflake-ips-badge 100
snowflake-ips-iptproxy 200
snowflake-ips-total 300
Processing this input and looking only at the proxy_type output file,
we see that the null proxy type is represented in just one row
(the value corresponding to other descriptors being implicitly zero):
./snowflake-stats input /dev/null snowflakes.proxy_type.csv /dev/null /dev/null
begin,end,type,unique_ips
2023-01-14 18:00:00,2023-01-15 18:00:00,badge,100
2023-01-14 18:00:00,2023-01-15 18:00:00,iptproxy,200
2023-01-15 18:00:00,2023-01-16 18:00:00,badge,100
2023-01-15 18:00:00,2023-01-16 18:00:00,iptproxy,200
2023-01-15 18:00:00,2023-01-16 18:00:00,,1
2023-01-16 18:00:00,2023-01-17 18:00:00,badge,100
2023-01-16 18:00:00,2023-01-17 18:00:00,iptproxy,200
2023-01-17 18:00:00,2023-01-18 18:00:00,badge,100
2023-01-17 18:00:00,2023-01-18 18:00:00,iptproxy,200
When the value of 1 is apportioned over 24-hour periods aligned to midnight,
we get a unique_ips share of 0.25 assigned to 2023-01-15 and a share 0.75 assigned to 2023-01-16,
which is correct.
What's wrong is the coverage. It is recorded as 0.25 and 0.75,
but it should be 1.00 and 1.00.
It's not that we are missing measurements before and after the
snowflake-stats-end 2023-01-16 18:00:00 descriptor,
it's that we have recorded measurements with an implicit zero.
./proxy-type snowflakes.proxy_type.csv
date,type,unique_ips,coverage
2023-01-14,badge,25.00,0.25
2023-01-14,iptproxy,50.00,0.25
2023-01-15,badge,100.00,1.00
2023-01-15,iptproxy,200.00,1.00
2023-01-15,,0.25,0.25
2023-01-16,badge,100.00,1.00
2023-01-16,iptproxy,200.00,1.00
2023-01-16,,0.75,0.75
2023-01-17,badge,100.00,1.00
2023-01-17,iptproxy,200.00,1.00
2023-01-18,badge,75.00,0.75
2023-01-18,iptproxy,150.00,0.75
To fix this, one way is to record explicit zeroes in intermediate files.
That works for proxy_type, where there's a small and known set of possible values to think of.
It may not work as well for proxy_country, where there are a lot of country codes,
which may not all be represented on every day.