- Truncate descriptions
Activity
I'm yet unclear what we'd gain by adding CAIDA.org data. We're using MaxMind's GeoLite ASN file which contains the following entry for 1984 Hosting:
1566564352,1566566399,"AS44925 1984 ehf AS number"
Onionoo would include that as follows in a relay details document:
"as_number":"AS44925","as_name":"1984 ehf AS number"
(Admittedly, the "AS number" part in that string doesn't make much sense and looks like a data import problem on MaxMind's side. But we can probably expect similar problems with CAIDA.org's data, just not with this particular entry.)
But let's also look at a bigger AS/organization that hosts a lot of relays: OVH. Here's what CAIDA.org says about OVH:
ORG-OS3-RIPE||OVH SAS|FR|RIPE 16276||OVH|ORG-OS3-RIPE|RIPE 35540||OVH-TELECOM|ORG-OS3-RIPE|RIPE
And here's what MaxMind's ASN file says about OVH:
86441984,86474751,"AS16276 OVH SAS" 92733440,92798975,"AS16276 OVH SAS" 96731136,96796671,"AS16276 OVH SAS" 134738944,134739199,"AS16276 OVH SAS" 135430144,135430399,"AS16276 OVH SAS" 135432192,135434239,"AS16276 OVH SAS" 135441408,135441663,"AS16276 OVH SAS" 135556608,135556863,"AS16276 OVH SAS" 135604480,135604735,"AS16276 OVH SAS" 135792640,135794687,"AS16276 OVH SAS" 135945728,135945983,"AS16276 OVH SAS" 136175616,136175871,"AS16276 OVH SAS" 136237056,136239103,"AS16276 OVH SAS" 136404992,136407039,"AS16276 OVH SAS" 136413184,136415743,"AS16276 OVH SAS" 624623616,624689151,"AS16276 OVH SAS" 624701440,624705535,"AS16276 OVH SAS" 633012224,633077759,"AS16276 OVH SAS" 635305984,635338751,"AS16276 OVH SAS" 635371520,635437055,"AS16276 OVH SAS" 778633216,778698751,"AS16276 OVH SAS" 1056243712,1056251903,"AS16276 OVH SAS" 1466073088,1466105855,"AS16276 OVH SAS" 1532647424,1532649471,"AS16276 OVH SAS" 1534656512,1534722047,"AS16276 OVH SAS" 1558052864,1558118399,"AS16276 OVH SAS" 1578565632,1578631167,"AS16276 OVH SAS" 1728384000,1728385023,"AS16276 OVH SAS" 1841168384,1841233919,"AS35540 OVH SAS" 2382675968,2382684159,"AS16276 OVH SAS" 2809266176,2809331711,"AS16276 OVH SAS" 2954821632,2954887167,"AS16276 OVH SAS" 2988441600,2988572671,"AS16276 OVH SAS" 3001868288,3001872383,"AS16276 OVH SAS" 3104444672,3104444927,"AS16276 OVH SAS" 3104579584,3104580095,"AS16276 OVH SAS" 3164930048,3164985007,"AS16276 OVH SAS" 3164985009,3164995583,"AS16276 OVH SAS" 3227451392,3227467775,"AS16276 OVH SAS" 3227713536,3227779071,"AS16276 OVH SAS" 3244823296,3244823551,"AS16276 OVH SAS" 3245162240,3245162495,"AS16276 OVH SAS" 3278773760,3278774271,"AS16276 OVH SAS" 3287738368,3287738879,"AS16276 OVH SAS" 3323674624,3323691007,"AS16276 OVH SAS" 3325198336,3325231103,"AS16276 OVH SAS" 3328479232,3328483327,"AS16276 OVH SAS" 3337957376,3337961471,"AS16276 OVH SAS" 3585744896,3585753087,"AS16276 OVH SAS" 3590029312,3590045695,"AS16276 OVH SAS"
Wouldn't we include the exact same output after switching to CAIDA.org data?
I'm hesitant to add another data source, because I expect inconsistencies between the two data sources where we don't have the exact same AS numbers in the two files and similar issues.
Another (minor) issue is the additional overhead for Onionoo server operators.
Stated differently, I'd want us to have a good reason for adding another data source. Can you maybe give a counterexample where using CAIDA.org data in addition to MaxMind data would enhance Onionoo data notably?
Trac:
Status: new to needs_informationAs far as I can tell, the OVH example you gave is, by happenstance, an instance where the CAIDA data is superior.
Lets look at:
ORG-OS3-RIPE||OVH SAS|FR|RIPE
16276||OVH|ORG-OS3-RIPE|RIPE
35540||OVH-TELECOM|ORG-OS3-RIPE|RIPE
For these entries, CAIDA says that the same organization with id=
ORG-OS3-RIPE
, owns both AS16276 and AS35540.When you look up OVH on MaxMind, you get only AS16276. In this case I believe the issue is that when you look at the raw records, AS16276 has the as-name
OVH
while AS35540 has the as-nameOVH-TELECOM
. Ergo on a plain-text matching they are not the same. The CAIDA data squashes these two distinct strings into the same organization id, and thus into the same organization name.Going beyond this specific case, the CAIDA data does some cleverness (see http://www.caida.org/research/topology/as2org/ for the methodology) to determine who the "real organization" is who owns the AS-number. This is helpful when firm A purchases firm B, and then firm A becomes an upstream provider of firm B (making firm B part of firm A's "cone"). The MaxMind data would list firm B, but CAIDA would list firm A.
These sorts of measures are important to ensure we're getting "real organizational diversity".
Hang on, the MaxMind data that I quoted above does include this row:
1841168384,1841233919,"AS35540 OVH SAS"
So, in this case the CAIDA data looks about as good as MaxMind's.
Can you give an example, or better a handful of examples, where CAIDA data is obviously better than MaxMind's?
Sure, it's available here: https://download.maxmind.com/download/geoip/database/asnum/GeoIPASNum2.zip
Trac:
collapsed_pairs.txtComparing the two data sets...
#ASNs incaida: 73,256 #ASNs inmaxmind: 53,354
#ASNs only in both: 52,810 #ASNs only incaida: 20,446 #ASNs only inmaxmind: 544
(1) On the above alone, theCaidadata is more comprehensive.
(2) I also attach a list of 849 pairs of entries that, withinmaxmindclosely resemble each other (between 90-99%) yet incaidaare merged into a single entity. If we want an organization diversity measure, we need the entities to match.
We can go deeper here. But I feel the point has been made. The next point would be that if firm A buys firm B, our org-diversity measure needs to look firm A, not firm B. This is what CAIDA doesandmaxminddoesnot. !http://www.caida.org/research/topology/as2org/
Okay, let's evaluate the pros and cons of adding CAIDA data. I'm counting pros as +1, neutrals as 0, and cons as -1. Let's see whether we'll get a positive number here.
- The CAIDA data doesn't contain IP address ranges, so we'll have to keep using MaxMind data in addition to CAIDA data. Okay. But that means that CAIDA's comprehensiveness in terms of number of ASNs is meaningless to us, because we're limited to whatever ASNs are in MaxMind data. (0)
- MaxMind contains 67 of its 2833 ASNs (not sure where your 53k number comes from) that CAIDA does not know about. Right now we'd have organization names for these ASNs, but once we switch over to using CAIDA's organization names we'd provide less information there. And I'm not willing to provide MaxMind data if CAIDA doesn't have anything for a given ASN, because nobody will understand that, nor do I want to provide both organization names. This is a serious problem that I don't know how to work around cleanly. (-1)
- CAIDA data is only updated every three months, MaxMind provides a new update every month. It already happens that people ping me because MaxMind's data is old, and that's only going to get worse with CAIDA. Somewhat related, MaxMind has been providing ASN data for many years now without major issues whereas CAIDA apparently started providing data only 2 years ago. (-1)
- We'd still need to write, review, and test code to handle CAIDA's data format. This could become a neutral if somebody submits a good patch, but please only do that if that makes the overall sum positive, or that patch might not get accepted. (-1)
- Operating an Onionoo server becomes a bit harder with an additional data source to update. We want more people to run Onionoo servers at some point, so we should make that process easier not harder. (-1)
- MaxMind indeed contains similar but not equivalent organization names which should be exactly the same. However, the actual number is lower than what your pairwise comparison implies, and somebody measuring organization diversity could always use a similarity metric as yours when looking at these strings. Anyway, CAIDA is indeed better here than MaxMind. (1)
I'm calculating -3 as sum here. That means no. Sorry. Leaving this ticket open for a few more days in case you have convincing arguments why the cons I'm listing about are actually neutrals or can be turned into pros.
Trac:
MMs_very_diff.txtThe CAIDA data doesn't contain IP address ranges, so we'll have to keep using MaxMind data in addition to CAIDA data. Okay. But that means that CAIDA's comprehensiveness in terms of number of ASNs is meaningless to us, because we're limited to whatever ASNs are in MaxMind data. (0)
You're right. We still only start with the IP#, and it would be a pain to implement a method to learn the AS numbers. Okay, that kills any utility of CAIDA having more ASs.
MaxMind contains 67 of its 2833 ASNs (not sure where your 53k number comes from) that CAIDA does not know about. Right now we'd have organization names for these ASNs, but once we switch over to using CAIDA's organization names we'd provide less information there. And I'm not willing to provide MaxMind data if CAIDA doesn't have anything for a given ASN, because nobody will understand that, nor do I want to provide both organization names. This is a serious problem that I don't know how to work around cleanly. (-1)
CAIDA data is only updated every three months, MaxMind provides a new update every month. It already happens that people ping me because MaxMind's data is old, and that's only going to get worse with CAIDA. Somewhat related, MaxMind has been providing ASN data for many years now without major issues whereas CAIDA apparently started providing data only 2 years ago. (-1)
The 53k figure is actually correct. Additionally, I would never wholly replace MaxMind data with CAIDA---the fields convey very different things. MaxMind says which organization is the registered owner, while CAIDA does some cleverness to learn the parent organization. Thisveryareverydifferent. I would propose that there be a new field, called something like !
parent_organization
for each relay which is populated by CAIDA [when it exists]. I claim this sets both of the above (-1)s to (0).We'd still need to write, review, and test code to handle CAIDA's data format. This could become a neutral if somebody submits a good patch, but please only do that if that makes the overall sum positive, or that patch might not get accepted. (-1)
The CAIDA format is a standard CSV. https://commons.apache.org/proper/commons-csv/ (0)
Operating an Onionoo server becomes a bit harder with an additional data source to update. We want more people to run Onionoo servers at some point, so we should make that process easier not harder. (-1)
This is indeed an issue. It seems entirely reasonable to me if someonedoesntwant to do the CAIDA data, they simply won't have the !
parent_organization
field. Totally cool with that. (0?)MaxMind indeed contains similar but not equivalent organization names which should be exactly the same. However, the actual number is lower than what your pairwise comparison implies, and somebody measuring organization diversity could always use a similarity metric as yours when looking at these strings. Anyway, CAIDA is indeed better here than MaxMind. (1)
So I actually low-balled this for you.
Here'sthe actual numbers.
-
of ASNs for which MM's organizations are different, yet CAIDA's 'parent organization' are the same: 3299
-
of ASNs for which MM's organization are very different, yet CAIDA's 'parent organization' are the same: 1935
I attach a list of those 1935 pairs as MMs_very_diff.txt .
Two AS-ORG names being similar is not sufficient nor necessary for two ASs to be correctly grouped under the same parent organization. We totally tried to learn these relationships from themaxminddata, and failed. I was in the process of deriving my own method from the academic literature until I found the CAIDA data which did everything I needed.
I have no stake in this. We tried to use something like MaxMind for Roster, failed, but then discovered CAIDA worked. You then requested that we move as much functionality intoOnionooas possible. So this is me trying to do that. It's of course totally fine to say that this is too niche a need to be worth including intoOnionoo. In which case, Roster will just continue to use its own database for this---which is totally cool. I'm just trying to, as you requested, upload the goods we found to theOnionooMothership. This is me exerting effort to be a good uploader of candidate good things toOnionoo.
-
Trac:
Cc: karsten, seansaito to karsten, seansaito, twim@riseup.net