The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2020-06-27T14:23:37Zhttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22279simplify and avoid repetition in ParserHelper methods2020-06-27T14:23:37Ziwakehsimplify and avoid repetition in ParserHelper methodsSimplify and avoid repetition in ParserHelper methods and tests in ExtraInfoDescriptorTest accordingly. Cf. legacy/trac#22217 comments 6 an 7. This includes a minor code clean-up and raising exceptions when duplicate keys are encountered.Simplify and avoid repetition in ParserHelper methods and tests in ExtraInfoDescriptorTest accordingly. Cf. legacy/trac#22217 comments 6 an 7. This includes a minor code clean-up and raising exceptions when duplicate keys are encountered.metrics-lib 1.8.0Karsten LoesingKarsten Loesinghttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22228Deprecate setFailUnrecognizedDescriptorLines()2020-06-27T14:23:37ZKarsten LoesingDeprecate setFailUnrecognizedDescriptorLines()We have two methods, `DescriptorParser#setFailUnrecognizedDescriptorLines()` and `DescriptorReader#setFailUnrecognizedDescriptorLines()` that a) are never used in known code and that b) can be easily replaced by checking whether `Descrip...We have two methods, `DescriptorParser#setFailUnrecognizedDescriptorLines()` and `DescriptorReader#setFailUnrecognizedDescriptorLines()` that a) are never used in known code and that b) can be easily replaced by checking whether `Descriptor#getUnrecognizedLines()` contains any lines.
We should get rid of those methods.metrics-lib 1.7.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22217Parse new padding-counts lines2020-06-27T14:23:37ZKarsten LoesingParse new padding-counts linesAs found in today's CollecTor logs:
```
2017-05-10 06:09:07,711 WARN o.t.c.b.SanitizedBridgesWriter:1212 Unrecognized line 'padding-counts 2017-05-10 01:48:43 (86400 s) bin-size=10000 write-drop=10000 write-pad=10000 write-total=10000 r...As found in today's CollecTor logs:
```
2017-05-10 06:09:07,711 WARN o.t.c.b.SanitizedBridgesWriter:1212 Unrecognized line 'padding-counts 2017-05-10 01:48:43 (86400 s) bin-size=10000 write-drop=10000 write-pad=10000 write-total=10000 read-drop=10000 read-pad=10000 read-total=70000 enabled-read-pad=0 enabled-read-total=0 enabled-write-pad=0 enabled-write-total=0 max-chanpad-timers=0'. Skipping.
```
We should provide support for parsing these lines in relay descriptors and in bridge descriptors if we keep those lines there (legacy/trac#22216).
Optimistically aiming for 1.7.0.metrics-lib 1.7.0Karsten LoesingKarsten Loesinghttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22208Provide lines containing extra, unrecognized arguments2021-11-29T14:19:31ZKarsten LoesingProvide lines containing extra, unrecognized argumentsThis question came up when implementing [#21934](https://trac.torproject.org/projects/tor/ticket/21934#comment:4): do we need to include lines with extra arguments in `Descriptor#getUnrecognizedLines()`? Or do we need another method `De...This question came up when implementing [#21934](https://trac.torproject.org/projects/tor/ticket/21934#comment:4): do we need to include lines with extra arguments in `Descriptor#getUnrecognizedLines()`? Or do we need another method `Descriptor#getPartiallyUnrecognizedLines()`? Or can we safely drop those extra arguments on the floor? See also `TorperfResults#getUnrecognizedKeys()` which is somewhat related here, but not exactly the same.
This is not super urgent, because at least the bytes of the entire descriptor are available for further processing.
The fix is also not trivial to write, because we'll have manually go through all places where extra arguments are permitted and ignored and add those lines to a list of partially unrecognized lines.
The result could be that doing nothing is the way to go. Needs more thoughts.metrics-lib 3.0.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22196Configure descriptor sources using method chaining2021-11-29T14:18:52ZKarsten LoesingConfigure descriptor sources using method chainingThe two descriptor sources `DescriptorReader` and `DescriptorParser` are instantiated by `DescriptorSourceFactory`, configured, and then executed. This requires state checks and leads to quite verbose application code. An exception to ...The two descriptor sources `DescriptorReader` and `DescriptorParser` are instantiated by `DescriptorSourceFactory`, configured, and then executed. This requires state checks and leads to quite verbose application code. An exception to this is `DescriptorCollector` which is state-less but which has a method with 5 parameters, which also does not produce very readable code.
We should consider switching to configuring descriptor sources using method chanining. [Here's a code sample](https://gitweb.torproject.org/user/karsten/metrics-lib.git/commit/?h=task-16225&id=f8c5862bbdfa0acefbbbd7d4899b997077c20196) from a few months ago, written for another ticket.
iwakeh rightly [commented on that other ticket](https://trac.torproject.org/projects/tor/ticket/16225#comment:6): "configuration of a descriptor source by fluent-style (or method chaining) is fine, but metrics-lib currently has the DescriptorSourceFactory approach, which would need to be adapted. That is, I see two things: the ideas around the code of DescriptorSourceBuilder are ideas about a new way of configuration and not exception/error handling in metrics-lib, i.e. a different ticket (there is one for redesign of the interface hierarchy, I'll look for it). Second, the current configuration and descriptor source instanciation need to be considered together (in that other ticket)."
Not assigning to a milestone on purpose. This likely needs more thoughts before we can write code and make plans for releasing that.https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22190DescriptorIndexCollector does not delete extraneous local files if remote pat...2020-06-27T14:23:37ZKarsten LoesingDescriptorIndexCollector does not delete extraneous local files if remote paths with leading /Today I realized that metrics-web does not delete any extraneous local files since we made `DescriptorIndexCollector` the default in metrics-lib version 1.6.0.
Turns out that the following metrics-web patch works around the issue:
```
...Today I realized that metrics-web does not delete any extraneous local files since we made `DescriptorIndexCollector` the default in metrics-lib version 1.6.0.
Turns out that the following metrics-web patch works around the issue:
```
diff --git a/modules/collectdescs/src/main/java/org/torproject/metrics/collectdescs/Main.java b/modules/collectdescs/src/main/java/org/torproject/metrics/collectdescs/Main.java
index 499dff9..39b3b69 100644
--- a/modules/collectdescs/src/main/java/org/torproject/metrics/collectdescs/Main.java
+++ b/modules/collectdescs/src/main/java/org/torproject/metrics/collectdescs/Main.java
@@ -17,14 +17,14 @@ public class Main {
DescriptorSourceFactory.createDescriptorCollector();
collector.collectDescriptors(
"https://collector.torproject.org", new String[] {
- "/recent/bridge-descriptors/extra-infos/",
- "/recent/bridge-descriptors/server-descriptors/",
- "/recent/bridge-descriptors/statuses/",
- "/recent/exit-lists/",
- "/recent/relay-descriptors/consensuses/",
- "/recent/relay-descriptors/extra-infos/",
- "/recent/relay-descriptors/server-descriptors/",
- "/recent/torperf/"
+ "recent/bridge-descriptors/extra-infos/",
+ "recent/bridge-descriptors/server-descriptors/",
+ "recent/bridge-descriptors/statuses/",
+ "recent/exit-lists/",
+ "recent/relay-descriptors/consensuses/",
+ "recent/relay-descriptors/extra-infos/",
+ "recent/relay-descriptors/server-descriptors/",
+ "recent/torperf/"
}, 0L, new File("../../shared/in"), true);
}
}
```
However, I think we need to fix this in metrics-lib by accepting remote paths with leading `/` for fetching remote files _and_ deleting extraneous local files.
iwakeh, want to look into this?metrics-lib 1.7.0iwakehiwakehhttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22154Remove code that was deprecated in the 1.x series2020-06-27T14:23:37ZKarsten LoesingRemove code that was deprecated in the 1.x seriesWhen we release version 2.0.0 we should remove all (?) code that was deprecated in the 1.x series.
Really all? What about code that we deprecated in the very last 1.x version just a few days before releasing 2.0.0? I believe even that...When we release version 2.0.0 we should remove all (?) code that was deprecated in the 1.x series.
Really all? What about code that we deprecated in the very last 1.x version just a few days before releasing 2.0.0? I believe even that, because whoever updates from 1.x to 2.x knows that things might break. And even if they update from an earlier 1.x, they'll need to watch out for changes. But I could be convinced otherwise.metrics-lib 2.0.0iwakehiwakehhttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22141Deprecate `DescriptorFile` and add relevant information to `Descriptor`2020-06-27T14:23:37ZKarsten LoesingDeprecate `DescriptorFile` and add relevant information to `Descriptor`I'd want us to implement legacy/trac#20395 where we'd be able to handle much larger descriptor files without copying all contents to memory before even looking at them. But I realized that `DescriptorFile#getDescriptors()` makes this ra...I'd want us to implement legacy/trac#20395 where we'd be able to handle much larger descriptor files without copying all contents to memory before even looking at them. But I realized that `DescriptorFile#getDescriptors()` makes this rather pointless. If we need to keep a list with all parsed descriptors in memory, each containing raw contents of some sort (see legacy/trac#22140), then what's the point of avoiding to read complete files to memory?
One way to fix this is to deprecate `DescriptorFile` and add all relevant information to `Descriptor`. And then `DescriptorReader` would return an `Iterator<Descriptor>`, internally backed by `BlockingIteratorImpl<Descriptor>`, rather than an `Iterator<DescriptorFile>`. Sounds easy!
Here's a catch though: `DescriptorFile#getException()` returns "the first exception that was thrown when reading this file or parsing its content". If we'd move this method to `Descriptor`, would we set an I/O exception to the first descriptor in that file, to all of them, or maybe to none of them? Turns out we don't have to worry too much about this. The only code that actually uses `DescriptorFile#getException()` is Onionoo's `DescriptorQueue`, and all it does is log the exception. We could just do the same.
What else could go wrong?metrics-lib 1.9.0Karsten LoesingKarsten Loesinghttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22140Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]2022-04-11T16:41:44ZKarsten LoesingStore raw descriptor contents as UTF-8 encoded Strings rather than byte[]When we're reading descriptors from disk we're storing raw descriptor contents as `byte[]` and returning them in `Descriptor#getRawDescriptorBytes()`. Also, we're storing partial raw descriptor contents in `DirSourceEntry#getDirSourceEn...When we're reading descriptors from disk we're storing raw descriptor contents as `byte[]` and returning them in `Descriptor#getRawDescriptorBytes()`. Also, we're storing partial raw descriptor contents in `DirSourceEntry#getDirSourceEntryBytes()` and `NetworkStatusEntry#getStatusEntryBytes()`.
Storing `byte[]` can be useful when writing raw contents back to disk, because we can be sure that contents are exactly the same as when we read them from disk. Namely, we don't have to worry about character encoding.
However, support for handling (large) `byte[]` content is limited. Today I looked into ways to handle large descriptor files (legacy/trac#20395), and I found that most libraries work best with character streams, not with byte streams. And I only briefly considered implementing Knuth-Morris-Pratt myself...
So, I looked at the four main code bases using metrics-lib (CollecTor, ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor bytes and how. After all, if we're not using them ourselves, we can as well get rid of them. Here's what I found:
1. Onionoo's `DescriptorQueue` uses raw bytes to keep statistics on processed bytes, which seems like something that would still work reasonably well with character lengths.
2. **CollecTor's `DescriptorPersistence` indeed uses raw descriptor bytes to write descriptors obtained from another CollecTor instance to disk. We'd have to change that.**
3. CollecTor's `VotePersistence` uses raw descriptor bytes to calculate the digest of votes, which is something we should implement in metrics-lib directly (legacy/trac#20333).
4. ExoneraTor's `ExoneraTorDatabaseImporter` imports raw status entry bytes into the database, but we know that those are just ASCII, so this would work as well with UTF-8 strings.
5. metrics-web's `RelayDescriptorDatabaseImporter` also imports raw status entry bytes into the database, which works with strings for the same reason as above.
I might have overlooked something.
But if not, CollecTor's `DescriptorPersistence` is the only place where we really need `byte[]` rather than `String`. If we can change that, we can switch from `Descriptor#getRawDescriptorBytes()` to `Descriptor#getRawDescriptor()` and deprecate the former (and do the same with the other two partial contents).
And then we can resume legacy/trac#20395 with a much more complete toolbox.https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22139last_restarted and platform missing even though it is available in descriptor2020-06-27T14:23:38Zcypherpunkslast_restarted and platform missing even though it is available in descriptorthere are currently 3 such relays, one example:
256F183F252DBBF080F2E70E5CB0F523A6323D0F
Also note that recommended_version is set to true even though that depends on the relay's platform string.
https://onionoo.torproject.org/details?...there are currently 3 such relays, one example:
256F183F252DBBF080F2E70E5CB0F523A6323D0F
Also note that recommended_version is set to true even though that depends on the relay's platform string.
https://onionoo.torproject.org/details?fingerprint=256F183F252DBBF080F2E70E5CB0F523A6323D0F
```
{"version":"4.0",
"relays_published":"2017-05-03 08:00:00",
"relays":[
{"nickname":"UbuntuCore169","fingerprint":"256F183F252DBBF080F2E70E5CB0F523A6323D0F","or_addresses":["176.158.53.63:44583"],"last_seen":"2017-05-02 18:00:00","last_changed_address_or_port":"2017-05-02 18:00:00","first_seen":"2017-05-02 18:00:00","running":false,"flags":["Running","V2Dir","Valid"],"country":"fr","country_name":"France","region_name":"\u00CEle-de-France","city_name":"Paris","latitude":48.8628,"longitude":2.3292,"as_number":"AS5410","as_name":"Bouygues Telecom SA","consensus_weight":0,"host_name":"static-176-158-53-63.ftth.abo.bbox.fr","exit_policy_summary":{"reject":["1-65535"]},"recommended_version":true,"measured":false}
],
"bridges_published":"2017-05-03 06:57:32",
"bridges":[
]}
```
https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2017-05-02-18-05-00-server-descriptors
```
@type server-descriptor 1.0
router UbuntuCore169 176.158.53.63 44583 0 0
identity-ed25519
-----BEGIN ED25519 CERT-----
AQQABleiAZ2Ce5QY1oSL0F79WeaPhL/zWomAVJG1vwTioPBkpeG7AQAgBABF3iK6
clXuNv2ZbfNSbmrJkKRLKsC41BZAVs1BSWQndRMNDsZJ/s6GmOd5IiU6axR5z2Nn
XTUR0TMGOc5KNJHqKi9Ht+iSIH02OeV1Gm/PNfos7KBKSJJROme1YQQsvwQ=
-----END ED25519 CERT-----
master-key-ed25519 Rd4iunJV7jb9mW3zUm5qyZCkSyrAuNQWQFbNQUlkJ3U
platform Tor 0.3.0.6 on Linux
proto Cons=1-2 Desc=1-2 DirCache=1 HSDir=1-2 HSIntro=3-4 HSRend=1-2 Link=1-4 LinkAuth=1,3 Microdesc=1-2 Relay=1-2
published 2017-05-02 17:25:22
fingerprint 256F 183F 252D BBF0 80F2 E70E 5CB0 F523 A632 3D0F
uptime 12
bandwidth 4194304 6291456 0
extra-info-digest 357F399E5A0FE2EEDEB7B3AD3D9328440EC17582 OgEu6BAQLUeTFjGofg0WTT9CYQsUGH9tiDENt/tiAD0
onion-key
-----BEGIN RSA PUBLIC KEY-----
MIGJAoGBAMYpYIFcAGOcfZBWt+nUPDu1ovbG8uamDBN4A/XTla74p6A3Ozl8/06D
1E/CcX6N2UahjDs+iM9EmND0k1CFgnkkkU7qBhm4aeOwfzSjDGXA52ab9vS0yEpa
aFHORGn88LRqcSvm9zRtChde5Ez0QJpBOuhyh19qIsSwT4EVa6CXAgMBAAE=
-----END RSA PUBLIC KEY-----
signing-key
-----BEGIN RSA PUBLIC KEY-----
MIGJAoGBAL6touSlbyMx2frcjIrLXcUUhN9rydnQhZrREZEdpALondnaEZzu3LE8
AeQI+VUTpZBlYbWR3Wh+wMDrdPzB3B07ATjAV3N07x6CtKk8YHE5RgShLlEr1k9c
DhN1VZi3rEA63pVfGTC1n7jXpAkMgYMW4KSHk40kgueu+3JxNSe1AgMBAAE=
-----END RSA PUBLIC KEY-----
onion-key-crosscert
-----BEGIN CROSSCERT-----
ITr+XCRVFqFE5o/5utRst/j8cZjEj43Ucd6n4Xoo566rVS9VPvUszduvPAZJECVS
QHPmshTsvXFH5+LEzCk0nN3cR5+iZX5zT15+1EoplE97doHQqtSTcA1CJSSFvoRj
1iobnqDn1lHLFyTMBJ4VV38a1NeovFmy4YkodTrtztk=
-----END CROSSCERT-----
ntor-onion-key-crosscert 1
-----BEGIN ED25519 CERT-----
AQoABlV6AUXeIrpyVe42/Zlt81JuasmQpEsqwLjUFkBWzUFJZCd1ALsQt0Q8mBNP
FcAXX6E+2oX2nGto910Sb1CBMPenMopKXaqArOPeqEQQx4+4x/waBLw7niBtEVjb
+WZ5cSha6Aw=
-----END ED25519 CERT-----
hidden-service-dir
ntor-onion-key fhiVUl9Ff0OlXd6zyqnfEA8u86KmewZISILHeU33Diw=
reject *:*
tunnelled-dir-server
router-sig-ed25519 pyHeZ3dimbx4cBOAjlhLbnav2F9FLrmy+CqO+QIv01VI4qK5xihG6s75HLj3s6dpa52xGBE6HNRdx2rCk2r3Bg
router-signature
-----BEGIN SIGNATURE-----
gJGxrxrbBVnO5x34450bKkBBBGZGJrgfYBLL6tfN6BhEYtENy9cWqt556boXsEuW
cN8z+OdNYr+LGJqUJgGWTSb1am26lU9lyHHHzVIhp9I9K4CXYq93POHCSore0M0c
PgAHPTkUN6WJvxachkEXwftzYaOLvJOqP+GFj+QvsVg=
-----END SIGNATURE-----
```metrics-lib 1.9.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/22122Add support for six new key-value pairs added by OnionPerf2020-06-27T14:23:38ZKarsten LoesingAdd support for six new key-value pairs added by OnionPerfOnionPerf adds six new key-value pairs to the .tpf format that Torperf/CollecTor did not produce: `ENDPOINTLOCAL`, `ENDPOINTPROXY`, `ENDPOINTREMOTE`, `HOSTNAMELOCAL`, `HOSTNAMEREMOTE`, and `SOURCEADDRESS`. We should add support for thes...OnionPerf adds six new key-value pairs to the .tpf format that Torperf/CollecTor did not produce: `ENDPOINTLOCAL`, `ENDPOINTPROXY`, `ENDPOINTREMOTE`, `HOSTNAMELOCAL`, `HOSTNAMEREMOTE`, and `SOURCEADDRESS`. We should add support for these keys to metrics-lib, so that we can start using their values.metrics-lib 1.7.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21934Allow extra arguments in lines containing comma-separated key-value lists2020-06-27T14:23:38ZKarsten LoesingAllow extra arguments in lines containing comma-separated key-value listsIn the context of simulating changes to directory-request statistics I added a few extra arguments to `dirreq-v3-resp` lines to include parameters like `bin_size=8`. But metrics-lib refuses to parse those descriptors, throwing a `Descri...In the context of simulating changes to directory-request statistics I added a few extra arguments to `dirreq-v3-resp` lines to include parameters like `bin_size=8`. But metrics-lib refuses to parse those descriptors, throwing a `DescriptorParseException`. I think this violates the [specification](https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n255) which says:
```
For forward compatibility, each item MUST allow extra arguments at the
end of the line unless otherwise noted. So if an item's description below
is given as:
"thing" int int int NL
then implementations SHOULD accept this string as well:
"thing 5 9 11 13 16 12" NL
but not this string:
"thing 5" NL
and not this string:
"thing 5 10 thing" NL
.
Whenever an item DOES NOT allow extra arguments, we will tag it with
"no extra arguments".
```
I put in a quick fix which might even become the real fix:
```
diff --git a/src/main/java/org/torproject/descriptor/impl/ParseHelper.java b/src/main/java/org/torproject/descriptor/impl/ParseHelper.java
index f73a591..2de05b5 100644
--- a/src/main/java/org/torproject/descriptor/impl/ParseHelper.java
+++ b/src/main/java/org/torproject/descriptor/impl/ParseHelper.java
@@ -341,10 +341,6 @@ public class ParseHelper {
if (partsNoOpt.length < index) {
throw new DescriptorParseException("Line '" + line + "' does not "
+ "contain a key-value list at index " + index + ".");
- } else if (partsNoOpt.length > index + 1 ) {
- throw new DescriptorParseException("Line '" + line + "' contains "
- + "unrecognized values beyond the expected key-value list at "
- + "index " + index + ".");
} else if (partsNoOpt.length > index) {
if (!commaSeparatedKeyValueListPatterns.containsKey(keyLength)) {
String keyPattern = "[0-9a-zA-Z?<>\\-_]"
```
I can confirm that this solves my issue. But does it produce any new issues by accepting lines that it shouldn't accept? And are there similar issues where we're throwing an exception instead of simply ignoring extra arguments?metrics-lib 1.7.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21932Stop relying on the platform's default charset2020-06-27T14:23:38ZKarsten LoesingStop relying on the platform's default charsetWhile looking into the encoding issue of different Onionoo instances producing different contact string encodings (legacy/trac#15813), I tracked down this issue to metrics-lib's `ServerDescriptorImpl.java` class and its usage of `new Str...While looking into the encoding issue of different Onionoo instances producing different contact string encodings (legacy/trac#15813), I tracked down this issue to metrics-lib's `ServerDescriptorImpl.java` class and its usage of `new String(byte[])`.
The issue is that the constructor above uses "the platform's default charset". Turns out that the main Onionoo instance uses `US-ASCII` as default charset (`Charset.defaultCharset()`) and the mirror uses `UTF-8`. (Interestingly, the mirror only uses `UTF-8` for commands executed by cron and also uses `US-ASCII` for commands directly executed by my user, so the default would change depending on whether Onionoo's updater was started automatically after a reboot or started manually by the user; which made debugging just a bit more challenging!)
Long story short, we should not rely on the platform's default charset when converting bytes to strings or vice versa, but we should explicitly specify the charset we want! We just need to pick one.
Somewhat related I ran an analysis of character encodings in relay server descriptors two weeks ago. Here's what I found:
```
$ wget
https://collector.torproject.org/archive/relay-descriptors/server-descriptors/server-descriptors-2017-02.tar.xz
$ tar xf server-descriptors-2017-02.tar.xz
$ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes
$ cut -d" " -f3 mimes | sort | uniq -c
68 charset=iso-8859-1
466900 charset=us-ascii
1145 charset=utf-8
```
I'd say let's just pretend that server descriptors are UTF-8 encoded. In this case, the following patch will resolve the issue for server descriptors:
```
diff --git a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
index 309cad4..2381378 100644
--- a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
@@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException;
import org.torproject.descriptor.ServerDescriptor;
import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
@@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends DescriptorImpl
}
private void parseDescriptorBytes() throws DescriptorParseException {
- Scanner scanner = new Scanner(new String(this.rawDescriptorBytes))
- .useDelimiter("\n");
+ Scanner scanner = new Scanner(new String(this.rawDescriptorBytes,
+ StandardCharsets.UTF_8)).useDelimiter("\n");
String nextCrypto = "";
List<String> cryptoLines = null;
while (scanner.hasNext()) {
```
If this sounds like a reasonable plan, we should look into other places in the code where we use methods relying on the platform's default charset and explicitly specify a charset there, too.metrics-lib 2.0.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21890Don't skip unrecognized lines in certain cases2020-06-27T14:23:38ZKarsten LoesingDon't skip unrecognized lines in certain casesWhen we started using Java 7's switch-on-String in 2b4d773, we broke unrecognized line parsing in extra-info descriptors. Namely, when we reached the end of a crypto block we didn't reset the list for collecting crypto lines. So far so...When we started using Java 7's switch-on-String in 2b4d773, we broke unrecognized line parsing in extra-info descriptors. Namely, when we reached the end of a crypto block we didn't reset the list for collecting crypto lines. So far so good, but any following unrecognized lines would be collected as crypto lines and later discarded, rather than being added to the unrecognized-lines list and later reported.
This only affects relay descriptors, because sanitized bridge descriptors don't contain crypto blocks. And it only affects relay descriptors with crypto blocks, like "identity-ed25519", whereas relay extra-info descriptors published by older versions were not affected.
Branch follows in a second.metrics-lib 1.7.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21751Use multiple threads to parse descriptors2020-06-27T14:23:38ZKarsten LoesingUse multiple threads to parse descriptorsThe following idea came up when I looked a bit into legacy/trac#17831 to speed up metrics-lib.
When we read and parse descriptors from disk, we're using a single thread to read and parse descriptors. It's a daemon thread and not the ap...The following idea came up when I looked a bit into legacy/trac#17831 to speed up metrics-lib.
When we read and parse descriptors from disk, we're using a single thread to read and parse descriptors. It's a daemon thread and not the application's main thread, so if the application's thread is busy processing parsed descriptors we're at least using two threads. But we could parallelize even more by using separate threads for reading and parsing and even using multiple threads for reading and/or for parsing. I'll leave the I/O part to legacy/trac#17831 and focus on the multi-threaded parsing part here.
I wrote a little patch that measures time spent on reading tarball contents in `DescriptorReaderImpl#readTarballs()` and then extended that by moving descriptor parsing code to a separate class that implements `Runnable` and that gets executed by an `ExecutorService`. I initialized that executor with `Executors.newFixedThreadPool(n)` for `n = [2, 4, 8, 16, 32, 64]`. I also tried `n = 1`, but ran out of memory due to a major issue in my simple patch: it reads _all_ tarball contents to memory when creating `Task` instances even if they cannot be executed anytime soon. What we should do is block the reader thread when it realizes that the executor is already full. I'm attaching my patch, but only to avoid starting from zero the next time. It needs more work.
| **separate parser threads** | **read `.tar` file (s)** | **parse `.tar` file (s)** | **read `.tar.xz` file (s)** | **parse `.tar.xz` file (s)** |
|-----------------------------|--------------------------|---------------------------|-----------------------------|------------------------------|
| none (current code) | 35 | 159 | 9 | 162 |
| 2 | 36 | 42 | 8 | 126 |
| 4 | 41 | 13 | 7 | 96 |
| 8 | 42 | 11 | 6 | 35 |
| 16 | 41 | 11 | 10 | 28 |
| 32 | 45 | 13 | 7 | 34 |
| 64 | 41 | 13 | 6 | 38 |
These results show that 4 threads speed up the parse time for `.tar` files by a **factor 12** after which there's no visible improvement, and 8 threads speed up the parse time for `.tar.xz` files by a **factor 4.6**. Just from these numbers I'd suggest using 8 threads by default and making this number configurable for the application. But: needs more work.
My recommendation would be to look more into making parsing multi-threaded and save legacy/trac#17831 for later. It seems like parsing is the lower-hanging fruit.
Note that reading the same tarball in extracted form using the current code took 271 seconds. In that case the lower-hanging fruit might be I/O improvements, not multi-threaded parsing. But my hope is that not many applications extract tarballs containing over 800,000 files and read them using `DescriptorReader`, especially not if they could as well read the tarball directly.
Suggestions welcome! Otherwise I might pick this up again and move it forward whenever there's time.https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21469Exclude index package from Javadocs2020-06-27T14:23:38ZKarsten LoesingExclude index package from JavadocsI just found that we're not excluding the index package from generated Javadocs, though the classes there are not part of the API but part of the implementation. We should exclude that package just like we exclude the impl package. Dir...I just found that we're not excluding the index package from generated Javadocs, though the classes there are not part of the API but part of the implementation. We should exclude that package just like we exclude the impl package. Dirty patch:
```
diff --git a/build.xml b/build.xml
index a870eca..136c15b 100644
--- a/build.xml
+++ b/build.xml
@@ -8,7 +8,7 @@
<property name="release.version" value="1.5.0-dev" />
<property name="javadoc-title" value="DescripTor API Documentation"/>
- <property name="javadoc-excludes" value="**/impl/**" />
+ <property name="javadoc-excludes" value="**/i*/**" />
<property name="implementation-title" value="DescripTor" />
<property name="name" value="descriptor" />
<property name="jarpatternprop" value="empty" />
```
Maybe there's a better way to achieve this?metrics-lib 1.6.0https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21379Metrics-lib web-site2020-06-27T14:23:39ZiwakehMetrics-lib web-siteQuote from meeting:
[this should contain]
* an overview what it is with a link to releases,
* javadocs,
* tutorials.Quote from meeting:
[this should contain]
* an overview what it is with a link to releases,
* javadocs,
* tutorials.https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21365Investigate whether descriptor parsing is guaranteed to be thread-safe2020-06-27T14:23:39ZiwakehInvestigate whether descriptor parsing is guaranteed to be thread-safePartially discovered in legacy/trac#20440 and should be thoroughly verified.
Class affected `BlockingIteratorImpl`Partially discovered in legacy/trac#20440 and should be thoroughly verified.
Class affected `BlockingIteratorImpl`https://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/21144remove 604 checkstyle complaints from metrics-lib2020-06-27T14:23:39Ziwakehremove 604 checkstyle complaints from metrics-libMostly for test classes.
Run `ant checks` to see the items.Mostly for test classes.
Run `ant checks` to see the items.Karsten LoesingKarsten Loesinghttps://gitlab.torproject.org/tpo/network-health/metrics/library/-/issues/20765adapt to new lines in votes and consensus and make the adaption to protocol c...2020-06-27T14:23:39Ziwakehadapt to new lines in votes and consensus and make the adaption to protocol changes easierNew protocol versions 23 and 25 introduce new lines for votes and consensus; e.g. [vote from moria](https://collector.sky-ip.org/recent/relay-descriptors/votes/2016-11-25-12-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-5858C65AC36...New protocol versions 23 and 25 introduce new lines for votes and consensus; e.g. [vote from moria](https://collector.sky-ip.org/recent/relay-descriptors/votes/2016-11-25-12-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-5858C65AC36DE37F3CA8EE03059A8730A97D31B8)
```
...
recommended-relay-protocols Cons=1-2 Desc=1-2 DirCache=1 HSDir=1 HSIntro=3 HSRend=1 Link=4 LinkAuth=1 Microdesc=1-2 Relay=2
recommended-client-protocols Cons=1-2 Desc=1-2 DirCache=1 HSDir=1 HSIntro=3 HSRend=1 Link=4 LinkAuth=1 Microdesc=1-2 Relay=2
required-relay-protocols Cons=1 Desc=1 DirCache=1 HSDir=1 HSIntro=3 HSRend=1 Link=3-4 LinkAuth=1 Microdesc=1 Relay=1-2
required-client-protocols Cons=1-2 Desc=1-2 DirCache=1 HSDir=1 HSIntro=3 HSRend=1 Link=4 LinkAuth=1 Microdesc=1-2 Relay=2
...
shared-rand-participate
shared-rand-commit 1 sha3-256 D586D18309DED4CD6D57C18FDB97EFA96D330566 AAAAAFg3fwANZSmhaatp83nojq97N/eLHSCShOFJqiR1Skc9lO/dXA== AAAAAFg3fwDxNhND8l9+/S4fxn+yeCKNgaZp3yJ8qWSkg8NICmZ+PA==
```
These should not be treated as unknown lines by metrics-lib.
In addition, the process of recognizing lines should be improved to make accommodation of new protocol versions a lot easier, ideally without code changes, i.e. w/o having to compile metrics-lib for such changes.
(should be tackled together with legacy/trac#17861, legacy/trac#19640, legacy/trac#19607)metrics-lib 1.6.0