Stop relying on the platform's default charset
While looking into the encoding issue of different Onionoo instances producing different contact string encodings (legacy/trac#15813 (moved)), I tracked down this issue to metrics-lib's ServerDescriptorImpl.java
class and its usage of new String(byte[])
.
The issue is that the constructor above uses "the platform's default charset". Turns out that the main Onionoo instance uses US-ASCII
as default charset (Charset.defaultCharset()
) and the mirror uses UTF-8
. (Interestingly, the mirror only uses UTF-8
for commands executed by cron and also uses US-ASCII
for commands directly executed by my user, so the default would change depending on whether Onionoo's updater was started automatically after a reboot or started manually by the user; which made debugging just a bit more challenging!)
Long story short, we should not rely on the platform's default charset when converting bytes to strings or vice versa, but we should explicitly specify the charset we want! We just need to pick one.
Somewhat related I ran an analysis of character encodings in relay server descriptors two weeks ago. Here's what I found:
$ wget
https://collector.torproject.org/archive/relay-descriptors/server-descriptors/server-descriptors-2017-02.tar.xz
$ tar xf server-descriptors-2017-02.tar.xz
$ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes
$ cut -d" " -f3 mimes | sort | uniq -c
68 charset=iso-8859-1
466900 charset=us-ascii
1145 charset=utf-8
I'd say let's just pretend that server descriptors are UTF-8 encoded. In this case, the following patch will resolve the issue for server descriptors:
diff --git a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
index 309cad4..2381378 100644
--- a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
@@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException;
import org.torproject.descriptor.ServerDescriptor;
import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
@@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends DescriptorImpl
}
private void parseDescriptorBytes() throws DescriptorParseException {
- Scanner scanner = new Scanner(new String(this.rawDescriptorBytes))
- .useDelimiter("\n");
+ Scanner scanner = new Scanner(new String(this.rawDescriptorBytes,
+ StandardCharsets.UTF_8)).useDelimiter("\n");
String nextCrypto = "";
List<String> cryptoLines = null;
while (scanner.hasNext()) {
If this sounds like a reasonable plan, we should look into other places in the code where we use methods relying on the platform's default charset and explicitly specify a charset there, too.