Fix deserialization of UTF-8 characters in details statuses and documents

While looking into the encoding issue of different Onionoo instances producing different contact string encodings (legacy/trac#15813 (moved)), I tracked down a somewhat related issue with UTF-8 characters all being converted to ?.

The issue is related to how we're avoiding to store UTF-8 characters in details statuses and details documents and instead escaping those characters. We're doing this correctly for the serialization part but incorrectly for the deserialization part.

We have two choices here. We could either give up on the escaping part and just store UTF-8 characters directly. Or we could fix the deserialization part. I have a fix for the latter and ran out of time for the former, but maybe the former would be the better fix. I'm including my fix here anyway:

diff --git a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
index 39d6271..b6b1c4c 100644
--- a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
+++ b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
@@ -496,22 +496,25 @@ public class DocumentStore {
     if (!parse) {
       return this.retrieveUnparsedDocumentFile(documentType,
           documentString);
-    } else if (documentType.equals(DetailsDocument.class)
-        || documentType.equals(BandwidthDocument.class)
+    } else if (documentType.equals(BandwidthDocument.class)
         || documentType.equals(WeightsDocument.class)
         || documentType.equals(ClientsDocument.class)
         || documentType.equals(UptimeDocument.class)) {
       return this.retrieveParsedDocumentFile(documentType,
           documentString);
+    } else if (documentType.equals(DetailsStatus.class)
+        || documentType.equals(DetailsDocument.class)) {
+      if (documentType.equals(DetailsStatus.class)) {
+        documentString = "{" + documentString + "}";
+      }
+      documentString = StringUtils.replace(documentString, "\\u", "\\\\u");
+      return this.retrieveParsedDocumentFile(documentType, documentString);
     } else if (documentType.equals(BandwidthStatus.class)
         || documentType.equals(WeightsStatus.class)
         || documentType.equals(ClientsStatus.class)
         || documentType.equals(UptimeStatus.class)
         || documentType.equals(UpdateStatus.class)) {
       return this.retrieveParsedStatusFile(documentType, documentString);
-    } else if (documentType.equals(DetailsStatus.class)) {
-      return this.retrieveParsedDocumentFile(documentType, "{"
-          + documentString + "}");
     } else {
       log.error("Parsing is not supported for type "
           + documentType.getName() + ".");

The main difference here is the StringUtils.replace() part. Without this line (so, current master) we would pass a string containing \uxxxx to Gson.fromJson() which would unescape it and turn it into the corresponding UTF-8 character. So far so good. But when we would later write this status or document back to disk, DocumentStore#writeToFile will write these bytes to disk as US-ASCII, and that will replace all UTF-8 characters with ?.

The patch fixes this by replacing \uxxxx in the file content with \\uxxxx which Gson.fromJson() will not consider an escaped UTF-8 character. We do have code in place that reverses this double-escaping, see DetailsStatus#unescapeJson. So, the patch fixes the problem by keeping things escaped until they are used.

Again, I think the cleaner fix would be to give up on escaping UTF-8 characters and just switch to UTF-8. The part that might make this a little harder is that we'll have to make sure that this works correctly for existing files. And if it does not, we'll need to special-case those somehow. But maybe the patch above helps to come up with this cleaner fix.