Skip to content

Fix regular expression in descriptor parser to correctly recognize bandwidth files

We're using a regular expression on the first 100 characters of a descriptor to recognize bandwidth files. More specifically, if a descriptor starts with ten digits followed by a newline, we parse it as a bandwidth file. (This is ugly, but the legacy bandwidth file format doesn't give us much of a choice.)

This regular expression is broken. The regular expression we want is one that matches the first 100 characters of a descriptor, which ours didn't do.

Suggested fix:

diff --git a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index 119fe09..08ac909 100644
--- a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -132,7 +132,7 @@ public class DescriptorParserImpl implements DescriptorParser {
           sourceFile);
     } else if (fileName.contains(LogDescriptorImpl.MARKER)) {
       return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile, fileName);
-    } else if (firstLines.matches("^[None..None](../compare/None...None){10}\\n")) {
+    } else if (firstLines.matches("(?s)[None..None](../compare/None...None){10}\\n.*")) {
       /* Identifying bandwidth files by a 10-digit timestamp in the first line
        * breaks with files generated before 2002 or after 2286 and when the next
        * descriptor identifier starts with just a timestamp in the first line

Explanation:

  • We don't need to start the pattern with ^, because the regular expression needs to match the whole string anyway.
  • The (?s) part enables the dotall mode: "In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"
  • We need to end the pattern with .* to match any characters following the first newline, which also includes newlines due to the previously enabled dotall mode.

I'll create a branch for this in a minute.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information