Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • Trac Trac
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Service Desk
    • Milestones
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
  • Wiki
    • Wiki
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • Legacy
  • TracTrac
  • Issues
  • #30369

Closed (moved)
(moved)
Open
Created May 02, 2019 by Karsten Loesing@karsten

Fix regular expression in descriptor parser to correctly recognize bandwidth files

We're using a regular expression on the first 100 characters of a descriptor to recognize bandwidth files. More specifically, if a descriptor starts with ten digits followed by a newline, we parse it as a bandwidth file. (This is ugly, but the legacy bandwidth file format doesn't give us much of a choice.)

This regular expression is broken. The regular expression we want is one that matches the first 100 characters of a descriptor, which ours didn't do.

Suggested fix:

diff --git a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index 119fe09..08ac909 100644
--- a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -132,7 +132,7 @@ public class DescriptorParserImpl implements DescriptorParser {
           sourceFile);
     } else if (fileName.contains(LogDescriptorImpl.MARKER)) {
       return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile, fileName);
-    } else if (firstLines.matches("^[None..None](../compare/None...None){10}\\n")) {
+    } else if (firstLines.matches("(?s)[None..None](../compare/None...None){10}\\n.*")) {
       /* Identifying bandwidth files by a 10-digit timestamp in the first line
        * breaks with files generated before 2002 or after 2286 and when the next
        * descriptor identifier starts with just a timestamp in the first line

Explanation:

  • We don't need to start the pattern with ^, because the regular expression needs to match the whole string anyway.
  • The (?s) part enables the dotall mode: "In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"
  • We need to end the pattern with .* to match any characters following the first newline, which also includes newlines due to the previously enabled dotall mode.

I'll create a branch for this in a minute.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking