Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
Trac
Trac
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Legacy
  • TracTrac
  • Issues
  • #21751

Closed (moved)
Open
Opened Mar 15, 2017 by Karsten Loesing@karsten

Use multiple threads to parse descriptors

The following idea came up when I looked a bit into #17831 (moved) to speed up metrics-lib.

When we read and parse descriptors from disk, we're using a single thread to read and parse descriptors. It's a daemon thread and not the application's main thread, so if the application's thread is busy processing parsed descriptors we're at least using two threads. But we could parallelize even more by using separate threads for reading and parsing and even using multiple threads for reading and/or for parsing. I'll leave the I/O part to #17831 (moved) and focus on the multi-threaded parsing part here.

I wrote a little patch that measures time spent on reading tarball contents in DescriptorReaderImpl#readTarballs() and then extended that by moving descriptor parsing code to a separate class that implements Runnable and that gets executed by an ExecutorService. I initialized that executor with Executors.newFixedThreadPool(n) for n = [2, 4, 8, 16, 32, 64]. I also tried n = 1, but ran out of memory due to a major issue in my simple patch: it reads all tarball contents to memory when creating Task instances even if they cannot be executed anytime soon. What we should do is block the reader thread when it realizes that the executor is already full. I'm attaching my patch, but only to avoid starting from zero the next time. It needs more work.

separate parser threads read .tar file (s) parse .tar file (s) read .tar.xz file (s) parse .tar.xz file (s)
none (current code) 35 159 9 162
2 36 42 8 126
4 41 13 7 96
8 42 11 6 35
16 41 11 10 28
32 45 13 7 34
64 41 13 6 38

These results show that 4 threads speed up the parse time for .tar files by a factor 12 after which there's no visible improvement, and 8 threads speed up the parse time for .tar.xz files by a factor 4.6. Just from these numbers I'd suggest using 8 threads by default and making this number configurable for the application. But: needs more work.

My recommendation would be to look more into making parsing multi-threaded and save #17831 (moved) for later. It seems like parsing is the lower-hanging fruit.

Note that reading the same tarball in extracted form using the current code took 271 seconds. In that case the lower-hanging fruit might be I/O improvements, not multi-threaded parsing. But my hope is that not many applications extract tarballs containing over 800,000 files and read them using DescriptorReader, especially not if they could as well read the tarball directly.

Suggestions welcome! Otherwise I might pick this up again and move it forward whenever there's time.

To upload designs, you'll need to enable LFS and have admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: legacy/trac#21751