Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
Trac
Trac
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Legacy
  • TracTrac
  • Issues
  • #13600

Closed (moved)
Open
Opened Oct 29, 2014 by Karsten Loesing@karsten

Improve bulk imports of descriptor archives

We need to improve bulk imports of descriptor archives. Whenever somebody wants to initialize Onionoo with existing data, they'll need to process years of descriptors. The current code is not at all optimized for that, but it's designed for running once per hour and updating things as quickly as possible. Let's fix that and support bulk imports better.

Here's what we should do:

  • We define a new directory in/archive/ where operators can put descriptor archives fetched from CollecTor. Whenever there are files in that directory we import them first (before descriptors in in/recent/). In particular, we iterate over files twice: in the first iteration we look at the first contained descriptor to determine its type, and in the second iteration we parse files containing server descriptors and then files containing other descriptors. (This order is important for computing advertised bandwidth fractions, which only works if we parse server descriptors before consensuses.) This process will take very long, so we should log whenever we complete a tarball, and ideally we'd print out how many tarballs we already parsed and how many more we need to parse.
  • We add a new command-line switch --update-only for only updating status files and not downloading descriptors or writing document files. Operators could then import archives, which would take days or even weeks, and then switch to downloading and processing recent descriptors. My branch task-12651-2 is a major improvement here, because it ensures that all documents will be written once the bulk import is done, not just the ones for relays and bridges that were contained in recent descriptors. Future command-line options would be --download-only and --write-only for the other two phases and --single-run that does what's the current default but once we switch from being called by cron every hour to scheduling our own hourly runs internally.

I somewhat expect us to run into memory problems when importing months or even years of data at once. So, part of the challenge here will be to keep an eye on memory usage and fix any memory issues.

To upload designs, you'll need to enable LFS and have admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: legacy/trac#13600