Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
Trac
Trac
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Legacy
  • TracTrac
  • Issues
  • #15844

Closed (moved)
Open
Opened Apr 28, 2015 by Karsten Loesing@karsten

Develop database schema to support Onionoo's search parameter efficiently

The current way for handling incoming client requests is to load all search-relevant data about relays and bridges into memory and handle requests from there. This has two major downsides: it's difficult to extend and it requires us to limit searches to relays and bridges that have been running in the past seven days. We would like to overcome both.

After some experimenting with database schemas it seems that supporting the search parameter efficiently will be most difficult. Here's what it does (from https://onionoo.torproject.org/protocol.html):

Return only (1) relays with the parameter value matching (part of a) nickname, (possibly $-prefixed) beginning of a hex-encoded fingerprint, beginning of a base64-encoded fingerprint without trailing equal signs, or beginning of an IP address, (2) bridges with (part of a) nickname or (possibly $-prefixed) beginning of a hashed hex-encoded fingerprint, and (3) relays and/or bridges matching a given qualified search term. Searches by relay IP address include all known addresses used for onion routing and for exiting to the Internet. Searches for beginnings of IP addresses are performed on textual representations of canonical IP address forms, so that searches using CIDR notation or non-canonical forms will return empty results. Searches are case-insensitive, except for base64-encoded fingerprints. If multiple search terms are given, separated by spaces, the intersection of all relays and bridges matching all search terms will be returned. Complete hex-encoded fingerprints should always be hashed using SHA-1, regardless of searching for a relay or a bridge, in order to not accidentally leak non-hashed bridge fingerprints in the URL. [...]

Before providing my experimental code (which I'd have to clean up anyway), I'd want to keep this discussion as open as possible and only present my general ideas how this could be implemented:

  • In the following I'll assume that we're going to use PostgreSQL as database. I already have some experience with it from other Tor-related projects, and our sysadmins like it more than other SQL databases. If NoSQL turns out to be superior for this use case based on some actual performance evaluations, I'm happy to consider that.

  • We'll have to support three comparison modes for the search paramter: "starts with", "starts with ignore case", and "contains as substring".

  • PostgreSQL does not support substring searches (LIKE '%foo%') out of the box, at least not efficiently, but there's a package called pg_trgm that can "determine the similarity of text based on trigram matching". It's contained in Debian's postgresql-contrib package, so it should be available to us.

  • Right now, search terms are supported starting at a minimum length of 1 character. I could imagine raising that to 3 characters if it has major benefits to search efficiency. Though if it doesn't, let's keep supporting searches for 1 or 2 characters.

  • I briefly experimented with a normalized database schema with a servers table containing one row per relay or bridge, an addresses table with one or more addresses per server, and a fingerprints table with original and hashed fingerprint per server. The performance was not very promising, because searches would have to happen in all three tables. Happy to try again if somebody has hints what I could have done wrong.

  • I also considered (but did not test) a schema with a single servers table that encodes all fields that are relevant for the search parameter in a single string with format "lower-case-nickname#base64-fingerprint|lower-case-hex-fingerprint|lower-case-hashed-hex-fingerprint|lower-case-address1|lower-case-address2". For example, Tonga would have the combined string "tonga#SgzNLdx5lQg9c/XWZxAMilgx8W0|4a0ccd2ddc7995083d73f5d667100c8a5831f16d|e654ae16b76cf002bd26adaf060f8a9c5d333cc9|82.94.251.203", and searches for Tonga would use the following condition: WHERE search LIKE '%tonga%#' OR search LIKE '%#Tonga%' OR search LIKE '%|tonga%'.

  • There may be variants of these two schemas that have advantages that I didn't think of yet. Suggestions very welcome.

If we can find a good database schema for the search parameter, implementing the other parameters should be relatively easy.

Here's Tonga's search data for a very first sample:

{
    "t": true,
    "f": "4A0CCD2DDC7995083D73F5D667100C8A5831F16D",
    "n": "Tonga",
    "ad": [
        "82.94.251.203"
    ],
    "cc": "nl",
    "as": "AS3265",
    "fs": "2007-10-27 12:00:00",
    "ls": "2015-04-18 13:00:00",
    "rf": [
        "Authority",
        "Fast",
        "HSDir",
        "Running",
        "Stable",
        "V2Dir",
        "Valid"
    ],
    "cw": 20,
    "r": true,
    "c": "4096/fd3428b4 lucky green <shamrock@cypherpunks.to>"
}

I also uploaded more sample search data in case that helps the discussion.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: legacy/trac#15844