Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]
When we're reading descriptors from disk we're storing raw descriptor contents as byte[]
and returning them in Descriptor#getRawDescriptorBytes()
. Also, we're storing partial raw descriptor contents in DirSourceEntry#getDirSourceEntryBytes()
and NetworkStatusEntry#getStatusEntryBytes()
.
Storing byte[]
can be useful when writing raw contents back to disk, because we can be sure that contents are exactly the same as when we read them from disk. Namely, we don't have to worry about character encoding.
However, support for handling (large) byte[]
content is limited. Today I looked into ways to handle large descriptor files (#20395 (moved)), and I found that most libraries work best with character streams, not with byte streams. And I only briefly considered implementing Knuth-Morris-Pratt myself...
So, I looked at the four main code bases using metrics-lib (CollecTor, ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor bytes and how. After all, if we're not using them ourselves, we can as well get rid of them. Here's what I found:
- Onionoo's
DescriptorQueue
uses raw bytes to keep statistics on processed bytes, which seems like something that would still work reasonably well with character lengths. - CollecTor's
DescriptorPersistence
indeed uses raw descriptor bytes to write descriptors obtained from another CollecTor instance to disk. We'd have to change that. - CollecTor's
VotePersistence
uses raw descriptor bytes to calculate the digest of votes, which is something we should implement in metrics-lib directly (#20333 (moved)). - ExoneraTor's
ExoneraTorDatabaseImporter
imports raw status entry bytes into the database, but we know that those are just ASCII, so this would work as well with UTF-8 strings. - metrics-web's
RelayDescriptorDatabaseImporter
also imports raw status entry bytes into the database, which works with strings for the same reason as above.
I might have overlooked something.
But if not, CollecTor's DescriptorPersistence
is the only place where we really need byte[]
rather than String
. If we can change that, we can switch from Descriptor#getRawDescriptorBytes()
to Descriptor#getRawDescriptor()
and deprecate the former (and do the same with the other two partial contents).
And then we can resume #20395 (moved) with a much more complete toolbox.