Store raw descriptor contents as UTF-8 encoded Strings rather than byte
When we're reading descriptors from disk we're storing raw descriptor contents as
byte and returning them in
Descriptor#getRawDescriptorBytes(). Also, we're storing partial raw descriptor contents in
byte can be useful when writing raw contents back to disk, because we can be sure that contents are exactly the same as when we read them from disk. Namely, we don't have to worry about character encoding.
However, support for handling (large)
byte content is limited. Today I looked into ways to handle large descriptor files (#20395 (moved)), and I found that most libraries work best with character streams, not with byte streams. And I only briefly considered implementing Knuth-Morris-Pratt myself...
So, I looked at the four main code bases using metrics-lib (CollecTor, ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor bytes and how. After all, if we're not using them ourselves, we can as well get rid of them. Here's what I found:
DescriptorQueueuses raw bytes to keep statistics on processed bytes, which seems like something that would still work reasonably well with character lengths.
DescriptorPersistenceindeed uses raw descriptor bytes to write descriptors obtained from another CollecTor instance to disk. We'd have to change that.
VotePersistenceuses raw descriptor bytes to calculate the digest of votes, which is something we should implement in metrics-lib directly (#20333 (moved)).
ExoneraTorDatabaseImporterimports raw status entry bytes into the database, but we know that those are just ASCII, so this would work as well with UTF-8 strings.
RelayDescriptorDatabaseImporteralso imports raw status entry bytes into the database, which works with strings for the same reason as above.
I might have overlooked something.
But if not, CollecTor's
DescriptorPersistence is the only place where we really need
byte rather than
String. If we can change that, we can switch from
Descriptor#getRawDescriptor() and deprecate the former (and do the same with the other two partial contents).
And then we can resume #20395 (moved) with a much more complete toolbox.