Do not let appended descriptor files grow too large
I revisited legacy/trac#20395 (moved) last week. The issue is that metrics-lib cannot handle large descriptor files, because it first reads the entire file into memory before splitting it into single descriptors and parsing them. While it would be possible to parse large descriptor files after making some major code changes (using
FileChannel and doing lazy parsing), I don't think that we have to do that. After all, we're writing these large descriptor files ourselves in CollecTor, and it's up to us to stop doing that.
Going back in time, the original reason for concatenating multiple descriptors into a single file was that rsyncing many tiny files from one host to another host was just slow. So we appended server descriptors and extra-info descriptors into a single file. This works well with server descriptors or extra-info descriptors published within 1 hour or even 10 hours. It does not work that well anymore with all server descriptors or extra-info descriptors synced from another CollecTor instance when starting a new instance (legacy/trac#20335 (moved)). It works even less well when importing one or more monthly tarballs containing server descriptors or extra-info descriptors (legacy/trac#27716 (moved)).
My suggestion is that we define a configurable limit for appended descriptor files of, say, 20 MiB. And when storing a descriptor, we check whether appending a descriptor to an existing descriptor file would exceed this limit and start a new descriptor file in that case.
There are some technical details to work out, but I think they can be solved. I also don't expect this to produce a lot of code, not even complex code changes. The benefit would be that we could resolve legacy/trac#20395 (moved) and legacy/trac#27716 (moved) by implementing this.
Thoughts on the general idea?