Extend file objects in index.json to include descriptor types, publication times, and file digests
atagar suggested to extend file objects in CollecTor's
index.json to include descriptor types, publication times, and file digests.
As of now, file objects in the
index.json file have the following fields:
"path": Relative path of the file.
"size": Size of the file in bytes.
"last_modified": Timestamp when the file was last modified using pattern
"YYYY-MM-DD HH:MM"in the UTC timezone.
The new fields could be defined as follows, though this is very much subject to discussion on this ticket:
"types": List of descriptor types as found in
@typeannotations of contained descriptors (optional).
"first_published": Earliest published timestamp (or similar) of contained descriptors (optional).
"last_published": Latest published timestamp (or similar) of contained descriptors (optional).
"sha256": SHA-256 digest of the file, encoded as base64 (optional).
All these new fields seem reasonable things to add, and I don't see why we wouldn't want to add them. The index will get bigger, but that sounds acceptable. The coding effort is non-zero, which is something we'll have to admit. But all in all, I don't see a blocker for doing this.
Implementation note: All these new fields have in common that they're not just file attributes that we can easily obtain from Java's
File class. We'll have to open and read files in order to obtain these fields, and that's very time-consuming. I could see how we do this in a background thread (or thread pool) started by CollecTor's
CreateIndexJson.java with a state file of some sort to avoid reprocessing files that haven't changed. And while this thread (pool) hasn't completed processing a file, the index would simply omit these new fields (not files!), which is why fields are defined as optional above.
What else did I miss? atagar, please fill in any thoughts that I left out.
Once we agree on the spec here, this could be a fine little project for a volunteer.