Share more code between modules
We currently have nine data-processing modules in metrics-web. Each of them reads descriptors from a local directory, aggregates them somehow, and writes one or more CSV files to a local directory. Some modules use a database for the aggregation part, others use state files.
Some of these modules have a lot of code in common. Yet they do not share actual code other than what's in metrics-lib for reading and parsing descriptors. This is bad for obvious reasons.
I'd like to approach this refactoring from a top-down perspective where we generalize similar functionality and use it in all modules. The following list is ordered by topic, not by priority:
-
Configuration: Most of the modules can be configured in some way, including database connection details or file system paths. It would be easier to configure things once for all modules by using a single config file, or to have reasonable defaults and modify single options via command-line arguments.
-
Scheduling: Right now, modules are running once per day, called by cron. We would like to run them more often, but we need to avoid overlapping runs. And we need to handle shutdowns gracefully. A common scheduler might help with this.
-
Descriptor reading and parsing: Each module has its own code for reading and parsing descriptors using metrics-lib. This includes setting paths where descriptors are located and paths for parse history files.
-
Statistics: We have similar code for computing percentiles and other statistics distributed over the code base. We might be able to generalize these computations and provide a common math/statistics interface for them. We should still use a math library, so this would be mostly a wrapper for that library.
-
Database access: Several of our modules have the same or very similar code to: connect to a database, import parsed descriptor parts into tables, executing stored procedures for importing data, executing stored procedures for aggregating data, querying one or more results view, and disconnecting from the database. What we need is a more powerful API to our databases than
java.sql
. -
Output: Our modules write one or more CSV files as their output. Some modules treat missing values differently in the output, but this code is mostly the same in all modules. Maybe this is still part of the database access item above. If not, we should share some code across modules for writing output files.
This is not high priority, and it requires discussion prior to making any code changes. This ticket is supposed to get us started here. (And I said I wouldn't close legacy/trac#26035 (moved) before this ticket exists.)