This is a living and changing document to accompany the current project for improving [https://CollecTor.torproject.org CollecTor].
== Areas of Work
During the course of this project the following sections will more and more turn into descriptions and documentation.
Currently, they are a mixture of very defined improvements as well as sketches and wishes and questions.
=== Analyze Descriptor Completeness
The analysis will be based on log-files and the downloaded files and address the following questions:
==== How many descriptors are missing?
* Details about missing referenced descriptors can be found here: [wiki:doc/CollecTor/AnalysisDescriptorCompleteness Analysis Part 1]
* Details about missing consensus and votes: [wiki:doc/CollecTor/AnalysisVotesAndConsensusCompleteness Analysis Part 2]
* Analysis of missing referenced descriptors on the current development CollecTor mirror: [wiki:doc/CollecTor/AnalysisDescriptorCompletenessFromScratch Analysis of pure download mirror]
==== How could this loss be avoided?
* actively monitor resources like available storage space (discussion in ticket #18865).
* verify and improve runtime statistics in order to have a clearer picture (discussion in ticket #19169).
* Extra-info descriptors dropped b/c of parsing problems are counted as missing. This should be avoided. ticket #19170.
==== Next Steps ====
Continue analysis when sync-process is deployed.
=== Provide Guide Documents
These guides should be based on the previous work in [https://onionoo.torproject.org Onionoo] and metrics-lib. In detail
* Contributor's Guide: create as detailed in #18733 and place the new guide in a central location, which still needs to be identified; this could be a large document in the central place and a small document in CollecTor referencing the main document. (detailed discussion in #18730)
* Release Process (definded in #18732)
* Installation Guide for Operators (adapt the [https://gitweb.torproject.org/collector.git/tree/INSTALL.md existing document]), ticket #18734
=== Implement the Release Process
(according to the guide above)
== Design Changes
This section describes improvements that ought to make CollecTor more maintainable, testable, and more efficient.
1. Run collector with an internal scheduler instead of using external scheduling (e.g. crontab), #19018
1. Add shutdown hook to provide a controlled way of stopping. Discussion #19016.
1. Some parts of CollecTor's data processing are provided by bash scripts run via crontab. These should be integrated into the java application.
=== Improve CollecTor Operation and Setup
Once there is the executable jar including the shutdown hook implementation CollecTor should be started as a linux service, i.e., an appropriate shell script needs to be provided.
=== Further Sketches of Areas for Improvements
* store unparsable descriptors rather than discarding them
- add local storage for descriptors that cannot be parsed for review by the service operator and later reprocessing
* synchronization between CollecTor instances see #18910 and DescriptorDistribution
* improve the process of creating tarballs
- reduce memory consumption throughout
* consider using an embedded http server in order to reduce operating complexity