Investigate consensus-tracker's memory usage
The first script that I ported over to stem was the consensus-tracker script which provides the automated emails for the list by the same name...
https://gitweb.torproject.org/atagar/tor-utils.git/blob/HEAD:/consensusTracker.py https://lists.torproject.org/cgi-bin/mailman/listinfo/consensus-tracker/
Moving this turned out to reveal some major issues with stem's ExitPolicy class in terms of memory usage. Those issues are fixed and the script now ran for several days without issue, but then a new type of memory problem surfaced.
Each hour the consensus-tracker makes an instance of the Sampling class, storing up to 192 of them at a time. Individually these our fine, but as the script runs and reaches that threshold the memory starts to stack up.
After a week the consensus-tracker instance on my system was using 75% of the system's memory and started failing to fetch new consensus information (I'm not positive that the memory usage is related to the failures, but seems likely).
So first question, why is stem using more memory than torctl? At a guess there's two issues...
-
TorCtl likely provided version 2 router status entries while stem provides version 3. A big difference between those two is that version 3 includes the microdescriptor exit policy.
-
TorCtl's ExitPolicyLine class is far lighter than our ExitPolicy. All it stores is the binary representation of the address, subnet mask, and port range (ie, the bare minimum to have a working match() method). Ours, however, includes IPv6 support and some additional data.
I've made a little hack in my consensus-tracker to drop the exit policy from the router status entries (... actually, the script doesn't use them so this should have zero impact). After a week or so of running this'll confirm or deny that the ExitPolicy is the issue.
If it is then I'll likely make the microdescriptor policies become lighter weight. They only need a subset of the information of a normal policy.