enabling CellStatistics results in gigabytes of incremental memory consumption

changed milestone to %Tor: unspecified

added 031-unreached-backport 034-removed-20180328 034-triage-20180328 component::core tor/tor milestone::Tor: unspecified parent::24806 points::1 priority::high severity::major status::new tor-dos type::defect version::tor 0.3.2.9 labels

Trac:
Version: N/A to Tor: 0.3.2.9
Component: - Select a component to Core Tor/Tor

See #24806 (moved) for some background, particularly https://trac.torproject.org/projects/tor/ticket/24806#comment:13

Marking for backport, let's get this done soon, because it makes the DDoS worse.

Edit: spacing

Trac:
Parent: N/A to #24806 (moved)
Severity: Normal to Major
Priority: Medium to High
Keywords: N/A deleted, must-fix-before-033-stable, 029-backport, ddos, 031-backport added
Milestone: N/A to Tor: 0.3.3.x-final
Points: N/A to 1

Can you give us any hints about what goes wrong? How quickly does the bloating happen?

I just ran my fast non-exit relay for an hour or so under valgrind with CellStatistics on, and it didn't find any leaks.

Now I'm running it without valgrind, and it doesn't seem to be bigger than normal.

Does it happen only at certain points, e.g. where it publishes a new descriptor with these stats?

Is there anything else I should do to trigger it?

Ok, my tor relay, now that it's not under valgrind, has been slowly growing.

It could be actual leaks that only happen when the relay is fast enough to do something. It could be memory fragmentation. Or it could be that we're simply keeping a whole lot of extra stuff in memory.

Yep, this thing just grows and grows. I'm up to 6 gigs of ram used.

We're keeping a whole lot of stuff in memory.

diff --git a/src/or/rephist.c b/src/or/rephist.c
index 15fb674..bb59b6b 100644
--- a/src/or/rephist.c
+++ b/src/or/rephist.c
@@ -2390,6 +2390,8 @@ rep_hist_add_buffer_stats(double mean_num_cells_in_queue,
   if (!circuits_for_buffer_stats)
     circuits_for_buffer_stats = smartlist_new();
   smartlist_add(circuits_for_buffer_stats, stats);
+  log_info(LD_HIST, "circuits_for_buffer_stats now size %d",
+           smartlist_len(circuits_for_buffer_stats));
 }
 
 /** Remember cell statistics for circuit <b>circ</b> at time

Every time a circuit closes, we malloc some stuff and add it onto this smartlist. And we only do something with the contents of the smartlist after 24 hours.

During this overload period, my relay handles many hundreds of millions of circuits in a day. Keeping anything about each one of them is going to be too much.

What do we actually use this option for? Some experiment we did in the distant past? It sounds like "recommend against enabling it" is the first step. And then either redesigning it to accomplish whatever goals remain, or stripping it out.

I am cc'ing Karsten because I bet he can help us with the 'distant past' part.

Trac:
Cc: N/A to karsten

Replying to arma:

I am cc'ing Karsten because I bet he can help us with the 'distant past' part.

commit b493a2ccb97e00f4fe3acb5c59c941c2babaeebb
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Sun Jul 5 19:53:25 2009 +0200

    If configured, write cell statistics to disk periodically.

And then either redesigning it to accomplish whatever goals remain, or stripping it out.

Instead of deciles, a combination of variance, skew and kurtosis could be used (a la Commons Math's AbstractStorelessUnivariateStatistic).

I'd like to take a closer look and also discuss this at the metrics team meeting on Thursday. Copying iwakeh and robgjansen who I think might have an opinion on this, too.

Trac:
Cc: karsten to karsten, robgjansen, iwakeh

@arma, CellStatistics still needs to be explicitly enabled in the torrc. Is your valgrind to 6GB of RAM have it enabled? If not, something else is causing massive memory usage (again...)

Just as an FYI, I've opened #25149 (moved) to hook the rephist layer in the OOM handler.

I'm thinking this is an 0.3.4 item, unless there's some reason that people really need to have CellStatistics turned on.

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

TL;DR

These stats are useful for traffic modeling. I think a fine plan is to recommend to phase out cell statistics, with the goal of replacing them in the long term with more useful privacy-preserving measurements using the new PrivCount protocol when its ready, such as bytes-per-stream and streams-per-circuit.

Data point about RAM usage:

I've had the CellStatistics option enabled on my fast relays for a couple of years now. I've noticed that RAM usage on any of the relays would occasionally increase above 2 GiB (even when I temporarily experimented with MaxMemInQueues 512 MB), but I never linked it to CellStatistics. I've now disabled CellStatistics because RAM has recently become a bit of a problem for other reasons.

Do we still care about cell statistics?

Theoretically, cell statistics can help us understand how well our test networks model the conditions of the real Tor network, which would be useful for test networks that actually try to faithfully model the conditions of the real Tor network (like Shadow). For example, we could run a bunch of clients in the test network that are initiating stream transfers according to some understanding of client usage, and then we collect the cell stats on the relays in the test network and compare them to the stats from the public network. How close we are provides a measure of fidelity. I actually did this in the original Shadow paper.

However, circuit-level cell information by itself is not the greatest for informing how the clients in the test network should generate traffic in the first place. For that, we need a combination of stream-level and circuit-level information. From a client modeling perspective, I need to know how many streams each client should create, how much to download on each stream, how many pauses to add and how long to pause, etc. Circuits and cells are a second-level effect of the client model.

The privacy implications are also worth considering (some concerns raised here, and moved here). I think individual relay stats are much more specific that what we need; for modeling purposes, we lose very little utility by using aggregate network results, but they are much safer.

must-fix-before-033-stable is now 033-must

Trac:
Keywords: must-fix-before-033-stable deleted, 033-must added

Looks like we don't need this in 033

Trac:
Keywords: 033-must deleted, N/A added

Trac:
Keywords: ddos deleted, tor-dos added

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

0.3.1 is end of life, there are no more backports. Tagging with 031-unreached-backport instead.

Trac:
Keywords: 031-backport deleted, 031-unreached-backport added

0.2.9 is no longer maintained as of 1 January 2020. Therefore, we won't be backporting anything to it.

(None of these tickets are merge_ready, so we can just delete the backport tag.)

Trac:
Keywords: 029-backport deleted, N/A added

changed time estimate to 8h

mentioned in issue #25149 (moved)

moved to tpo/core/tor#25141 (closed)

mentioned in issue tpo/core/tor#25149 (closed)

enabling CellStatistics results in gigabytes of incremental memory consumption

Child items 0

Activity