prop224: Disconnects on long-lasting HS connections (possibly because of mds)

changed milestone to %Tor: 0.3.3.x-final in legacy/trac

added component::core tor/tor in Legacy / Trac milestone::Tor: 0.3.3.x-final in Legacy / Trac owner::asn in Legacy / Trac points::1 in Legacy / Trac priority::medium in Legacy / Trac prop224 in Legacy / Trac prop224-bugs in Legacy / Trac resolution::invalid in Legacy / Trac severity::normal in Legacy / Trac status::closed in Legacy / Trac tor-hs in Legacy / Trac type::defect in Legacy / Trac labels

Replying to asn:

but everytime the number of mds would decrease more and more. Then at some point it reached below minimum dir info, Tor stalled and then it started fetching mds again. It's like something expires mds while in memory, but there is no logic to fetch them back in before we reach the minimum point.

There is something that expires the md's -- see microdesc_cache_clean(). And if you haven't made any new socks requests lately, there won't be any predicted ports, so Tor won't be fetching new ones. Its list of descriptors will slowly decay, and that's ok. Or at least, that's intended by the current design.

So maybe the weird bug here is that Tor cares about whether you have a current microdesc for your primary guard, in order to maintain the connection to it?

If that's the issue, one fix would be to fix stuff so Tor doesn't break current connections just because a microdesc has disappeared. Another fix would be to prevent Tor from going dormant while it has any open connections. I'd be tempted to explore the former fix, since the latter one will (a) reduce the effectiveness of going dormant, and (b) probably require some fiddly bits where we say "oh but that connection was just for fetching a new consensus so it's not really a connection".

Replying to asn:

{{{ Sep 16 00:51:57.000 [info] circuit_mark_for_close_(): Circuit 0 marked for close at src/or/command.c:599 (orig reason: 521, new reason: 0) }}}

521 means "remote" plus "finished". So I think it is referring to the consensus directory fetch attempt, not to your irc stream.

Replying to asn:

{{{ ... Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing. }}}

Tell me more about this eof? It would appear that your irc client decided to hang up.

{{{ Sep 16 01:13:31.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1. }}}

And here it would appear that your irc client decided to make a new connection.

Trac:
torlog_first_incident.log.gz

Replying to arma:

Replying to asn:
...
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.
}}}
Tell me more about this eof? It would appear that your irc client decided to hang up.

Here are the logs surrounding the eof. Seems like it's from natural cause...

{{{ Sep 16 01:12:56.000 [debug] connection_bucket_refill_helper(): global_read_bucket now 1073741824. Sep 16 01:12:56.000 [debug] connection_bucket_refill_helper(): or_conn->read_bucket now 1073741824. Sep 16 01:13:30.000 [debug] conn_read_callback(): socket 18 wants to read. Sep 16 01:13:30.000 [debug] read_to_chunk(): Encountered eof on fd 18 Sep 16 01:13:30.000 [debug] circuit_consider_stop_edge_reading(): considering layer_hint->package_window 906 Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing. Sep 16 01:13:30.000 [debug] connection_edge_end(): Sending end on conn (fd 18). Sep 16 01:13:30.000 [debug] relay_send_command_from_edge_(): delivering 3 cell forward. Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell. Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell. Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell. Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell. Sep 16 01:13:30.000 [debug] append_cell_to_circuit_queue(): Made a circuit active. Sep 16 01:13:30.000 [debug] scheduler_channel_has_waiting_cells(): Channel 6 at 0x7f8c240de460 went from waiting_for_cells to pending Sep 16 01:13:30.000 [debug] conn_close_if_marked(): Cleaning up connection (fd 18). Sep 16 01:13:30.000 [debug] circuit_detach_stream(): Removing stream 51231 from circ 4062943041 Sep 16 01:13:30.000 [debug] connection_remove(): removing socket 18 (type Socks), n_conns now 4 Sep 16 01:13:30.000 [debug] connection_free_(): closing fd 18.


I wonder what's going on here. My IRC client was idle at that time and shouldn't have disconnected... IRC clients are supposed to stick around for weeks, months, years.

WRT my IRC client deciding to make a new connection shortly after, I guess that's probably normal reconnect behavior from irssi when it detects a disconnect.

I attached some debug logs of the surrounding times. Couldn't find something incriminating there...

Here is another such disconnect case from yesterday evening. It's a legacy HS client running latest master connecting to a legacy HS:

Sep 16 17:29:31.000 [debug] connection_edge_process_relay_cell(): Now seen 55751 relay cells here (command 9, stream 0).
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] pathbias_count_use_success(): Marked circuit 151 (7.000000/8.000000) as used successfully for guard motmot ($E248C3A604E196137A3175D4B2E4328922178B47)
Sep 16 17:29:31.000 [info] circuit_mark_for_close_(): Circuit 2915300815 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)
Sep 16 17:29:31.000 [debug] connection_or_process_cells_from_inbuf(): 16: starting, inbuf_datalen 0 (0 pending in tls object).
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): global_read_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): or_conn->read_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x7f8c23af4dc0 queued a destroy for circ 2915300815, cmux counter is now 1, global counter is now 1
Sep 16 17:29:31.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.
Sep 16 17:29:31.000 [debug] channel_write_cell_generic_(): Writing p 0x7f8c2364cee0 to channel 0x7f8c240de460 with global ID 6
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] channel_tls_get_overhead_estimate_method(): Estimated overhead ratio for TLS chan 6 is 1.053891
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Increasing queue size for channel 6 by 541 from 0 to 541
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Increasing global queue size by 541 for channel 6, new size is 541
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue size adjustment by +541 for channel 6
Sep 16 17:29:31.000 [debug] scheduler_update_queue_heuristic(): Queue heuristic is now 0
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue heuristic is now 541
Sep 16 17:29:31.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x7f8c23af4dc0 sent a destroy, cmux counter is now 0, global counter is now 0
Sep 16 17:29:31.000 [debug] channel_send_destroy(): Sending destroy (circID 2915300815) on channel 0x7f8c240de460 (global ID 6)
Sep 16 17:29:31.000 [info] connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.
Sep 16 17:29:31.000 [info] rend_client_note_connection_attempt_ended(): Connection attempt for d6sfftbz6pkwfwwl has ended; cleaning up temporary state.
Sep 16 17:29:31.000 [debug] conn_close_if_marked(): Cleaning up connection (fd 12).
Sep 16 17:29:31.000 [debug] connection_remove(): removing socket 12 (type Socks), n_conns now 4
Sep 16 17:29:31.000 [debug] connection_free_(): closing fd 12.
Sep 16 17:29:31.000 [debug] conn_write_callback(): socket 16 wants to write.
Sep 16 17:29:31.000 [debug] flush_chunk_tls(): flushed 514 bytes, 0 ready to flush, 0 remain.
Sep 16 17:29:31.000 [debug] connection_handle_write_impl(): After TLS write of 514: 0 read, 543 written
Sep 16 17:29:31.000 [debug] channel_tls_get_overhead_estimate_method(): Estimated overhead ratio for TLS chan 6 is 1.053891
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Decreasing queue size for channel 6 by 541 from 541 to 0
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Decreasing global queue size by 541 for channel 6, new size is 0
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue size adjustment by -541 for channel 6
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue heuristic is now 0
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): global_write_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): or_conn->write_bucket now 1073741824.
Sep 16 17:29:32.000 [debug] conn_read_callback(): socket 7 wants to read.
Sep 16 17:29:32.000 [debug] connection_handle_listener_read(): Connection accepted on socket 12 (child of fd 7).
Sep 16 17:29:32.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.
Sep 16 17:29:32.000 [debug] connection_add_impl(): new conn type Socks, socket 12, address 127.0.0.1, n_conns 4.
Sep 16 17:29:32.000 [debug] conn_read_callback(): socket 12 wants to read.
Sep 16 17:29:32.000 [debug] read_to_chunk(): Read 33 bytes. 33 on inbuf.
Sep 16 17:29:32.000 [debug] connection_ap_handshake_process_socks(): entered.
Sep 16 17:29:32.000 [debug] parse_socks(): socks4: Everything is here. Success.
Sep 16 17:29:32.000 [debug] connection_ap_handshake_rewrite(): Client asked for d6sfftbz6pkwfwwl.onion:6697
Sep 16 17:29:32.000 [info] connection_ap_handle_onion(): Got a hidden service request for ID 'd6sfftbz6pkwfwwl'
Sep 16 17:29:32.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2580 more seconds.
Sep 16 17:29:32.000 [info] connection_ap_handle_onion(): Descriptor is here. Great.
Sep 16 17:29:32.000 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
Sep 16 17:29:32.000 [notice] Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.

The incriminating mark for close at circuitbuild.c:1543 seems to be from circuit_truncated(). Could this be from natural causes where one of the rend circuit hops died and the circ had to be rebuilt? :/

Seems like the eof from comment:4 was actually caused by my irssi, since it failed to receive a PONG message from the ircd for 5 minutes which is the timeout. We now need to figure that part out.

Trac:
Status: new to assigned
Owner: N/A to asn

Trac:
Status: assigned to needs_information

Deferring this out of 0.3.2 for now, till we learn more info.

Trac:
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.3.x-final

Seems like most disconnects happening on the ircd are caused by killing of rend circs:

circuit_mark_for_close_(): Circuit 4054201541 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)\

where orig reason: 520 means that the circ died because of END_CIRC_REASON_CHANNEL_CLOSED...

Is it normal for this to occur so frequently??? It occurs for me about once a day, and it also occurs to more people.

It seems the current conclusion for the client-side disconnect is that it happened because of a (mis)feature in your irc client?

Replying to asn:

Seems like most disconnects happening on the ircd are caused by killing of rend circs: {{{ circuit_mark_for_close_(): Circuit 4054201541 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)
}}}

Whose log is this? Your Tor client still, or is this from the service-side?

If it's the service side, are there more logs? Do they come in clumps, or spread out, or what?

After almost a month of ping/pong with a simpler client and service setup, we got very very few of these so I'm calling it "IRC protocol + IRC daemon" crazyness.

Trac:
Status: needs_information to closed
Resolution: N/A to invalid

closed

changed time estimate to 8h

mentioned in issue legacy/trac#23621 (moved)

moved from legacy/trac#23543 (moved)

added Bug label and removed 1 deleted label

added 1 deleted label and removed 1 deleted label

prop224: Disconnects on long-lasting HS connections (possibly because of mds)

Child items ...

Activity