I'm having trouble parsing compressed descriptors fetched via tor's DirPort. I don't know enough about .z compression to track down the problem though. Anyway, here's what I found:
Download consensus and all server descriptors from turtles:
Attempt to parse compressed consensus using Python's zlib:
Python 2.7.5 (default, Aug 25 2013, 00:04:04) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import zlib>>> print len(zlib.decompress(open('turtles-consensus.z', 'rb').read()))1085611>>> print len(zlib.decompress(open('turtles-server-all.z', 'rb').read()))Traceback (most recent call last): File "<stdin>", line 1, in <module>zlib.error: Error -5 while decompressing data: incomplete or truncated stream>>> print len(zlib.decompressobj().decompress(open('turtles-server-all.z', 'rb').read()))8932838
Decompressing the consensus works fine using zlib.decompress(), but decompressing the server descriptors does not. The only workaround I found was to explicitly create a decompressor using zlib.decompressobj(). AIUI, the difference between the two approaches is that the latter can handle partial content (cf. http://stackoverflow.com/questions/20620374/how-to-inflate-a-partial-zlib-file/20625078#20625078).
We might need to look at or/directory.c for this. It might be that tor is not returning a proper zlib image (properly 'terminated', or whatever (I don't know how zlib format works, at least not yet.))
Maybe take a look at the part of the code that does the actual 'send zlib-compressed data in chunks' (in directory_handle_command_get() which starts on line ~2524):
Could it be that the way tor terminates that zlib stream (add 'empty compressed data' to the zlib stream) is not a proper way to close a zlib archive? (So for example this could mean that tor does indeed assume that it's okay to send 'partial content' (in the sense of unclosed-zlib-archive, zlib-as-stream, not in the sense of fail-to-send-some-data)?)
Can't look more into this right now, but maybe this pointer will help somehow.
It seems that zlib header should include the length of the zlib image. When sending in multiple chunks, tor is of course unable to provide the right length in the header.
(this is the "try this magical pill" part:) does using zlib.decompress(data, 15 + 32) on python's side help, perchance (can't test right now)? (See: http://stackoverflow.com/questions/1838699/how-can-i-decompress-a-gzip-stream-with-zlib) (note: this will probably not work, because the problem on SE was somewhat actually different; but it's probably easy to try out.)
Could it be that the way tor terminates that zlib stream (add 'empty compressed data' to the zlib stream) is not a proper way to close a zlib archive? (So for example this could mean that tor does indeed assume that it's okay to send 'partial content' (in the sense of unclosed-zlib-archive, zlib-as-stream, not in the sense of fail-to-send-some-data)?)
I didn't look very closely, but it seems that tor doesn't simply add empty compressed data, but that it also sets done to 1.
It seems that zlib header should include the length of the zlib image. When sending in multiple chunks, tor is of course unable to provide the right length in the header.
Well, tor could go through chunks twice, once to sum up their lengths and write that to the header, and another time to actually send chunks.
But I don't know the code enough to say if this would solve the problem.
(this is the "try this magical pill" part:) does using zlib.decompress(data, 15 + 32) on python's side help, perchance (can't test right now)? (See: http://stackoverflow.com/questions/1838699/how-can-i-decompress-a-gzip-stream-with-zlib) (note: this will probably not work, because the problem on SE was somewhat actually different; but it's probably easy to try out.)
The magical pill didn't work.
But here's some more information from curl -vv: it seems that tor uses "deflate" to compress descriptors. AFAIK, there's also "zlib" and "gzip", which tor doesn't seem to use. Though Python's zlib should support all of them. Hmm.
I didn't look very closely, but it seems that tor doesn't simply add empty compressed data, but that it also sets done to 1.
You're right, I think. Someone could try using wireshark or inserting a dump-this-data-in-hex-to-log call, and comparing the bytes with zlib flush modes1, or something. Fun pastime! :)
Python's zlib should indeed support deflate. Hrmgh.
Here's a more pragmatic angle, which does not help reduce anxiety at all:
curl http://76.73.17.194:9030/tor/server/all.z > turtles-server-all.zcurl http://76.73.17.194:9030/tor/server/all > turtles-server-all-not-compressedpythonPython 2.7.3 (default, Mar 13 2014, 11:03:55) [GCC 4.7.2] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import zlib>>> with open('turtles-server-all', 'wb') as f:... f.write(zlib.decompressobj().decompress(open('turtles-server-all.z', 'rb').read()))... >>> ^Ddiff turtles-server-all turtles-server-all-not-compressed > turtles-diff
I also tried sorting the outputs before diffing, but that didn't help much, I think. Those same outputs sorted and then diffed: http://ravinesmp.com/volatile/sorted-turtles-diff (I just used sort naively.)
I did the two curls right one after the other, so maybe it's a matter of the dir authority continuously updating its descriptors, and thus it is natural for two successive all requests to return different results? (edit: some of the differences can definitely be explained by this: at least one relay updated its descriptor to a newer one; and the bandwidth numbers differ a bit; this is OK.)
I'm not sure what to make of this. diffing the two outputs using vimdiff does help to reduce confusion a bit, I'd recommend trying vimdiff here. It seems that the bandwidth-numbers-related differences are simply there because of the two successive GETs; but the other differences, I'm not sure about.
If we can make sure the differences are only because of the continuously updated directory data, then we can pragmatically conclude that "every tool using compressed descriptors from directory authorities should assume it is receiving zlib stream data", and maybe "this assumption (zlib stream) should be mentioned in some appropriate place to avoid confusion."
If we can't make sure of that, then we have a spooky rabbit hole.
It seems that zlib header should include the length of the zlib image. When sending in multiple chunks, tor is of course unable to provide the right length in the header.
Well, tor could go through chunks twice, once to sum up their lengths and write that to the header, and another time to actually send chunks.
I suppose. The overall logic would have to be changed, as it's currently "make a chunk and send it out right now."
But I don't know the code enough to say if this would solve the problem.
Yeah, maybe this is not a very productive angle anyway.
[...]
But here's some more information from curl -vv: it seems that tor uses "deflate" to compress descriptors. AFAIK, there's also "zlib" and "gzip", which tor doesn't seem to use. Though Python's zlib should support all of them. Hmm.
I'm reeeally not sure of those terms myself, but I think that zlib encapsulates DEFLATE either way; it adds (zlib or gzip) header to the DEFLATE-compressed data (DEFLATE is a compression algorithm.) So it should be "the data is DEFLATEd, the only question is if the header is in zlib or gzip format." But, I'm not sure.
Here's a theory why compressed and non-compressed results differ: the compressed result is cached on the directory, whereas the non-compressed result is put together on-the-fly. I didn't find the place in the code where this is done, but it seems at least plausible to me.
So, assuming we don't have this particular rabbit hole, can you rephrase your other insights? Is there anything wrong with the way directories compress strings, or is this just a matter of clients handling compressed responses in the right way? Thanks!
Here's a theory why compressed and non-compressed results differ: the compressed result is cached on the directory, whereas the non-compressed result is put together on-the-fly. I didn't find the place in the code where this is done, but it seems at least plausible to me.
Good idea. (fwiw, I could not find such a place, and from looking at the function write_to_evbuffer_zlib() (called by connection_write_to_buf_impl_(), which is the 'implementation' function for connection_write_to_buf_zlib(), which is noted [comment:1 above]) in or/buffers.c (around line 2549), it would appear no caching is taking place. But I know nothing about bufferevents, and maybe one of those write() functions further down the callstack are doing something (haven't looked. didn't find anything, though.))
So, assuming we don't have this particular rabbit hole, can you rephrase your other insights? Is there anything wrong with the way directories compress strings, or is this just a matter of clients handling compressed responses in the right way? Thanks!
Let's keep this open for at least another day if that's OK, but for now, the tl;dr would be "the clients should assume the compressed data is a zlib stream." This should automatically imply that 1) all is well in the way tor handles zlib data (directory-server-side and client-side), and that 2) your python solution with the decompressobj() is the correct way to handle this (from that SO thread it looks like other people do the same when having to handle zlib streams) as far as python is concerned.
If it's not the correct way, it's (only) a matter of finding the proper way of handling zlib streams {in python | in general}.
Main caveat: I don't know nearly enough about zlib to be able to make such claims without frowning. :) But, (assuming the diffs can be explained away) I think that this was just a mismatch in assumptions about zlib data. This is how it should be.
I might still recommend to note this behaviour down somewhere, i.e. include a "clients should assume the compressed data is a zlib stream" (vs. archive), but I'm not sure what's the best place for that, if any.
I don't think any zlib-compressed-data caching happens (can't find any in buffers or torgzip, etc.)
Nevertheless, the diffs should really be just because directories' view of relays and their descriptors is always changing (it would be nice to confirm this in a definite manner, of course.) Also, some caching could be happening "somewhere else" (i.e.: I don't know.)
all.z is a zlib stream, and as far as zlib streams go, everything is OK with it.
I'm not really qualified to make the above statements, sorry. i.e.: someone else should take a look at this, too.
I don't think any zlib-compressed-data caching happens (can't find any in buffers or torgzip, etc.)
Nevertheless, the diffs should really be just because directories' view of relays and their descriptors is always changing (it would be nice to confirm this in a definite manner, of course.) Also, some caching could be happening "somewhere else" (i.e.: I don't know.)
I'm not too worried about this, but I can't say for sure that there's no bug here. But should we move this to a separate ticket? The two possible issues seem unrelated. Would you want to create that ticket?
all.z is a zlib stream, and as far as zlib streams go, everything is OK with it.
I don't buy that yet. Why would zlib.decompress fail if this is a valid zlib stream?
(Also note that the zlib.decompressobj solution doesn't work for me, because I'm really looking for a way to make java.util.zip.InflaterInputStream work. I just wrote the test cases in Python, because I was hoping to get more feedback on that. And look how this worked just fine! But I really suspect that the problem is in tor's use of zlib.)
So, I looked at turtles-server-all.z using xxd with RFC 1950 opened in a second window. The first bytes (0x78da) look like a valid header (78 = "deflate with 32k window size", da= "max compression, no dict"), but I'm not sure about the last bytes (0x0000ffff). These last bytes are supposed to be the Adler-32 checksum, but these bytes don't look like a valid checksum to me. Maybe something is not initialized or not flushed correctly?
I'm not really qualified to make the above statements, sorry. i.e.: someone else should take a look at this, too.
zlib.decompress(string[, wbits[, bufsize]])[...]wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception.The default value is [...] 15.
I tried all the window sizes that people try (including zlib.MAX_WBITS|32 for automatic header detection), nothing worked.
I don't know why decompressobj().decompress() does work. I looked at python's packages briefly, in particular, at local/lib/python2.7/encodings/zlib_codec.py, but didn't find anything useful there. Fun times.
I don't think any zlib-compressed-data caching happens (can't find any in buffers or torgzip, etc.)
Nevertheless, the diffs should really be just because directories' view of relays and their descriptors is always changing (it would be nice to confirm this in a definite manner, of course.) Also, some caching could be happening "somewhere else" (i.e.: I don't know.)
I'm not too worried about this, but I can't say for sure that there's no bug here. But should we move this to a separate ticket? The two possible issues seem unrelated. Would you want to create that ticket?
Could you please clarify/confirm what you had in mind? By the two issues, did you mean
.z-compressed descriptors can't be parsed / the zlib data returned by tor is strange (this ticket)
tor/server/all and tor/server/all.z (possibly) differ (new ticket)
Just say "yes, this", and I'll create a ticket for "2."
So, I looked at turtles-server-all.z using xxd with RFC 1950 opened in a second window. The first bytes (0x78da) look like a valid header (78 = "deflate with 32k window size", da= "max compression, no dict"), but I'm not sure about the last bytes (0x0000ffff). These last bytes are supposed to be the Adler-32 checksum, but these bytes don't look like a valid checksum to me. Maybe something is not initialized or not flushed correctly?
Good catch!
Potentially directly related: there are multiple zlib flush modes. Here's what zlib's manual has to say about Z_SYNC_FLUSH (emphasis mine):
If the parameter flush is set to Z_SYNC_FLUSH, all pending output is flushed to the output buffer and the output is aligned on a byte boundary, so that the decompressor can get all input data available so far. (In particular avail_in is zero after the call if enough output space has been provided before the call.) Flushing may degrade compression for some compression algorithms and so it should be used only when necessary. This completes the current deflate block and follows it with an empty stored block that is three bits plus filler bits to the next byte, followed by four bytes (00 00 ff ff).
This is probably related. (To clarify for someone else who might be reading this (ha, ha): those four bytes are at the very end of the archived .z stream.)
Regarding the flush modes themselves, python's zlib mentions
Z_SYNC_FLUSH, Z_FULL_FLUSH, Z_FINISH.
tor's gzip module only uses Z_SYNC_FLUSH and Z_FINISH.
Could it be that the two sides are making different assumptions about these flush modes?
Or that tor does not do a Z_FINISH in the end (though, maybe it does. It has the code for that. Whether it's called at the end of (the writing of) the .z stream, I'm not sure.)
Responding inline to less important points below (i.e.: can probably skip them.)
all.z is a zlib stream, and as far as zlib streams go, everything is OK with it.
I don't buy that yet. Why would zlib.decompress fail if this is a valid zlib stream?
(This paragraph doesn't help much, but it explains my confused thought process ->) When I wrote that, I was operating under the assumption that there are zlib archives and streams, and that python expects an archive. fwiw, there does seem to be some kind of assumption that python makes (see comments on http://stackoverflow.com/questions/3122145/zlib-error-error-3-while-decompressing-incorrect-header-check/22310760#22310760 maybe), but, I've little clue. Maybe zlib data is a zlib stream and that's that, and it should just work.
(Also note that the zlib.decompressobj solution doesn't work for me, because I'm really looking for a way to make java.util.zip.InflaterInputStream work. I just wrote the test cases in Python, because I was hoping to get more feedback on that. And look how this worked just fine! But I really suspect that the problem is in tor's use of zlib.)
Any meaningful error messages on java's part here?
Also, by the way: the tor_gzip_compress() in torgzip.c looks like a rather hairy function. It doesn't operate at a "quick hack" level, but it does seem to make some specific assumptions. Either way, the comments included there do not reassure me that tor_gzip_compress() does everything right. :)
Hi Karsten, isn't this the same issue as legacy/trac#9379 (moved)? Nine months back I noticed that the DirPort's compressed responses were very frequently truncated. Retries could address it, but the failure rate was so high it was actually better for DocTor to opt for uncompressed requests...
Hi Karsten, isn't this the same issue as legacy/trac#9379 (moved)? Nine months back I noticed that the DirPort's compressed responses were very frequently truncated. Retries could address it, but the failure rate was so high it was actually better for DocTor to opt for uncompressed requests...
Ah, so there was a ticket for this. :) Yes, AFAICT this is the same issue. Same exception in python's traceback, etc.
There's a useful comment there,
In theory, you could use your script to download from a directory mirror, right? Can you find a directory mirror that logs on debug level that you then request server or extra-info descriptors from?
I now remember actually trying to download the tor/server/all[.z] from a (very-small-time) exit relay which is also a directory mirror (with the intention of hacking in "dump descriptor data (for the GET request) as it comes to debug log / to file" functionality into it (switching to tor git-master would be easy anyway)), and getting an empty response for tor/server/all and tor/server/all.z:
My guess would be that the stream is ending before it writes out data from Z_FINISH. But when I tried to reproduce this behavior with chutney, I couldn't make it happen.
One possible thing to blame from dirserv.c:
/* XXXX024 This 'last' business should actually happen on the last * routerinfo, not on the last fingerprint. */
If that's the case, this will happen some times but not always. It looks like it would only happen when the identity of the last router added to the big key list turns out not be present... but I'd expect that would be pretty rare. Strange that this would be happening so consistently.
To actually debug this, one good first step would be to see if you can get it to repeat reliably on a test network.
With respect to the other issues:
My guess is that the difference between all and the decompressed all.z is an artifact of changed information between the first and second requests. Probably a different issue, if it's an issue at all.
My guess is that the low-bandwidth router is refusing to answer because of:
if (global_write_bucket_low(TO_CONN(conn), dlen, 2)) { log_info(LD_DIRSERV, "Client asked for server descriptors, but we've been " "writing too many bytes lately. Sending 503 Dir busy."); write_http_status_line(conn, 503, "Directory busy, try again later"); conn->dir_spool_src = DIR_SPOOL_NONE; goto done; }
(I just added a unit test for write_to_buf_zlib to see if there might be something low-level going on there; I didn't find anything. Commit was ed0e2eca)
I'm not sure about reproing this for a test network, but see #9379 for a small script to repro this. You can target it against a particular authority or directory mirror to see what it's doing on the server side.
95.85.10.71 does indeed return a 503 (should have seen this myself, whoops.) Upped the limit, and at least the compressed descriptors archive is now being returned.
Its .z returned is successfully parsed by python's import zlib; print len(zlib.decompress(open('all.z', 'rb').read())), and the archive ends with a different four byte tail (seems unique etc., assuming it's the adler32 checksum.) Re-downloaded later, another different tail.
(Re: other things that Nick said, will try and see to them, but won't happen before ~tomorrow)
turtles seems to consistently produce a bogus all.z (which at this point can be identified by the trailing 0000ffff, which seems to be the way a Z_SYNC_FLUSH (not Z_FINISH) ends.)) Other authorities inconclusive, but IAC turtles seems to be unique in its failure rate (subsequent requests do not help.)
I wonder what can its last entry in the conn->fingerprint_stack (when it (well, dir_fps) gets filled) be? Can we infer, etc.? Can there be a tangling node on conn->fingerprint_stack which fails at descriptor lookup?
Without looking at dirserv.c, it looks like missing descriptors might be causing this problem. Here's an easy way to reproduce:
Fetch an existing descriptor by digest:
$ curl http://128.31.0.39:9131/tor/server/d/f53bcf37a5233f4bf91373a62f92692a526e866d.z > moria1-f5.z % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 874 0 874 0 0 300 0 --:--:-- 0:00:02 --:--:-- 300
Attempt to fetch a non-existing descriptor by digest:
$ curl http://128.31.0.39:9131/tor/server/d/1917db60e6c769567fc364e9dbb6fe8aa5feeb16.z > moria1-19.z % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
So far so good, now fetch both at once:
$ curl http://128.31.0.39:9131/tor/server/d/f53bcf37a5233f4bf91373a62f92692a526e866d+1917db60e6c769567fc364e9dbb6fe8aa5feeb16.z > moria1-f5+19.z % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 875 0 875 0 0 2523 0 --:--:-- --:--:-- --:--:-- 2528
Note in the curl output how the second download is 1 byte larger than the first.
There's one difference in the first line, but more importantly, the last line differs. The second download ends with 0x0000ffff which we saw before. It seems this was supposed to be the separator between first and second descriptor. But then tor didn't find a second descriptor and just closed the stream.
Now, how's a compressed result of two descriptors supposed to look like?
Let's fetch another existing descriptor:
$ curl http://128.31.0.39:9131/tor/server/d/6d44fd5b8e5b7382089b02614eddb64b290eed42.z > moria1-6d.z % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 817 0 817 0 0 598 0 --:--:-- 0:00:01 --:--:-- 598
And now we're fetching both existing descriptors at once:
$ curl http://128.31.0.39:9131/tor/server/d/f53bcf37a5233f4bf91373a62f92692a526e866d+6d44fd5b8e5b7382089b02614eddb64b290eed42.z > moria1-f5+6d.z % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 1497 0 1497 0 0 1257 0 --:--:-- 0:00:01 --:--:-- 1257
Here are the first line, all lines containing 0x0000ffff, and the last line:
Could you please clarify/confirm what you had in mind? By the two issues, did you mean
.z-compressed descriptors can't be parsed / the zlib data returned by tor is strange (this ticket)
tor/server/all and tor/server/all.z (possibly) differ (new ticket)
Just say "yes, this", and I'll create a ticket for "2."
Yes, this. :) See also nickm's later comment for a possible explanation of 2 ("My guess is that the difference between all and the decompressed all.z is an artifact of changed information between the first and second requests. Probably a different issue, if it's an issue at all."). Thanks!
In that case, I have a theory that my branch "bug11648" will fix this. I'm still curious where the fingerprints for the missing relays are coming from in the all.z case, though.
Trac: Status: new to needs_review Keywords: tor-relay deleted, tor-relay 024-backport added Milestone: N/Ato Tor: 0.2.5.x-final
In that case, I have a theory that my branch "bug11648" will fix this. I'm still curious where the fingerprints for the missing relays are coming from in the all.z case, though.
TL;DR patch works, and seems to work in the very way that we'd assumed it should work. At this point I don't think there's any doubt how strange .zs are produced: if the {microdescriptor, descriptor} for the last digest on a list of digests to be processed cannot be fetched, a needed call to close the zlib stream is skipped. The question (maybe) remains why a digest list for all.z may contain a non-fetchable entry at the end so frequently (this is empirical, and unclear.)
As far as the patch is concerned, IMO it's good. (See below for details.)
Can't (yet) reproduce bogus (with 0000ffff in the end) all.z on a directory mirror
Can reproduce bogus "unfetchable descriptor at the end" archive:
(note that descriptors may expire etc.; later on, you may need to supply different digests)
Inserted a couple of log_info()s (branch "bug11648_dirserv_log_bogus_desc") (the "log_info() if descriptor or microdescriptor lookup fails" might as well eventually go into master IMO, because IMO it's log_info()-worthy. Unless logging router digests requested at INFO level is not a good idea?)
"processed last digest" not logged when last digest lookup fails (confirms one of the assumptions)
"processed last digest" logged when last digest lookup succeeds
would be nice to see what kinds of dangling non-fetchable digests are on turtles' all. Would be possible to do that in a simple manner by running those log_info()s, enabling INFO log, and grepping for "last digest on the stack".
still curious why turtles' (and maybe in general) all.z contain nonexistent digests at the end so often.
still curious why turtles' (and maybe in general) all.z contain nonexistent digests at the end so often.
Note how the processing for "all" works in dirserv_get_routerdescs(): it takes all the routers in the routerlist, then excludes any that can't be sent unencrypted. It takes their identity fingerprints, not their descriptor digests.
Later, connection_dirserv_add_servers_to_outbuf() tries to get a descriptor using signed_descriptor_get_by_fp(). But look how the publish_cutoff field works: it only actually serves a descriptor when if it was published within the last ROUTER_MAX_AGE_TO_PUBLISH seconds. Otherwise, it treats the server as having no descriptor. I think that is likely the major cause of what's going on here.
Based on those tests, it seems like a good idea to merge the patch into 0.2.5. Because of our use of Z_SYNC_FLUSH, we aren't actually losing any real data here, though, so I say this is not a backport candidate for 0.2.4.
Merging patch, closing ticket. Please open new ticket(s) as needed if there are lingering mysteries to be chased down.
Trac: Status: needs_review to closed Resolution: N/Ato fixed
We aren't done here. See bug legacy/trac#311 (moved) -- that's the reason we introduced the old behavior in 475eb5d6. I think there's an easy fix, though.
And in fact, I think we really fixed bug legacy/trac#311 (moved) with 05eff35a. Added a unit test as 891d239e. Opening a new ticket for a related issue though.
Trac: Status: reopened to closed Resolution: N/Ato fixed