At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7.
Designs
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
I set up a demo page on two servers, one with HSTS and one without. Only the one with HSTS shows a difference in document.characterSet. Note that neither of the servers specifies the encoding in the Content-Type header, so you get a warning in the browser console and the browser has to infer the encoding.
The technique from legacy/trac#10703 (closed) always finds iso-8859-1. (I think that technique has trouble distinguishing iso-8859-1 and windows-1252.)
Chromium 52.0.2743.116 doesn't appear to make a difference between HSTS and non-HSTS. Go to Settings → Web content → Customize fonts → Encoding and change to Korean. Both demo pages show EUC-KR.
gk: can we change the keyword to tbb-fingerprinting-locale please? TIA :)
I am only going on previous comments about which sites have HSTS and which don't (and those commments are contradictory, I think, I need coffee - let me know if I have it the wrong way round). Either way, there are four test sites
thinking out loud: If they're requesting pages as en-US, etc (spoof = 2) .. then the breakage should be nothing more than a normal en-US bundle, right? IDK, does the override pref affect chrome? Does this impact users on non-English OSes?
I am only going on previous comments about which sites have HSTS and which don't
You can forget about HSTS. That conjecture was wrong. bamsoftware.com has HSTS and it doesn't show the leak. The reason the previous results seem contradictory is that the page that in 2016 was at !https://people.eecs.berkeley.edu/ (no HSTS) now redirects to a different server, !https://www.bamsoftware.com/ (HSTS).
If the cause of the difference is not HSTS, what is it? My new guess is that it must have to do with the Content-Type header and whether it specifies an encoding or not.
Edit: PS: can we change the title: replace HSTS with legacy encoding or something - thanks)
There's an error in my spreadsheet... hu and pl are the same, but I said they were different.. so that's one less bucket. But, I tested all the legacy fallback options available in Firefox, and ko returns EUC-KR, so I would expect that to be the same in TB.
There are 14 values in the UI legacy fallback coimbox, they are
as well as "default for current locale", which would cover any others, I guess (IANA expert) - e.g I am not sure what happens with Lithuanian, Malay: but Thai would leak as windows-874
Trac: Description: At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
to
At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7. Summary: document.characterSet enables fingerprinting of localization (only with HSTS?) to document.characterSet leaks locale when HTML page does not specify its own encoding
all TB's not using spoof_english - previous glorious colored results 2 years ago: #20025 (comment 2633410)
ar (was windows-1256)
cs (was windows-1250)
el (was ISO-8859-7)
fa (was windows-1256)
he (was windows-1255)
hu (was ISO-8859-2)
ja (was Shift_JIS)
ko (was EUC-KR according to david)
lt (new)
mk (was windows-1251)
ms (new)
my (new)
pl (was ISO-8859-2)
ru (was windows-1251)
th (new - would have been windows-874?)
tr (was windows-1254)
vi (was windows-1258)
zh-CN (was GBK)
zh-TW (was Big5)
PS: that was a painful exercise (all mostly old TB11.5a1 with the missing connection button bug - thanks), please don't make me do it again. I will update my non en-US suite one day, probably when 12 hits ... it just takes so long :-(