At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from #10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in #10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I set up a demo page on two servers, one with HSTS and one without. Only the one with HSTS shows a difference in document.characterSet. Note that neither of the servers specifies the encoding in the Content-Type header, so you get a warning in the browser console and the browser has to infer the encoding.
The technique from #10703 (closed) always finds iso-8859-1. (I think that technique has trouble distinguishing iso-8859-1 and windows-1252.)
Chromium 52.0.2743.116 doesn't appear to make a difference between HSTS and non-HSTS. Go to Settings → Web content → Customize fonts → Encoding and change to Korean. Both demo pages show EUC-KR.
gk: can we change the keyword to tbb-fingerprinting-locale please? TIA :)
I am only going on previous comments about which sites have HSTS and which don't (and those commments are contradictory, I think, I need coffee - let me know if I have it the wrong way round). Either way, there are four test sites
The thorin test page links to and opens the other three in a new tab.
Obligatory Pic
spreadsheet to follow
Results:
all tests done in 9.0a6
all 30 non en-US bundles tested were set to spoof
excluding the windows-1252 fallback, there are 12 buckets covering 14 languages
ko - not tested, waiting for #31886 (moved) , but reading above it would be windows-1252 anyway
mk - had to install the Macedonian language pack and set spoof etc, see #31725 (moved)
Notes
Options>General>Languages>Fonts and Colors>Advanced>Text Encoding for Legacy Content
this sets the pref intl.charset.fallback.override if you change it from "Default for current locale"
Solution
Set intl.charset.fallback.override = windows-1252 when privacy.spoof_english = 2, and reset it when privacy.spoof_english !== 2
Do this upstream (not sure if #10703 (closed) also needs upstreaming)
thinking out loud: If they're requesting pages as en-US, etc (spoof = 2) .. then the breakage should be nothing more than a normal en-US bundle, right? IDK, does the override pref affect chrome? Does this impact users on non-English OSes?
I am only going on previous comments about which sites have HSTS and which don't
You can forget about HSTS. That conjecture was wrong. bamsoftware.com has HSTS and it doesn't show the leak. The reason the previous results seem contradictory is that the page that in 2016 was at !https://people.eecs.berkeley.edu/ (no HSTS) now redirects to a different server, !https://www.bamsoftware.com/ (HSTS).
If the cause of the difference is not HSTS, what is it? My new guess is that it must have to do with the Content-Type header and whether it specifies an encoding or not.
Edit: PS: can we change the title: replace HSTS with legacy encoding or something - thanks)
There's an error in my spreadsheet... hu and pl are the same, but I said they were different.. so that's one less bucket. But, I tested all the legacy fallback options available in Firefox, and ko returns EUC-KR, so I would expect that to be the same in TB.
There are 14 values in the UI legacy fallback coimbox, they are
as well as "default for current locale", which would cover any others, I guess (IANA expert) - e.g I am not sure what happens with Lithuanian, Malay: but Thai would leak as windows-874
Trac: Description: At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from #10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in #10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
to
At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
https://hsivonen.com/test/moz/check-charset.htm
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from #10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in #10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7. Summary: document.characterSet enables fingerprinting of localization (only with HSTS?) to document.characterSet leaks locale when HTML page does not specify its own encoding