document.characterSet leaks locale when HTML page does not specify its own encoding
At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:
Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output
Your fallback charset is: windows-1252
But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output
Your fallback charset is: EUC-KR
This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.
The really strange thing is that this only seems to be effective when the server has HSTS (a valid
Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.
Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7.