document.characterSet leaks locale when HTML page does not specify its own encoding

added component::applications/tor browser in Legacy / Trac owner::tbb-team in Legacy / Trac priority::medium in Legacy / Trac severity::normal in Legacy / Trac status::new in Legacy / Trac tbb-fingerprinting-locale in Legacy / Trac type::defect in Legacy / Trac labels

Trac:

tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)

Trac:

tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)

Trac:

tor-browser-linux64-6.0.4_ko.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)

Trac:

tor-browser-linux64-6.0.4_ko.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)

I set up a demo page on two servers, one with HSTS and one without. Only the one with HSTS shows a difference in document.characterSet. Note that neither of the servers specifies the encoding in the Content-Type header, so you get a warning in the browser console and the browser has to infer the encoding.

The technique from legacy/trac#10703 (closed) always finds iso-8859-1. (I think that technique has trouble distinguishing iso-8859-1 and windows-1252.)

with HSTS

HSTS demo page: https://people.torproject.org/~dcf/tor20025/check-charset.html

document.characterSet is windows-1252 for the en-US bundle and EUC-KR for the ko bundle.

en-US	ko

without HSTS

non-HSTS demo page: https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

document.characterSet is windows-1252 for both the en-US and ko bundles.

en-US	ko

I checked and the same HSTS weirdness happens with stock Firefox 45.3.0. To reproduce, go to Preferences → Content → Fonts & Colors → Advanced → Text Encoding for Legacy Content, and select Korean. Then the HSTS demo page https://people.torproject.org/~dcf/tor20025/check-charset.html will show EUC-KR for document.characterSet. The non-HSTS demo page https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html continues to show windows-1252.

Chromium 52.0.2743.116 doesn't appear to make a difference between HSTS and non-HSTS. Go to Settings → Web content → Customize fonts → Encoding and change to Korean. Both demo pages show EUC-KR.

Latest Tor Browser: https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (legacy/trac#10703 (closed)) iso-8859-1 document.characterSet (legacy/trac#20025 (moved)) UTF-8 document.charset UTF-8 document.inputEncoding UTF-8

Anyone got same result? (Firefox 61 with resistFingerprint also have this value)

Replying to cypherpunks:

Latest Tor Browser: https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (legacy/trac#10703 (closed)) iso-8859-1 document.characterSet (legacy/trac#20025 (moved)) UTF-8 document.charset UTF-8 document.inputEncoding UTF-8

cypherpunks, please also try https://people.torproject.org/~dcf/tor20025/check-charset.html.

For me, with Tor Browser 8.0a8 en-US, I get:

https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (legacy/trac#10703 (closed))	iso-8859-1
document.characterSet (legacy/trac#20025 (moved))	UTF-8
document.charset	UTF-8
document.inputEncoding	UTF-8

https://people.torproject.org/~dcf/tor20025/check-charset.html

Using ambiguous bytes (legacy/trac#10703 (closed))	iso-8859-1
document.characterSet (legacy/trac#20025 (moved))	windows-1252
document.charset	windows-1252
document.inputEncoding	windows-1252

I conjectured that the difference may be because of HSTS, but that appears not to be the case, because bamsoftware.com has HSTS.

gk: can we change the keyword to tbb-fingerprinting-locale please? TIA :)

I am only going on previous comments about which sites have HSTS and which don't (and those commments are contradictory, I think, I need coffee - let me know if I have it the wrong way round). Either way, there are four test sites

no leak: thorin - https://thorin-oakenpants.github.io/testing/bug20025.html
no leak: bamsoftware - https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html
this leaks: hsivonen - https://hsivonen.com/test/moz/check-charset.htm
this leaks: dcf - https://people.torproject.org/~dcf/tor20025/check-charset.html

The thorin test page links to and opens the other three in a new tab.

Obligatory Pic

spreadsheet to follow

Results:

all tests done in 9.0a6
all 30 non en-US bundles tested were set to spoof
excluding the windows-1252 fallback, there are 12 buckets covering 14 languages
ko - not tested, waiting for legacy/trac#31886 (moved) , but reading above it would be windows-1252 anyway
mk - had to install the Macedonian language pack and set spoof etc, see legacy/trac#31725 (moved)

Notes

Options>General>Languages>Fonts and Colors>Advanced>Text Encoding for Legacy Content
this sets the pref intl.charset.fallback.override if you change it from "Default for current locale"

Solution

Set intl.charset.fallback.override = windows-1252 when privacy.spoof_english = 2, and reset it when privacy.spoof_english !== 2
Do this upstream (not sure if legacy/trac#10703 (closed) also needs upstreaming)
thinking out loud: If they're requesting pages as en-US, etc (spoof = 2) .. then the breakage should be nothing more than a normal en-US bundle, right? IDK, does the override pref affect chrome? Does this impact users on non-English OSes?

Class, discuss! :) .. pic to follow

Trac:

results in glorious technicolor

Trac:
Keywords: N/A deleted, tbb-fingerprinting-locale added

Replying to Thorin:

I am only going on previous comments about which sites have HSTS and which don't

You can forget about HSTS. That conjecture was wrong. bamsoftware.com has HSTS and it doesn't show the leak. The reason the previous results seem contradictory is that the page that in 2016 was at !https://people.eecs.berkeley.edu/ (no HSTS) now redirects to a different server, !https://www.bamsoftware.com/ (HSTS).

If the cause of the difference is not HSTS, what is it? My new guess is that it must have to do with the Content-Type header and whether it specifies an encoding or not.

	= leaks =	=`Content-Type` =
= thorin-oakenpants.github.io=	no	`text/html; charset=utf-8`
= www.bamsoftware.com=	no	`text/html; charset=UTF-8`
= hsivonen.com=	yes	`text/html`
= people.torproject.org=	yes	`text/html`

You can check the Content-Type header yourself using the curl command.

<pre style="font-size: 80%;">
$ <strong>curl --head https://thorin-oakenpants.github.io/testing/bug20025.html</strong>
HTTP/2 200 
server: GitHub.com
<span style="background: gold;">content-type: text/html; charset=utf-8</span>
last-modified: Sun, 29 Sep 2019 15:29:53 GMT
etag: "5d90cdf1-7ec"
access-control-allow-origin: *
expires: Sun, 29 Sep 2019 16:52:42 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: XXX
accept-ranges: bytes
date: Sun, 29 Sep 2019 16:42:42 GMT
via: 1.1 varnish
age: 0
x-served-by: XXX
x-cache: MISS
x-cache-hits: 0
x-timer: S1569775362.340251,VS0,VE329
vary: Accept-Encoding
x-fastly-request-id: XXX
content-length: 2028

$ <strong>curl --head https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html</strong>
HTTP/1.1 200 OK
Date: Sun, 29 Sep 2019 16:41:16 GMT
Server: Apache/2.4.25 (Debian)
Vary: User-Agent,Referer,Accept-Encoding
Last-Modified: Thu, 01 Feb 2018 20:06:42 GMT
ETag: "5d2-5642c2265f880"
Accept-Ranges: bytes
Content-Length: 1490
Strict-Transport-Security: max-age=15768000
<span style="background: gold;">Content-Type: text/html; charset=UTF-8</span>

$ <strong>curl --head https://hsivonen.com/test/moz/check-charset.htm</strong>
HTTP/2 200 
server: nginx/1.17.4
date: Sun, 29 Sep 2019 16:42:22 GMT
<span style="background: gold;">content-type: text/html</span>
content-length: 353
last-modified: Mon, 25 Feb 2013 11:31:59 GMT
etag: "3998-161-4d68ae39709c0"
accept-ranges: bytes
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload

$ <strong>curl --head https://people.torproject.org/~dcf/tor20025/check-charset.html</strong>
HTTP/1.1 200 OK
Date: Sun, 29 Sep 2019 16:41:08 GMT
Server: Apache
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
X-Xss-Protection: 1
Referrer-Policy: no-referrer
Strict-Transport-Security: max-age=15768000; preload
Public-Key-Pins: pin-sha256="EfzQ7Gg2LG2mQyjStHmfD4yVzzi/30yyRnAKquPlPMQ="; pin-sha256="Tnmd19BxbL/grn2RdYAAyck34e1KeIq9n5CK6ZZVP1w="; max-age=5184000
Last-Modified: Tue, 30 Aug 2016 05:30:00 GMT
ETag: "5d2-53b4345990616"
Accept-Ranges: bytes
Content-Length: 1490
Vary: Accept-Encoding
<span style="background: gold;">Content-Type: text/html</span>
</pre>

Trac:
Keywords: tbb-fingerprinting deleted, N/A added

nvm .. i need more coffee

Edit: PS: can we change the title: replace HSTS with legacy encoding or something - thanks)

There's an error in my spreadsheet... hu and pl are the same, but I said they were different.. so that's one less bucket. But, I tested all the legacy fallback options available in Firefox, and ko returns EUC-KR, so I would expect that to be the same in TB.

There are 14 values in the UI legacy fallback coimbox, they are

https://dxr.mozilla.org/mozilla-central/source/browser/components/preferences/fonts.xul#256

arabic - windows-1256
baltic - windows-1257
central european, ISO - ISO-8859-2
central european, Microsoft - windows-1250
chinese, simpliflied - GBK
chinese, traditional - Big5
cyrillic - windows-1251
greek - ISO-8859-7
hebrew - windows-1255
japanese - Shift_JIS
korean - EUC-KR
thai - windows-874
turkish - windows-1254
vietnamese - windows-1258

as well as "default for current locale", which would cover any others, I guess (IANA expert) - e.g I am not sure what happens with Lithuanian, Malay: but Thai would leak as windows-874

Trac:
Description: At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here: https://hsivonen.com/test/moz/check-charset.htm Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output Your fallback charset is: windows-1252 But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output Your fallback charset is: EUC-KR

This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.

The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.

to

At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here: https://hsivonen.com/test/moz/check-charset.htm Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output Your fallback charset is: windows-1252 But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output Your fallback charset is: EUC-KR

This is a separate issue from legacy/trac#10703 (closed). I'll leave a comment with a demo page that shows both techniques, with the one in legacy/trac#10703 (closed) giving the same result and document.characterSet giving different results.

The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.

Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7.
Summary: document.characterSet enables fingerprinting of localization (only with HSTS?) to document.characterSet leaks locale when HTML page does not specify its own encoding

Upstream: https://bugzilla.mozilla.org/show_bug.cgi?id=1485258

mentioned in issue legacy/trac#27268 (moved)

moved from legacy/trac#20025 (moved)

added Bug label and removed 1 deleted label

added Fingerprinting label and removed 1 deleted label

@acat this can be tagged as ESR91 will have: see

FF79+ : 1603712 Remove intl.charset.detector.ng.enabled pref and resulting dead code
FF88+ : 1691890 Disable FTP on beta and release
comment 9 from Henri

AFAICT bug 1603712 made this bug moot except for display of non-ASCII file paths in FTP directory listings ...

I'm still seeing the differnece between https://people.torproject.org/~dcf/tor20025/check-charset.html and https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html on Linux 11.5a13

server configs, not client side IMO. Seems resolved to me, no entropy at all

Just retested the below and they all came back with the same results

torproject = windows-1252
hsivonen = windows-1252
bam = UTF-8

all TB's not using spoof_english - previous glorious colored results 2 years ago: #20025 (comment 2633410)

ar (was windows-1256)
cs (was windows-1250)
el (was ISO-8859-7)
fa (was windows-1256)
he (was windows-1255)
hu (was ISO-8859-2)
ja (was Shift_JIS)
ko (was EUC-KR according to david)
lt (new)
mk (was windows-1251)
ms (new)
my (new)
pl (was ISO-8859-2)
ru (was windows-1251)
th (new - would have been windows-874?)
tr (was windows-1254)
vi (was windows-1258)
zh-CN (was GBK)
zh-TW (was Big5)

PS: that was a painful exercise (all mostly old TB11.5a1 with the missing connection button bug - thanks), please don't make me do it again. I will update my non en-US suite one day, probably when 12 hits ... it just takes so long :-(

added Roadmap::Future label

closed

document.characterSet leaks locale when HTML page does not specify its own encoding

Designs

Child items 0

Activity

with HSTS

without HSTS