Detecting censorship in HTML pages
The HTTP test of OONI does a TCP connection to the target host and send an HTTP request to obtain a webpage. If a webpage is retrieved this may be that of the censor. The issue is understanding if such a page is the legitimate response of it's a block page. How do we do this?
The naive way to do so is to make a connection over Tor and check if that matches the one that is made over the live network, this has some problems though, for example if the site is geolocalized it will be different for Tor.
Another simple approach is to have a database of content lengths of websites, but this also will fail if the censored page is very similar to the real web page.
Another approach is to find a smart fuzzy matching algorithm for the Test page.
Other ideas?