Friday 17 June 2016

Browsing is Broken Part 2: Blocking Unsolicited Content

In part 1, I explained why I want news websites to send me their content directly, instead of passing me off to third-party advert networks that they have no control over. Since that isn't going to happen any time soon, we have to find ways to stop our browsers fetching potentially damaging content from the third-party servers that the media companies refer us to. The most popular options are ad blockers, proxies and blacklists

Ad Blockers

Adblock is a browser add-in that hides ads. Your browser still fetches and downloads the ad, but then Adblock steps in and stops the ad from being displayed, or the video from being played. For many people, this is a fine solution, and Adblock is justifiably very popular. I used it for a while myself.

The problem with most ad blockers is that your browser still requests and fetches all that ad content in the first place, and then, the ad blocking software runs, taking up more time and cpu, in order to remove the ad. The advertising networks of course know about ad blockers, so they try to circumvent them, disguising ads, so the adblockers let them through. Some adblockers can really slow down the browser quite a bit, and use up a lot of cpu and memory.

The advertiser vs adblocker battle feels a lot like virus writers vs virus scanners. Both sides have to run as fast as possible just to stay still, and the users never quite know who is in the lead. When the virus writers jump ahead, the consequences are devastating, in part because the Darwinian selection pressure means that a successful virus has to be immensely sophisticated and difficult to eliminate.

There has also been a backlash against ad blockers from technology companies. Google removed Adblock from the play store in 2013, and many ad blockers have been removed from the Apple Appstore over the years too. So as a user, you can't rely on these apps being available on your device indefinitely.

The big advantage that ad blockers have over the techniques I'll discuss next is that they are really simple to install and use. Pretty much no technical knowledge is required by the user, and if the adblocker stops working for whatever reason, the browser usually works fine.

Proxies

A proxy is a server that a browser uses to access the internet. So the conversation becomes:

mybrowser: hey, proxy, can you get me the nytimes front page please?
proxy: sure, I'll go get it
[proxy talks to nytimes]
proxy: here you go mybrowser
mybrowser: thanks proxy! ok, let's look at this html, ok, gotta get a bunch more stuff
mybrowser: hey proxy, can you get me all this stuff from strange_address_1 through strange_address_200
proxy: sheesh, sure, whatever, coming right up....ok here you go:
mybrowser: thanks! ...ah crap this is huge.

Every browser supports configuring a proxy to talk to, because in many corporate networks, the only way to access the web is via a proxy. This helps IT control and monitor web access, and partition bandwidth so that you syncing your iTunes library on your work PC doesn't interfere with corporate email traffic, which would go on a different route. 

Once you have all the web traffic going through a proxy server that you control, it's easy to do things like set up a blacklist of sites that the proxy refuses to access. So if a browser requests hidden.malware.site.xyz.com, the proxy says "Access forbidden" or something equally sinister. Naturally, it doesn't stop there, and lots of corporations set up their proxies to stop people accessing facebook. The smart ones let them access facebook but have the proxy log every access and track how much time the employee is goofing off. 

The functionality developed for corporate proxies is very close to what we need for unsolicited content blocking, and sure enough, there are ad-blocking proxies that do a great job of removing unsolicited and potentially harmful content. 

Cloud proxies

Most proxy services are cloud-based; you get the address details of the proxy, you input them into your browser proxy configuration, and from then on, your browser asks the proxy server in the cloud to fulfill all your requests. 

There are a couple of problems with this. Sometimes, the cloud proxy server is in a different country to you. So you get the google homepage for Romania (no joke, happens to me a lot when I use privatetunnel), and google asks you all the time if you want to translate the page into Romanian. 

When gmail sees you logging in via a proxy, it will probably have a bit of a fit, and ask you to reauthenticate, and prove that you're a human with one of those squiggly text things. Also, this will keep happening, because when you go through proxy services, the server you go through changes regularly. From gmail's point of view, it looks a lot like someone's trying to hack into your account from dodgy locations that keep changing. 

Another problem with cloud proxies is that your requests have to make the extra trip to the proxy server. Usually this isn't a big overhead, but it can mean that your browsing feels slower. 

Local Proxies

A proxy is just a software program, so you can install one on your PC. One of the best is a program called Privoxy, which works really well and can be configured to do whatever you need. Using Privoxy installed locally on your PC has none of the issues of using a cloud proxy. Websites like gmail don't see any difference when you access them. Unlike cloud based proxies, your web requests don't have to jump through a remote server, so your browsing should feel as fast as it does with no proxy - in fact it might feel a bit faster because Privoxy will filter out ads and other content. 

The downside of Privoxy is that it can require a bit of technical knowledge to set up and maintain. If you configure your browser to use the Privoxy proxy, then if Privoxy isn't running, you'll get a message saying "Proxy server not accepting connections" or something similar. If you've read this far, and installed Privoxy, I'm sure that won't be a problem for you to figure out, but it may not be something you install for non-technical friends and family. 

Blacklists

Every time your browser loads a page, it starts by converting the "human" name of the website, like www.example.com into an IP address. It does this by accessing the domain name system, or DNS, which is like a big database that maps all the web addresses on the internet onto IP addresses. If no entry for the human name is found, the browser doesn't know what IP address to send the request to. If it can't find the address for the name you typed into the address bar, you see an error message like "www.bleaurgh223A.com not found". 

When your browser gets the html for a page from the primary site, like nytimes.com, it reads the html and fetches any content required to complete the page. The html will have addresses telling the browser where to go. This is how your browser ends up fetching content from adnetwork3vil.dblclack.net when you want to read nytimes.com. It asks DNS for the IP address of adnetwork3vil.dblclack.net and then fetches that anorak ad. 

Browsers are very forgiving, and they expect errors to happen. This is good, because if you look at your browser console (a hidden debugging window you can usually access by hitting F12), you'll see that almost every page you load has some errors. Often these errors are due to broken links. The guy who created bestcatpictures2003.com linked to lots of cat pictures on websites that no longer exist. The browser expects this kind of thing, so it just does its best and displays whatever parts of the page it could successfully get. 

Hosts File Blacklisting

Before DNS became cloud-based, computers had to have a file that listed all the mappings from human names to IP addresses. This is the "hosts" file, and it's still located in /etc/hosts on most unix systems, and C:\Windows\System32\drivers\etc\hosts on windows. The operating system still checks this file every time the browser makes a DNS request, just in case it has an entry. If it finds an entry, it's much faster than asking the cloud based DNS. These days, the hosts file usually contains just one or two entries, but there's no reason you can't add more.

That's how we create a blacklist. We add entries into the hosts file and map them to bad IP addresses, e.g. 0.0.0.0. When the browser asks for the address of  adnetwork3vil.dblclack.net, the operating system checks the hosts file, and finds an entry mapping adnetwork3vil.dblclack.net to 0.0.0.0, the browser tries to fetch content from 0.0.0.0, which fails, but the browser is built to expect such failures, so the rest of the page loads just fine. 

A lot of people work to create these blacklist files, and you can download good ones for free on the web. Although the description of how all this works is a bit technical, installing a hosts file is just a matter of backing up your existing file and copying in the new one to the right location. After that, you might want to get the latest version every few months as new sites are added, but there's really no maintenance required. 

The main catch with using the hosts file for blacklisting is that you need to have administrator access on your device. For PCs this isn't usually a problem (you definitely have admin access on your home PC), but on an Android phone, it means you need root access, which requires some technical knowledge. 

Dissatisfaction

None of the methods we have to avoid unsolicited content is entirely satisfactory. I currently use a hosts blacklist on all my devices, and I really like the results. Ad blocker browser plug-ins are the best solution for non-technical users, but Google and Apple have shown that they are opposed to allowing us to use them. At the time of writing (June 2016), adblock apps are available, let's hope it stays that way. 

Privoxy is probably the best solution overall  - it gives you complete control, and nobody can stop you installing it on your laptop/PC. Unfortunately, in order to get Privoxy working on your phone, you need it to be jailbroken or rooted. 




No comments:

Post a Comment