Friday 17 June 2016

Browsing is Broken Part 3: Privacy

Access Provider Privacy

Whenever you connect to the web, you're connecting via some kind of access provider. Most people will think of their ISP (internet service provider) aka your home broadband provider, but these days we're constantly connecting our phones to wifi at work, in cafes, shops and airports. Many phone network providers are teaming up with wifi networks so your phone will automatically connect to wifi spots around your city, and the latest phones support making calls and texts over the wifi connection. 

My privacy requirement is that when I connect to an access point, my web traffic is protected from the access provider, and that they can't see what I'm browsing or read the emails or messages that I download over their wifi. You might feel that this is unnecessary; can't we just trust the access providers? I'm not going to get into that here, other than to point out that you're also trusting every individual tech geek that works at those companies, and a lot of small technology outsourcing companies that they will use for IT installation and support. You're also hoping that they haven't been hacked by malicious individuals, and that they never will be hacked (probability:zero). And regarding your need for privacy, even if you are entirely blameless, consider the possibility that one day a friend or relative sends you a "private" message in which they joke about something that looks illegal or sinister when taken out of context. Their privacy is dependent on your privacy. 

VPNs

OK, let's start simple here. Say your company has a bunch of computers in two offices in different cities. Each office has its own private network, connecting just the computers in that office to each other. Naturally, you'll want to be able to connect the two networks together (an 'inter-office-network'!). That's the "N" in VPN. So you connect the two with a cable from one office to another. These days, unlike when the telegraph first arrived, you don't lay the cable yourself, you lease one from the phone company. Everything works great, the connection is Private (that's the P), but, leasing a line is really expensive. And since the internet is already available for free, why not use that instead? So you want a private network that goes out over the public internet, so you need some fancy software that creates a Virtual reality (that's the "V") style simulation of a Private Network on top of the public internet.  

That's where VPNs come from. These days, you can download a VPN client to your phone or laptop, and connect to a cloud-based VPN server. Now, it's as if you have a cable connecting your device to that server directly, throught the magic of encryption and internet routing. Any traffic that goes over that tunnel can't be accessed by the real devices in between, such as the wifi router in the cafe, because it's encrypted and only the VPN server knows how to decrypt it. 

So, VPN clients are a great solution for maintaining privacy from your access provider, right? It's true they provide a potential solution, but there are pitfalls. The VPN client on your phone can stop running, or need to reconnect to the server, while this is happening all your web traffic is susceptible. Even if the VPN stays up and running all the time, you can't always be sure what traffic is routed over it. Remember our two offices VPN example? Well in that situation, IT guys would still route the web traffic from PCs in the office directly to the internet - only traffic destined for the other office's machines would be routed over the VPN. 

Most VPN client apps for phones do try to route everything over the VPN, since that's the real reason people use them. But they can still leak information. When you connect to a wifi access point, your device has to talk to it directly in order to get configuration information so that it can actually work (this is called DHCP). If the VPN client refused to let any traffic go to any destination other than over the VPN, you wouldn't be able to connect to the access point in the first place. 

Even if you get your VPN configured as tight as possible, it's quite likely that you still leak DNS lookups (remember those from part 2?). So the access provider can't see exactly what data you're transferring, but they can see all the website addresses that you look up in order to connect to them, which is quite a lot of meta data and certainly doesn't constitute privacy. 

The larger access providers, such as the big home broadband companies, are aware of the use of VPNs and of course they can detect when your VPN client attempts to connect to a VPN server (since they know the DNS names and IP addresses of the popular VPN services). If they wanted to, it's pretty easy for them to cause these connections to fail by blocking the initial connection, so your client can't reach the VPN server to start the whole encryption process. 

It's also possible for the access provider to take the traffic from your VPN client and send it to one of their own servers. This requires some sophisticated NSA level techniques, but it's entirely feasible. A less sophisticated approach requires the attacker to first hack into the VPN servers and get some decryption keys, but that's not at all infeasible - most OSs have security vulnerabilities and it only takes one server to be unpatched for the attacker to succeed. 

Proxies

Now that you understand VPNs, proxies are a sinch, and we already discussed them in a previous post. Essentially a proxy is a server in the cloud that your browser connects to and sends all its web requests to. It's arguably a little simpler than a VPN, and they are just focused on keeping your browser traffic private, unlike the more general purpose VPN. 

Unfortunately, many of the popular browsers and proxies still leak DNS requests. So your web traffic is encrypted, but a snooper can easily tell which sites you're accessing. 

Dissatisfaction

I'm sure when technically minded people read this, they'll suggest many possible ways of securing your web traffic from the access provider, but I've yet to find anything that a person of basic technical ability can be confident they've configured correctly and be sure they won't leak information or leave themselves open to various vulnerabilities. 



Browsing is Broken Part 2: Blocking Unsolicited Content

In part 1, I explained why I want news websites to send me their content directly, instead of passing me off to third-party advert networks that they have no control over. Since that isn't going to happen any time soon, we have to find ways to stop our browsers fetching potentially damaging content from the third-party servers that the media companies refer us to. The most popular options are ad blockers, proxies and blacklists

Ad Blockers

Adblock is a browser add-in that hides ads. Your browser still fetches and downloads the ad, but then Adblock steps in and stops the ad from being displayed, or the video from being played. For many people, this is a fine solution, and Adblock is justifiably very popular. I used it for a while myself.

The problem with most ad blockers is that your browser still requests and fetches all that ad content in the first place, and then, the ad blocking software runs, taking up more time and cpu, in order to remove the ad. The advertising networks of course know about ad blockers, so they try to circumvent them, disguising ads, so the adblockers let them through. Some adblockers can really slow down the browser quite a bit, and use up a lot of cpu and memory.

The advertiser vs adblocker battle feels a lot like virus writers vs virus scanners. Both sides have to run as fast as possible just to stay still, and the users never quite know who is in the lead. When the virus writers jump ahead, the consequences are devastating, in part because the Darwinian selection pressure means that a successful virus has to be immensely sophisticated and difficult to eliminate.

There has also been a backlash against ad blockers from technology companies. Google removed Adblock from the play store in 2013, and many ad blockers have been removed from the Apple Appstore over the years too. So as a user, you can't rely on these apps being available on your device indefinitely.

The big advantage that ad blockers have over the techniques I'll discuss next is that they are really simple to install and use. Pretty much no technical knowledge is required by the user, and if the adblocker stops working for whatever reason, the browser usually works fine.

Proxies

A proxy is a server that a browser uses to access the internet. So the conversation becomes:

mybrowser: hey, proxy, can you get me the nytimes front page please?
proxy: sure, I'll go get it
[proxy talks to nytimes]
proxy: here you go mybrowser
mybrowser: thanks proxy! ok, let's look at this html, ok, gotta get a bunch more stuff
mybrowser: hey proxy, can you get me all this stuff from strange_address_1 through strange_address_200
proxy: sheesh, sure, whatever, coming right up....ok here you go:
mybrowser: thanks! ...ah crap this is huge.

Every browser supports configuring a proxy to talk to, because in many corporate networks, the only way to access the web is via a proxy. This helps IT control and monitor web access, and partition bandwidth so that you syncing your iTunes library on your work PC doesn't interfere with corporate email traffic, which would go on a different route. 

Once you have all the web traffic going through a proxy server that you control, it's easy to do things like set up a blacklist of sites that the proxy refuses to access. So if a browser requests hidden.malware.site.xyz.com, the proxy says "Access forbidden" or something equally sinister. Naturally, it doesn't stop there, and lots of corporations set up their proxies to stop people accessing facebook. The smart ones let them access facebook but have the proxy log every access and track how much time the employee is goofing off. 

The functionality developed for corporate proxies is very close to what we need for unsolicited content blocking, and sure enough, there are ad-blocking proxies that do a great job of removing unsolicited and potentially harmful content. 

Cloud proxies

Most proxy services are cloud-based; you get the address details of the proxy, you input them into your browser proxy configuration, and from then on, your browser asks the proxy server in the cloud to fulfill all your requests. 

There are a couple of problems with this. Sometimes, the cloud proxy server is in a different country to you. So you get the google homepage for Romania (no joke, happens to me a lot when I use privatetunnel), and google asks you all the time if you want to translate the page into Romanian. 

When gmail sees you logging in via a proxy, it will probably have a bit of a fit, and ask you to reauthenticate, and prove that you're a human with one of those squiggly text things. Also, this will keep happening, because when you go through proxy services, the server you go through changes regularly. From gmail's point of view, it looks a lot like someone's trying to hack into your account from dodgy locations that keep changing. 

Another problem with cloud proxies is that your requests have to make the extra trip to the proxy server. Usually this isn't a big overhead, but it can mean that your browsing feels slower. 

Local Proxies

A proxy is just a software program, so you can install one on your PC. One of the best is a program called Privoxy, which works really well and can be configured to do whatever you need. Using Privoxy installed locally on your PC has none of the issues of using a cloud proxy. Websites like gmail don't see any difference when you access them. Unlike cloud based proxies, your web requests don't have to jump through a remote server, so your browsing should feel as fast as it does with no proxy - in fact it might feel a bit faster because Privoxy will filter out ads and other content. 

The downside of Privoxy is that it can require a bit of technical knowledge to set up and maintain. If you configure your browser to use the Privoxy proxy, then if Privoxy isn't running, you'll get a message saying "Proxy server not accepting connections" or something similar. If you've read this far, and installed Privoxy, I'm sure that won't be a problem for you to figure out, but it may not be something you install for non-technical friends and family. 

Blacklists

Every time your browser loads a page, it starts by converting the "human" name of the website, like www.example.com into an IP address. It does this by accessing the domain name system, or DNS, which is like a big database that maps all the web addresses on the internet onto IP addresses. If no entry for the human name is found, the browser doesn't know what IP address to send the request to. If it can't find the address for the name you typed into the address bar, you see an error message like "www.bleaurgh223A.com not found". 

When your browser gets the html for a page from the primary site, like nytimes.com, it reads the html and fetches any content required to complete the page. The html will have addresses telling the browser where to go. This is how your browser ends up fetching content from adnetwork3vil.dblclack.net when you want to read nytimes.com. It asks DNS for the IP address of adnetwork3vil.dblclack.net and then fetches that anorak ad. 

Browsers are very forgiving, and they expect errors to happen. This is good, because if you look at your browser console (a hidden debugging window you can usually access by hitting F12), you'll see that almost every page you load has some errors. Often these errors are due to broken links. The guy who created bestcatpictures2003.com linked to lots of cat pictures on websites that no longer exist. The browser expects this kind of thing, so it just does its best and displays whatever parts of the page it could successfully get. 

Hosts File Blacklisting

Before DNS became cloud-based, computers had to have a file that listed all the mappings from human names to IP addresses. This is the "hosts" file, and it's still located in /etc/hosts on most unix systems, and C:\Windows\System32\drivers\etc\hosts on windows. The operating system still checks this file every time the browser makes a DNS request, just in case it has an entry. If it finds an entry, it's much faster than asking the cloud based DNS. These days, the hosts file usually contains just one or two entries, but there's no reason you can't add more.

That's how we create a blacklist. We add entries into the hosts file and map them to bad IP addresses, e.g. 0.0.0.0. When the browser asks for the address of  adnetwork3vil.dblclack.net, the operating system checks the hosts file, and finds an entry mapping adnetwork3vil.dblclack.net to 0.0.0.0, the browser tries to fetch content from 0.0.0.0, which fails, but the browser is built to expect such failures, so the rest of the page loads just fine. 

A lot of people work to create these blacklist files, and you can download good ones for free on the web. Although the description of how all this works is a bit technical, installing a hosts file is just a matter of backing up your existing file and copying in the new one to the right location. After that, you might want to get the latest version every few months as new sites are added, but there's really no maintenance required. 

The main catch with using the hosts file for blacklisting is that you need to have administrator access on your device. For PCs this isn't usually a problem (you definitely have admin access on your home PC), but on an Android phone, it means you need root access, which requires some technical knowledge. 

Dissatisfaction

None of the methods we have to avoid unsolicited content is entirely satisfactory. I currently use a hosts blacklist on all my devices, and I really like the results. Ad blocker browser plug-ins are the best solution for non-technical users, but Google and Apple have shown that they are opposed to allowing us to use them. At the time of writing (June 2016), adblock apps are available, let's hope it stays that way. 

Privoxy is probably the best solution overall  - it gives you complete control, and nobody can stop you installing it on your laptop/PC. Unfortunately, in order to get Privoxy working on your phone, you need it to be jailbroken or rooted. 




Thursday 16 June 2016

Browsing is Broken Part 1: Unsolicited Content

The websites of many of the major news outlets that I used to read regularly are now overloaded with ads and content from third parties that I just can't tolerate any more. I started to notice how bad things were getting about three years ago when visiting rollingstone.com on my Android phone exposed me to malware that made a charge to my mobile bill. It's not just inconvenient, it's insecure, and ultimately it's lose-lose for the media and their audience.

I get the business model that online companies need to sell advertising, and in principle I support that absolutely. Heck, I tried signing this blog up for Adsense in the off chance I can finally make a dollar back off Google (they turned me down). What I don't agree with is the way they implement it. Simplistically, when I enter nytimes.com in my browser, the conversation between the computers involved goes something like this:

myphone: Hey, nytimes.com can I get the front page please?
nytimes: Hang on a sec...just looking you up...
nytimes: Hey adservers, dubhrosa just asked me for my front page, whaddya got for him
adserver_network: oh baby! dubhrosa, I've got a ton of stuff for that guy, I'll send it directly to him if that's ok.
nytimes: Yeah sure, go nuts, I'll send him the headlines and some pictures, I'll leave most of the page for you guys
adserver_network: Great! Last week he bought an anorak from amazon. Maybe he'd like to see a couple more ads for anoraks. Also, a few months ago, he clicked on an ad for septic tank inspection services, it might have been a misclick, and we've shown him about 2000 more of the same ad since then, but hey maybe today's the day. Oh, and he seems to be into cars, so lets put on that video for the new Ford truck that starts to play automatically.
nytimes: ok great, thanks dudes
adserver_network: sure thing nytimes, here's your 0.001c
nytimes: hey thanks! nice tip! you guys are sooo nice!
myphone: ok, here's the html for this nytimes front page, thanks nytimes
nytimes: my pleasure
myphone: ok, in order to display this page, I need to go fetch a crapload of pictures and stuff, let's get that
myphone: hey, strange_name_1 through strange_name_200, can I have this stuff please?
adserver_network: [teehee, they never know it's us] sure! here you go!
myphone: yikes, they're sending me 50 megabytes of crap here, oh well this is gonna hurt my data plan and my battery.


Here's how the conversation should go:

myphone: Hey, nytimes.com can I get the front page please?
nytimes: Hang on a sec...just looking you up...
nytimes: (to self) ok, what ads do I have today that I should show dubhrosa...ok, stick them into the page
nytimes: here you go, this is the front page html
myphone: thanks nytimes
myphone: ok, there's some other stuff I need to download from nytimes to complete the page
myphone: nytimes, give me these pictures and video links please
nytimes: here you go
myphone: thanks!

The key difference is that in this flow, the nytimes is responsible for storing and serving the advertising content to its readers. The ad content is stored on their servers, and their staff have the ability to control that content. They can still target me with ads they think are relevant based on my previous online activity, but they have full control over the content that is sent in response to my request. They can keep the page size below some sensible limit. They can ensure their readers have a nice experience when browsing their site. I don't think any of this is unreasonable demand. Imagine if a newspaper editor allowed advertisers to scrawl whatever they wanted into the adspace of the newspaper, with absolutely no review by the newspaper staff. Shouldn't media companies, whose brand is so important, take control of what they send to their readers?

Unfortunately, the way the online ad industry has turned out means that this is unlikely to change, and in order to make browsing tolerable, we have to find solutions. I've looked at quite a few, and that's what I'll be talking about in part 2. Read Part 2