Alfred - Notes on HTTP cookies

Some Alfred customers have noticed that HTTP "cookies" are used by the wrangler mode web interface, and they have asked for an explanation of what cookies really are and what risks, if any, are associated with their usage.

The following is a somewhat chatty discussion of the cookie mechanism, which doesn't presuppose that the reader is a web hacker (and as a result it doesn't delve too deeply into the technical details). The truly nerdy may want to read the more technical Netscape cookie specification, or visit one of the many other related sites.

In a nutshell, http cookies implement a form of "client-side persistent state" using tagged strings, on top of the previously stateless http protocol. They're called cookies mostly because there wasn't another readily available word that described their purpose (actually I think that X-windows used the term 'magic cookie' in the early 80's to describe the string passed around by the basic X authentication scheme).

Some Background

One point of background information: most web browsers take pains to not reveal the identity of the user to the servers. In fact HTTP has always had built-in support for the transmission of user identity, but most browsers don't use it. And it is an active choice on the part of browser designers: almost everyone who uses a browser has filled in the "identity" form, so the browser does know who you claim to be, in fact it even knows who the underlying operating system thinks you are, it just doesn't send it to servers. For the most part this is exactly the behavior that users want, they'd rather not have their name or email address sent to Yahoo or Microsoft or CNN every time they connect. (NB: one exception is the text-only browser 'lynx' which appears to send the user identity with every request).

There are times however when a site needs a mechanism for recognizing returning visitors. Usually it's not necessary to know their name or email address, rather a site just wants to know something simple about the current visit. This is where http cookies come in. For example, if you're shopping for CD's it's pretty important for the site to have a list of what you've picked out when you get to the "checkout counter". Remember, default web transactions are atomic, there's no notion of a session or a 'history of your visit to a site', the server has no way of knowing whether you've been wandering around deep in its pages for an hour or whether you were just catapulted there from outer space. From the server's point of view it will get your request to pick a Billie Holiday CD and then someone else's request for Bengali surf music and the 'atomic' response to these two requests needs to be "ok" ... but how does it know where to store each selection, under whose name?

The old solution to this problem was to continuously tack more and more parameters to the URL and funnel everything through a form-handler, or to make every page at the site a dynamically generated form which had the user's current state stuffed away in hidden data fields. This mechanism is still used for simple cases but it becomes remarkably complicated very quickly. What was needed was a way to maintain state without requiring dynamically generated pages or having to run every trivial link to a new document through a cgi script.

So, cookies are something that Netscape came up with (way back in version 1.1) to address this problem, and it's a real problem if you're trying to encourage businesses to buy your server software and go online. The resulting scheme depends on a lot of cooperation between the browser and the server. Which is to say: it's a significant, albeit straightforward, protocol change. Here's the way it works:

Sites can, and do, make widely varying uses of cookies. For example, some sites just want to know whether you've visited before, or they're just trying to get a feel for which pages on their site actually get revisited. The shopping sites sometimes just continuously add your purchase choices to their cookie string as you make selections, so the cookie is literally your 'live' shopping list; more often though they have a database running on the server which contains your shopping list and the cookie string is just an identifier that they use to access the database.

This capability is immensely useful for real applications, alfred uses it, and I'm glad it's available.

But, as is pretty obvious, it's this notion of a database with 'my' name in it that raises some privacy concerns. Since the site cookie is sent with every connection to that site, they can quite easily do 'useful' things like track the actual path I take through their site, or how often I visit, or pick up my last session from where I left off; they can track which little advertising banners I've seen (and even match advertising to what they know about my purchasing habits and links that I choose to visit). And once I make a purchase, and actually put a real name, address, and credit card number into the system, then their cookie is no longer just an arbitrary id, it's a pseudonym for a very real person.

To be fair of course, we blithely hand out this sort of information all the time via magazine subscriptions and our address at Radio Shack, etc, etc. And credit card companies, magazines, and even the post office, sell their mailing lists all the time to private sector companies which exist to cross-correlate these lists and find patterns. So the actual level of 'new' privacy danger from web cookies is pretty small, it's just getting a lot of press at the moment. People who are very concerned about privacy should also be aware that web servers do know the internet host address of connecting browsers (this is necessary for sending information back to the browser!). In many situations this address is sufficient to recognize a returning visitor, even if the user's login name, email address, or other identifier aren't known.

One thing that's built into the cookie protocol is that the browser is only supposed to send a cookie to the specified domain, and no other (this is certainly true of the Netscape and Microsoft browsers). This would seem to limit the tracking exposure to those sites at which you've elected to make yourself trackable (by accepting the server's cookie). There are some interesting twists in this regard though. There are a couple of ad companies, like doubleclick.net and LinkExchange, which have contracts with various corporate web sites in which the corporate xyz.com server sends the browser a cookie which is tagged not for the xyz domain but for the ad company's domain instead. Hence the browser won't send the cookie back to the site that created it since the domains don't match!

The reason for doing this is that images on a web page are retrieved separately from the text, with their own atomic transaction request ... and the images for a page can be retrieved from any arbitrary server, although usually it's the same as for the rest of the document. The advertising scheme is that the ad banners are retrieved from the ad company site, not the xyz machines, so when your browser goes to fetch a banner into the xyz homepage it initiates a transaction which includes the xyz-set cookie for the ad domain. Hence, the ad company can show you a unique ad, different from the one you might have just seen at a completely different site which also has a contract with the same ad company. And obviously with the right sort of pre-arrangement with the referring site, they can read the cookie to determine something about the sort of ad they should be showing you. Also, the ad company can control exposure of particular ads, for example charging the advertiser more for repeat display to the same user.

One more technical detail about cookies: they have expiration dates. A server can tell the browser, in effect, 'only send this cookie back to me until such-and-such a date in the future'. If the date is in the past, it is interpreted to mean 'unset a previously set cookie'. If the cookie is sent with no expiration date then the spec says that it should be interpreted to mean that it expires when the browser is shutdown ... i.e. don't keep the cookie around between launches of the browser. The flip side of which is that cookies with explicit expiration dates are kept around somewhere. For example, on a unix system the Netscape browser puts them in a file called ~/.netscape/cookies, and it's a plain-text file which you can almost figure out by looking through it. If you object to a cookie you've received, exit the browser (which will kill non-saved cookies) and edit the cookie file to clean out any offending entries. Fair warning: the spec allows an individual cookie to be up to 4K bytes (and binary!), although it is rare to see one that's more than 30 (printable) characters.

And finally, with regard to Alfred's usage of cookies: they are only used in one situation, and that is to authenticate the wrangler mode web browser when it is connecting to a user's dispatcher to get job status. The point is that render wranglers don't want to have to type in a different user name and password every time they grab status from a different dispatcher. When the wrangler mode is initiated the maitre-d authenticates the would-be wrangler with a password challenge and then sends the wrangler's browser an encoded cookie. Then when the wrangler connects to a dispatcher looking for status the cookie is also sent along, and the dispatcher can use it to authenticate the connection.

Notes on HTTP "cookies"

Some Background