Search - User Network
Search - Video
Search - Categories
Search - Contacts
Search - News
Search - News Feeds
Search - Tags

Find out What fingerprints does your browser leave behind as you surf the web and see how easily identifiable you are

Traditionally, people assume they can prevent a website from identifying them by disabling cookies on their web browser. Unfortunately, this is not the whole story. When you visit a website, you are allowing that site to access a lot of information about your computer's configuration. Combined, this information can create a kind of fingerprint — a signature that could be used to identify you and your computer. But how effective would this kind of online tracking be?

EFF (Electronic frontier Foundation)  is running an experiment to find out. Website Panopticlick will anonymously log the configuration and version information from your operating system, your browser, and your plug-ins, and compare it to our database of five million other configurations. Then, it will give you a uniqueness score — letting you see how easily identifiable you might be as you surf the web. Adding your information to our database will help EFF evaluate the capabilities of Internet tracking and advertising companies, who are already using techniques of this sort to record people's online activities. They develop these methods in secret, and don't always tell the world what they've found. But this experiment will give us more insight into the privacy risk posed by browser fingerprinting, and help web users to protect themselves.

What Information is "Personally Identifiable"?

Mr. X lives in ZIP code 02138 and was born July 31, 1945.

These facts about him were included in an anonymized medical record released to the public. Sounds like Mr. X is pretty anonymous, right?

Not if you're Latanya Sweeney,  a Carnegie Mellon University computer science professor who showed in 1997 that this information was enough to pin down Mr. X's more familiar identity -- William Weld, the governor of Massachusetts throughout the 1990s.

Gender, ZIP code, and birth date feel anonymous, but Prof. Sweeney was able to identify Governor Weld through them for two reasons. First, each of these facts about an individual (or other kinds of facts we might not usually think of as identifying) independently narrows down the population, so much so that the combination of (gender, ZIP code, birthdate) was unique for about 87% of the U.S. population. If you live in the United States, there's an 87% chance that you don't share all three of these attributes with any other U.S. resident. Second, there may be particular data sources available (Sweeney used a Massachusetts voter registration database) that let people do searches to bootstrap what they know about someone in order to learn more -- including traditional identifiers like name and address. In a very concrete sense, "anonymized" or "merely demographic" information about people may be neither. (And a web site that asks "anonymous" users for seemingly trivial information about themselves may be able to use that information to make a unique profile for an individual, or even look up that individual in other databases.)

Many contemporary privacy rules and debates center on the notion of "personally identifiable information" (PII). The PII concept is used by several legal regimes and many organizations' privacy policies; generally, information that identifies a particular person is considered much more sensitive than information that does not. For instance,

Federal telecommunications privacy laws use "individually identifiable information" (about a subscriber) as a basis for the category of protected information called Customer Proprietary Network Information (CPNI);
Federal health privacy regulations use "individually identifiable health information" (about a patient) as a basis for the category called Protected Health Information (PHI);
Federal financial privacy laws, the EU Data Protection Directive, and state privacy laws all employ similar terms and concepts;

and, in each case, facts deemed "personally identifiable" or "individually identifiable" may receive dramatically higher protections under these laws and regulations.

But research by Prof. Sweeney and other experts has demonstrated that surprisingly many facts, including those that seem quite innocuous, neutral, or "common", could potentially identify an individual. Privacy law, mainly clinging to a traditional intuitive notion of identifiability, has largely not kept up with the technical reality.

"When our freedoms in the networked world come under attack, the Electronic Frontier Foundation (EFF) is the first line of defense."

A recent paper by Paul Ohm, "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization",  provides a thorough introduction and a useful perspective on this issue. Prof. Ohm's paper is important reading for anyone interested in personal privacy, because it shows how deanonymization results achieved by researchers like Latanya Sweeney and Arvind Narayanan seriously undermine traditional privacy assumptions. In particular, the binary distinction between "personally-identifiable information" and "non-personally-identifiable information" is increasingly difficult to sustain. Our intuition that certain information is "anonymous" is often wrong. Given the proper circumstances and insight, almost any kind of information might tend to identify an individual; information about people is more identifying than has been assumed, and in the long run the whole enterprise of classifying facts as "PII" or "not PII" is questionable.

Statistical inference and clever use of databases has resulted in impressive examples of deanonymization of supposedly anonymous data, the kinds of data that most organizations have not regarded as PII. Apart from combinations of demographic data, some of the sorts of things that may well uniquely identify you include your search terms; your purchase habits; your preferences or opinions about music, books, or movies; and even the structure of your social networks -- in a purely abstract sense, even when shorn of the identities of your friends and contacts. Deanonymization is effective, and it's dramatically easier than our intuitions suggest. Given the number of variables that potentially distinguish us, we are much more different from each other than we expect, and there are more sources of data than we realize that may be used to narrow down exactly who a particular record refers to.

Many of these papers were meant as proofs of concept: they show that people can potentially be re-identified by these kinds of data, not that everyone will be. Not everyone's medical records were as easy to put a name to as Governor Weld's. And Narayanan and Shmatikov's research definitively identified only two Netflix users from their movie ratings -- not every user whose ratings were published by Netflix. Still, many of these research results deliberately do not use all the data available about individuals because their goal is to show the effectiveness of mathematical techniques, not to violate individuals' privacy. Real-world attacks will use many more kinds of available information simultaneously to narrow in on people's identities. As Bruce Schneier has observed, such attacks only get better over time; they never get worse.

The Electronic Frontier Foundation (EFF) told a federal judge today that the government should not be allowed to use the "state secrets privilege" to preempt the class-action lawsuit against AT&T. EFF's suit accuses AT&T of collaborating with the National Security Agency (NSA) in illegally spying on millions of Americans -- handing over customers' telephone and Internet records and communications without any legal authority. Department of Justice lawyers argued today that even if the NSA program is illegal, pursuing the case might expose "state secrets." However, EFF attorneys asked the judge to allow the case to proceed, considering the privilege in regards to specific evidence and situations instead of derailing the suit all together.

Ohm argues that it's more appropriate to think of identifiability as a continuum. The notion of "anonymized" or "sanitized" data is then problematic; researchers habitually share, or even publish, data sets which assign code numbers to individuals. There have already been conspicuous problems with this practice, like when AOL published "anonymized" search logs which turned out to identify some individuals from the content of their search terms alone.

We hope "Broken Promises of Privacy" encourages people who work with personal data to think more critically about their retention and sharing practices and the effectiveness of the anonymization or pseudonymization techniques they're using. We also hope it finds a broad audience and helps start a wider discussion among researchers, technologists, and lawyers about what "privacy protection" should mean in the era of deanonymization.

A Primer on Information Theory and Privacy

If we ask whether a fact about a person identifies that person, it turns out that the answer isn't simply yes or no. If all I know about a person is their ZIP code, I don't know who they are. If all I know is their date of birth, I don't know who they are. If all I know is their gender, I don't know who they are. But it turns out that if I know these three things about a person, I could probably deduce their identity! Each of the facts is partially identifying.

There is a mathematical quantity which allows us to measure how close a fact comes to revealing somebody's identity uniquely. That quantity is called entropy, and it's often measured in bits. Intuitively you can think of entropy being generalization of the number of different possibilities there are for a random variable: if there are two possibilities, there is 1 bit of entropy; if there are four possibilities, there are 2 bits of entropy, etc. Adding one more bit of entropy doubles the number of possibilities.

Because there are around 7 billion humans on the planet, the identity of a random, unknown person contains just under 33 bits of entropy (two to the power of 33 is 8 billion). When we learn a new fact about a person, that fact reduces the entropy of their identity by a certain amount. There is a formula to say how much:

ΔS = - log2 Pr(X=x)

Where ΔS is the reduction in entropy, measured in bits,2 and Pr(X=x) is simply the probability that the fact would be true of a random person. Let's apply the formula to a few facts, just for fun:

Starsign: ΔS = - log2 Pr(STARSIGN=capricorn) = - log2 (1/12) = 3.58 bits of information
Birthday: ΔS = - log2 Pr(DOB=2nd of January) = -log2 (1/365) = 8.51 bits of information

Note that if you combine several facts together, you might not learn anything new; for instance, telling me someone's starsign doesn't tell me anything new if I already knew their birthday.

In the examples above, each starsign and birthday was assumed to be equally likely. The calculation can also be applied to facts which have non-uniform likelihoods. For instance, the likelihood that an unknown person's ZIP code is 90210 (Beverley Hills, California) is different to the likelihood that their ZIP code would be 40203 (part of Louisville, Kentucky). As of 2007, there were 21,733 people living in the 90210 area, only 452 in 40203, and around 6.625 billion on the planet.

Knowing my ZIP code is 90210: ΔS = - log2 (21,733/6,625,000,000) = 18.21 bits
Knowing my ZIP code is 40203: ΔS = - log2 (452/6,625,000,000) = 23.81 bits
Knowing that I live in Moscow: ΔS = -log2 (10524400/6,625,000,000) = 9.30 bits


How much entropy is needed to identify someone?

As of 2007, identifying someone from the entire population of the planet required:

S = log2 (1/6625000000) = 32.6 bits of information.

Conservatively, we can round that up to 33 bits.

So for instance, if we know someone's birthday, and we know their ZIP code is 40203, we have 8.51 + 23.81 = 32.32 bits; that's almost, but perhaps not quite, enough to know who they are: there might be a couple of people who share those characteristics. Add in their gender, that's 33.32 bits, and we can probably say exactly who the person is.

An Application To Web Browsers

Now, how would this paradigm apply to web browsers? It turns out that, in addition to the commonly discussed "identifying" characteristics of web browsers, like IP addresses and tracking cookies, there are more subtle differences between browsers that can be used to tell them apart.

One significant example is the User-Agent string, which contains the name, operating system and precise version number of the browser, and which is sent every web server you visit. A typical User Agent string looks something like this:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv: Gecko/20070725 Firefox/

As you can see, there's quite a lot of "stuff" in there. It turns out that that "stuff" is quite useful for telling different people apart on the net. In another post, we report that on average, User Agent strings contain about 10.5 bits of identifying information, meaning that if you pick a random person's browser, only one in 1,500 other Internet users will share their User Agent string.

EFF's Panopticlick project is a privacy research effort to measure how much identifying information is being conveyed by other browser characteristics. Visit Panopticlick to see how identifying your browser is, and to help us in our research.

Go to Panopticlick and find out about yourself




found by murmur55

source ............







 user network - make A history
nmfscd - net label
free cult albums - make A history