In 2006, AOL released a file containing search queries posed by many of its users. The user names were replaced with random hashes, though the query text was not modified. It turns out some users had queried their own names, or "vanity queries," and nearby locations like local businesses. As a result, it was not difficult for reporters to find and interview an AOL user1 then learn personal details about her (such as age and medical history) from the rest of her queries.
Could AOL have protected all its users by also replacing each word in the search queries with a random hash? Probably not; Kumar et al.27 showed that word co-occurrence patterns would provide clues about which hashes correspond to which words, thus allowing an attacker to partially reconstruct the original queries. Such privacy concerns are not unique to Web-search data. Businesses, government agencies, and research groups routinely collect data about individuals and need to release some form of it for a variety of reasons (such as meeting legal requirements, satisfying business obligations, and encouraging reproducible scientific research). However, they must also protect sensitive information, including identities, facts about individuals, trade secrets, and other application-specific considerations, in the raw data. The privacy challenge is that sensitive information can be inferred in many ways from the data releases. Homer et al.20 showed participants in genomic research studies may be identified from publication of aggregated research results. Greveler et al.17 showed smart meter readings can be used to identify the TV shows and movies being watched in a target household. Coull et al.6 showed webpages viewed by users can be deduced from metadata about network flows, even when server IP addresses are replaced with pseudonyms. And Goljan and Fridrich16 showed how cameras can be identified from noise in the images they produce.