Publishing Search Logs – A Comparative Study of Privacy Guarantees

Abstract  

Search engine companies collect the “database of intentions”, the histories of their users’ search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information.

In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how methods

that achieve variants of k-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by -differential privacy unfortunately does not provide any utility for this problem. We then propose a novel algorithm ZEALOUS and show how to set its parameters to achieve (_, δ)-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis that achieves (__, δ_)-in distinguishability.

Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves k-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to k−anonymity while at the same time achieving much stronger privacy guarantees. Publishing Search Logs – A Comparative Study of Privacy Guarantees

Introduction

Search engines play a crucial role in the navigation through the vastness of the Web. Today’s search engines do not just collect and index webpages, they also collect and mine information about their users. They store the queries, clicks, IP-addresses, and other information about the interactions with users in what is called a search log. Search logs contain valuable information that search engines use to tailor their services better to their users’ needs. They enable the discovery of trends, patterns, and anomalies in the search behavior of users,and they can be used in the development and testing of new algorithms to improve search performance and quality. Scientists all around the world would like to tap this gold mine for their own research; search engine companies, however, do not release them because they contain sensitive information about their users, for example searches for diseases, lifestyle choices, personal tastes, and political affiliations. The only release of a search log happened in 2007 by AOL, and it went into the annals of tech history as one of the great debacles in the search industry.1 AOL published three months of search logs of 650,000 users. The only measure to protect user privacy was the replacement of user–ids with random numbers — utterly insufficient protection as the New York Times showed by identifying a user from Lilburn, Georgia [4], whose search queries not only contained identifying information but also sensitive information about her friends’ ailments.

Related Post