Database Anonymization. David Sánchez

Читать онлайн книгу.

Database Anonymization - David Sánchez


Скачать книгу
by law in most Western countries. Indeed, without privacy, other fundamental rights, like freedom of speech and democracy, are impaired. The outstanding challenge is to create technology that implements those legal guarantees in a way compatible with functionality and security.

      This book is devoted to privacy preservation in data releases. Indeed, in our era of big data, harnessing the enormous wealth of information available is essential to increasing the progress and well-being of humankind. The challenge is how to release data that are useful for administrations and companies to make accurate decisions without disclosing sensitive information on specific identifiable individuals.

      This conflict between utility and privacy has motivated research by several communities since the 1970s, both in official statistics and computer science. Specifically, computer scientists contributed the important notion of the privacy model in the late 1990s, with k-anonymity being the first practical privacy model. The idea of a privacy model is to state ex ante privacy guarantees that can be attained for a particular data set using one (or several) anonymization methods.

      In addition to k-anonymity, we survey here its extensions l-diversity and t-closeness, as well as the alternative paradigm of differential privacy. Further, we draw on our recent research to report connections and synergies between all these privacy models: in fact, the k-anonymity-like models and differential privacy turn out to be more related than previously thought. We also show how microaggregation, a well-known family of anonymization methods that we have developed to a large extent since the late 1990s, can be used to create anonymization methods that satisfy most of the surveyed privacy models while improving the utility of the resulting protected data.

      We sincerely hope that the reader, whether academic or practitioner, will benefit from this piece of work. On our side, we have enjoyed writing it and also conducting the original research described in some of the chapters.

      Josep Domingo-Ferrer, David Sánchez, and Jordi Soria-Comas

      January 2016

       Acknowledgments

      We thank Professor Elisa Bertino for encouraging us to write this Synthesis Lecture. This work was partly supported by the European Commission (through project H2020 “CLARUS”), by the Spanish Government (through projects “ICWT” TIN2012-32757 and “SmartGlacis” TIN2014-57364-C2-1-R), by the Government of Catalonia (under grant 2014 SGR 537), and by the Templeton World Charity Foundation (under project “CO-UTILITY”). Josep Domingo-Ferrer is partially supported as an ICREA-Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but the opinions expressed in this work are the authors’ own and do not necessarily reflect the views of UNESCO or any of the funders.

      Josep Domingo-Ferrer, David Sánchez, and Jordi Soria-Comas

      January 2016

      CHAPTER 1

       Introduction

      The current social and economic context increasingly demands open data to improve planning, scientific research, market analysis, etc. In particular, the public sector is pushed to release as much information as possible for the sake of transparency. Organizations releasing data include national statistical institutes (whose core mission is to publish statistical information), healthcare authorities (which occasionally release epidemiologic information) or even private organizations (which sometimes publish consumer surveys). When published data refer to individual respondents, care must be exerted for the privacy of the latter not to be violated. It should be de facto impossible to relate the published data to specific individuals. Indeed, supplying data to national statistical institutes is compulsory in most countries but, in return, these institutes commit to preserving the privacy of the respondents. Hence, rather than publishing accurate information for each individual, the aim should be to provide useful statistical information, that is, to preserve as much as possible in the released data the statistical properties of the original data.

      Disclosure risk limitation has a long tradition in official statistics, where privacy-preserving databases on individuals are called statistical databases. Inference control in statistical databases, also known as Statistical Disclosure Control (SDC), Statistical Disclosure Limitation (SDL), database anonymization, or database sanitization, is a discipline that seeks to protect data so that they can be published without revealing confidential information that can be linked to specific individuals among those to whom the data correspond.

      Disclosure limitation has also been a topic of interest in the computer science research community, which refers to it as Privacy Preserving Data Publishing (PPDP) and Privacy Preserving Data Mining (PPDM). The latter focuses on protecting the privacy of the results of data mining tasks, whereas the former focuses on the publication of data of individuals.

      Whereas both SDC and PPDP pursue the same objective, SDC proposes protection mechanisms that are more concerned with the utility of the data and offer only vague (i.e., ex post) privacy guarantees, whereas PPDP seeks to attain an ex ante privacy guarantee (by adhering to a privacy model), but offers no utility guarantees.

      In this book we provide an exhaustive overview of the fundamentals of privacy in data releases, including privacy models, anonymization/SDC methods, and utility and risk metrics that have been proposed so far in the literature. Moreover, as a more advanced topic, we discuss in detail the connections between several proposed privacy models (how to accumulate the guarantees offered by different privacy models to achieve more robust protection and when are such guarantees equivalent or complementary). We also propose bridges between SDC methods and privacy models (i.e., how specific SDC methods can be used to satisfy specific privacy models and thereby offer ex ante privacy guarantees).

      The book is organized as follows.

      • Chapter 2 details the basic notions of privacy in data releases: types of data releases, privacy threats and metrics, and families of SDC methods.

      • Chapter 3 offers a comprehensive overview of SDC methods, classified into perturbative and non-perturbative ones.

      • Chapter 4 describes how disclosure risk can be empirically quantified via record linkage.

      • Chapter 5 discusses the well-known k-anonymity privacy model, which is focused on preventing re-identification of individuals, and details which data protection mechanisms can be used to enforce it.

      • Chapter 6 describes two extensions of k-anonymity (l-diversity and t-closeness) focused on offering protection against attribute disclosure.

      • Chapter 7 presents in detail how t-closeness can be attained on top of k-anonymity by relying on data microaggregation (i.e., a specific SDC method based on data clustering).

      • Chapter 8 describes the differential privacy model, which mainly focuses on providing sanitized answers with robust privacy guarantees to specific queries. We also explain SDC techniques that can be used to attain differential privacy. We also discuss in detail the relationship between differential privacy and k-anonymity-based models (t-closeness, specifically).

      • Chapters 9 and 10 present two state-of-the-art approaches to offer utility-preserving differentially private data releases by relying on the notion of k-anonymous data releases and on multivariate and univariate microaggregation, respectively.

      • Chapter 11 summarizes general conclusions and introduces some topics for future research. More specific conclusions are given at the end of each chapter.

      CHAPTER 2

       Privacy in Data Releases

      References to privacy were already present in the writings of Greek philosophers when they distinguish the outer (public) from the inner (private). Nowadays privacy is considered a fundamental right of individuals [34, 101]. Despite this long history, the formal description of the “right to privacy” is


Скачать книгу