Database Anonymization. David Sánchez
Читать онлайн книгу.are lumped together to form a new category. The same is done for bottom values (those below a certain threshold).
Local suppression
This is a masking method in which certain values of individual attributes are suppressed with the aim of increasing the set of records agreeing on a combination of key values. Ways to combine local suppression and generalization are implemented in the μ-Argus SDC package [45].
If a continuous attribute Xi is part of a set of key attributes, then each combination of key values is probably unique. Since it does not make sense to systematically suppress the values of Xi, we conclude that local suppression is rather oriented to categorical attributes.
3.2 PERTURBATIVE MASKING METHODS
Noise addition
Additive noise is a family of perturbative masking methods. The values in the original data set are masked by adding some random noise. The statistical properties of the noise being added determine the effect of noise addition on the original data set. Several noise addition procedures have been developed, each of them with the aim to better preserve the statistical properties of the original data.
• Masking by uncorrelated noise addition. The vector of observations, xi, for the i-th attribute of the original data set Xi is replaced by a vector yi = xi + ei where ei is a vector of normally distributed errors. Let
, respectively, the k-th and l-th components of vector ei. We have that and are independent and drawn from a normal distribution . The usual approach is for the variance of the noise added to attribute Xi to be proportional to the variance of Xi; that is, . The term “uncorrelated” is used to mean that there is no correlation between the noise added to different attributes.This method preserves means and covariances,
However, neither variances nor correlations are preserved
• Masking by correlated noise addition. Noise addition alone always modifies the variance of the original attributes. Thus, if we want to preserve the correlation coefficients of the original data, the covariances must be modified. This is what masking by correlated noise does. By taking the covariance matrix of the noise to be proportional to the covariance matrix of the original data we have:
• Masking by noise addition and linear transformation. In [48], a method is proposed that ensures by additional transformations that the sample covariance matrix of the masked attributes is an unbiased estimator for the covariance matrix of the original attributes.
• Masking by noise addition and nonlinear transformation. Combining simple additive noise and nonlinear transformation has also been proposed, in such a way that application to discrete attributes is possible and univariate distributions are preserved. Unfortunately, the application of this method is very time-consuming and requires expert knowledge on the data set and the algorithm. See [44] for more details.
Noise addition methods with normal distributions are naturally meant for continuous data, even though some adaptations to categorical data have been also proposed [76]. Moreover, the introduction of the differential privacy model for disclosure control has motivated the use of other noise distributions. The focus here is on the preservation of the privacy guarantees of the model rather than the statistical properties of the data. The addition of uncorrelated Laplace distributed noise is the most common approach to attain differential privacy [29]. For the case of discrete data, the geometric distribution [33] is a better alternative to the Laplace distributed noise. It has also been shown that the Laplace distribution is not the optimal noise in attaining differential privacy for continuous data [92].
Data/rank swapping
Data swapping was originally presented as an SDC method for databases containing only categorical attributes. The basic idea behind the method is to transform a database by exchanging values of confidential attributes among individual records. Records are exchanged in such a way that low-order frequency counts or marginals are maintained.
In spite of the original procedure not being very used in practice, its basic idea had a clear influence in subsequent methods. A variant of data swapping for microdata is rank swapping, which will be described next in some detail. Although originally described only for ordinal attributes [36], rank swapping can also be used for any numerical attribute. See Algorithm 1. First, values of an attribute A are ranked in ascending order, then each ranked value of A is swapped with another ranked value randomly chosen within a restricted range (e.g., the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter).
This algorithm is independently used on each original attribute in the original data set. It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap.
Microaggregation
Microaggregation is a family of SDC techniques for continuous microdata. The rationale behind microaggregation is that confidentiality rules in use allow publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates (i.e., contributes too much to) the group and k is a threshold value. Strict application of such confidentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microaggregation. To obtain microaggregates in a microdata set with n records, these are combined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original averaged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be published. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within-group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as:
Algorithm 1 Rank swapping with swapping restricted to p%
The between-groups sum of squares SSA is
The total sum of squares is SST = SSA C SSE or explicitly
The lower the SSE, the higher the within group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE (or equivalently, maximizes SSA).
Given a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes and then microaggregated. Also, the way to form