I have been thinking about data security issues, in particular database-reconstruction attacks. To quote Wikipedia, a reconstruction attack is any method for partially reconstructing a private database from public aggregate information. The question I am specifically interested in is this: Can an attacker with general interactive query access to a dataset recover a piece of information he/she does not have permission to access by constructing a sequence of statistical queries and making suitable inferences with the answers? (Note that we are not really talking here about the deductive closure of a logic database D [GMN84], which is the set of all the facts that can be iteratively deduced from queries on D and previously deduced facts; all databases can be completely reconstructed given unlimited number of queries that can be written in a sufficiently rich query language. In database reconstruction attacks, we are specifically looking only at statistical counting queries, i.e. queries that return how many records satisfy a certain property, and we also care about efficiency, i.e. can the database be reconstructed with only a relatively small number of statistical counting queries?)
In general terms, data-reconstruction attacks cannot be completely avoided using simple controls. There is, in fact, a Database Reconstruction Theorem [DN03] that shows how one can use a linear number of summation queries on random subsets of a given private dataset D to reconstruct large parts of D with high probability by exploiting sparsity constraints in the solution of linear systems of equations. (As one might have expected, the proof [DMT07] has a tight connection with the field of Compressive Sensing [FR13].) The linear programming and related attack methods using constraint programming — Sudoku is not a bad mental model — have been demonstrated successfully in practice [GAM18, CK18], and their significance acknowledged by institutions like the US Census Bureau [KA23].
In addition to the above attack methods to reconstruct entire databases, there are numerous other techniques in the machine learning literature on recovering or predicting the values of unobserved or missing attributes from observable attributes, including low-rank matrix factorisation techniques for collaborative filtering, and algorithms for different probabilistic graphical models and neural network models.
Results like the Database Reconstruction Theorem have led to different formal models of privacy protection, the most established of which is Differential Privacy [DR14], an approach for providing privacy guarantees by constraining algorithms to limit the disclosure of individual records when publishing aggregate information from a database. Differentially private data-release mechanisms also have the nice properties of being robust to post-processing and composition. The former says that anything that can be computed using the output of a DP mechanism is also differentially private. The latter says that the privacy risk from multiple DP data-releases is bounded by the sum of the privacy risks from the individual DP data-releases — this is of course how DP blunts data-reconstruction attacks that use multiple queries on a database! Differential Privacy is now widely adopted in the collection and publishing of data in industry and government [Li23]. Some of the limitations in the Differential Privacy formulation have been analysed and generalised in a systematic manner too [KM14].
The general mathematical theory called Quantitative Information Flow [QIF20] provides a systematic way to understand and quantify the leakage of data and information as they go through different transformations and processing, in terms of the attacker’s initial uncertainty (in the form of a prior probability distribution) about a secret information, and the attacker’s uncertainty about the secret (in the form of its posterior probability distribution) after observing the results of transformations, both direct and indirect (so-called side-channels). Quantitative Information Flow is related to Differential Privacy [Al11] but it is a strictly more general framework that can be used to analyse complex data-processing pipelines.
Here are some of my suggestions for dealing with the threat of Database Reconstruction Attacks when building database systems that contain sensitive information:
- Differential privacy (DP) techniques are considered for use when publishing aggregated information or statistics from private datasets. Given DP protects privacy by injecting suitable noise into published statistics, there is a trade-off against the need for accurate reporting of statistics from a private dataset. As such, DP may only be used for highly sensitive datasets, for example those containing health data [Dy21].
- Some specific and practical DP algorithms that can be considered for interactive and/or correlated queries are the Median Mechanism [RR10], the SmallDB Mechanism [BLR13], and the SCQ Mechanism [Fi20].
- Quantitative Information Flow techniques are used to analyse and quantify data leakage in major and sensitive data-processing pipelines.
- All queries in the database are logged and continuously monitored and analysed using modern machine learning and AI algorithms to detect anomalous query patterns.
Practitioners of data security who want to guard against database reconstruction attacks should also be aware of the following known issues in widely used controls:
- Traditional technique like k-anonymity, l-diversity, and t-closeness have been demonstrated to fail to meet privacy protection standards like GDPR when it comes to anonymisation and preventing “singling-out”, which is the possibility to isolate some or all records which identify an individual in a dataset [Ni21].
- Differential Privacy is known to have limitations on correlated data [KM11] because DP assumes the records in the underlying database are independent of each other, a condition that may not be satisfied in general, especially on data collected from some underlying network or social phenomenon. There are generalised classes of privacy notions, including Dependent Differential Privacy [Liu16] and Inferential Privacy [GK17] that seek to quantify the amount of correlation in a dataset and then introduce additional noise to retain privacy. There are also alternative formulations of privacy like Zero-Knowledge Privacy [Ge11] that take completely different approaches.
References:
- [GMN84] H. Gallaire, J. Minker, J-M. Nicolas, Logic and Databases: A Deductive Approach, Computing Surveys, 1984.
- [FR13] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer, 2013
- [DN03] I. Dinur and K. Nissim, Revealing information while preserving privacy, PODS 2003. (See also http://differentialprivacy.org/reconstruction-theory.)
- [DMT07] C. Dwork, F. McSherry and K. Talwar, The price of privacy and the limits of LP decoding, STOC, 2007.
- [GAM18] S. Garfinkel, J.M. Abowd, and C. Martindale, Understanding Database Reconstruction Attacks on Public Data, ACM Queue, 2018.
- [CK18] A. Cohen and K. Nissim, Linear Program Reconstruction in Practice, 2018
- [KA23] S.A. Keller and J.M. Abowd, Database reconstruction does compromise confidentiality, PNAS, 2023.
- [DR14] C. Dwork and A. Roth, The Algorithmic Foundations of Different Privacy, Now Publishers, 2014.
- [Li23] Y. Li, et al, Private Graph Data Release: A Survey, ACM Computing Surveys, 2023.
- [KM14] D. Kifer and A. Machanavajjhala, Pufferfish: A Framework for Mathematical Privacy Definitions, TODS, 2014.
- [QIF20] M. Alvim, et al, The Science of Quantitative Information Flow, Springer, 2020.
- [Al11] M. Alvim, et al, Differential Privacy: On the trade-off between utility and information leakage, FAST, 2011.
- [Dy21] A. Dyda, et al, Differential Privacy for public health data: An innovative tool to optimise information sharing while protecting data confidentiality, Patterns, 2021.
- [RR10] A. Roth and T. Roughgarden, Interactive Privacy via the Median Mechanism, STOC, 2010.
- [BLR13] A. Blum, K. Ligget, and A. Roth, A learning theory approach to non-interactive database privacy, JACM, 2013.
- [Fi20] B. Fish, et al, Sampling without compromising accuracy in adaptive data analysis, ALT, 2020.
- [Ni21] K. Nissim, Privacy: From Database Reconstruction to Legal Theorems, PODS, 2021.
- [KM11] D. Kifer and A. Machanavajjhala, No free lunch in data privacy, SIGMOD, 2011.
- [Liu16] C. Liu, et al, Dependence makes you vulnerable: Differential Privacy under dependent tuples, NDSSS, 2016.
- [GK17] A. Ghosh and R. Kleinberg, Inferential privacy guarantees for differentially private mechanisms, ITCS, 2017.
- [Ge11] J. Gehrke, et al, Towards privacy for social network: A zero-knowledge based definition of privacy, TCC, 2011.
One thought on “How To Deal with Database Reconstruction Attacks”