Enhancing Synthetic Data Techniques for Practical Applications

Background

Addressing unresolved topics in synthetic data.

Project Description

This research project will advance statistical and computational methods for releasing high quality synthetic data as public use files. In the face of high and expanding risks of unintended and/or illegal disclosures, many data stewards are considering synthetic public use files. These comprise simulated records, with values generated from statistical models estimated with the confidential data. This can reduce disclosure risks, since it can be difficult to re-identify individuals and their sensitive attributes when the released values are simulated. Despite growing interest in synthetic data solutions for data dissemination, there are significant gaps in the theory and methods of synthetic data that complicate and hinder practical implementations. This project will address three critical yet unresolved topics in synthetic data, namely (1) assessing data subjects' disclosure risks, (2) facilitating data analysts' evaluation of their synthetic data inferences, and (3) generating synthetic datasets in surveys with complex designs. The results of this research will offer federal agencies, survey organizations, research centers, and other data producers the means to create safer and more analytically useful synthetic data products. In turn, this will help data stewards to better meet the challenges of public use data dissemination. The project will train Ph.D. and undergraduate students to become researchers in data privacy protection methods, thereby contributing to the pipeline of experts in data privacy and in statistics and data science more broadly. The project also will develop and disseminate software code that implements the various approaches.

Publications

Lin, T. and Reiter, J. P. (forthcoming), “Differentially private verification of survey-weighted estimates,” Transactions on Data Privacy.

Kazan, Z. and Reiter, J. P., (forthcoming) “Prioritizing privacy: A Bayesian approach to setting the privacy budget in differential privacy,” NeurIPS.

Guha, S. and Reiter, J. P. (2024), “Simultaneous causal inference and probabilistic record linkage in observational studies with covariates spread over two files,” Journal of Statistical Planning and Inference, 229, Article 106090.

Kazan, Z. and Reiter, J. P. (2024), “Assessing disclosure risk for differentially private, hierarchical count data, with application to the 2020 U. S. decennial census,” Statistica Sinica, 34, to appear.

Mathur, S., Si, Y., and Reiter, J. P. (2024), “Fully synthetic data for complex surveys,” Survey Methodology, to appear.

Yang, C. and Reiter, J. P. (2024), “Differentially private methods for stability analyses,” The American Statistician, 78, 180 - 191

Reiter, J. P. (2023), “Synthetic data: A look back and a look forward,” Transactions on Data Privacy, 16, 15 - 24

Under Review:
Guha, S. and Reiter, J. P., “Differentially private estimation of weighted average treatment effects for binary outcomes,” submitted.

Kazan, Z. and Reiter, J. P., “Bayesian inference under differential privacy: Prior selection considerations with application to univariate Gaussian data and regression,” submitted.

Duration

8/22 - 8/25

Project Team

Jerome P. Reiter

Zeki Kazan

Enhancing Synthetic Data Techniques for Practical Applications

Primary Investigator(s):

Project Team

Cybersecurity Theme(s)