Synthesizing public-health microdata with pervasive missingness: a deep generative approach
Nathan Blackthorn
University of West Florida Libraries
Master of Science (MS), University of West Florida
2025
Metrics
1 File views/ downloads
14 Record Views
Abstract
There is a growing acknowledgment of the significance of social determinants of health inshaping fair public health policies. These determinants, reflected in emerging data streams like mobility and social media data, are increasingly integral to public health models. However, privacy concerns impede broad access to sensitive data, because even non-identifiable data are susceptible to deductive disclosure. To address this, synthetic populations trained on such data emerge as a privacy-conscious solution, offering the added benefit of exploring various ”what-if” scenarios. However, traditional techniques for generating synthetic populations face limitations.
This thesis explores the use of deep generative models (DGMs), specifically Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), to generate synthetic populations that address privacy concerns related to social determinants of health data. Developing DGMs for this purpose involves overcoming several challenges. First, the thesis proposes an approach that incorporates an autoencoder to handle both continuous and categorical variables, ensuring the synthetic data maintains consistency with categorical constraints. Second, it introduces a VAE-based training algorithm capable of learning directly from datasets with missing values. Existing state-of-the-art methods for handling missing data face two major limitations: (1) they typically require an initial imputation step followed by model training, which increases computational time, and (2) they often rely on deep learning techniques that assume fully observed training data—an assumption that doesn’t hold in many real-world scenarios. To address these limitations, this work presents a novel technique that enables training of fully connected layers directly from incomplete datasets, even when no individual training sample is fully observed.
Utilizing the 2019 PUMS dataset for Florida, the study trains DGMs to produce diverse and adaptable synthetic populations. The results, assessed by comparing marginal distributions and employing t-SNE, underscore the DGMs efficacy in balancing privacy and data utility. The significance of this thesis lies in identifying the promise of DGMs in generating synthetic populations to advance health research.
Files and links (1)
pdf
Synthesizing public-health microdata with pervasive missingness4.08 MBDownloadView
Preprint Thesis pdf Open Access
Details
Title
Synthesizing public-health microdata with pervasive missingness
Resource Type
Thesis
Contributors
Ashok Srinivasan (Committee Chair)
Andrew A Mahyari (Committee Member)
Brian Jalaian (Committee Member)
Publisher
University of West Florida Libraries
Format
pdf
Number of pages
42
Copyright
Permission granted to the University of West Florida Libraries by the author to digitize and/or display this information for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires the permission of the copyright holder.
Identifiers
99381469153406600
Academic Unit
Computer Science
Language
English
Awarding Institution
University of West Florida; Master of Science (MS)
Theses and Dissertations
Master of Science (MS), University of West Florida
Synthesizing public-health microdata with pervasive missingness