Logo image
Synthesizing public-health microdata with pervasive missingness: a deep generative approach
Thesis   Open access

Synthesizing public-health microdata with pervasive missingness: a deep generative approach

Nathan Blackthorn
University of West Florida Libraries
Master of Science (MS), University of West Florida
2025

Metrics

1 File views/ downloads
14 Record Views

Abstract

There is a growing acknowledgment of the significance of social determinants of health inshaping fair public health policies. These determinants, reflected in emerging data streams like mobility and social media data, are increasingly integral to public health models. However, privacy concerns impede broad access to sensitive data, because even non-identifiable data are susceptible to deductive disclosure. To address this, synthetic populations trained on such data emerge as a privacy-conscious solution, offering the added benefit of exploring various ”what-if” scenarios. However, traditional techniques for generating synthetic populations face limitations. This thesis explores the use of deep generative models (DGMs), specifically Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), to generate synthetic populations that address privacy concerns related to social determinants of health data. Developing DGMs for this purpose involves overcoming several challenges. First, the thesis proposes an approach that incorporates an autoencoder to handle both continuous and categorical variables, ensuring the synthetic data maintains consistency with categorical constraints. Second, it introduces a VAE-based training algorithm capable of learning directly from datasets with missing values. Existing state-of-the-art methods for handling missing data face two major limitations: (1) they typically require an initial imputation step followed by model training, which increases computational time, and (2) they often rely on deep learning techniques that assume fully observed training data—an assumption that doesn’t hold in many real-world scenarios. To address these limitations, this work presents a novel technique that enables training of fully connected layers directly from incomplete datasets, even when no individual training sample is fully observed. Utilizing the 2019 PUMS dataset for Florida, the study trains DGMs to produce diverse and adaptable synthetic populations. The results, assessed by comparing marginal distributions and employing t-SNE, underscore the DGMs efficacy in balancing privacy and data utility. The significance of this thesis lies in identifying the promise of DGMs in generating synthetic populations to advance health research.
pdf
Synthesizing public-health microdata with pervasive missingness4.08 MBDownloadView
Preprint Thesis pdf Open Access

Details

Logo image