Synthesizing public-health microdata with pervasive missingness: a deep generative approach

Nathan Blackthorn

Back

Synthesizing public-health microdata with pervasive missingness: a deep generative approach

Thesis

Open access

Synthesizing public-health microdata with pervasive missingness: a deep generative approach

Nathan Blackthorn

University of West Florida Libraries

Master of Science (MS), University of West Florida

2025

Metrics

1 File views/ downloads

14 Record Views

Abstract

There is a growing acknowledgment of the significance of social determinants of health inshaping fair public health policies. These determinants, reflected in emerging data streams like mobility and social media data, are increasingly integral to public health models. However, privacy concerns impede broad access to sensitive data, because even non-identifiable data are susceptible to deductive disclosure. To address this, synthetic populations trained on such data emerge as a privacy-conscious solution, offering the added benefit of exploring various ”what-if” scenarios. However, traditional techniques for generating synthetic populations face limitations. This thesis explores the use of deep generative models (DGMs), specifically Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), to generate synthetic populations that address privacy concerns related to social determinants of health data. Developing DGMs for this purpose involves overcoming several challenges. First, the thesis proposes an approach that incorporates an autoencoder to handle both continuous and categorical variables, ensuring the synthetic data maintains consistency with categorical constraints. Second, it introduces a VAE-based training algorithm capable of learning directly from datasets with missing values. Existing state-of-the-art methods for handling missing data face two major limitations: (1) they typically require an initial imputation step followed by model training, which increases computational time, and (2) they often rely on deep learning techniques that assume fully observed training data—an assumption that doesn’t hold in many real-world scenarios. To address these limitations, this work presents a novel technique that enables training of fully connected layers directly from incomplete datasets, even when no individual training sample is fully observed. Utilizing the 2019 PUMS dataset for Florida, the study trains DGMs to produce diverse and adaptable synthetic populations. The results, assessed by comparing marginal distributions and employing t-SNE, underscore the DGMs efficacy in balancing privacy and data utility. The significance of this thesis lies in identifying the promise of DGMs in generating synthetic populations to advance health research.

Files and links (1)

pdf

Synthesizing public-health microdata with pervasive missingness4.08 MBDownload View

Preprint Thesis pdf Open Access

Details

Title: Synthesizing public-health microdata with pervasive missingness
Resource Type: Thesis
Contributors: Ashok Srinivasan (Committee Chair)
Andrew A Mahyari (Committee Member)
Brian Jalaian (Committee Member)
Publisher: University of West Florida Libraries
Format: pdf
Number of pages: 42
Copyright: Permission granted to the University of West Florida Libraries by the author to digitize and/or display this information for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires the permission of the copyright holder.
Identifiers: 99381469153406600
Academic Unit: Computer Science
Language: English
Awarding Institution: University of West Florida; Master of Science (MS)
Theses and Dissertations: Master of Science (MS), University of West Florida

Synthesizing public-health microdata with pervasive missingness: a deep generative approach

Metrics

Abstract

Files and links (1)

Details

University of West Florida Social media