The importance of social determinants of health in shaping equitable public health policies is gaining increasing recognition. Emerging data sources, such as mobility and social media data, are becoming key to public health models. However, privacy concerns often limit access to these sensitive data, as even anonymized datasets are vulnerable to deductive disclosure. A promising solution to this challenge is the use of synthetic populations, which not only safeguard privacy but also enable the exploration of various "what-if" scenarios. Most existing studies on synthetic populations assume full access to complete datasets for training. However, in many public health applications, such as census data, this assumption is unrealistic due to incomplete responses, especially regarding sensitive questions like wealth or education. This paper introduces a novel approach for training Variational AutoEncoders (VAEs) with incomplete data, without resorting to missing value imputation. Instead, the VAEs are trained solely on the observed data. Using the 2019 PUMS dataset for Florida, we successfully train VAEs to generate diverse and flexible synthetic populations. By comparing marginal distributions and utilizing t-SNE for analysis, the results highlight the effectiveness of this method in addressing missing data challenges. This work demonstrates the potential of VAEs in generating synthetic populations for health research, even when complete datasets are unavailable, thereby offering a robust solution to advance public health studies while preserving data privacy.
Related links
Details
Title
Training Variational Autoencoders for Population Synthesis in Public Health with Missing Data
Publication Details
IEEE International Conference on Big Data, pp.4969-4973
Resource Type
Conference proceeding
Conference
IEEE International Conference on Big Data (BigData) (Washington, DC, USA, 12/15/2024–12/18/2024)