List of works
Conference proceeding
Toward Human-Aligned LLM Reviews for Scientific Papers
Published 09/15/2025
Proceedings IEEE International Conference on e-Science: eScience 2025, 363 - 364
IEEE International Conference on e-Science: eScience 2025, 09/15/2025–09/18/2025, Chicago, Illinois, USA
The peer review process is strained by increasing submission volumes, reviewer fatigue, and inconsistent standards. While Large Language Models (LLMs) can aid in reviews, they are often overly optimistic and lack technical depth. We developed an innovative prompting strategy that, when applied to ChatGPT-4 on ICLR 2025 papers, reduced score inflation and generated reviews more closely aligned with human reviewer median scores.
Conference proceeding
Training Variational Autoencoders for Population Synthesis in Public Health with Missing Data
Published 12/15/2024
IEEE International Conference on Big Data, 4969 - 4973
IEEE International Conference on Big Data (BigData), 12/15/2024–12/18/2024, Washington, DC, USA
The importance of social determinants of health in shaping equitable public health policies is gaining increasing recognition. Emerging data sources, such as mobility and social media data, are becoming key to public health models. However, privacy concerns often limit access to these sensitive data, as even anonymized datasets are vulnerable to deductive disclosure. A promising solution to this challenge is the use of synthetic populations, which not only safeguard privacy but also enable the exploration of various "what-if" scenarios. Most existing studies on synthetic populations assume full access to complete datasets for training. However, in many public health applications, such as census data, this assumption is unrealistic due to incomplete responses, especially regarding sensitive questions like wealth or education. This paper introduces a novel approach for training Variational AutoEncoders (VAEs) with incomplete data, without resorting to missing value imputation. Instead, the VAEs are trained solely on the observed data. Using the 2019 PUMS dataset for Florida, we successfully train VAEs to generate diverse and flexible synthetic populations. By comparing marginal distributions and utilizing t-SNE for analysis, the results highlight the effectiveness of this method in addressing missing data challenges. This work demonstrates the potential of VAEs in generating synthetic populations for health research, even when complete datasets are unavailable, thereby offering a robust solution to advance public health studies while preserving data privacy.
Conference proceeding
Published 03/2020
2020 IEEE Aerospace Conference
IEEE Aerospace Conference, 03/07/2020–03/14/2020, Big Sky, MT, USA
This paper presents an integrated computational modelling framework combining pedestrian dynamics and infection spread models, to analyse the infectious disease spread during the different stages of air-travel. While, commercial air travel is central to the global mobility of goods and people, it has also been identified as a leading factor in the spread of several epidemic diseases including influenza, SARS and Ebola. The mixing of susceptible and infectious individuals in these high people density locations like airports involves pedestrian movement which needs to be taken into account in the modelling studies of disease dynamics. We develop a Molecular Dynamics based social force modeling approach for pedestrian dynamics and combine it with a stochastic infection dynamics model to evaluate the spread of viral infectious diseases in airplanes and airports. We apply the multiscale model for various key components of air travel and suggest strategies to reduce the number of contacts and the spread of infectious diseases. We simulate pedestrian movement during boarding and deplaning of some typical commercial airplane models and movement of people through security check areas. We found specific boarding strategies that reduce the number of contacts. Further, we find that smaller airplanes are more effective in reducing the number of contacts compared to larger airplanes. We propose certain queue configuration that reduces contacts between people and mitigate disease spread.
Conference proceeding
Published 01/31/2020
Cyberinfrastructure for Sustained Scientific Innovation (CSSI) PIs meeting, 02/13/2020–02/14/2020, Seattle, Washington
Pedestrian dynamics provides mathematical models that can accurately simulate the movement of individuals in a crowd. These models allow scientists to understand how different policies, such as boarding procedures on planes, can prevent, or make worse, the transmission of infections. This project seeks to develop a novel software that will provide a variety of pedestrian dynamics models, infection spread models, as well as data so that scientists can analyze the effect of different mechanisms on the spread of directly transmitted diseases in crowded areas. The initial focus of this project is on air travel. However, the software can be extended to a broader scope of applications in movement analysis and epidemiology, such as in theme parks and sports venues.
Conference proceeding
Development of cybersecurity lab exercises for mobile health
Published 2020
Journal of The Colloquium for Information Systems Security Education, 7, 1
Colloquium for Information Systems Security Education
There is an emerging class of public health applications where non-health data from mobile apps, such as social media data, are used in subsequent models that identify threats to public health. On one hand, these models require accurate data, which would have an immense impact on public health. On the other hand, results from these models could compromise the privacy of an individual’s health status even without directly using health data. In addition, privacy could also be affected if systems hosting these models are compromised through security breaches. Students ought to be trained in evaluating the effectiveness of different protocols in ensuring privacy while providing useful data to the models.
Conference proceeding
Next-Generation High-Resolution Vector-Borne Disease Risk Assessment
Published 07/16/2019
2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 08/27/2019–08/30/2019, Vancouver, BC, Canada
Vector-borne diseases cause more than 1 million deaths annually. Estimates of epidemic risk at high spatial resolutions can enable effective public health interventions. Our goal is to identify the risk of importation of such diseases into vulnerable cities at the granularity of neighborhoods. Conventional models cannot achieve such spatial resolution, especially in real-time. Besides, they lack real-time data on demographic heterogeneity, which is vital for accurate risk estimation. Social media, such as Twitter, promise data from which demographic and spatial information could be inferred in real-time. On the other hand, such data can be noisy and inaccurate. Our novel approach leverages Twitter data, using machine learning techniques at multiple spatial scales to overcome its limitations, to deliver results at the desired resolution. We validate our method against the Zika outbreak in Florida in 2016. Our main contribution lies in proposing a novel approach that uses machine learning on social media data to identify the risk of vector-borne disease importation at a sufficiently fine spatial resolution to permit effective intervention. This will lead to a new generation of epidemic risk assessment models, promising to transform public health by identifying specific locations for targeted intervention.
Conference proceeding
High-resolution home location prediction from tweets using deep learning with dynamic structure
Published 07/06/2019
2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 540 - 542
International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 08/27/2019–08/30/2019, Vancouver, BC, Canada
Timely and high-resolution estimates of the home locations of a sufficiently large subset of the population are critical for effective disaster response and public health intervention, but this is still an open problem. Conventional data sources, such as census and surveys, have a substantial time lag and cannot capture seasonal trends. Recently, social media data has been exploited to address this problem by leveraging its large user-base and real-time nature. However, inherent sparsity and noise, along with large estimation uncertainty in home locations, have limited their effectiveness. Consequently, much of previous research has aimed only at a coarse spatial resolution, with accuracy being limited for high-resolution methods. In this paper, we develop a deep-learning solution that uses a two-phase dynamic structure to deal with sparse and noisy social media data. In the first phase, high recall is achieved using a random forest, producing more balanced home location candidates. Then two deep neural networks are used to detect home locations with high accuracy. We obtained over 90% accuracy for large subsets on a commonly used dataset. Compared to other high-resolution methods, our approach yields up to 60% error reduction by reducing high-resolution home prediction error from over 21% to less than 8%. Systematic comparisons show that our method gives the highest accuracy both for the entire sample and for subsets. Evaluation on a real-world public health problem further validates the effectiveness of our approach.
Conference proceeding
Parallel Low Discrepancy Parameter Sweep for Public Health Policy
Published 01/01/2018
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 291 - 300
International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 05/01/2018–05/04/2018, Washington, DC, USA
Numerical simulations are used to analyze the effectiveness of alternate public policy choices in limiting the spread of infections. In practice, it is usually not feasible to predict their precise impacts due to inherent uncertainties, especially at the early stages of an epidemic. One option is to parameterize the sources of uncertainty and carry out a parameter sweep to identify their robustness under a variety of possible scenarios. The Self Propelled Entity Dynamics (SPED) model has used this approach successfully to analyze the robustness of different airline boarding and deplaning procedures. However, the time taken by this approach is too large to answer questions raised during the course of a decision meeting. In this paper, we use a modified approach that pre-computes simulations of passenger movement, performing only the disease-specific analysis in real time. A novel contribution of this paper lies in using a low discrepancy sequence (LDS) in the parameter sweep, and demonstrating that it can lead to a reduction in analysis time by one to three orders of magnitude over the conventional lattice-based parameter sweep. However, its parallelization suffers from greater load imbalance than the conventional approach. We examine this and relate it to number-theoretic properties of the LDS. We then propose solutions to this problem. Our approach and analysis are applicable to other parameter sweep problems too. The primary contributions of this paper lie in the new approach of low discrepancy parameter sweep and in exploring solutions to challenges in its parallelization, evaluated in the context of an important public health application.
Conference proceeding
Optimizing Massively Parallel Simulations of Infection Spread Through Air-Travel for Policy Analysis
Published 01/01/2016
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 136 - 145
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 05/16/2016–05/19/2016, Cartagena, Colombia
Project VIPRA [1] uses a new approach to modeling the potential spread of infections in airplanes, which involves tracking detailed movements of individual passengers. Inherent uncertainties are parameterized, and a parameter sweep carried out in this space to identify potential vulnerabilities. Simulation time is a major bottleneck for exploration of 'what-if' scenarios in a policy-making context under real-world time constraints. This paper identifies important bottlenecks to e cient computation: ine ciency in workflow, parallel IO, and load imbalance. Our solutions to the above problems include modifying the workflow, optimizing parallel IO, and a new scheme to predict computational time, which leads to e cient load balancing on fewer nodes than currently required. Our techniques reduce the computational time from several hours on 69,000 cores to around 20 minutes on around 39,000 cores on the Blue Waters machine for the same computation. The significance of this paper lies in identifying performance bottlenecks in this class of applications, which is crucial to public health, and presenting a solution that is e ective in practice.
Conference proceeding
Efficient Barrier Implementation on the POWER8 Processor
Published 01/01/2015
2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 165 - 173
International Conference on High Performance Computing, 12/16/2015–12/19/2015, Bengaluru, India
POWER8 is a new generation of POWER processor capable of 8-way simultaneous multi-threading per core. High-performance computing capabilities, such as high amount of instruction-level and thread level parallelism, are integrated with a deep memory hierarchy. Fine-grained parallel applications running on such architectures often rely on an efficient barrier implementation for synchronization. We present a variety of barrier implementations for a 4-chip POWER8 node. These implementations are optimized based on a careful study of the POWER8 memory sub-system. Our best implementation yields one to two orders of magnitude lower time than the current MPI and POSIX threads based barrier implementations on POWER8. Apart from providing efficient barrier implementations, an additional significance of this work lies in demonstrating how certain features of the memory subsystem, such as NUMA access to remote L3 cache and the impact of prefetching, can be used to design efficient primitives on the POWER8.