Mimi Sun
Authored Publications
Sort By
Community search signatures as foundation features for human-centered geospatial modeling
Chaitanya Kamath
Mohit Agarwal
David Schottlander
Shailesh Bavadekar
Niv Efron
Shravya Shetty
ICML 2024 Workshop on Data-Centric Machine Learning Research
Preview abstract
Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used to perform nowcasting across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In regions with a population greater than 3000 that cover over 95% of the contiguous US population, our models achieve an average R-squared score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.
View details
General Geospatial Inference with a Population Dynamics Foundation Model
Chaitanya Kamath
Prithul Sarker
Joydeep Paul
Yael Mayer
Sheila de Guia
Jamie McPike
Adam Boulanger
David Schottlander
Yao Xiao
Manjit Chakravarthy Manukonda
Monica Bharel
Von Nguyen
Luke Barrington
Niv Efron
Krish Eswaran
Shravya Shetty
(2024) (to appear)
Preview abstract
Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations, and researchers to understand and reason over complex relationships between human behavior and local contexts. This support includes identifying populations at elevated risk and gauging where to target limited aid resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even related tasks. To address this, we introduce the Population Dynamics Foundation Model (PDFM), which aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on geospatial interpolation across all tasks, surpassing existing satellite and geotagged image based location encoders. In addition, it achieves state-of-the-art performance in extrapolation and super-resolution for 25 of the 27 tasks. We also show that the PDFM can be combined with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers. In conclusion, we have demonstrated a general purpose approach to geospatial modeling tasks critical to understanding population dynamics by leveraging a rich set of complementary globally available datasets that can be readily adapted to previously unseen machine learning tasks.
View details
Dense Feature Memory Augmented Transformers for COVID-19 Vaccination Search Classification
Yi Tay
Chaitanya Kamath
Shailesh Bavadekar
Evgeniy Gabrilovich
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
Preview abstract
With the devastating outbreak of COVID-19, vaccines are one of the crucial lines of defense against mass infection in this global pandemic. Given the protection they provide, vaccines are becoming mandatory in certain social and professional settings. This paper presents a classification model for detecting COVID-19 vaccination related search queries, a machine learning model that is used to generate search insights for COVID-19 vaccinations. The proposed method combines and leverages advancements from modern state-of-the-art (SOTA) natural language understanding (NLU) techniques such as pretrained Transformers with traditional dense features. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task, improving a strong well-established gradient-boosting baseline by relative +15% improvement in F1 score and +14% in precision.
View details
Google COVID-19 Vaccination Search Insights: Anonymization Process Description
Adam Boulanger
Akim Kumok
Arti Patankar
Benjamin Miller
Chaitanya Kamath
Charlotte Stanton
Chris Scott
Damien Desfontaines
Evgeniy Gabrilovich
Gregory A. Wellenius
John S. Davis
Karen Lee Smith
Krishna Kumar Gadepalli
Mark Young
Shailesh Bavadekar
Tague Griffith
Yael Mayer
Arxiv.org (2021)
Preview abstract
This report describes the aggregation and anonymization process applied to the COVID-19 Vaccination Search Insights~\cite{vaccination}, a publicly available dataset showing aggregated and anonymized trends in Google searches related to COVID-19 vaccination. The applied anonymization techniques protect every user’s daily search activity related to COVID-19 vaccinations with $(\varepsilon, \delta)$-differential privacy for $\varepsilon = 2.19$ and $\delta = 10^{-5}$.
View details
Vaccine Search Patterns Provide Insights into Vaccination Intent
Sean Malahy
Keith Spangler
Jessica Leibler
Kevin J. Lane
Shailesh Bavadekar
Chaitanya Kamath
Akim Kumok
Yuantong Sun
Tague Griffith
Adam Boulanger
Mark Young
Charlotte Stanton
Yael Mayer
Karen Lee Smith
Kat Chou
Jonathan I. Levy
Adam A.Szpiro
Evgeniy Gabrilovich
Gregory A. Wellenius
arXiv (2021), TBD
Preview abstract
Despite ample supply of COVID-19 vaccines, the proportion of fully vaccinated individuals remains suboptimal across much of the US. Rapid vaccination of additional people will prevent new infections among both the unvaccinated and the vaccinated, thus saving lives. With the rapid rollout of vaccination efforts this year, the internet has become a dominant source of information about COVID-19 vaccines, their safety and efficacy, and their availability. We sought to evaluate whether trends in internet searches related to COVID-19 vaccination - as reflected by Google's Vaccine Search Insights (VSI) index - could be used as a marker of population-level interest in receiving a vaccination. We found that between January and August of 2021: 1) Google's weekly VSI index was associated with the number of new vaccinations administered in the subsequent three weeks, and 2) the average VSI index in earlier months was strongly correlated (up to r = 0.89) with vaccination rates many months later. Given these results, we illustrate an approach by which data on search interest may be combined with other available data to inform local public health outreach and vaccination efforts. These results suggest that the VSI index may be useful as a leading indicator of population-level interest in or intent to obtain a COVID-19 vaccine, especially early in the vaccine deployment efforts. These results may be relevant to current efforts to administer COVID-19 vaccines to unvaccinated individuals, to newly eligible children, and to those eligible to receive a booster shot. More broadly, these results highlight the opportunities for anonymized and aggregated internet search data, available in near real-time, to inform the response to public health emergencies.
View details
A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan
Joel Shor
Arkady Epshteyn
Ashwin Sura Ravi
Beth Luan
Chun-Liang Li
Daisuke Yoneoka
Dario Sava
Hiroaki Miyata
Hiroki Kayama
Isaac Jones
Joe Mckenna
Johan Euphrosine
Kris Popendorf
Nate Yoder
Shashank Singh
Shuhei Nomura
Thomas Tsai
npj Digital Medicine (2021)
Preview abstract
The COVID-19 pandemic has highlighted the global need for reliable models of disease spread. We evaluate an AI-improved forecasting approach that provides daily predictions of the expected number of confirmed COVID-19 deaths, cases and hospitalizations during the following 28 days. We present an international, prospective evaluation of model performance across all states and counties in the USA and prefectures in Japan. National mean absolute percentage error (MAPE) for predicting COVID-19 associated deaths before and after prospective deployment remained consistently <3% (US) and <10% (Japan). Average statewide (US) and prefecture wide (Japan) MAPE was 6% and 20% respectively (14% when looking at prefectures with more than 10 deaths).We show our model performs well even during periods of considerable change in population behavior, and that it is robust to demographic differences across different geographic locations.We further demonstrate the model provides meaningful explanatory insights, finding that the model appropriately responds to local and national policy interventions. Our model enables counterfactual simulations, which indicate continuing NPIs alongside vaccinations is essential for more rapidly recovering from the pandemic, delaying the application of interventions has a detrimental effect, and allow exploration of the consequences of different vaccination strategies. The COVID-19 pandemic remains a global emergency. In the face of substantial challenges ahead, the approach presented here has the potential to inform critical decisions.
View details
Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description
Akim Kumok
Chaitanya Kamath
Charlotte Stanton
Damien Desfontaines
Evgeniy Gabrilovich
Gerardo Flores
Gregory Alexander Wellenius
Ilya Eckstein
John S. Davis
Katie Everett
Krishna Kumar Gadepalli
Rayman Huang
Shailesh Bavadekar
Thomas Ludwig Roessler
Venky Ramachandran
Yael Mayer
Arxiv.org, N/A (2020)
Preview abstract
This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset, a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily search activity of every user with \varepsilon-differential privacy for \varepsilon = 1.68.
View details
Scalable and accurate deep learning for electronic health records
Alvin Rishi Rajkomar
Eyal Oren
Nissan Hajaj
Mila Hardt
Peter J. Liu
Xiaobing Liu
Jake Marcus
Patrik Per Sundberg
Kun Zhang
Yi Zhang
Gerardo Flores
Gavin Duggan
Jamie Irvine
Kurt Litsch
Alex Mossin
Justin Jesada Tansuwan
De Wang
Dana Ludwig
Samuel Volchenboum
Kat Chou
Michael Pearson
Srinivasan Madabushi
Nigam Shah
Atul Butte
npj Digital Medicine (2018)
Preview abstract
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart.
View details