Mimi Sun

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Community search signatures as foundation features for human-centered geospatial modeling
    Chaitanya Kamath
    Mohit Agarwal
    David Schottlander
    Shailesh Bavadekar
    Niv Efron
    Shravya Shetty
    ICML 2024 Workshop on Data-Centric Machine Learning Research
    Preview abstract Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used to perform nowcasting across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In regions with a population greater than 3000 that cover over 95% of the contiguous US population, our models achieve an average R-squared score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features. View details
    General Geospatial Inference with a Population Dynamics Foundation Model
    Chaitanya Kamath
    Prithul Sarker
    Joydeep Paul
    Yael Mayer
    Sheila de Guia
    Jamie McPike
    Adam Boulanger
    David Schottlander
    Yao Xiao
    Manjit Chakravarthy Manukonda
    Monica Bharel
    Von Nguyen
    Luke Barrington
    Niv Efron
    Krish Eswaran
    Shravya Shetty
    (2024) (to appear)
    Preview abstract Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations, and researchers to understand and reason over complex relationships between human behavior and local contexts. This support includes identifying populations at elevated risk and gauging where to target limited aid resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even related tasks. To address this, we introduce the Population Dynamics Foundation Model (PDFM), which aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on geospatial interpolation across all tasks, surpassing existing satellite and geotagged image based location encoders. In addition, it achieves state-of-the-art performance in extrapolation and super-resolution for 25 of the 27 tasks. We also show that the PDFM can be combined with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers. In conclusion, we have demonstrated a general purpose approach to geospatial modeling tasks critical to understanding population dynamics by leveraging a rich set of complementary globally available datasets that can be readily adapted to previously unseen machine learning tasks. View details
    Dense Feature Memory Augmented Transformers for COVID-19 Vaccination Search Classification
    Yi Tay
    Chaitanya Kamath
    Shailesh Bavadekar
    Evgeniy Gabrilovich
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
    Preview abstract With the devastating outbreak of COVID-19, vaccines are one of the crucial lines of defense against mass infection in this global pandemic. Given the protection they provide, vaccines are becoming mandatory in certain social and professional settings. This paper presents a classification model for detecting COVID-19 vaccination related search queries, a machine learning model that is used to generate search insights for COVID-19 vaccinations. The proposed method combines and leverages advancements from modern state-of-the-art (SOTA) natural language understanding (NLU) techniques such as pretrained Transformers with traditional dense features. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task, improving a strong well-established gradient-boosting baseline by relative +15% improvement in F1 score and +14% in precision. View details
    Vaccine Search Patterns Provide Insights into Vaccination Intent
    Sean Malahy
    Keith Spangler
    Jessica Leibler
    Kevin J. Lane
    Shailesh Bavadekar
    Chaitanya Kamath
    Akim Kumok
    Yuantong Sun
    Tague Griffith
    Adam Boulanger
    Mark Young
    Charlotte Stanton
    Yael Mayer
    Karen Lee Smith
    Kat Chou
    Jonathan I. Levy
    Adam A.Szpiro
    Evgeniy Gabrilovich
    Gregory A. Wellenius
    arXiv (2021), TBD
    Preview abstract Despite ample supply of COVID-19 vaccines, the proportion of fully vaccinated individuals remains suboptimal across much of the US. Rapid vaccination of additional people will prevent new infections among both the unvaccinated and the vaccinated, thus saving lives. With the rapid rollout of vaccination efforts this year, the internet has become a dominant source of information about COVID-19 vaccines, their safety and efficacy, and their availability. We sought to evaluate whether trends in internet searches related to COVID-19 vaccination - as reflected by Google's Vaccine Search Insights (VSI) index - could be used as a marker of population-level interest in receiving a vaccination. We found that between January and August of 2021: 1) Google's weekly VSI index was associated with the number of new vaccinations administered in the subsequent three weeks, and 2) the average VSI index in earlier months was strongly correlated (up to r = 0.89) with vaccination rates many months later. Given these results, we illustrate an approach by which data on search interest may be combined with other available data to inform local public health outreach and vaccination efforts. These results suggest that the VSI index may be useful as a leading indicator of population-level interest in or intent to obtain a COVID-19 vaccine, especially early in the vaccine deployment efforts. These results may be relevant to current efforts to administer COVID-19 vaccines to unvaccinated individuals, to newly eligible children, and to those eligible to receive a booster shot. More broadly, these results highlight the opportunities for anonymized and aggregated internet search data, available in near real-time, to inform the response to public health emergencies. View details
    A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan
    Joel Shor
    Arkady Epshteyn
    Ashwin Sura Ravi
    Beth Luan
    Chun-Liang Li
    Daisuke Yoneoka
    Dario Sava
    Hiroaki Miyata
    Hiroki Kayama
    Isaac Jones
    Joe Mckenna
    Johan Euphrosine
    Kris Popendorf
    Nate Yoder
    Shashank Singh
    Shuhei Nomura
    Thomas Tsai
    npj Digital Medicine (2021)
    Preview abstract The COVID-19 pandemic has highlighted the global need for reliable models of disease spread. We evaluate an AI-improved forecasting approach that provides daily predictions of the expected number of confirmed COVID-19 deaths, cases and hospitalizations during the following 28 days. We present an international, prospective evaluation of model performance across all states and counties in the USA and prefectures in Japan. National mean absolute percentage error (MAPE) for predicting COVID-19 associated deaths before and after prospective deployment remained consistently <3% (US) and <10% (Japan). Average statewide (US) and prefecture wide (Japan) MAPE was 6% and 20% respectively (14% when looking at prefectures with more than 10 deaths).We show our model performs well even during periods of considerable change in population behavior, and that it is robust to demographic differences across different geographic locations.We further demonstrate the model provides meaningful explanatory insights, finding that the model appropriately responds to local and national policy interventions. Our model enables counterfactual simulations, which indicate continuing NPIs alongside vaccinations is essential for more rapidly recovering from the pandemic, delaying the application of interventions has a detrimental effect, and allow exploration of the consequences of different vaccination strategies. The COVID-19 pandemic remains a global emergency. In the face of substantial challenges ahead, the approach presented here has the potential to inform critical decisions. View details
    Google COVID-19 Vaccination Search Insights: Anonymization Process Description
    Adam Boulanger
    Akim Kumok
    Arti Patankar
    Benjamin Miller
    Chaitanya Kamath
    Charlotte Stanton
    Chris Scott
    Damien Desfontaines
    Evgeniy Gabrilovich
    Gregory A. Wellenius
    John S. Davis
    Karen Lee Smith
    Krishna Kumar Gadepalli
    Mark Young
    Shailesh Bavadekar
    Tague Griffith
    Yael Mayer
    Arxiv.org (2021)
    Preview abstract This report describes the aggregation and anonymization process applied to the COVID-19 Vaccination Search Insights~\cite{vaccination}, a publicly available dataset showing aggregated and anonymized trends in Google searches related to COVID-19 vaccination. The applied anonymization techniques protect every user’s daily search activity related to COVID-19 vaccinations with $(\varepsilon, \delta)$-differential privacy for $\varepsilon = 2.19$ and $\delta = 10^{-5}$. View details
    Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description
    Akim Kumok
    Chaitanya Kamath
    Charlotte Stanton
    Damien Desfontaines
    Evgeniy Gabrilovich
    Gerardo Flores
    Gregory Alexander Wellenius
    Ilya Eckstein
    John S. Davis
    Katie Everett
    Krishna Kumar Gadepalli
    Rayman Huang
    Shailesh Bavadekar
    Thomas Ludwig Roessler
    Venky Ramachandran
    Yael Mayer
    Arxiv.org, N/A (2020)
    Preview abstract This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset, a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily search activity of every user with \varepsilon-differential privacy for \varepsilon = 1.68. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Nissan Hajaj
    Mila Hardt
    Peter J. Liu
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine (2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details