Diana Mincu

Diana Mincu

Diana is a research engineer in Google Research and a member of the Brain Health Research team, having previously worked in Google Health, DeepMind and Google Play. She's currently focusing on the inclusive mobile health technology space, but she is also an active collaborator on fairness projects. During her research-time at Google she has worked on Electronic Health Records problems and has published on model explainability within the healthcare domain. She holds a Masters in Artificial Intelligence and Machine Learning from the Polytechnic University of Bucharest, and has also worked on ML research in the security domain at BitDefender.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks. View details
    Preview abstract Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline. View details
    Preview abstract Machine learning (ML) approaches have demonstrated promising results in a wide range of healthcare applications. Data plays a crucial role in developing ML-based healthcare systems that directly affect people’s lives. Many of the ethical issues surrounding the use of ML in healthcare stem from structural inequalities underlying the way we collect, use, and handle data. Developing guidelines to improve documentation practices regarding the creation, use, and maintenance of ML healthcare datasets is therefore of critical importance. In this work, we introduce Healthsheet, a contextualized adaptation of the original datasheet questionnaire for health-specific applications. Through a series of semi-structured interviews, we adapt the datasheets for healthcare data documentation. As part of the Healthsheet development process and to understand the obstacles researchers face in creating datasheets, we worked with three publicly-available healthcare datasets as our case studies, each with different types of structured data: Electronic health Records (EHR), clinical trial study data, and smartphone-based performance outcome measures. Our findings from the interviewee study and case studies show 1) that datasheets should be contextualized for healthcare, 2) that despite incentives to adopt accountability practices such as datasheets, there is a lack of consistency in the broader use of these practices 3) how the ML for health community views datasheets and particularly Healthsheets as diagnostic tool to surface the limitations and strength of datasets and 4) the relative importance of different fields in the datasheet to healthcare concerns. View details
    Preview abstract Interpretability techniques aim to provide the rationale behind a model's decision, typically by explaining either an individual prediction (local explanation, e.g. `why is this patient diagnosed with this condition') or a class of predictions (global explanation, e.g. `why is this set of patients diagnosed with this condition in general'). While there are many methods focused on either one, few frameworks can provide both local and global explanations in a consistent manner. In this work, we combine two powerful existing techniques, one local (Integrated Gradients, IG) and one global (Testing with Concept Activation Vectors), to provide local and global concept-based explanations. We first sanity check our idea using two synthetic datasets with a known ground truth, and further demonstrate with a benchmark natural image dataset. We test our method with various concepts, target classes, model architectures and IG parameters (e.g. baselines). We show that our method improves global explanations over vanilla TCAV when compared to ground truth, and provides useful local insights. Finally, a user study demonstrates the usefulness of the method compared to no or global explanations only. We hope our work provides a step towards building bridges between many existing local and global methods to get the best of both worlds. View details
    Multi-task prediction of organ dysfunction in the ICU using sequential sub-network routing
    Eric Loreaux
    Anne Mottram
    Hugh Montgomery
    Ali Connell
    Nenad Tomašev
    Martin Seneviratne
    Journal of the American Medical Informatics Association (JAMIA) (2021)
    Preview abstract Introduction: Multi-task learning (MTL) using electronic health records (EHRs) allows concurrent prediction of multiple endpoints. MTL has shown promise in improving model performance and training efficiency; however it often suffers from negative transfer - impaired learning if tasks are not appropriately selected. We introduce a sequential sub-network routing (SeqSNR) architecture which uses soft parameter sharing to find related tasks and encourage cross-learning between them. Materials and Methods: Using the Medical Information Mart for Intensive Care (MIMIC-III) dataset, we train deep neural network models to predict the onset of six endpoints including specific organ dysfunctions and general clinical outcomes: acute kidney injury, continuous renal replacement therapy, mechanical ventilation, vasoactive medications, mortality, and length of stay. We compare single task models (ST) with naive multi-task (shared bottom, SB) and SeqSNR in terms of discriminative performance and label efficiency. Results: SeqSNR showed a modest yet statistically significant performance boost across at least 4 out of 6 tasks compared to SB and ST. When the size of the training dataset was reduced for a given task, SeqSNR outperformed ST for all cases showing an average AU PRC boost of 2.1%, 2.9%, and 2.1% for tasks using 1%, 5%, and 10% of labels respectively. Discussion and Conclusion: Multi-task learning has variable performance compared to single-task learning, with the possibility for negative transfer. The SeqSNR architecture outperforms SB and ST in discriminative performance and shows superior performance in terms of label efficiency. SeqSNR should be considered for multi-task predictive modeling using EHR data. View details
    Concept-based model explanations for Electronic Health Records
    Eric Loreaux
    Shaobo Hou
    Sebastien Baur
    Martin G Seneviratne
    Anne Mottram
    Nenad Tomasev
    Association for Computing Machinery, New York, NY, USA (2021), 36–46
    Preview abstract Recurrent Neural Networks (RNNs) are often used for sequential modeling of adverse outcomes in electronic health records (EHRs) due to their ability to encode past clinical states. These deep, recurrent architectures have displayed increased performance compared to other modeling approaches in a number of tasks, fueling the interest in deploying deep models in clinical settings. One of the key elements in ensuring safe model deployment and building user trust is model explainability. Testing with Concept Activation Vectors (TCAV) has recently been introduced as a way of providing human-understandable explanations by comparing high-level concepts to the network's gradients. While the technique has shown promising results in real-world imaging applications, it has not been applied to structured temporal inputs. To enable an application of TCAV to sequential predictions in the EHR, we propose an extension of the method to time series data. We evaluate the proposed approach on an open EHR benchmark from the intensive care unit, as well as synthetic data where we are able to better isolate individual effects. View details
    Underspecification Presents Challenges for Credibility in Modern Machine Learning
    Dan Moldovan
    Ben Adlam
    Babak Alipanahi
    Alex Beutel
    Christina Chen
    Jon Deaton
    Shaobo Hou
    Neil Houlsby
    Ghassen Jerfel
    Yian Ma
    Akinori Mitani
    Andrea Montanari
    Christopher Nielsen
    Thomas Osborne
    Rajiv Raman
    Kim Ramasamy
    Martin Gamunu Seneviratne
    Shannon Sequeira
    Harini Suresh
    Victor Veitch
    Journal of Machine Learning Research (2020)
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details