Saptarashmi Bandyopadhyay

Saptarashmi Bandyopadhyay

I'm Saptarashmi, a Student Researcher at Google AI AR in Summer 2024, Fall 2024 and Spring 2025 collaborating with Google DeepMind and a 5th year PhD student (PhD Candidate) of Computer Science at the University of Maryland, College Park (UMD), having started my PhD in Fall 2020. As a PhD student, I've also been an AI Resident at GoogleX, Alphabet's Moonshot Factory in Summer 2023.

My Ph.D. thesis topic is on "Multi-Agent Autonomous Decision Making in Artificial Learning" focusing on Deep Multi-agent Learning like Multi-agent Reinforcement Learning (RL), Multi-agent Imitation Learning (IL), Multi-Agent Meta Learning, Multi-Agent Self Supervised Learning (SSL) and other Multi-Agent Learning paradigms in real world problems such as multimodal agents, embodied agents, climate conservation, economic applications like autonomous orchestration in supply chains, stock portfolio optimization, recommender systems, AI alignment, Game Theory, Multilingual AI Agents (specially low-resource languages) & AI privacy among others. Before joining as a Student Researcher Intern, I've been a Lead PhD student RA in a DoD project for Multi-Agent Explainable AI, correcting feedback of AI Agents. I have also developed libraries to speedup multi-agent evolutionary training in simulation environments by 12500x times with JAX like JAXMARL in collaboration with researchers at Google DeepMind, Oxford University, Waymo, interactive ML for Education with researchers at Carnegie Mellon University, Asset Representations in Multi-agent RL and Prediction of Vulnerable Hotspots of Deforestation in Indonesia at UMD among other research works.

My papers have been published at AAAI, NeurIPS (ML), AAMAS (AI Agents), EMNLP, ACL (NLP); SPIE (Medical Computer Vision). I'm also the Lead Organizer and Workshop Chair of the accepted AAAI 2025 workshop on Multi-Agent AI in the Real World workshop to be held in Philadelphia, organizing with Aleksandra Faust in Google DeepMind US, Andrea Colaco in Google AI Augmented Reality, John Dickerson and Tom Goldstein at the University of Maryland, College Park and other co-organizers from Carnegie Mellon University, ETH Zurich, University of Maryland, College Park, Google DeepMind Robotics, Google DeepMind UK, Columbia University and Arthur AI. (Details in our website: https://sites.google.com/corp/view/marw-ai-agents).

If you are interested in AI Agents, you may be interested in the Multi-Agent AI Reading Group https://go.umd.edu/marl at UMD in 2022 with 1090 participants from 6 continents with prominent research speakers and talk resources from industry and academia including Turing Award Laureates. Talk details and the Google Form to join the talks are on our website.

Further details can be found on https://sites.google.com/view/saptarashmi/about
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help the AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents, on the other hand, can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly, or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal Agent focuses on the research question of identifying circumstances that may require the Agent to intervene proactively. This allows the Agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using Augmented Reality. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding an user agent to complete procedural tasks. View details