Vidhya Navalpakkam

Vidhya Navalpakkam

I am currently a Principal Scientist at Google Research. I lead an interdisciplinary team at the intersection of Machine learning, Neuroscience, Cognitive Psychology and Vision. My interests are in modeling user attention and behavior across multimodal interfaces, for improved usability and accessibility of Google products. I am also interested in applications of attention for healthcare (e.g., smartphone-based screening for health conditions).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior work collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which keywords in the text prompt are not represented in the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict these rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). View details
    Preview abstract Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR – a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements. View details
    Preview abstract We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private algorithm for this task and demonstrate its advantages over previous algorithms on several real-world datasets. Our core algorithmic primitive is a differentially private procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance (EMD) to the average of the inputs. We prove theoretical bounds on the error of our algorithm under certain sparsity assumption and that these are essentially optimal. View details
    Digital biomarker of mental fatigue
    Vincent Wen-Sheng Tseng
    Venky Ramachandran
    Tanzeem Choudhury
    npj Digital Medicine, 4 (2021), pp. 1-5
    Preview abstract Mental fatigue is an important aspect of alertness and wellbeing. Existing fatigue tests are subjective and/or time-consuming. Here, we show that smartphone-based gaze is significantly impaired with mental fatigue, and tracks the onset and progression of fatigue. A simple model predicts mental fatigue reliably using just a few minutes of gaze data. These results suggest that smartphone-based gaze could provide a scalable, digital biomarker of mental fatigue. View details
    Preview abstract Eye tracking has been widely used for decades in vision research, language and usability. However, most prior research has focused on large desktop displays using specialized eye trackers that are expensive and cannot scale. Little is known about eye movement behavior on phones, despite their pervasiveness and large amount of time spent. We leverage machine learning to demonstrate accurate smartphone-based eye tracking without any additional hardware. We show that the accuracy of our method is comparable to state-of-the-art mobile eye trackers that are 100x more expensive. Using data from over 100 opted-in users, we replicate key findings from previous eye movement research on oculomotor tasks and saliency analyses during natural image viewing. In addition, we demonstrate the utility of smartphone-based gaze for detecting reading comprehension difficulty. Our results show the potential for scaling eye movement research by orders-of-magnitude to thousands of participants (with explicit consent), enabling advances in vision research, accessibility and healthcare. View details
    Preview abstract Recent research has demonstrated the ability to estimate user’s gaze on mobile devices, by performing inference from an image captured with the phone’s front-facing camera, and without requiring specialized hardware. Gaze estimation accuracy is known to improve with additional calibration data from the user. However, most existing methods require either significant number of calibration points or computationally intensive model fine-tuning that is practically infeasible on a mobile device. In this paper, we overcome limitations of prior work by proposing a novel few-shot personalization approach for 2D gaze estimation. Compared to the best calibration-free model [11], the proposed method yields substantial improvements in gaze prediction accuracy (24%) using only 3 calibration points in contrast to previous personalized models that offer less improvement while requiring more calibration points. The proposed model requires 20x fewer FLOPS than the state-of-the-art personalized model [11] and can be run entirely on-device and in real-time, thereby unlocking a variety of important applications like accessibility, gaming and human-computer interaction. View details
    Towards better measurement of attention and satisfaction in mobile search
    Dmitry Lagun
    Chih-Hung Hsieh
    SIGIR '14 Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (2014), pp. 113-122
    Preview abstract Web Search has seen two big changes recently: rapid growth in mobile search traffic, and an increasing trend towards providing answer-like results for relatively simple information needs (e.g., [weather today]). Such results display the answer or relevant information on the search page itself without requiring a user to click. While clicks on organic search results have been used extensively to infer result relevance and search satisfaction, clicks on answer-like results are often rare (or meaningless), making it challenging to evaluate answer quality. Together, these call for better measurement and understanding of search satisfaction on mobile devices. In this paper, we studied whether tracking the browser viewport (visible portion of a web page) on mobile phones could enable accurate measurement of user attention at scale, and provide good measurement of search satisfaction in the absence of clicks. Focusing on answer-like results in web search, we designed a lab study to systematically vary answer presence and relevance (to the user's information need), obtained satisfaction ratings from users, and simultaneously recorded eye gaze and viewport data as users performed search tasks. Using this ground truth, we identified increased scrolling past answer and increased time below answer as clear, measurable signals of user dissatisfaction with answers. While the viewport may contain three to four results at any given time, we found strong correlations between gaze duration and viewport duration on a per result basis, and that the average user attention is focused on the top half of the phone screen, suggesting that we may be able to scalably and reliably identify which specific result the user is looking at, from viewport data alone. View details
    Measurement and modeling of eye-mouse behavior
    LaDawn Jentzsch
    Rory Sayres
    Sujith Ravi
    Alex J. Smola
    Proceedings of the 22nd International World Wide Web Conference (2013)
    Preview