Junfeng He

Junfeng He

Short bio

Junfeng He is a tech lead and research scientist in Google Research. He got his bachelor and master degree from Tsinghua University, and PhD from Columbia University.
His full publication list can be found in google scholar page

Research areas

His major research areas include computer vision, machine learning, search/retrieval/ranking, HCI, and health. He has about 20 years research experience on image retrieval&classification, image generation/editing and their detection, ranking, large scale (approximate) machine learning, etc.

His current research interests include
  • User foundation model to model user feedback/behavior/interaction on visual content, and its application to improve content generation and design
  • Generative models, especially LHF, post-training improvement, evaluation and behavior understanding for generative models
  • Computer vision with human in the loop, the intersection of computer vision and human vision/perception

    Recent research papers (*: co-first author +: corresponding author)



    User foundation models, evaluating/optimizing generative models and content creation with user foundation models

  • Rich Human Feedback for Text-to-Image Generation, Youwei Liang*, Junfeng He*+, Gang Li*+, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, Vidhya Navalpakkam, CVPR 2024 (Best Paper)
  • Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation, Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang, ECCV 2024
  • Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation, Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham, AIES 2024
  • ALOHA: from Attention to Likes – a unified mOdel for understanding HumAn responses to diverse visual content, Peizhao Li*, Junfeng He*+, Gang Li*+, Rachit Bhargava, Shaolei Shen, Nachiappan Valliappan, Youwei Liang, Hongxiang Gu, Venky Ramachandran, Golnaz Farhadi, Yang Li, Kai J Kohlhoff, Vidhya Navalpakkam, arXiv
  • Deep Saliency Prior for Reducing Visual Distraction, Kfir Aberman*, Junfeng He*, Yossi Gandelsman, Inbar Mosseri, David E Jacobs, Kai Kohlhoff, Yael Pritch, Michael Rubinstein, CVPR 2022

    Modeling of human attention & behavior and its applications

  • Learning from Unique Perspectives: User-aware Saliency Modeling, Shi Chen, Nachiappan Valliappan, Shaolei Shen, Xinyu Ye, Kai J Kohlhoff, Junfeng He+, CVPR 2023
  • Teacher-generated spatial-attention labels boost robustness and accuracy ofcontrastive models, Yushi Yao*, Chang Ye*, Junfeng He+, Gamaleldin Fathy Elsayed+, CVPR 2023
  • Teacher-generated pseudo human spatial-attention labels boost contrastive learning models>, Yushi Yao, CHANG YE, Junfeng He, Gamaleldin Fathy Elsayed, SVRHM Workshop@ NeurIPS 2022
  • Smartphone‐based gaze estimation for in‐home autism research, Na Yeon Kim, Junfeng He, Qianying Wu, Na Dai, Kai Kohlhoff, Jasmin Turner, Lynn K Paul, Daniel P Kennedy, Ralph Adolphs, Vidhya Navalpakkam Autism Research, 2024
  • Accelerating eye movement research via accurate and affordable smartphone eye tracking, N Valliappan, N Dai, E Steinberg, J He, K Rogers…, Nature Communications, 2020
  • On-Device Few-Shot Personalization for Real-Time Gaze Estimation, Junfeng He, Khoi Pham, Nachiappan Valliappan, Pingmei Xu, Chase Roberts, Dmitry Lagun, Vidhya Navalpakkam, ICCV 2019 GAZE workshop , Best paper
  • Gazegan-unpaired adversarial image generation for gaze estimation, M Sela, P Xu, J He, V Navalpakkam, D Lagun, arXiv preprint arXiv:1711.09767, 2017
  • Differentially Private Heatmaps, Badih Ghazi, Junfeng He, Kai Kohlhoff, Ravi Kumar, Pasin Manurangsi, Vidhya Navalpakkam, Nachiappan Valliappan, AAAI 2023

    Awards

  • Best Paper Award , CVPR, 2024
  • Publication&OpenSourcing Excellence Award , Perira org, Google Research, 2021
  • Best Paper Award , ICCV GAZE workshop, 2019

    Google Blogpost

  • Blogpost for "Rich human feedback for text-to-image generation"
  • Blogpost for "Enabling delightful user experiences via predictive models of human attention"
  • Blogpost for using saliency in JPEG XL
  • Authored Publications
    Sort By
    • Title
    • Title, descending
    • Year
    • Year, descending
      Preview abstract Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR – a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements. View details
      Preview abstract Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior work collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which keywords in the text prompt are not represented in the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict these rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). View details
      Preview abstract Everyone is unique. Given the same visual stimuli, people's attention is driven by both salient visual cues and their own inherent preferences. Knowledge of visual preferences not only facilitates understanding of fine-grained attention patterns of diverse users, but also has the potential of benefiting the development of customized applications. Nevertheless, existing saliency models typically limit their scope to attention as it applies to the general population and ignore the variability between users' behaviors. In this paper, we identify the critical role of visual preferences in attention modeling, and for the first time study the problem of user-aware saliency modeling. Our work aims to advance attention research from three distinct perspectives: (1) We present a new model with the flexibility to capture attention patterns of various combinations of users, so that we can adaptively predict personalized attention, user group attention, and general saliency at the same time with one single model; (2) To augment models with knowledge about the composition of attention from different users, we further propose a principled learning method to understand visual attention in a progressive manner; and (3) We carry out extensive analyses on publicly available saliency datasets to shed light on the roles of visual preferences. Experimental results on diverse stimuli, including naturalistic images and web pages, demonstrate the advantages of our method in capturing the distinct visual behaviors of different users and the general saliency of visual stimuli. View details
      Preview abstract We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private algorithm for this task and demonstrate its advantages over previous algorithms on several real-world datasets. Our core algorithmic primitive is a differentially private procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance (EMD) to the average of the inputs. We prove theoretical bounds on the error of our algorithm under certain sparsity assumption and that these are essentially optimal. View details
      Preview abstract Eye tracking has been widely used for decades in vision research, language and usability. However, most prior research has focused on large desktop displays using specialized eye trackers that are expensive and cannot scale. Little is known about eye movement behavior on phones, despite their pervasiveness and large amount of time spent. We leverage machine learning to demonstrate accurate smartphone-based eye tracking without any additional hardware. We show that the accuracy of our method is comparable to state-of-the-art mobile eye trackers that are 100x more expensive. Using data from over 100 opted-in users, we replicate key findings from previous eye movement research on oculomotor tasks and saliency analyses during natural image viewing. In addition, we demonstrate the utility of smartphone-based gaze for detecting reading comprehension difficulty. Our results show the potential for scaling eye movement research by orders-of-magnitude to thousands of participants (with explicit consent), enabling advances in vision research, accessibility and healthcare. View details
      Preview abstract Recent research has demonstrated the ability to estimate user’s gaze on mobile devices, by performing inference from an image captured with the phone’s front-facing camera, and without requiring specialized hardware. Gaze estimation accuracy is known to improve with additional calibration data from the user. However, most existing methods require either significant number of calibration points or computationally intensive model fine-tuning that is practically infeasible on a mobile device. In this paper, we overcome limitations of prior work by proposing a novel few-shot personalization approach for 2D gaze estimation. Compared to the best calibration-free model [11], the proposed method yields substantial improvements in gaze prediction accuracy (24%) using only 3 calibration points in contrast to previous personalized models that offer less improvement while requiring more calibration points. The proposed model requires 20x fewer FLOPS than the state-of-the-art personalized model [11] and can be run entirely on-device and in real-time, thereby unlocking a variety of important applications like accessibility, gaming and human-computer interaction. View details