Kai Kohlhoff
Kai Kohlhoff studied computer science, computational biology and structural bioinformatics at the Karlsruhe Institute of Technology (KIT), Jacobs University Bremen, and the University of Cambridge. After finishing his PhD, he was a Simbios Distinguished Postdoctoral Fellow in Bioengineering at Stanford University. Kai joined Google as a Visiting Faculty in 2011 and is now working as a research scientist at Google AI.
Authored Publications
Sort By
UniAR: A Unified model for predicting human Attention and Responses on visual content
Peizhao Li
Gang Li
Rachit Bhargava
Shaolei Shen
Youwei Liang
Hongxiang Gu
Venky Ramachandran
Golnaz Farhadi
Preview abstract
Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR – a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
View details
Rich Human Feedback for Text to Image Generation
Katherine Collins
Nicholas Carolan
Youwei Liang
Peizhao Li
Dj Dvijotham
Gang Li
Sarah Young
Jiao Sun
Arseniy Klimovskiy
Preview abstract
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality.
Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior work collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation.
In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which keywords in the text prompt are not represented in the image.
We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict these rich feedback automatically.
We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions.
Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants).
View details
Preview abstract
Everyone is unique. Given the same visual stimuli, people's attention is driven by both salient visual cues and their own inherent preferences. Knowledge of visual preferences not only facilitates understanding of fine-grained attention patterns of diverse users, but also has the potential of benefiting the development of customized applications. Nevertheless, existing saliency models typically limit their scope to attention as it applies to the general population and ignore the variability between users' behaviors. In this paper, we identify the critical role of visual preferences in attention modeling, and for the first time study the problem of user-aware saliency modeling. Our work aims to advance attention research from three distinct perspectives: (1) We present a new model with the flexibility to capture attention patterns of various combinations of users, so that we can adaptively predict personalized attention, user group attention, and general saliency at the same time with one single model; (2) To augment models with knowledge about the composition of attention from different users, we further propose a principled learning method to understand visual attention in a progressive manner; and (3) We carry out extensive analyses on publicly available saliency datasets to shed light on the roles of visual preferences. Experimental results on diverse stimuli, including naturalistic images and web pages, demonstrate the advantages of our method in capturing the distinct visual behaviors of different users and the general saliency of visual stimuli.
View details
Preview abstract
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private algorithm for this task and demonstrate its advantages over previous algorithms on several real-world datasets.
Our core algorithmic primitive is a differentially private procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance (EMD) to the average of the inputs. We prove theoretical bounds on the error of our algorithm under certain sparsity assumption and that these are essentially optimal.
View details
Accelerating eye movement research via accurate and affordable smartphone eye tracking
Na Dai
Ethan Steinberg
Kantwon Rogers
Venky Ramachandran
Mina Shojaeizadeh
Li Guo
Nature Communications, 11 (2020)
Preview abstract
Eye tracking has been widely used for decades in vision research, language and usability. However, most prior research has focused on large desktop displays using specialized eye trackers that are expensive and cannot scale. Little is known about eye movement behavior on phones, despite their pervasiveness and large amount of time spent. We leverage machine learning to demonstrate accurate smartphone-based eye tracking without any additional hardware. We show that the accuracy of our method is comparable to state-of-the-art mobile eye trackers that are 100x more expensive. Using data from over 100 opted-in users, we replicate key findings from previous eye movement research on oculomotor tasks and saliency analyses during natural image viewing. In addition, we demonstrate the utility of smartphone-based gaze for detecting reading comprehension difficulty. Our results show the potential for scaling eye movement research by orders-of-magnitude to thousands of participants (with explicit consent), enabling advances in vision research, accessibility and healthcare.
View details
Preview abstract
Given an object and a hand, identifying a robust grasp out of an infinite set of grasp candidates is a challenging problem, and several grasp synthesis approaches have been proposed in the robotics community to find the promising ones. Most of the approaches assume both the object and the hand to be rigid and evaluate the robustness of the grasp based on the wrenches acting at contact points. Since rigid body mechanics is used in these works, the actual distribution of the contact tractions is not considered, and contacts are represented by their resultant wrenches. However, the tractions acting at the contact interfaces play a critical role in the robustness of the grasp, and not accounting for these in detail is a serious limitation of the current approaches. In this paper, we replace the conventional wrench-based rigid-body approaches with a deformable-body mechanics formulation as is conventional in solid mechanics. We briefly review the wrench-based grasp synthesis approaches in the literature and address the drawbacks present in them from a solid mechanics standpoint. In our formulation, we account for deformation in both the grasper and the object and evaluate the robustness of grasp based on the distribution of normal and tangential tractions at the contact interface. We contrast how a given grasp situation is solved using conventional wrench space formulations and deformable solid mechanics and show how tractions on the contacting surfaces influence the grasp equilibrium. Recognizing that contact areas can be correlated to contact tractions, we propose a grasp performance index, π , based on the contact areas. We also devise a grasp analysis strategy to identify robust grasps under random perturbations and implement it using Finite Element Method (FEM) to study a few grasps. One of the key aspects of our Finite Element (FE)-based approach is that it can be used to monitor the dynamic interaction between object and hand for judging grasp robustness. We then compare our measure, π , with conventional grasp quality measures, ϵ and v and show that it successfully accounts for the effect of the physical characteristics of the object and hand (such as the mass, Young’s modulus and coefficient of friction) and identifies robust grasps that are in line with human intuition and experience.
View details
Google-Accelerated Biomolecular Simulations
Biomolecular Simulations, Springer, New York (2019)
Preview abstract
Biomolecular simulations rely heavily on the availability of suitable compute infrastructure for data-driven tasks like modeling, sampling, and analysis. These resources are typically available on a per-lab and per-facility basis, or through dedicated national supercomputing centers. In recent years, cloud computing has emerged as an alternative by offering an abundance of on-demand, specialist-maintained resources that enable efficiency and increased turnaround through rapid scaling.
Scientific computations that take the shape of parallel workloads using large datasets are commonplace, making them ideal candidates for distributed computing in the cloud. Recent developments have greatly simplified the task for the experimenter to configure the cloud for use and job submission. This chapter will show how to use Google’s Cloud Platform for biomolecular simulations by example of the molecular dynamics package GROningen MAchine for Chemical Simulations (GROMACS). The instructions readily transfer to a large variety of other tasks, allowing the reader to use the cloud for their specific purposes.
Importantly, by using Docker containers (a popular light-weight virtualization solution) and cloud storage, key issues in scientific research are addressed: reproducibility of results, record keeping, and the possibility for other researchers to obtain copies and directly build upon previous work for further experimentation and hypothesis testing.
View details
Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds
Nathaniel Cabot Thomas
Tess Smidt
Steven Kearnes
Li Li
Patrick Riley
(2018)
Preview abstract
We introduce tensor field neural networks, which are locally equivariant to 3D rotations, translations, and permutations of points at every layer. 3D rotation equivariance removes the need for data augmentation to identify features in arbitrary orientations. Our network uses filters built from spherical harmonics; due to the mathematical consequences of this filter choice, each layer accepts as input (and guarantees as output) scalars, vectors, and higher-order tensors, in the geometric sense of these terms. We demonstrate the capabilities of tensor field networks with tasks in geometry, physics, and chemistry.
View details
Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a Multi-Armed Bandit model with correlated rewards
Jeffrey Mahler
Florian T. Pokorny
Brian Hou
Melrose Roderick
Michael Laskey
Mathieu Aubry
Torsten Kroeger
James Kuffner
Ken Goldberg
2016 IEEE International Conference on Robotics and Automation (ICRA) (2016)
Preview abstract
This paper presents the Dexterity Network (Dex-Net) 1.0, a dataset of 3D object models and a sampling-based planning algorithm to explore how Cloud Robotics can be used for robust grasp planning. The algorithm uses a Multi- Armed Bandit model with correlated rewards to leverage prior grasps and 3D object models in a growing dataset that currently includes over 10,000 unique 3D object models and 2.5 million parallel-jaw grasps. Each grasp includes an estimate of the probability of force closure under uncertainty in object and gripper pose and friction. Dex-Net 1.0 uses Multi-View Convolutional Neural Networks (MV-CNNs), a new deep learning method for 3D object classification, to provide a similarity metric between objects, and the Google Cloud Platform to simultaneously run up to 1,500 virtual cores, reducing experiment runtime by up to three orders of magnitude. Experiments suggest that correlated bandit techniques can use a cloud-based network of object models to significantly reduce the number of samples required for robust grasp planning. We report on system sensitivity to variations in similarity metrics and in uncertainty in pose and friction. Code and updated information is available at http://berkeleyautomation.github.io/dex-net/.
View details