Jump to Content
Jonathan Tompson

Jonathan Tompson

My research background covers a wide range of topics: computer vision and graphics, robotics, computational fluid dynamics, reinforcement learning, unsupervised learning, hand and human body tracking and analog IC design. You can find more of my projects at jonathantompson.com.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract In recent years, much progress has been made in learning robotic manipulation policies that can follow natural language instructions. Common approaches involve learning methods that operate on offline datasets, such as task-specific teleoperated demonstrations or on hindsight labeled robotic experience. Such methods work reasonably but rely strongly on the assumption of clean data: teleoperated demonstrations are collected with specific tasks in mind, while hindsight language descriptions rely on expensive human labeling. Recently, large-scale pretrained language and vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. However, can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills contained in the offline dataset that weren't necessarily reflected in ground truth labels? We investigate fine-tuning a reward model on a small dataset of robot interactions with crowd-sourced natural language labels and using the model to relabel instructions of a large offline robot dataset. The resulting dataset with diverse language skills is used to train imitation learning policies, which outperform prior methods by up to 30% when evaluated on a diverse set of novel language instructions that were not contained in the original dataset. View details
    InnerMonologue: Embodied Reasoning through Planning with Language Models
    Wenlong Huang
    Harris Chan
    Jacky Liang
    Igor Mordatch
    Yevgen Chebotar
    Noah Brown
    Tomas Jackson
    Linda Luu
    Brian Andrew Ichter
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Recent works have shown the capabilities of large language models to perform tasks requiring reasoning and to be applied to applications beyond natural language processing, such as planning and interaction for embodied robots.These embodied problems require an agent to understand the repertoire of skills available to a robot and the order in which they should be applied. They also require an agent to understand and ground itself within the environment. In this work we investigate to what extent LLMs can reason over sources of feedback provided through natural language. We propose an inner monologue as a way for an LLM to think through this process and plan. We investigate a variety of sources of feedback, such as success detectors and object detectors, as well as human interaction. The proposed method is validated in a simulation domain and on real robotic. We show that Innerlogue can successfully replan around failures, and generate new plans to accommodate human intent. View details
    Preview abstract Rearranging and manipulating deformable objects such as cables, fabrics, and bags is a long-standing challenge in robotic manipulation. The complex dynamics and high-dimensional configuration spaces of deformables, compared to rigid objects, make manipulation difficult not only for multi-step planning, but even for goal specification. Goals cannot be as easily specified as rigid object poses, and may involve complex relative spatial relations such as ``place the item inside the bag". In this work, we develop a suite of simulated benchmarks with 1D, 2D, and 3D deformable structures, including tasks that involve image-based goal-conditioning and multi-step deformable manipulation. We propose embedding goal-conditioning into Transporter Networks, a recently proposed model architecture for robotic manipulation that uses learned template matching to infer displacements that can represent pick and place actions. We demonstrate that goal-conditioned Transporter Networks enable agents to manipulate deformable structures into flexibly specified configurations without test-time visual anchors for target locations. We also significantly extend prior results using Transporter Networks for manipulating deformable objects by testing on tasks with 2D and 3D deformables. View details
    Preview abstract We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multi-valued (set-valued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) behavioral cloning policies, including on tasks with high-dimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform state-of-the-art offline reinforcement learning methods on the challenging human-expert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision. View details
    Preview abstract We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc. In this work, we demonstrate that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment demonstration videos that are robust to these differences. Specifically, we present a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL) that leverages temporal cycle-consistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences. Prior to our work, producing rewards from self-supervised embeddings typically required alignment with a reference trajectory, which may be difficult to acquire under stark embodiment differences. We show empirically that if the embeddings are aware of task progress, simply taking the negative distance between the current state and goal state in the learned embedding space is useful as a reward for training policies with reinforcement learning. We find our learned reward function not only works for embodiments seen during training, but also generalizes to entirely new embodiments. Additionally, when transferring real-world human demonstrations to a simulated robot, we find that XIRL is more sample efficient than current best methods. View details
    Preview abstract We present the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive policies that can be transferred between environments of varying dynamics, by imitating a small number of demonstrations collected from a single source domain. This problem is important in robotic learning because in real world scenarios 1) reward functions are hard to obtain, 2) learned policies from one domain are difficult to deploy in another due to varying source to target domain statistics, 3) collecting expert demonstrations in multiple environments where the dynamics are known and controlled is often infeasible. We address these constraints by building upon recent advances in adversarial imitation learning; we condition our policy on a learned dynamics embedding and we employ a domain-adversarial loss to learn a dynamics-invariant discriminator. The effectiveness of our method is demonstrated on simulated control tasks with varying environment dynamics and the learned adaptive agent outperforms several recent baselines. View details
    Preview abstract Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved can encompass object(s) or an end effector. In this work, we propose the Transporter Network, a simple model architecture that rearranges deep features to infer spatial displacements from visual input -- which can parameterize robot actions. It makes no assumptions of objectness (e.g. canonical poses, models, or keypoints), it exploits spatial symmetries, and is orders of magnitude more sample efficient than our benchmarked alternatives in learning vision-based manipulation tasks: from stacking a pyramid of blocks, to assembling kits with unseen objects; from manipulating deformable ropes, to pushing piles of small objects with closed-loop feedback. Our method can represent complex multi-modal policy distributions and generalizes to multi-step sequential tasks, as well as 6DoF pick-and-place. Experiments on 10 simulated tasks show that it learns faster and generalizes better than a variety of end-to-end baselines, including policies that use ground-truth object poses. We validate our methods with hardware in the real world. View details
    Preview abstract Fully convolutional deep correlation networks are currently the state of the art approaches to single object visual tracking. It is commonly assumed that these networks perform tracking by detection by matching features of the object instance with features of the scene. Strong architectural priors and conditioning on the object representation is thought to encourage this tracking strategy. Despite these efforts, we show that deep trackers often default to “tracking by saliency” detection – without relying on the object representation. This leads us to introduce an auxiliary detection task that encourages more discriminative object representations and improves tracking performance. View details
    Preview abstract The need for understanding periodic videos is pervasive. Videos of biological processes, manufacturing processes, people exercising, objects being manipulated are only a few examples where the respective fields would benefit greatly if they were able to process periodic videos automatically. We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in leveraging temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen videos in the wild. We train this model with a synthetic dataset from a large unlabeled video dataset by sampling short clips of varying lengths and repeating them with different periods. However, simply training powerful video classification models on this synthetic dataset doesn't transfer to real videos. We constrain the period prediction model to use the self-similarity of temporal representations to ensure that the model generalizes to real videos with repeated actions. This combination of synthetic data and a powerful yet constrained model allows us to predict periods in a class-agnostic fashion. Our repetition counting model substantially exceeds the state of the art performance on existing periodicity benchmarks. We also collect a new challenging dataset called Countix which is more difficult than the existing datasets, capturing difficulties in repetition counting in videos in the real-world. We present extensive experiments on this dataset and hope this encourages more research in this important problem. View details
    Imitation Learning via Off-Policy Distribution Matching
    Ilya Kostrikov
    Ofir Nachum
    Submission for NeurIPS workshop, ICLR conference (2020)
    Preview abstract When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one typically alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data-inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can consistently outperform state-of-the-art methods. View details
    Preview abstract We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycleconsistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of selfsupervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: https://sites.google.com/view/temporal-cycle-consistency. View details
    Preview abstract We propose a self-supervised approach to learning a wide variety of manipulation skills from unlabeled data collected through playing in and interacting within a playground environment. Learning by playing offers three main advantages: 1) Collecting large amounts of play data is cheap and fast as it does not require staging the scene nor labeling data, 2) It relaxes the need to have a discrete and rigid definition of skills/tasks during the data collection. This allows the agent to focus on acquiring a continuum set of manipulation skills as a whole, which can then be conditioned to perform a particular skill such as grasping. Furthermore, this data already includes ways to recover, retry or transition between different skills, which can be used to achieve a reactive closed-loop control policy, 3) It allows to quickly learn a new skill from making use of pre-existing general abilities. Our proposed approach to learning new skills from unlabeled play data decouples high-level planning prediction from low-level action prediction by: first self-supervise learning of a latent planning space, then self-supervise learning of an action model that is conditioned on a latent plan. This results in a single task-agnostic policy conditioned on a user-provided goal. This policy can perform a variety of tasks in the environment where playing was observed. We train a single model on 3 hours of unlabeled play data and evaluate it on 18 tasks simply by feeding a goal state corresponding to each task. The baseline model reaches an accuracy of 65\% using 18 specialized policies in 100-shot per task and trained on 1800 expensive demonstrations. Our model completes the tasks with an average of 85\% accuracy using a single policy in zero shots (having never been explicitly trained on these tasks) using cheap unlabeled data. Videos of the performed experiments are available at https://sites.google.com/view/sslmp View details
    Preview abstract Algorithms for imitation learning based on adversarial optimization, such as generative adversarial imitation learning (GAIL) and adversarial inverse reinforcement learning (AIRL), can effectively mimic demonstrated behaviours by employing both reward and reinforcement learning (RL). However, applications of such algorithms are challenged by the inherent instability and poor sample efficiency of on-policy RL. In particular, the inadequate handling of absorbing states in canonical implementations of RL environments causes an implicit bias in reward functions used by these algorithms. While these biases might work well for some environments, they lead to sub-optimal behaviors in others. Moreover, despite the ability of these algorithms to learn from a few demonstrations, they require a prohibitively large number of the environment interactions for many real-world applications. To address these issues, we first propose to extend the environment MDP with absorbing states which leads to task-independent, and more importantly, unbiased rewards. Secondly, we introduce an off-policy learning algorithm, which we refer to as Discriminator-Actor-Critic. We demonstrate the effectiveness of proper handling of absorbing states, while empirically improving the sample efficiency by an average factor of 10. Our implementation is available online. View details
    Preview abstract Recently, deep learning based models have pushed the state-of-the-art performance for the task of action recognition in videos. Yet, for many large-scale datasets like Kinetics and UCF101, the correct temporal order of frames doesn't seem to be essential to solving the task. We find that the temporal order matters more for the recently introduced 20BN Something-Something dataset where the task of fine-grained action recognition necessitates the model to do temporal reasoning. We show that when temporal order matters, recurrent models can significantly outperform non-recurrent models. This also provides us with an opportunity to inspect the recurrent units using qualitative approaches to get more insight into what they are encoding about actions in videos. View details
    Preview abstract We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417. View details
    Preview abstract In this work we explore a new approach for robots to teach themselves about the world simply by observing it. In particular we investigate the effectiveness of learning task-agnostic representations for continuous control tasks. We extend Time-Contrastive Networks (TCN) that learn from visual observations by embedding multiple frames jointly in the embedding space as opposed to a single frame. We show that by doing so, we are now able to encode both position and velocity attributes significantly more accurately. We test the usefulness of this self-supervised approach in a reinforcement learning setting. We show that the representations learned by agents observing themselves take random actions, or other agents perform tasks successfully, can enable the learning of continuous control policies using algorithms like Proximal Policy Optimization (PPO) using only the learned embeddings as input. View details
    Preview abstract This paper presents KeypointNet, an end-to-end geometric reasoning framework to learn an optimal set of category-specific 3D keypoints, along with their detectors. Given a single image, KeypointNet extracts 3D keypoints that are optimized for a downstream task. We demonstrate this framework on 3D pose estimation by proposing a differentiable objective that seeks the optimal set of keypoints for recovering the relative pose between two views of an object. Our model discovers geometrically and semantically consistent keypoints across viewing angles and instances of an object category. Importantly, we find that our end-to-end framework using no ground-truth keypoint annotations outperforms a fully supervised baseline using the same neural network architecture on the task of pose estimation. The discovered 3D keypoints on the car, chair, and plane categories of ShapeNet are visualized at keypointnet.github.io. View details
    Preview abstract In this work we explore a new approach for robots to teach themselves about the world simply by observing it. In particular we investigate the effectiveness of learning task-agnostic representations for continuous control tasks. We extend Time-Contrastive Networks (TCN) that learn from visual observations by embedding multiple frames jointly in the embedding space as opposed to a single frame. We show that by doing so, we are now able to encode both position and velocity attributes significantly more accurately. We test the usefulness of this self-supervised approach in a reinforcement learning setting. We show that the representations learned by agents observing themselves take random actions, or other agents perform tasks successfully, can enable the learning of continuous control policies using algorithms like Proximal Policy Optimization (PPO) using only the learned embeddings as input. We also demonstrate significant improvements on the real-world Pouring dataset with a relative error reduction of 39.4% for motion attributes and 11.1% for static attributes compared to the single-frame baseline. Video results are available at this https URL . View details
    Accelerating Eulerian Fluid Simulation With Convolutional Networks
    Kristofer Schlachter
    Pablo Sprechmann
    Ken Perlin
    ICML (2017)
    Preview abstract Real-time simulation of fluid and smoke is a long standing problem in computer graphics, where state-of-the-art approaches require large compute resources, making real-time applications often impractical. In this work, we propose a data-driven approach that leverages the approximation power of deep-learning methods with the precision of standard fluid solvers to obtain both fast and highly realistic simulations. The proposed method solves the incompressible Euler equations following the standard operator splitting method in which a large, often ill-condition linear system must be solved. We propose replacing this system by learning a Convolutional Network (ConvNet) from a training set of simulations using a semi-supervised learning method to minimize long-term velocity divergence. ConvNets are amenable to efficient GPU implementations and, unlike exact iterative solvers, have fixed computational complexity and latency. The proposed hybrid approach restricts the learning task to a linear projection without modeling the well understood advection and body forces. We present real-time 2D and 3D simulations of fluids and smoke; the obtained results are realistic and show good generalization properties to unseen geometry. View details
    Preview abstract We propose a method for multi-person detection and 2-D keypoint localization (human pose estimation) that achieves state-of-the-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector with an Inception-ResNet architecture. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Our final system achieves average precision of 0.636 on the COCO test-dev set and the 0.628 test-standard sets, outperforming the CMU-Pose winner of the 2016 COCO keypoints challenge. Further, by using additional labeled data we obtain an even higher average precision of 0.668 on the test-dev set and 0.658 on the test-standard set, thus achieving a roughly 10% improvement over the previous best performing method on the same challenge. View details
    Preview abstract In this paper, we examine the problem of robotic manipulation of granular media. We evaluate multiple predictive models used to infer the dynamics of scooping and dumping actions. These models are evaluated on a task that involves manipulating the media in order to deform it into a desired shape. Our best performing model is based on a highly-tailored convolutional network architecture with domain-specific optimizations, which we show accurately models the physical interaction of the robotic scoop with the underlying media. We empirically demonstrate that explicitly predicting physical mechanics results in a policy that out-performs both a hand-crafted dynamics baseline, and a “value-network”, which must otherwise implicitly predict the same mechanics in order to produce accurate value estimates. View details
    No Results Found