Pierre Sermanet

Pierre Sermanet

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In recent years, much progress has been made in learning robotic manipulation policies that can follow natural language instructions. Common approaches involve learning methods that operate on offline datasets, such as task-specific teleoperated demonstrations or on hindsight labeled robotic experience. Such methods work reasonably but rely strongly on the assumption of clean data: teleoperated demonstrations are collected with specific tasks in mind, while hindsight language descriptions rely on expensive human labeling. Recently, large-scale pretrained language and vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. However, can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills contained in the offline dataset that weren't necessarily reflected in ground truth labels? We investigate fine-tuning a reward model on a small dataset of robot interactions with crowd-sourced natural language labels and using the model to relabel instructions of a large offline robot dataset. The resulting dataset with diverse language skills is used to train imitation learning policies, which outperform prior methods by up to 30% when evaluated on a diverse set of novel language instructions that were not contained in the original dataset. View details
    Robotic Table Tennis: A Case Study into a High Speed Learning System
    Jon Abelian
    Saminda Abeyruwan
    Michael Ahn
    Justin Boyd
    Erwin Johan Coumans
    Omar Escareno
    Wenbo Gao
    Navdeep Jaitly
    Juhana Kangaspunta
    Satoshi Kataoka
    Gus Kouretas
    Yuheng Kuang
    Corey Lynch
    Thinh Nguyen
    Ken Oslund
    Barney J. Reed
    Anish Shankar
    Avi Singh
    Grace Vesom
    Peng Xu
    Robotics: Science and Systems (2023)
    Preview abstract We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material. View details
    InnerMonologue: Embodied Reasoning through Planning with Language Models
    Wenlong Huang
    Harris Chan
    Jacky Liang
    Pete Florence
    Andy Zeng
    Igor Mordatch
    Yevgen Chebotar
    Noah Brown
    Tomas Jackson
    Linda Luu
    Sergey Levine
    Karol Hausman
    Brian Andrew Ichter
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Recent works have shown the capabilities of large language models to perform tasks requiring reasoning and to be applied to applications beyond natural language processing, such as planning and interaction for embodied robots.These embodied problems require an agent to understand the repertoire of skills available to a robot and the order in which they should be applied. They also require an agent to understand and ground itself within the environment. In this work we investigate to what extent LLMs can reason over sources of feedback provided through natural language. We propose an inner monologue as a way for an LLM to think through this process and plan. We investigate a variety of sources of feedback, such as success detectors and object detectors, as well as human interaction. The proposed method is validated in a simulation domain and on real robotic. We show that Innerlogue can successfully replan around failures, and generate new plans to accommodate human intent. View details
    GoalsEye: Learning High Speed Precision Table Tennis on a Physical Robot
    Saminda Wishwajith Abeyruwan
    Anish Shankar
    Corey Harrison Lynch
    International Conference on Intelligent Robots and Systems (IROS) (2022)
    Preview abstract Learning goal conditioned control in the real world is a challenging open problem in robotics. Reinforcement learning systems have the potential to learn autonomously via trial-and-error, but in practice the costs of manual reward design, ensuring safe exploration, and hyperparameter tuning are often enough to preclude real world deployment. Imitation learning approaches, on the other hand, offer a simple way to learn control in the real world, but typically require costly curated demonstration data and lack a mechanism for continuous improvement. Recently, iterative imitation techniques have been shown to learn goal directed control from undirected demonstration data, and improve continuously via self-supervised goal reaching, but results thus far have been limited to simulated environments. In this work, we present evidence that iterative imitation learning can scale to goal-directed behavior on a real robot in a dynamic setting: high speed, precision table tennis (e.g. "land the ball on this particular target"). We find that this approach offers a straightforward way to do continuous on-robot learning, without complexities such as reward design or sim-to-real transfer. It is also scalable -- sample efficient enough to train on a physical robot in just a few hours. In real world evaluations, we find that the resulting policy can perform on par or better than amateur humans (with players sampled randomly from a robotics lab) at the task of returning the ball to specific targets on the table. Finally, we analyze the effect of an initial undirected bootstrap dataset size on performance, finding that a modest amount of unstructured demonstration data provided up-front drastically speeds up the convergence of a general purpose goal-reaching policy. View details
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Andy Zeng
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Chelsea Finn
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Karol Hausman
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Sergey Levine
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning (CoRL) (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Preview abstract We propose a self-supervised approach for learning representations of objects from monocular videos and demonstrate it is particularly useful in situated settings such as robotics. The main contributions of this paper are: 1) a self-supervising objective trained with contrastive learning that can discover and disentangle object attributes from video without using any labels; 2) we leverage object self-supervision for online adaptation: the longer our online model looks at objects in a video, the lower the object identification error, while the offline baseline remains with a large fixed error; 3) to explore the possibilities of a system entirely free of human supervision, we let a robot collect its own data, train on this data with our self-supervise scheme, and then show the robot can point to objects similar to the one presented in front of it, demonstrating generalization of object attributes. An interesting and perhaps surprising finding of this approach is that given a limited set of objects, object correspondences will naturally emerge when using contrastive learning without requiring explicit positive pairs. Videos illustrating online object adaptation and robotic pointing are available at this address: https://sites.google.com/view/object-contrastive-networks/home View details
    Preview abstract The need for understanding periodic videos is pervasive. Videos of biological processes, manufacturing processes, people exercising, objects being manipulated are only a few examples where the respective fields would benefit greatly if they were able to process periodic videos automatically. We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in leveraging temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen videos in the wild. We train this model with a synthetic dataset from a large unlabeled video dataset by sampling short clips of varying lengths and repeating them with different periods. However, simply training powerful video classification models on this synthetic dataset doesn't transfer to real videos. We constrain the period prediction model to use the self-similarity of temporal representations to ensure that the model generalizes to real videos with repeated actions. This combination of synthetic data and a powerful yet constrained model allows us to predict periods in a class-agnostic fashion. Our repetition counting model substantially exceeds the state of the art performance on existing periodicity benchmarks. We also collect a new challenging dataset called Countix which is more difficult than the existing datasets, capturing difficulties in repetition counting in videos in the real-world. We present extensive experiments on this dataset and hope this encourages more research in this important problem. View details
    Learning Latent Plans from Play
    Corey Harrison Lynch
    Mohi Khansari
    Vikash Kumar
    Sergey Levine
    RSS (2019)
    Preview abstract We propose a self-supervised approach to learning a wide variety of manipulation skills from unlabeled data collected through playing in and interacting within a playground environment. Learning by playing offers three main advantages: 1) Collecting large amounts of play data is cheap and fast as it does not require staging the scene nor labeling data, 2) It relaxes the need to have a discrete and rigid definition of skills/tasks during the data collection. This allows the agent to focus on acquiring a continuum set of manipulation skills as a whole, which can then be conditioned to perform a particular skill such as grasping. Furthermore, this data already includes ways to recover, retry or transition between different skills, which can be used to achieve a reactive closed-loop control policy, 3) It allows to quickly learn a new skill from making use of pre-existing general abilities. Our proposed approach to learning new skills from unlabeled play data decouples high-level planning prediction from low-level action prediction by: first self-supervise learning of a latent planning space, then self-supervise learning of an action model that is conditioned on a latent plan. This results in a single task-agnostic policy conditioned on a user-provided goal. This policy can perform a variety of tasks in the environment where playing was observed. We train a single model on 3 hours of unlabeled play data and evaluate it on 18 tasks simply by feeding a goal state corresponding to each task. The baseline model reaches an accuracy of 65\% using 18 specialized policies in 100-shot per task and trained on 1800 expensive demonstrations. Our model completes the tasks with an average of 85\% accuracy using a single policy in zero shots (having never been explicitly trained on these tasks) using cheap unlabeled data. Videos of the performed experiments are available at https://sites.google.com/view/sslmp View details
    Preview abstract We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycleconsistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of selfsupervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: https://sites.google.com/view/temporal-cycle-consistency. View details
    Preview abstract Recently, deep learning based models have pushed the state-of-the-art performance for the task of action recognition in videos. Yet, for many large-scale datasets like Kinetics and UCF101, the correct temporal order of frames doesn't seem to be essential to solving the task. We find that the temporal order matters more for the recently introduced 20BN Something-Something dataset where the task of fine-grained action recognition necessitates the model to do temporal reasoning. We show that when temporal order matters, recurrent models can significantly outperform non-recurrent models. This also provides us with an opportunity to inspect the recurrent units using qualitative approaches to get more insight into what they are encoding about actions in videos. View details