Eduard Trulls

Eduard Trulls

Eduard Trulls is a Research Scientist at Google Zurich, working on Machine Learning for visual recognition. Before that he was a post-doc at the Computer Vision Lab at EPFL in Lausanne, Switzerland, working with Pascal Fua. He obtained his PhD from the Institute of Robotics in Barcelona, Spain, co-advised by Francesc Moreno and Alberto Sanfeliu. Before his PhD he worked in mobile robotics.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    TUSK: Task-Agnostic Unsupervised Keypoints
    Yuhe Jin
    Weiwei Sun
    Jan Hosang
    Kwang Moo Yi
    Thirty-sixth Conference on Neural Information Processing Systems (2022)
    Preview abstract Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method—an issue which is never discussed or evaluated. We propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection—all unsupervised—with performance on par with the state of the art, while also being able to deal with multiple instances. View details
    COTR: Correspondence Transformer for Matching Across Images
    Wei Jiang
    Jan Hosang
    Kwang Moo Yi
    International Conference in Computer Vision (2021)
    Preview abstract We propose a novel framework for finding correspondences in images based on a deep neural network that, given two images and a query point in one of them, finds its correspondence in the other. By doing so, one has the option to query only the points of interest and retrieve sparse correspondences, or to query all points in an image and obtain dense mappings. Importantly, in order to capture both local and global priors, and to let our model relate between image regions using the most relevant among said priors, we realize our network using a transformer. At inference time, we apply our correspondence network by recursively zooming in around the estimates, yielding a multiscale pipeline able to provide highly-accurate correspondences. Our method significantly outperforms the state of the art on both sparse and dense correspondence problems on multiple datasets and tasks, ranging from wide-baseline stereo to optical flow, without any retraining for a specific dataset. We commit to releasing data, code, and all the tools necessary to train from scratch and ensure reproducibility. View details
    ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning
    Weiwei Sun
    Wei Jiang
    Kwang Moo Yi
    Computer Vision Pattern Recognition (CVPR) (2020)
    Preview abstract Many problems in computer vision require dealing with sparse, unordered data in the form of point clouds. Permutation-equivariant networks have become a popular solution-they operate on individual data points with simple perceptrons and extract contextual information with global pooling. This can be achieved with a simple normalization of the feature maps, a global operation that is unaffected by the order. In this paper, we propose Attentive Context Normalization (ACN), a simple yet effective technique to build permutation-equivariant networks robust to outliers. Specifically, we show how to normalize the feature maps with weights that are estimated within the network, excluding outliers from this normalization. We use this mechanism to leverage two types of attention: local and global-by combining them, our method is able to find the essential data points in high-dimensional space to solve a given task. We demonstrate through extensive experiments that our approach, which we call Attentive Context Networks (ACNe), provides a significant leap in performance compared to the state-of-the-art on camera pose estimation, robust fitting, and point cloud classification under noise and outliers. View details
    DISK: Learning local features with policy gradient
    Michal Tyszkiewicz
    Pascal Fua
    Neural Information Processing Systems, 2020 (2020) (to appear)
    Preview abstract Local feature frameworks are difficult to learn in an end-to-end fashion due to the discreteness inherent to the selection and matching of sparse keypoints. We introduce DISK (DIScrete Keypoints), a novel method that overcomes these obstacles by leveraging principles from Reinforcement Learning (RL), optimizing end-to-end for a high number of correct feature matches. Our simple yet expressive probabilistic model lets us keep the training and inference regimes close, while maintaining good enough convergence properties to reliably train from scratch. Our features can be extracted very densely while remaining discriminative, challenging commonly held assumptions about what constitutes a good keypoint, as showcased in Fig. 1, and deliver state-of-the-art results on three public benchmarks. View details
    Image Matching across Wide Baselines: From Paper to Practice
    Yuhe Jin
    Dmytro Mishkin
    Anastasiia Mishchuk
    Jiri Matas
    Pascal Fua
    Kwang Moo Yi
    International Journal of Computer Vision (2020) (to appear)
    Preview abstract We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task – the accuracy of the reconstructed camera pose – as our primary metric. Our pipeline’s modular structure allows easy integration, configuration, and combination of different methods and heuristics. This is demonstrated by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the conducted experiments reveal unexpected properties of Structure from Motion (SfM) pipelines that can help improve their performance, for both algorithmic and learned methods. Data and code are online, providing an easy-to-use and flexible framework for the benchmarking of local features and robust estimation methods, both alongside and against top-performing methods. This work provides a basis for the Image Matching Challenge. View details
    Preview abstract We propose a novel image sampling method for differentiable image transformation in deep neural networks. The sampling schemes currently used in deep learning, such as Spatial Transformer Networks, rely on bilinear interpolation, which performs poorly under severe scale changes, and more importantly, results in poor gradient propagation. This is due to their strict reliance on direct neighbors. Instead, we propose to generate random auxiliary samples in the vicinity of each pixel in the sampled image, and create a linear approximation with their intensity values. We then use this approximation as a differentiable formula for the transformed image. We demonstrate that our approach produces more representative gradients with a wider basin of convergence for image alignment, which leads to considerable performance improvements when training networks for classification tasks. This is not only true under large downsampling, but also when there are no scale changes. We compare our approach with multi-scale sampling and show that we outperform it. We then demonstrate that our improvements to the sampler are compatible with other tangential improvements to Spatial Transformer Networks and that it further improves their performance. View details
    Beyond Cartesian Representations for Local Descriptors
    Patrick Ebel
    Anastasiia Mishchuk
    Kwang Moo Yi
    Pascal Fua
    ICCV (2019)
    Preview abstract The dominant approach for learning local patch descriptors relies on small image regions whose scale must be properly estimated a priori by a keypoint detector. In other words, if two patches are not in correspondence, their descriptors will not match. A strategy often used to alleviate this problem is to “pool” the pixel-wise features over logpolar regions, rather than regularly spaced ones. By contrast, we propose to extract the “support region” directly with a log-polar sampling scheme. We show that this provides us with a better representation by simultaneously oversampling the immediate neighbourhood of the point and undersampling regions far away from it. We demonstrate that this representation is particularly amenable to learning descriptors with deep networks. Our models can match descriptors across a much wider range of scales than was possible before, and also leverage much larger support regions without suffering from occlusions. We report state-of-the-art results on three different datasets. View details