Alessio Tonioni

Alessio Tonioni

I’m a researcher in computer vision and deep learning and I’m currently working with Federico Tombari at Google Zurich. Previously I was enrolled as a post doc at the Computer Vision Lab of the university of Bologna under the supervision of Professor Luigi Di Stefano. I received my PhD in Computer Science and Engineering from University of Bologna on April 2019. During my PhD I have worked on deep learning solutions for product detection and recognition in retail environments and on deep learning applied to depth estimation from stereo and monocular cameras.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    TextMesh: Generation of Realistic 3D Meshes From Text Prompts
    Christina Tsalicoglou
    Fabian Manhardt
    Michael Niemeyer
    3DV 2024 (2024)
    Preview abstract The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh. View details
    TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction Using Vision-Based Tactile Sensing
    Mauro Comi
    Yijiong Lin
    Alex Church
    Laurence Aitchison
    Nathan Lepora
    IEEE Robotics and Automation Letters (2024)
    Preview abstract Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: this https URL View details
    NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions
    Mohamad Shahbazi
    Evangelos Ntaveli
    Edo Collins
    Danda Pani Paudel
    Martin Danelljan
    Luc Van Gool
    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2023)
    Preview abstract Pose-conditioned convolutional generative models struggle with high-quality 3D-consistent image generation from single-view datasets, due to their lack of sufficient 3D priors. Recently, the integration of Neural Radiance Fields (NeRFs) and generative models, such as Generative Adversarial Networks (GANs), has transformed 3D-aware generation from single-view images. NeRF-GANs exploit the strong inductive bias of neural 3D representations and volumetric rendering at the cost of higher computational complexity. This study aims at revisiting pose-conditioned 2D GANs for efficient 3D-aware generation at inference time by distilling 3D knowledge from pretrained NeRFGANs. We propose a simple and effective method, based on re-using the well-disentangled latent space of a pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly generate 3D-consistent images corresponding to the underlying 3D representations. Experiments on several datasets demonstrate that the proposed method obtains results comparable with volumetric rendering in terms of quality and 3D consistency while benefiting from the computational advantage of convolutional networks. The code is available at: https://github.com/ mshahbazi72/NeRF-GAN-Distillation View details
    NeRF-Supervised Deep Stereo
    Fabio Tosi
    Daniele De Gregorio
    Matteo Poggi
    Computer Vision and Pattern Recognition (2023)
    Preview abstract We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization. View details
    LatentSwap3D: Swapping Latent Codes for Semantic Edits
    Enis Simsar
    Evin Pınar Örnek
    Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
    Preview abstract 3D GANs have the ability to generate latent codes for entire 3D volumes rather than only 2D images. These models offer desirable features like high-quality geometry and multi-view consistency, but, unlike their 2D counterparts, complex semantic image editing tasks for 3D GANs have only been partially explored. To address this problem, we propose LatentSwap3D, a semantic edit approach based on latent space discovery that can be used with any off-the-shelf 3D or 2D GAN model and on any dataset. LatentSwap3D relies on identifying the latent code dimensions corresponding to specific attributes by feature ranking using a random forest classifier. It then performs the edit by swapping the selected dimensions of the image being edited with the ones from an automatically selected reference image. Compared to other latent space control-based edit methods, which were mainly designed for 2D GANs, our method on 3D GANs provides remarkably consistent semantic edits in a disentangled manner and outperforms others both qualitatively and quantitatively. We show results on seven 3D GANs (?-GAN, GIRAFFE, StyleSDF, MVCGAN, EG3D, StyleNeRF, and VolumeGAN) and on five datasets (FFHQ, AFHQ, Cats, MetFaces, and CompCars). View details
    Learning good features to transfer across tasks and domains
    Adriano Cardace
    Luca De Luigi
    Luigi Di Stefano
    Pierluigi Zama Ramirez
    Samuele Salti
    IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) (to appear)
    Preview abstract The availability of labelled data is the major obstacle to the deployment of deep learning algorithms to solve computer vision tasks in new domains. Recent works have shown that it is possible to leverage on correlations between features learned by neural networks for different tasks on different domains to reduce the need for full supervision. This is achieved by learning to transfer features across both tasks and domains. In this work, we show how constraining the structure of the source and target feature space is the key to improve the performances of such a transfer framework. In particular, we demonstrate the benefits of: learning features able to capture fine-grain details of the image and aligning the space across tasks by means of an auxiliary task; aligning the feature spaces across domains by means of a novel norm discrepancy loss. We achieve state of the art results in synthetic-to-real adaptation scenarios for this novel setting. View details
    Continual Adaptation for Deep Stereo
    Fabio Tosi
    Luigi Di Stefano
    Matteo Poggi
    Stefano Mattoccia
    IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) (to appear)
    Preview abstract Depth estimation from stereo images is carried out with unmatched results by convolutional neural networks trained end-to-end to regress dense disparities. Like for most tasks, this is possible if large amounts of labelled samples are available for training, possibly covering the whole data distribution encountered at deployment time. Being such an assumption systematically unmet in real applications, the capacity of adapting to any unseen setting becomes of paramount importance. Purposely, we propose a continual adaptation paradigm for deep stereo networks designed to deal with challenging and ever-changing environments. We design a lightweight and modular architecture, Modularly ADaptive Network (MADNet), and formulate Modular ADaptation algorithms (MAD, MAD++) which permit efficient optimization of independent sub-portions of the entire network. In our paradigm, the learning signals needed to continuously adapt models online can be sourced from self-supervision via right-to-left image warping or from traditional stereo algorithms. With both sources, no other data than the input images being gathered at deployment time are needed. Thus, our network architecture and adaptation algorithms realize the first real-time self-adaptive deep stereo system and pave the way for a new paradigm that can facilitate practical deployment of end-to-end architectures for dense disparity regression. View details
    A Divide et Impera Approach for 3D Shape Reconstruction from Multiple Views
    Riccardo Spezialetti
    David Joseph New Tan
    Keisuke Tateno
    International Virtual Conference on 3D Vision (2020) (to appear)
    Preview abstract Estimating the 3D shape of an object from a single or multiple images has gained popularity thanks to the recent breakthroughs powered by deep learning. Most approaches regress the full object shape in a canonical pose, possibly extrapolating the occluded parts based on the learned priors. However, their viewpoint invariant technique often discards the unique structures visible from the input images. In contrast, this paper proposes to rely on viewpoint variant reconstructions by merging the visible information from the given views. Our approach is divided into three steps. Starting from the sparse views of the object, we first align them into a common coordinate system by estimating the relative pose between all the pairs. Then, inspired by the traditional voxel carving, we generate an occupancy grid of the object taken from the silhouette on the images and their relative poses. Finally, we refine the initial reconstruction to build a clean 3D model which preserves the details from each viewpoint. To validate the proposed method, we perform a comprehensive evaluation on the ShapeNet reference benchmark in terms of relative pose estimation and 3D shape reconstruction. View details