Alessio Tonioni
I’m a researcher in computer vision and deep learning and I’m currently working with Federico Tombari at Google Zurich. Previously I was enrolled as a post doc at the Computer Vision Lab of the university of Bologna under the supervision of Professor Luigi Di Stefano.
I received my PhD in Computer Science and Engineering from University of Bologna on April 2019. During my PhD I have worked on deep learning solutions for product detection and recognition in retail environments and on deep learning applied to depth estimation from stereo and monocular cameras.
Research Areas
Authored Publications
Sort By
TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction Using Vision-Based Tactile Sensing
Mauro Comi
Yijiong Lin
Alex Church
Laurence Aitchison
Nathan Lepora
IEEE Robotics and Automation Letters (2024)
Preview abstract
Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: this https URL
View details
TextMesh: Generation of Realistic 3D Meshes From Text Prompts
Christina Tsalicoglou
Fabian Manhardt
Michael Niemeyer
3DV 2024 (2024)
Preview abstract
The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh.
View details
Learning good features to transfer across tasks and domains
Adriano Cardace
Luca De Luigi
Luigi Di Stefano
Pierluigi Zama Ramirez
Samuele Salti
IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) (to appear)
Preview abstract
The availability of labelled data is the major obstacle to the deployment of deep learning algorithms to solve computer vision tasks in new domains. Recent works have shown that it is possible to leverage on correlations between features learned by neural networks for different tasks on different domains to reduce the need for full supervision. This is achieved by learning to transfer features across both tasks and domains. In this work, we show how constraining the structure of the source and target feature space is the key to improve the performances of such a transfer framework. In particular, we demonstrate the benefits of: learning features able to capture fine-grain details of the image and aligning the space across tasks by means of an auxiliary task; aligning the feature spaces across domains by means of a novel norm discrepancy loss. We achieve state of the art results in synthetic-to-real adaptation scenarios for this novel setting.
View details
NeRF-Supervised Deep Stereo
Fabio Tosi
Daniele De Gregorio
Matteo Poggi
Computer Vision and Pattern Recognition (2023)
Preview abstract
We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.
View details
NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions
Mohamad Shahbazi
Evangelos Ntaveli
Edo Collins
Danda Pani Paudel
Martin Danelljan
Luc Van Gool
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2023)
Preview abstract
Pose-conditioned convolutional generative models
struggle with high-quality 3D-consistent image generation
from single-view datasets, due to their lack of sufficient
3D priors. Recently, the integration of Neural Radiance
Fields (NeRFs) and generative models, such as Generative
Adversarial Networks (GANs), has transformed 3D-aware
generation from single-view images. NeRF-GANs exploit
the strong inductive bias of neural 3D representations and
volumetric rendering at the cost of higher computational
complexity. This study aims at revisiting pose-conditioned
2D GANs for efficient 3D-aware generation at inference
time by distilling 3D knowledge from pretrained NeRFGANs.
We propose a simple and effective method, based on
re-using the well-disentangled latent space of a pre-trained
NeRF-GAN in a pose-conditioned convolutional network
to directly generate 3D-consistent images corresponding
to the underlying 3D representations. Experiments on
several datasets demonstrate that the proposed method
obtains results comparable with volumetric rendering in
terms of quality and 3D consistency while benefiting from
the computational advantage of convolutional networks.
The code is available at: https://github.com/
mshahbazi72/NeRF-GAN-Distillation
View details
LatentSwap3D: Swapping Latent Codes for Semantic Edits
Enis Simsar
Evin Pınar Örnek
Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Preview abstract
3D GANs have the ability to generate latent codes for entire 3D volumes rather than only 2D images. These models offer desirable features like high-quality geometry and multi-view consistency, but, unlike their 2D counterparts, complex semantic image editing tasks for 3D GANs have only been partially explored. To address this problem, we propose LatentSwap3D, a semantic edit approach based on latent space discovery that can be used with any off-the-shelf 3D or 2D GAN model and on any dataset. LatentSwap3D relies on identifying the latent code dimensions corresponding to specific attributes by feature ranking using a random forest classifier. It then performs the edit by swapping the selected dimensions of the image being edited with the ones from an automatically selected reference image. Compared to other latent space control-based edit methods, which were mainly designed for 2D GANs, our method on 3D GANs provides remarkably consistent semantic edits in a disentangled manner and outperforms others both qualitatively and quantitatively. We show results on seven 3D GANs (?-GAN, GIRAFFE, StyleSDF, MVCGAN, EG3D, StyleNeRF, and VolumeGAN) and on five datasets (FFHQ, AFHQ, Cats, MetFaces, and CompCars).
View details
Continual Adaptation for Deep Stereo
Fabio Tosi
Luigi Di Stefano
Matteo Poggi
Stefano Mattoccia
IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) (to appear)
Preview abstract
Depth estimation from stereo images is carried out with unmatched results by convolutional neural networks trained end-to-end to regress dense disparities. Like for most tasks, this is possible if large amounts of labelled samples are available for training, possibly covering the whole data distribution encountered at deployment time. Being such an assumption systematically unmet in real applications, the capacity of adapting to any unseen setting becomes of paramount importance. Purposely, we propose a continual adaptation paradigm for deep stereo networks designed to deal with challenging and ever-changing environments. We design a lightweight and modular architecture, Modularly ADaptive Network (MADNet), and formulate Modular ADaptation algorithms (MAD, MAD++) which permit efficient optimization of independent sub-portions of the entire network. In our paradigm, the learning signals needed to continuously adapt models online can be sourced from self-supervision via right-to-left image warping or from traditional stereo algorithms. With both sources, no other data than the input images being gathered at deployment time are needed. Thus, our network architecture and adaptation algorithms realize the first real-time self-adaptive deep stereo system and pave the way for a new paradigm that can facilitate practical deployment of end-to-end architectures for dense disparity regression.
View details
A Divide et Impera Approach for 3D Shape Reconstruction from Multiple Views
Riccardo Spezialetti
David Joseph New Tan
Keisuke Tateno
International Virtual Conference on 3D Vision (2020) (to appear)
Preview abstract
Estimating the 3D shape of an object from a single or multiple images has gained popularity thanks to the recent breakthroughs powered by deep learning.
Most approaches regress the full object shape in a canonical pose, possibly extrapolating the occluded parts based on the learned priors.
However, their viewpoint invariant technique often discards the unique structures visible from the input images.
In contrast, this paper proposes to rely on viewpoint variant reconstructions by merging the visible information from the given views.
Our approach is divided into three steps.
Starting from the sparse views of the object, we first align them into a common coordinate system by estimating the relative pose between all the pairs.
Then, inspired by the traditional voxel carving, we generate an occupancy grid of the object taken from the silhouette on the images and their relative poses.
Finally, we refine the initial reconstruction to build a clean 3D model which preserves the details from each viewpoint.
To validate the proposed method, we perform a comprehensive evaluation on the ShapeNet reference benchmark in terms of relative pose estimation and 3D shape reconstruction.
View details