Jump to Content

Publications

Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.

People looking at a screen

Publications

Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.

Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
1 - 15 of 1449 publications
    Preview abstract We present PhoMoH, a neural network methodology to construct generative models of photo-realistic 3D geometry and appearance of human heads including hair, beards, an oral cavity, and clothing. In contrast to prior work, PhoMoH models the human head using neural fields, thus supporting complex topology. Instead of learning a head model from scratch, we propose to augment an existing expressive head model with new features. Concretely, we learn a highly detailed geometry network layered on top of a mid-resolution head model together with a detailed, local geometry-aware, and disentangled color field. Our proposed architecture allows us to learn photo-realistic human head models from relatively little data. The learned generative geometry and appearance networks can be sampled individually and enable the creation of diverse and realistic human heads. Extensive experiments validate our method qualitatively and across different metrics. View details
    Preview abstract We propose Hierarchical Text Spotter (HTS), the first method for the joint task of word-level text spotting and geometric layout analysis. HTS can annotate text in images with a hierarchical representation of 4 levels: character, word, line, and paragraph. The proposed HTS is characterized by two novel components: (1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines; (2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words. HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks. Code will be released upon acceptance. View details
    MetaMix: Meta-state Precision Searcher for Mixed-precision Activation Quantization
    Han-Byul Kim
    Joo Hyung Lee
    Sungjoo Yoo
    Hong-Seok Kim
    Proc. The 38th Annual AAAI Conference on Artificial Intelligence (AAAI) (2024)
    Preview abstract Mixed-precision quantization of efficient networks often suffer from activation instability encountered in the exploration of bit selections. To address this problem, we propose a novel method called MetaMix which consists of bit selection and weight training phases. The bit selection phase iterates two steps, (1) the mixed-precision-aware weight update, and (2) the bit-search training with the fixed mixed-precision-aware weights, both of which combined reduce activation instability in mixed-precision quantization and contribute to fast and high-quality bit selection. The weight training phase exploits the weights and step sizes trained in the bit selection phase and fine-tunes them thereby offering fast training. Our experiments with efficient and hard-to-quantize networks, i.e., MobileNet v2 and v3, and ResNet-18 on ImageNet show that our proposed method pushes the boundary of mixed-precision quantization, in terms of accuracy vs. operations, by outperforming both mixed- and single-precision SOTA methods. View details
    Preview abstract We present SPHEAR, an accurate, differentiable parametric statistical 3D human head model, enabled by a novel 3D registration method based on spherical embeddings. We shift the paradigm away from the classical Non-Rigid Registration methods, which operate under various surface priors, increasing reconstruction fidelity and minimizing required human intervention. Additionally, SPHEAR is a complete model that allows not only to sample diverse synthetic head shapes and facial expressions, but also gaze directions, high-resolution color textures, surface normal maps, and hair cuts represented in detail, as strands. SPHEAR can be used for automatic realistic visual data generation, semantic annotation, and general reconstruction tasks. Compared to state-of-the-art approaches, our components are fast and memory efficient, and experiments support the validity of our design choices and the accuracy of registration, reconstruction and generation techniques. View details
    TextMesh: Generation of Realistic 3D Meshes From Text Prompts
    Christina Tsalicoglou
    Fabian Manhardt
    Michael Niemeyer
    3DV 2024 (2024)
    Preview abstract The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh. View details
    LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals
    Arjun Karpur
    Guilherme Perrotta
    Ricardo Martin-Brualla
    Proc. 3DV'24 (2024) (to appear)
    Preview abstract Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach. View details
    Using Early Readouts to Mediate Featural Bias in Distillation
    Durga Sivasubramanian
    Anmol Mekala
    Ganesh Ramakrishnan
    WACV 2024 (2024)
    Preview abstract Deep networks tend to learn spurious feature-label correlations in real-world supervised learning tasks. This vulnerability is aggravated in distillation, where a (student) model may have less representational capacity than the corresponding teacher model. Often, knowledge of specific problem features is used to reweight instances & rebalance the learning process. We propose a novel early readout mechanism whereby we attempt to predict the label using representations from earlier network layers. We show that these early readouts automatically identify problem instances or groups in the form of confident, incorrect predictions. We improve group fairness measures across benchmark datasets by leveraging these signals to mediate between teacher logits and supervised label. We extend our results to the closely related but distinct problem of domain generalization, which also critically depends on the quality of learned features. We provide secondary analyses that bring insight into the role of feature learning in supervision and distillation. View details
    Beyond SOT: Tracking Multiple Generic Objects at Once
    Christoph Mayer
    Martin Danelljan
    Vittorio Ferrari
    Luc Van Gool
    WACV'24 (2024)
    Preview abstract Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. However multiobject GOT poses its own challenges and is more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new largescale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4× faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, results and trained models are available at https://github.com/visionml/pytracking. View details
    Large Scale Self-Supervised Pretraining for Active Speaker Detection
    Alice Chuang
    Keith Johnson
    Tony (Tuấn) Nguyễn
    Wei Xia
    Yunfan Ye
    ICASSP 2024 (2024) (to appear)
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Understanding metric-related pitfalls in image analysis validation
    Annika Reinke
    Lena Maier-Hein
    Paul Jager
    Shravya Shetty
    Understanding Metrics Workgroup
    Nature Methods (2024)
    Preview abstract Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. View details
    Preview abstract Hand-drawn doodles present an interesting and difficult set of textures to model and synthesize. Unlike the typical natural images that are most often used in texture synthesis studies, the doodles examined here are characterized by the use of sharp, irregular, and imperfectly scribbled patterns, frequent imprecise strokes, haphazardly connected edges, and randomly or spatially shifting themes. The almost binary nature of the doodles examined makes it difficult to hide common mistakes such as discontinuities. Further, there is no color or shading to mask flaws and repetition; any process that relies on, even stochastic, region copying is readily discernible. To tackle the problem of synthesizing these textures, we model the underlying generation process of the doodle taking into account potential unseen, but related, expansion contexts. We demonstrate how to generate infinitely long textures, such that the texture can be extended far beyond a single image's source material. This is accomplished by creating a novel learning mechanism that is taught to condition the generation process on its own generated context -- what was generated in previous steps -- not just upon the original material. View details
    Brain2Music: Reconstructing Music from Human Brain Activity
    Yu Takagi
    Takuya Matsuyama
    Andrea Agostinelli
    Tomoya Nakai
    Christian Frank
    Shinji Nishimoto
    (2023)
    Preview abstract The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli. We provide supplementary material including examples of the reconstructed music at https://google-research.github.io/seanet/brain2music View details
    Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
    Qitong Wang
    Liangzhe Yuan
    Xi Peng
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. View details
    Joint Adaptive Representations for Image-Language Learning
    Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
    Preview abstract Image-language transformer models have achieved tremendous success, but they come at high computational costs. We here propose a joint adaptive image-language representation learning, which adaptively and iteratively fuses the multi-modal features. This consistently reduces the model cost and size, allows the model to scale without a large increase in FLOPs or memory, and outperforms bigger and much more expensive models. With only 40M training examples and with 39 GFLOPs our model outperforms many times larger models, some reaching 800 GFLOPs. View details