Daniel Duckworth

Daniel Duckworth

Daniel Duckworth received his Master of Science in Engineering from University of California, Berkeley under the supervision of Prof. Stuart J. Russell, where he worked on stochastic methods for Bayesian inference. Since joining Google Brain in 2017, Daniel has since branched off to numerical optimization, generalization, generative modeling, and natural language processing. Google Scholar Profile
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations. View details
    Preview abstract The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes. View details
    Preview abstract A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery. View details
    Kubric: A scalable dataset generator
    Anissa Yuenming Mak
    Austin Stone
    Carl Doersch
    Cengiz Oztireli
    Charles Herrmann
    Daniel Rebain
    Derek Nowrouzezahrai
    Dmitry Lagun
    Fangcheng Zhong
    Florian Golemo
    Francois Belletti
    Henning Meyer
    Hsueh-Ti (Derek) Liu
    Issam Laradji
    Klaus Greff
    Kwang Moo Yi
    Matan Sela
    Noha Radwan
    Thomas Kipf
    Tianhao Wu
    Vincent Sitzmann
    Yilun Du
    Yishu Miao
    (2022)
    Preview abstract Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations. We also publish a collection of generated datasets and baseline results on several vision tasks. View details
    Object Scene Representation Transformer
    Filip Pavetić
    Leonidas Guibas
    Klaus Greff
    Thomas Kipf
    Advances in Neural Information Processing Systems (2022), pp. 9512-9524
    Preview abstract A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. View details
    Preview abstract We present NeSF, a method for producing 3D semantic fields from pre-trained density fields and sparse 2D semantic supervision. Our method side-steps traditional scene representations by leveraging neural representations where 3D information is stored within neural fields. In spite of being supervised by 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points. Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the pre-trained density fields improve. Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on convincing synthetic scenes while also offering features unavailable to existing methods. View details
    NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
    Ricardo Martin-Brualla*
    Noha Radwan*
    Alexey Dosovitskiy
    Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections of in-the-wild photographs. We build on Neural Radiance Fields (NeRF), which uses the weights of a multilayer perceptron to model the density and color of a scene as a function of 3D coordinates. While NeRF works well on images of static subjects captured under controlled settings, it is incapable of modeling many ubiquitous, real-world phenomena in uncontrolled images, such as variable illumination or transient occluders. We introduce a series of extensions to NeRF to address these issues, thereby enabling accurate reconstructions from unstructured image collections taken from the internet. We apply our system, dubbed NeRF-W, to internet photo collections of famous landmarks, and demonstrate temporally consistent novel view renderings that are significantly closer to photorealism than the prior state of the art. View details
    Preview abstract We show theoretically and experimentally that both data whitening and second order optimization erase information about the training dataset, and can prevent any generalization for high dimensional datasets. First we show that if the input layer of a model is a dense linear layer, then the datapoint-datapoint second moment matrix contains all information which can be used to make predictions. Second, we show that for high dimensional datasets, where the number of features is at least as large as the number of datapoints, and where the whitening transform is computed on the full (train+test) dataset, whitening erases all information in this datapoint-datapoint second moment matrix. Generalization is thus completely impossible for models trained on high dimensional whitened datasets. Second order optimization of a linear model is identical to first order optimization of the same model after data whitening. Second order optimization can thus also prevent any generalization in similar situations. We experimentally verify these predictions for models trained on whitened data, and for linear models trained with an online Newton optimizer. We further experimentally demonstrate that generalization continues to be harmed even when the theoretical constraints on input dimensionality (for whitening), or linearity of the model (for second order optimization) are relaxed. View details
    Automatic Detection of Generated Text is Easiest when Humans are Fooled
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 1808-1822
    Preview abstract Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies—top-_k_, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems. View details
    Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset
    Chinnadhurai Sankar
    Arvind Neelakantan
    Semih Yavuz
    Ben Goodrich
    Amit Dubey
    Kyu-Young Kim
    Andy Cedilnik
    EMNLP (2019) (to appear)
    Preview abstract A significant barrier to progress in data-driven approaches to building dialog systems is the lack of high quality, goal-oriented conversational data. To help satisfy this elementary requirement, we introduce the initial release of the Taskmaster-1 dataset which includes 13,215 task-based dialogs comprising six domains. Two procedures were used to create this collection, each with unique advantages. The first involves a two-person, spoken "Wizard of Oz" (WOz) approach in which trained agents and crowdsourced workers interact to complete the task while the second is "self-dialog" in which crowdsourced workers write the entire dialog themselves. We do not restrict the workers to detailed scripts or to a small knowledge base and hence we observe that our dataset contains more realistic and diverse conversations in comparison to existing datasets. We offer several baseline models including state of the art neural seq2seq architectures with benchmark performance as well as qualitative human evaluations. Dialogs are labeled with API calls and arguments, a simple and cost effective approach which avoids the requirement of complex annotation schema. The layer of abstraction between the dialog model and the service provider API allows for a given model to interact with multiple services that provide similar functionally. Finally, the dataset will evoke interest in written vs. spoken language, discourse patterns, error handling and other linguistic phenomena related to dialog system research, development and design. The dataset is available at ai.google/tools/datasets/taskmaster-1. View details