Timothy Dozat
I joined Google as a research scientist in January 2019. My current work falls into two broad categories: neural network architectures (and some of the theory behind them), emphasizing language model pretraining and distillation; and "classic" NLP tasks, such as part-of-speech tagging and parsing. I've also recently been collaborating with teams working on the Google Assistant. I received my PhD in Lingustics from Stanford University, where I worked under Chris Manning on developing Universal Dependencies and building neural parsers that could reproduce it. I also dabbled in convex optimization at one point, and I might come back to it someday.
Research Areas
Authored Publications
Sort By
Dialect-robust Evaluation of Generated Text
Jiao Sun
Elizabeth Clark
Tu Vu
Sebastian Gehrmann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), pp. 6010-6028
Preview abstract
Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.
View details
FormNetV2: Inductive Multimodal Graph Contrastive Learning for Form Document Information Extraction
Chun-Liang Li
Hao Zhang
Xiang Zhang
Kihyuk Sohn
Nikolai Glushnev
Joshua Ainslie
Nan Hua
ACL (2023)
Preview abstract
The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
View details
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation
Jan A. Botha
Xavier Garcia
Transactions of the Association for Computational Linguistics (2023)
Preview abstract
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms. We explore automatic evaluation metrics for FRMT and validate their correlation with expert human evaluation across both region-matched and mismatched rating scenarios. Finally, we present a number of baseline models for this task, and offer guidelines for how researchers can train, evaluate, and compare their own models. Our dataset and evaluation code are publicly available: https://bit.ly/frmt-task
View details
FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction
Chun-Liang Li
Nan Hua
Joshua Ainslie
Association for Computational Linguistics (ACL) (2022)
Preview abstract
Sequence modeling has demonstrated state-of-the-art performance on natural language and document understanding tasks. However, it is challenging to correctly serialize tokens in form-like documents in practice due to their variety of layout patterns. We propose FormNet, a structure-aware sequence model to mitigate the suboptimal serialization of forms. First, we design Rich Attention that leverages the spatial relationship between tokens in a form for more precise attention score calculation. Second, we construct Super-Tokens for each word by embedding representations from their neighboring tokens through graph convolutions. FormNet therefore explicitly recovers local syntactic information that may have been lost during serialization. In experiments, FormNet outperforms existing methods with a more compact model size and less pre-training data, establishing new state-of-the-art performance on CORD, FUNSD and Payment benchmarks.
View details