Yichao Zhou

Yichao Zhou

Yichao is a Software Engineer in Google Research. He focuses on information extraction using machine learning. Prior to Google, he earned my Ph.D. in Computer Science from University of California, Los Angeles (UCLA), advised by Prof. Wei Wang. His research interests lie broadly at natural language processing and text mining.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    VRDU: A Benchmark for Visually-rich Document Understanding
    Zilong Wang
    Wei Wei
    2023 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    Preview abstract Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we argue that existing benchmarks do not reflect the complexity of real documents seen in industry, and therefore not suitable for measuring progress in practical settings. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call VRDU for Visually Rich Document Understanding. VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as nested entities, complex templates including tables and multi-column layouts and diversity of different layouts within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and observe three conclusions: (1) generalizing to new templates from a document type is still very challenging, (2) few-shot performance continues to have a lot of headroom, and (3) models struggle with nested repeated fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope that it helps inspire and guide future research in this challenging area. View details
    Preview abstract Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy. View details
    Preview abstract Given a web page, extracting an object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Prior approaches have either relied on computationally expensive visual feature engineering or required large amounts of training data to get to an acceptable precision. In this paper, we propose a novel method, LeArNing TransfErable node RepresentatioNs for Attribute Extraction (LANTERN), to tackle the problem. We model the problem as a tree node tagging task. The key insight is to learn a contextual representation for each node in the DOM tree where the context explicitly takes into account the tree structure of the neighborhood around the node. Experiments on the SWDE public dataset show that LANTERN outperforms the previous state-of-the-art (SOTA) by 1.44% (F1 score) with a dramatically simpler model architecture. Furthermore, we report that utilizing data from a different domain (for instance, using training data about web pages with cars to extract book objects) is surprisingly useful and helps beat the SOTA by a further 1.37%. View details