Yeqing Li

Yeqing Li

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    High Resolution Medical Image Analysis with Spatial Partitioning
    Le Hou
    Niki J. Parmar
    Noam Shazeer
    Xiaodan Song
    Youlong Cheng
    High Resolution Medical Image Analysis with Spatial Partitioning (2019)
    Preview abstract Medical images such as 3D computerized tomography (CT) scans, have a typical resolution of 512×512×512 voxels, three orders of magnitude more pixel data than ImageNet images. It is impossible to train CNN models directly on such high resolution images, because feature maps of a single image do not fit in the memory of single GPU/TPU. Existing image analysis approaches alleviate this problem by dividing (e.g. taking 2D slices of 3D scans) or down-sampling input images, which leads to complicated implementation and sub-optimal performance due to information loss. In this paper, we implement spatial partitioning, which internally distributes input and output of convolution operations across GPUs/TPUs. Our implementation is based on the Mesh-TensorFlow framework and is transparent to end users. To the best of our knowledge, this is the first work on training networks on 512×512×512 resolution CT scans end-to-end, without significant computational overhead. View details
    Guided Attention for Large Scale Scene Text Verification
    Dafang He
    Alex Gorban
    Derrall Heath
    Julian Ibarz
    Qian Yu
    Daniel Kifer
    C. Lee Giles
    arXiv (2018)
    Preview abstract Many tasks are related to determining if a particular text string exists in an image. In this work, we propose a model called Guided Attention that learns this task end-to-end. The model takes an image and a text string as input and then outputs the probability of the text string being present in the image. This is the first end-to-end model that learns such relationships between text and images and that does not require explicit scene text detection or recognition. Such a model can be applied to a variety of tasks requiring to know whether a named entity is present in an image. Furthermore, this model does not need any bounding box annotation, and it is the first work in scene text area that tackles such a problem. We show that our method is better than several state-of-the-art methods on a challenging Street View Business Matching dataset, which contains millions of images. In addition, we demonstrate the uniqueness of our task via a comparison between our problem and a typical VQA (Visual Question Answering) problem, which also has as input an image and a sequence of words. This new real-world task provides a new perspective for various research combining images and text. View details
    Preview abstract This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. View details
    Attention-based Extraction of Structured Information from Street View Imagery
    Zbigniew Wojna
    Alex Gorban
    Dar-Shyang Lee
    Qian Yu
    Julian Ibarz
    ICDAR (2017), pp. 8
    Preview abstract We present a neural network model, based on CNNs, RNNs and attention mechanisms, which achieves 84.04% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state of the art (Smith’16), which achieved 72.46%. Furthermore, our new method is much simpler and more general than the previous approach. To demonstrate the generality of our model, we also apply it to two datasets, derived from Google Street View, in which the goal is to extract business names from store fronts, and extract structured date/time information from parking signs. Finally, we study the speed/accuracy tradeoff that results from cutting pretrained inception CNNs at different depths and using them as feature extractors for the attention mechanism. The resulting model is not only accurate but efficient, allowing it to be used at scale on a variety of challenging real-world text extraction problems. View details