Jump to Content
Eugene Ie

Eugene Ie

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    On the Evaluation of Vision-and-Language Navigation Instructions
    Ming Zhao
    Peter Anderson
    Vihan Jain
    Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021)
    Preview abstract Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions. However, existing instruction generators have not been comprehensively evaluated, and the automatic evaluation metrics used to develop them have not been validated. Using human wayfinders, we show that these generators perform on par with or only slightly better than a template-based generator and far worse than human instructors. Furthermore, we discover that BLEU, ROUGE, METEOR and CIDEr are ineffective for evaluating grounded navigation instructions. To improve instruction evaluation, we propose an instruction-trajectory compatibility model that operates without reference instructions. Our model shows the highest correlation with human wayfinding outcomes when scoring individual instructions. For ranking instruction generation systems, if reference instructions are available we recommend using SPICE. View details
    RecSim NG: Toward Principled Uncertainty Modeling for Recommender Ecosystems
    Martin Mladenov
    Vihan Jain
    Christopher Colby
    Nicolas Mayoraz
    Hubert Pham
    Dustin Tran
    Ivan Vendrov
    ArXiv (2021)
    Preview abstract The development of recommender systems that optimize multi-turn interaction with users, and model the interactions of different agents (e.g., users, content providers, vendors) in the recommender ecosystem have drawn increasing attention in recent years. Developing and training models and algorithms for such recommenders can be especially difficult using static datasets, which often fail to offer the types of counterfactual predictions needed to evaluate policies over extended horizons. To address this, we develop RecSim NG, a probabilistic platform for the simulation of multi-agent recommender systems. RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. It offers: a powerful, general probabilistic programming language for agent-behavior specification; tools for probabilistic inference and latent-variable model learning, backed by automatic differentiation and tracing; a TensorFlow-based runtime for running simulations on accelerated hardware. We describe RecSim NG and illustrate how it can be used to create transparent, configurable, end-to-end models of a recommender ecosystem, complemented by a small set of simple use cases that demonstrate how RecSim NG can help both researchers and practitioners easily develop and train novel algorithms for recommender systems. A short version of this paper was published at RecSys 2020. View details
    CoMSum: Dataset and Neural Model for Contextual Multi-Document Summarization
    Fei Sha
    Sheide Chammas
    Wan Zhu
    International Conference on Document Analysis and Recognition, International Conference on Document Analysis and Recognition (2021)
    Preview abstract Summarization is the task of compressing source document(s) into coherent and succinct passages. Query-based (contextual) multi-document summarization (qMDS) is a variant that targets summaries to specific informational needs with queries providing additional contexts. Progress in qMDS has been hampered by limited availability of corresponding types of datasets. In this work, we make two contributions. First, we develop an automatic approach for creating both extractive and abstractive qMDS examples from existing language resources. We use this approach to create \qmds, a qMDS dataset for public use. Secondly, to validate the utility of \qmds, we propose a neural model for extractive summarization that exploits the hierarchical nature of the input from multiple documents. It also infuses queries into the modeling to extract query-specific summaries. The experimental results show that modeling the queries and the multiple documents hierarchically improve the performance of qMDS on this datasets. This is consitent with our intuition and supports using \qmds for developing learning methods for qMDS. View details
    Learning to Represent Images and Texts with Denotation Graphs
    Bowen Zhang
    Hexiang Hu
    Vihan Jain
    Fei Sha
    Proceedings of EMNLP 2020 (to appear)
    Preview abstract Learning to fuse vision and language information and represent them is an important topic with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets with images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning methods such as ViLBERT can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of text-based image retrieval, and referral expression localization. We will release to the public both our codes and the extracted denotation graphs on both the Flickr30K and the COCO datasets. View details
    Demonstrating Principled Uncertainty Modeling for Recommender Ecosystems with RecSim NG
    Martin Mladenov
    Vihan Jain
    Christopher Colby
    Nicolas Mayoraz
    Hubert Pham
    Dustin Tran
    Ivan Vendrov
    RecSys '20: Fourteenth ACM Conference on Recommender Systems (2020), pp. 591-593
    Preview abstract We develop RecSim NG, a probabilistic platform that supports natural, concise specification and learning of models for multi-agent recommender systems simulation. RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. An extended version of this paper is available as arXiv:2103.08057. View details
    Preview abstract We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations. We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments. View details
    BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
    Wang Zhu
    Hexiang Hu
    Jiacheng Chen
    Zhiwei Deng
    Vihan Jain
    Fei Sha
    Proceedings of ACL 2020
    Preview abstract Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk’s generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. View details
    Preview abstract The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison. View details
    Multi-modal Discriminative Model for Vision-and-Language Navigation
    Haoshuo Huang
    Vihan Jain
    Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP) (2019)
    Preview abstract Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals. Successful agents must have the ability to parse natural language of varying linguistic styles, ground them in potentially unfamiliar scenes, plan and react with ambiguous environmental feedback. Generalization ability is limited by the amount of human annotated data. In particular, paired vision-language sequence data is expensive to collect. We develop a discriminator that evaluates how well an instruction explains a given path in VLN task using multi-modal alignment. Our study reveals that only a small fraction of the high-quality augmented data from Fried et al. (2018), as scored by our discriminator, is useful for training VLN agents with similar performance on previously unseen environments. We also show that a VLN agent warm-started with pre-trained components from the discriminator outperforms the benchmark success rates of 35.5 by 10% relative measure on previously unseen environments. View details
    Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Morgane Lustman
    Vince Gatto
    Paul Covington
    Jim McFadden
    arXiv (2019)
    Preview abstract Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user-choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. (ii) We outline a methodology that leverages existing myopic learning-based recommenders to quickly develop a recommender that handles LTV. (iii) We demonstrate our methods in simulation, and validate the scalability of decomposed TD-learning using SlateQ in live experiments on YouTube. View details
    Preview abstract We propose RecSim, a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects of user behavior and item structure at a level of abstraction well-suited to pushing the limits of current reinforcement learning (RL) and RS techniques in sequential interactive recommendation problems. Environments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior. We outline how RecSim offers value to RL and RS researchers and practitioners, and how it can serve as a vehicle for academic-industrial collaboration. View details
    Preview abstract We show that it is feasible to perform entity linking by training a dual encoder (two-tower) model that encodes mentions and entities in the same dense vector space, where candidate entities are retrieved by approximate nearest neighbor search. Unlike prior work, this setup does not rely on an alias table followed by a re-ranker, and is thus the first fully learned entity retrieval model. We show that our dual encoder, trained using only anchor-text links in Wikipedia, outperforms discrete alias table and BM25 baselines, and is competitive with the best comparable results on the standard TACKBP-2010 dataset. In addition, it can retrieve candidates extremely fast, and generalizes well to a new dataset derived from Wikinews. On the modeling side, we demonstrate the dramatic value of an unsupervised negative mining algorithm for this task. View details
    Natural Language Grounded Multitask Navigation
    Xin Wang
    Vihan Jain
    William Wang
    Zornitsa Kozareva
    Sujith Ravi
    NeurIPS Visually Grounded Interaction and Language (ViGIL) (2019)
    Preview abstract Recent research efforts enable the study of natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, data scarcity is a critical issue in these tasks, as conducting human demonstrated language interactions in the simulator is still expensive and time-consuming and it is impractical to exhaustively collect samples for all variants of the navigation tasks. Therefore, we introduce a generalized multitask navigation model that can seamlessly be trained on language-grounded navigation tasks such as Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH). Benefiting from richer natural language guidance, the multitask model can efficiently transfer knowledge across related tasks. Experiments show that it outperforms the single-task model by 7% (success rate) on VLN and 61% (goal progress) on NDH, establishing the new state of the art for NDH. View details
    Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    Vihan Jain
    Gabriel Magalhaes
    Ashish Vaswani
    Association for Computational Linguistics (2019)
    Preview abstract Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation (VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al. 2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion. View details
    SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2592-2599
    Preview abstract Reinforcement learning (RL) methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube. View details
    General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
    Gabriel Ilharco Magalhaes
    Vihan Jain
    NeurIPS Visually Grounded Interaction and Language (ViGIL) Workshop (2019)
    Preview abstract In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate in an environment. Datasets for such tasks typically contain pairs of these instructions and reference trajectories, but current popular evaluation metrics fail to properly account for the fidelity of agents to the those trajectories. To address this, we introduce the normalized Dynamic Time Warping (nDTW) metric. nDTW softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful episodes and effectively captures both success and fidelity. We collect human similarity judgments for simulated paths and find our DTW metrics correlates better with human rankings than all other metrics. We also show that using nDTW as a reward signal for agents using reinforcement learning improves performance on both the Room-to-Room and Room-for-Room datasets. View details
    Preview abstract Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) require machine agents to interpret natural language instructions and learn to act in visually realistic environments to achieve navigation goals. The overall task requires competence in several perception problems: successful agents combine spatio-temporal, vision and language understanding to produce appropriate action sequences. Our approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN. Specifically, the representations are adapted to solve both a cross-modal sequence alignment and sequence coherence task. In the sequence alignment task, the model determines whether an instruction corresponds to a sequence of visual frames. In the sequence coherence task, the model determines whether the perceptual sequences are predictive sequentially in the instruction-conditioned latent space. By transferring the domain-adapted representations, we improve competitive agents in R2R as measured by the success rate weighted by path length (SPL) metric. View details
    Preview abstract Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets. View details
    Translation-Inspired OCR
    Dmitriy Genzel
    Nemanja Spasojevic
    Michael Jahr
    Frank Yung-Fong Tang
    ICDAR-2011
    Preview abstract Optical character recognition is carried out using techniques borrowed from statistical machine translation. In particular, the use of multiple simple feature functions in linear combination, along with minimum-error-rate training, integrated decoding, and $N$-gram language modeling is found to be remarkably effective, across several scripts and languages. Results are presented using both synthetic and real data in five languages. View details
    Minimizing off-target signals in RNA fluorescent in situ hybridization
    Aaron Arvey
    Anita Hermann
    Cheryl C. Hsia
    Yoav Freund
    William McGinnis
    Nucleic Acids Research (2010)
    Preview
    Large Scale Content-Based Audio Retrieval from Text Queries
    Gal Chechik
    Martin Rehn
    Samy Bengio
    ACM International Conference on Multimedia Information Retrieval (MIR), ACM (2008)
    Preview abstract In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future. View details
    Multi-class protein classification using adaptive codes
    Iain Melvin
    Jason Weston
    William Stafford Noble
    Christina Leslie
    Journal of Machine Learning Research, vol. 8 (2007), pp. 1557-1581
    CP motifs, Hap1 and Heme Signaling
    L. Zhang
    C. Leslie
    H. C. Lee
    A. Kundaje
    X. Xin
    Y. Freund
    International Proceedings of the 15th International Conference on Cytochrome P450: Biochemistry, Biophysics, Functional Genomics (2007), pp. 45-51
    SVM-fold: a tool for discriminative multi-class protein fold and superfamily recognition
    Iain Melvin
    Rui Kuang
    Jason Weston
    William Stafford Noble
    Christina Leslie
    BMC Bioinformatics (2007)
    BioSpike: Efficient search for homologous proteins by indexing patterns
    Yoav Freund
    University of California, San Diego (2006)
    Semi-supervised protein classification using cluster kernels
    Jason Weston
    Christina Leslie
    William Stafford Noble
    Semi-Supervised Learning, MIT Press (2006), pp. 329-346
    Profile-based string kernels for remote homology detection and motif extraction
    Rui Kuang
    Ke Wang
    Kai Wang
    Mahira Siddiqi
    Yoav Freund
    Christina Leslie
    Journal of Bioinformatics and Computational Biology, vol. 3 (2005), pp. 527-550
    Multi-class protein fold recognition using adaptive codes
    Jason Weston
    William Stafford Noble
    Christina S. Leslie
    Proceedings of the 22nd international conference on Machine learning (2005), pp. 329-336
    Semi-supervised protein classification using cluster kernels
    Jason Weston
    Christina S. Leslie
    Dengyong Zhou
    Andr
    William Stafford Noble
    Bioinformatics, vol. 21 (2005), pp. 3241-3247
    Profile-based string kernels for remote homology detection and motif extraction
    Rui Kuang
    Ke Wang
    Kai Wang
    Mahira Siddiqi
    Yoav Freund
    Christina S. Leslie
    Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04), pp. 152-160