Eugene Ie
Authored Publications
Google Publications
Other Publications
Sort By
CoMSum: Dataset and Neural Model for Contextual Multi-Document Summarization
Sheide Chammas
Wan Zhu
International Conference on Document Analysis and Recognition, International Conference on Document Analysis and Recognition (2021)
Preview abstract
Summarization is the task of compressing source document(s) into coherent and succinct passages. Query-based (contextual) multi-document summarization (qMDS) is a variant that targets summaries to specific informational needs with queries providing additional contexts. Progress in qMDS has been hampered by limited availability of corresponding types of datasets. In this work, we make two contributions. First, we develop an automatic approach for creating both extractive and abstractive qMDS examples from existing language resources. We use this approach to create \qmds, a qMDS dataset for public use. Secondly, to validate the utility of \qmds, we propose a neural model for extractive summarization that exploits the hierarchical nature of the input from multiple documents. It also infuses queries into the modeling to extract query-specific summaries. The experimental results show that modeling the queries and the multiple documents hierarchically improve the performance of qMDS on this datasets. This is consitent with our intuition and supports using \qmds for developing learning methods for qMDS.
View details
RecSim NG: Toward Principled Uncertainty Modeling for Recommender Ecosystems
Martin Mladenov
Vihan Jain
Christopher Colby
Nicolas Mayoraz
Hubert Pham
Ivan Vendrov
ArXiv (2021)
Preview abstract
The development of recommender systems that optimize multi-turn interaction with users, and model the interactions of different
agents (e.g., users, content providers, vendors) in the recommender ecosystem have drawn increasing attention in recent years.
Developing and training models and algorithms for such recommenders can be especially difficult using static datasets, which often
fail to offer the types of counterfactual predictions needed to evaluate policies over extended horizons. To address this, we develop
RecSim NG, a probabilistic platform for the simulation of multi-agent recommender systems. RecSim NG is a scalable, modular,
differentiable simulator implemented in Edward2 and TensorFlow. It offers: a powerful, general probabilistic programming language for
agent-behavior specification; tools for probabilistic inference and latent-variable model learning, backed by automatic differentiation
and tracing; a TensorFlow-based runtime for running simulations on accelerated hardware. We describe RecSim NG and illustrate
how it can be used to create transparent, configurable, end-to-end models of a recommender ecosystem, complemented by a small
set of simple use cases that demonstrate how RecSim NG can help both researchers and practitioners easily develop and train novel algorithms for recommender systems.
A short version of this paper was published at RecSys 2020.
View details
On the Evaluation of Vision-and-Language Navigation Instructions
Ming Zhao
Peter Anderson
Vihan Jain
Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021)
Preview abstract
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations. We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.
View details
Demonstrating Principled Uncertainty Modeling for Recommender Ecosystems with RecSim NG
Martin Mladenov
Vihan Jain
Christopher Colby
Nicolas Mayoraz
Hubert Pham
Ivan Vendrov
RecSys '20: Fourteenth ACM Conference on Recommender Systems (2020), pp. 591-593
Preview abstract
We develop RecSim NG, a probabilistic platform that supports natural, concise specification and learning of models for multi-agent recommender systems simulation. RecSim NG is a scalable, modular,
differentiable simulator implemented in Edward2 and TensorFlow.
An extended version of this paper is available as arXiv:2103.08057.
View details
Preview abstract
Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk’s generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better.
View details
Preview abstract
Learning to fuse vision and language information and represent them is an important topic with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets with images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning methods such as ViLBERT can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of text-based image retrieval, and referral expression localization. We will release to the public both our codes and the extracted denotation graphs on both the Flickr30K and the COCO datasets.
View details
Preview abstract
The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison.
View details
Natural Language Grounded Multitask Navigation
Xin Wang
Vihan Jain
William Wang
Zornitsa Kozareva
Sujith Ravi
NeurIPS Visually Grounded Interaction and Language (ViGIL) (2019)
Preview abstract
Recent research efforts enable the study of natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, data scarcity is a critical issue in these tasks, as conducting human demonstrated language interactions in the simulator is still expensive and time-consuming and it is impractical to exhaustively collect samples for all variants of the navigation tasks. Therefore, we introduce a generalized multitask navigation model that can seamlessly be trained on language-grounded navigation tasks such as Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH). Benefiting from richer natural language guidance, the multitask model can efficiently transfer knowledge across related tasks. Experiments show that it outperforms the single-task model by 7% (success rate) on VLN and 61% (goal progress) on NDH, establishing the new state of the art for NDH.
View details
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
Vihan Jain
Gabriel Magalhaes
Ashish Vaswani
Association for Computational Linguistics (2019)
Preview abstract
Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation (VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al. 2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.
View details
Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology
Vihan Jain
Jing Wang
Sanmit Narvekar
Ritesh Agarwal
Rui Wu
Morgane Lustman
Vince Gatto
Paul Covington
Jim McFadden
arXiv (2019)
Preview abstract
Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user-choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. (ii) We outline a methodology that leverages existing myopic learning-based recommenders to quickly develop a recommender that handles LTV. (iii) We demonstrate our methods in simulation, and validate the scalability of decomposed TD-learning using SlateQ in live experiments on YouTube.
View details
RecSim: A Configurable Simulation Platform for Recommender Systems
Martin Mladenov
Vihan Jain
Sanmit Narvekar
Jing Wang
Rui Wu
arXiv (2019)
Preview abstract
We propose RecSim, a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects of user behavior and item structure at a level of abstraction well-suited to pushing the limits of current reinforcement learning (RL) and RS techniques in sequential interactive recommendation problems. Environments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior. We outline how RecSim offers value to RL and RS researchers and practitioners, and how it can serve as a vehicle for academic-industrial collaboration.
View details
Learning Dense Representations for Entity Retrieval
Larry Lansing
Diego Garcia-Olano
CoNLL (2019)
Preview abstract
We show that it is feasible to perform entity linking by training a dual encoder (two-tower) model that encodes mentions and entities in the same dense vector space, where candidate entities are retrieved by approximate nearest neighbor search. Unlike prior work, this setup does not rely on an alias table followed by a re-ranker, and is thus the first fully learned entity retrieval model. We show that our dual encoder, trained using only anchor-text links in Wikipedia, outperforms discrete alias table and BM25 baselines, and is competitive with the best comparable results on the standard TACKBP-2010 dataset. In addition, it can retrieve candidates extremely fast, and generalizes well to a new dataset derived from Wikinews. On the modeling side, we demonstrate the dramatic value of an unsupervised negative mining algorithm for this task.
View details
SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
Vihan Jain
Jing Wang
Sanmit Narvekar
Ritesh Agarwal
Rui Wu
Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2592-2599
Preview abstract
Reinforcement learning (RL) methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube.
View details
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
Gabriel Ilharco Magalhaes
Vihan Jain
NeurIPS Visually Grounded Interaction and Language (ViGIL) Workshop (2019)
Transferable Representation Learning in Vision-and-Language Navigation
Haoshuo Huang
Vihan Jain
Gabriel Ilharco Magalhaes
ICCV 2019 (2019)
Multi-modal Discriminative Model for Vision-and-Language Navigation
Haoshuo Huang
Vihan Jain
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP) (2019)
Using Web Co-occurrence Statistics for Improving Image Categorization
Samy Bengio
Andrew Rabinovich
Jonathon Shlens
Yoram Singer
arXiv (2013)
Preview abstract
Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets.
View details
Translation-Inspired OCR
Dmitriy Genzel
Nemanja Spasojevic
Michael Jahr
Frank Yung-Fong Tang
ICDAR-2011
Preview abstract
Optical character recognition is carried out using techniques
borrowed from statistical machine translation. In particular, the
use of multiple simple feature functions in linear combination,
along with minimum-error-rate training, integrated decoding, and
$N$-gram language modeling is found to be remarkably effective,
across several scripts and languages. Results are presented using
both synthetic and real data in five languages.
View details
Minimizing off-target signals in RNA fluorescent in situ hybridization
Preview
Aaron Arvey
Anita Hermann
Cheryl C. Hsia
Yoav Freund
William McGinnis
Nucleic Acids Research (2010)
Large Scale Content-Based Audio Retrieval from Text Queries
Gal Chechik
Martin Rehn
Samy Bengio
ACM International Conference on Multimedia Information Retrieval (MIR), ACM (2008)
Preview abstract
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags.
In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM).
We test our approach on two large real-world datasets: a collection of
short sound effects, and a noisier and larger collection of
user-contributed user-labeled recordings (25K files, 2000 terms
vocabulary). We find that all three methods achieved very good
retrieval performance. For instance, a positive document is retrieved
in the first position of the ranking more than half the time, and on
average there are more than 4 positive documents in the first 10
retrieved, for both datasets. PAMIR completed both training and
retrieval of all data in less than 6 hours for both datasets, on a
single machine. It was one to three orders of magnitude faster than
the competing approaches. This approach should therefore scale to much
larger datasets in the future.
View details
Multi-class protein classification using adaptive codes
Iain Melvin
Jason Weston
William Stafford Noble
Christina Leslie
Journal of Machine Learning Research, vol. 8 (2007), pp. 1557-1581
SVM-fold: a tool for discriminative multi-class protein fold and superfamily recognition
Iain Melvin
Rui Kuang
Jason Weston
William Stafford Noble
Christina Leslie
BMC Bioinformatics (2007)
CP motifs, Hap1 and Heme Signaling
L. Zhang
C. Leslie
H. C. Lee
A. Kundaje
X. Xin
Y. Freund
International Proceedings of the 15th International Conference on Cytochrome P450: Biochemistry, Biophysics, Functional Genomics (2007), pp. 45-51
BioSpike: Efficient search for homologous proteins by indexing patterns
Semi-supervised protein classification using cluster kernels
Jason Weston
Christina Leslie
William Stafford Noble
Semi-Supervised Learning, MIT Press (2006), pp. 329-346
Semi-supervised protein classification using cluster kernels
Jason Weston
Christina S. Leslie
Dengyong Zhou
Andr
William Stafford Noble
Bioinformatics, vol. 21 (2005), pp. 3241-3247
Profile-based string kernels for remote homology detection and motif extraction
Rui Kuang
Ke Wang
Kai Wang
Mahira Siddiqi
Yoav Freund
Christina Leslie
Journal of Bioinformatics and Computational Biology, vol. 3 (2005), pp. 527-550
Multi-class protein fold recognition using adaptive codes
Jason Weston
William Stafford Noble
Christina S. Leslie
Proceedings of the 22nd international conference on Machine learning (2005), pp. 329-336
Profile-based string kernels for remote homology detection and motif extraction
Rui Kuang
Ke Wang
Kai Wang
Mahira Siddiqi
Yoav Freund
Christina S. Leslie
Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04), pp. 152-160