Gustavo Hernandez Abrego

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5 encoder-decoder. We establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark. Our encoder-only models outperform the previous best models on both SentEval and SentGLUE transfer tasks, including semantic textual similarity (STS). Scaling up ST5 from millions to billions of parameters shown to consistently improve performance. Finally, our encoder-decoder method achieves a new state-of-the-art on STS when using sentence embeddings. View details
    Large Dual Encoders Are Generalizable Retrievers
    Jianmo Ni
    Zhuyun Dai
    Vincent Zhao
    Yi Luan
    Keith B. Hall
    Ming-Wei Chang
    Yinfei Yang
    (2022)
    Preview abstract It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-ofdomain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck embedding size fixed. With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform existing sparse and dense retrievers on the BEIR dataset (Thakur et al., 2021) significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best out-of-domain performance. View details
    Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval
    Jianmo Ni
    Yinfei Yang
    EMNLP 2021, Association for Computational Linguistics (2021), pp. 6091-6103
    Preview abstract In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages in automatic question answering tasks. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Of the three retrieval based strategies, two are based on the semantic similarity between the actual passage and its alternatives and another one is based on the lexical overlap between them. In our experiments we train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. Negative sampling is applied in both stages. Our negative sampling is particularly useful when we augment the generic data for pre-training with synthetic examples. We evaluate our approach in three passage retrieval tasks for open-domain question answering. Even though it is not evident that there is one single sampling strategy that works best in all three tasks, it is clear that they all contribute to improving the contrast between the actual retrieval and its alternatives. Furthermore, mixing the negatives from different strategies can achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering tasks that we evaluated. View details
    Self-supervised Learning for Pairwise Data Refinement
    Bowen Liang
    Wei Wang
    Zarana Parekh
    Yinfei Yang
    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China (2020), pp. 435-446 (to appear)
    Preview abstract We present a self-supervised method to refine pairwise data using the contents of the data itself.Our method is based on the cross-lingual similarity scores calculated with a dual-encoder model and using them to select data to train new dual-encoder models in an iterative way. To illustrate the functionality of our method, we apply it to the task of denoising parallel texts mined from the internet on two language pairs: en-fr and en-de. We train dual-encoder models on the refined data and test them on the BUCC bitext mining tasks. The dual-encoder models show steady performance improvement with every iteration. We also use the refined data to train machine translation models that we integrate in our method for further improvement of the dual-encoder models. The machine translation models that we evaluate are competitive against similar models trained with data filtered with a supervised approach. Our method has the advantage that, given that it is entirely self-supervised, it is well-suited to handle text data for which there is no prior knowledge about the language or where labeled clean data is not available. View details