Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 423 publications
    Preview abstract A recent large-scale experiment conducted by Chrome has demonstrated that a "quieter" web permission prompt can reduce unwanted interruptions while only marginally affecting grant rates. However, the experiment and the partial roll-out were missing two important elements: (1) an effective and context-aware activation mechanism for such a quieter prompt, and (2) an analysis of user attitudes and sentiment towards such an intervention. In this paper, we address these two limitations by means of a novel ML-based activation mechanism -- and its real-world on-device deployment in Chrome -- and a large-scale user study with 13.1k participants from 156 countries. First, the telemetry-based results, computed on more than 20 million samples from Chrome users in-the-wild, indicate that the novel on-device ML-based approach is both extremely precise (>99% post-hoc precision) and has very high coverage (96% recall for notifications permission). Second, our large-scale, in-context user study shows that quieting is often perceived as helpful and does not cause high levels of unease for most respondents. View details
    Preview abstract The web utilizes permission prompts to moderate access to certain capabilities. We present the first investigation of user behavior and sentiment of this security and privacy measure on the web, using 28 days of telemetry data from more than 100M Chrome installations on desktop platforms and experience sampling responses from 25,706 Chrome users. Based on this data, we find that ignoring and dismissing permission prompts are most common for geolocation and notifications. Permission prompts are perceived as more annoying and interrupting when they are not allowed, and most respondents cite a rational reason for the decision they took. Our data also supports that the perceived availability of contextual information from the requesting website is associated with allowing access to a requested capability. More usable permission controls could facilitate adoption of best practices that address several of the identified challenges; and ultimately could lead to better user experiences and a safer web. View details
    Scaling Up LLM Reviews for Google Ads Content Moderation
    Ariel Fuxman
    Chih-Chun Chia
    Dongjin Kwon
    Enming Luo
    Mehmet Tek
    Ranjay Krishna
    Tiantian Fang
    Tushar Dogra
    Yu-Han Lyu
    (2024)
    Preview abstract Large language models (LLMs) are powerful tools for content moderation but LLM inference costs and latency on large volumes of data, such as the Google Ads repository, are prohibitive for their casual usage. This study is focused on scaling up LLM reviews for content moderation in Google Ads. First, we use heuristics to select candidates via filtering and duplicate removal, and create clusters of ads for which we select one representative ad per cluster. Then, LLMs are used to review only the representative ads. Finally we propagate the LLM decisions for representative ads back to their clusters. This method reduces the number of reviews by more than 3 orders of magnitude while achieving a 2x recall compared to a non-LLM model as a baseline. Note that, the success of this approach is a strong function of the representations used in clustering and label propagation; we observed that cross-modal similarity representations yield better results than uni-modal representations. View details
    Preview abstract Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like "Yes" and "No". However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers. View details
    Preview abstract Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. View details
    Preview abstract Browser fingerprinting is often associated with cross-site user tracking, a practice that many browsers (e.g., Safari, Brave, Edge, Firefox and Chrome) want to block. However, less is publicly known about its uses to enhance online safety, where it can provide an additional security layer against service abuses (e.g., in combination with CAPTCHAs) or during user authentication. To the best of our knowledge, no fingerprinting defenses deployed thus far consider this important distinction when blocking fingerprinting attempts, so they might negatively affect website functionality and security. To address this issue we make three main contributions. First, we propose and evaluate a novel machine learning-based method to automatically identify authentication pages (i.e. sign-in and sign-up pages). Our algorithm -- which relies on a hybrid unsupervised/supervised approach -- achieves 96-98% precision and recall on a large, manually-labelled dataset of 10,000 popular sites. Second, we compare our algorithm with other methods from prior works on the same dataset, showing that it significantly outperforms all of them (+83% F1-score). Third, we quantify the prevalence of fingerprinting scripts across sign-in and sign-up pages (9.2%) versus those executed on other pages (8.9%); while the rates of fingerprinting are similar, home pages and authentication pages differ in the third-party scripts they include and how often these scripts are labeled as tracking. We also highlight the substantial differences in fingerprinting behavior on login and sign-up pages. Our work sheds light on the complicated reality that fingerprinting is used to both protect user security and invade user privacy, and that this dual nature must be considered by fingerprinting mitigations. View details
    DSI++: Updating Transformer Memory with New Documents
    Yi Tay
    Jinfeng Rao
    Emma Strubell
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
    Preview abstract Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1% over competitive baselines for NQ and requires 6 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence. View details
    Preview abstract Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect. View details
    Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance
    Pratyush Kar
    Bing-Rong Lin
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2023)
    Preview abstract As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily compatible, which makes the trade-off less ideal for either of them. In this paper, we propose a practical regression compatible ranking (RCR) approach that achieves a better trade-off, where the two ranking and regression components are proved to be mutually aligned. Although the same idea applies to ranking with both binary and graded relevance, we mainly focus on binary labels in this paper. We evaluate the proposed approach on several public LTR benchmarks and show that it consistently achieves either best or competitive result in terms of both regression and ranking metrics, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy. The proposed approach has been successfully deployed in the YouTube production system. View details
    Preview abstract The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications. A number of techniques have been developed for ANN search that are efficient, accurate, and scalable. However, such techniques typically have a number of parameters that affect the speed-recall tradeoff, and exhibit poor performance when such parameters aren't properly set. Tuning these parameters has traditionally been a manual process, demanding in-depth knowledge of the underlying search algorithm. This is becoming an increasingly unrealistic demand as ANN search grows in popularity. To tackle this obstacle to ANN adoption, this work proposes a constrained optimization-based approach to tuning quantization-based ANN algorithms. Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks. View details
    Conversational Information Seeking
    Hamed Zamani
    Johanne R. Trippas
    Jeff Dalton
    Filip Radlinski
    Foundations and Trends® in Information Retrieval (2023), pp. 244-456
    Preview abstract Conversational information seeking (CIS) is concerned with a sequence of interactions between one or more users and an information system. Interactions in CIS are primarily based on natural language dialogue, while they may include other types of interactions, such as click, touch, and body gestures. This monograph provides a thorough overview of CIS definitions, applications, interactions, interfaces, design, implementation, and evaluation. This monograph views CIS applications as including conversational search, conversational question answering, and conversational recommendation. Our aim is to provide an overview of past research related to CIS, introduce the current state-of-the-art in CIS, highlight the challenges still being faced in the community, and suggest future directions. View details
    HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
    Jiaying Lu
    Jiaming Shen
    Bo Xiong
    Wenjing Ma
    Steffen Staab
    Carl Yang
    Proc. of The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
    Preview abstract Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt. View details
    Optimizing Test-time Query Representations for Dense Retrieval
    Mujeen Sung
    Jungsoo Park
    Jaewoo Kang
    Danqi Chen
    Findings of ACL 2023
    Preview abstract Recent developments of dense retrieval rely on quality representations of queries and contexts from pre-trained query and context encoders. In this paper, we introduce TOUR (Test-Time Optimization of Query Representations), which further optimizes instance-level query representations guided by signals from test-time retrieval results. We leverage a cross-encoder re-ranker to provide fine-grained pseudo labels over retrieval results and iteratively optimize query representations with gradient descent. Our theoretical analysis reveals that TOUR can be viewed as a generalization of the classical Rocchio algorithm for pseudo relevance feedback, and we present two variants that leverage pseudo-labels as hard binary or soft continuous labels. We first apply TOUR on phrase retrieval with our proposed phrase re-ranker, and also evaluate its effectiveness on passage retrieval with an off-the-shelf re-ranker. TOUR greatly improves end-to-end open-domain question answering accuracy, as well as passage retrieval performance. TOUR also consistently improves direct re-ranking by up to 2.0% while running 1.3-2.4x faster with an efficient implementation. View details
    Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
    Zhuyun Dai
    Tao Lei
    Iftekhar Naim
    Ming-Wei Chang
    Vincent Zhao
    Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear scoring functions are computationally expensive, necessitating a two-stage process for inference: initial candidate retrieval via token retrieval and subsequent refinement stage which re-ranks candidates using the scoring function. Prior training algorithms mainly focus on the re-ranking stage, under-estimating the importance of the token retrieval stage. In this paper, we rethink the role of token retrieval for multi-vector retrieval models and presentXTR, ConteXtualized TokenRetriever. XTR introduces a simple, yet novel, objective function to encourage better token retrieval, which drastically reduce the mismatch between the training objective and the inference procedure. Unexpectedly, our studies have demonstrated that when the token retrieval stage is improved, the refinement stage can be reduced and approximated. Based on this observation, XTR includes a fast refinement algorithm that can re-rank the candidates 4,000× cheaper compared to the refinement stage of ColBERT. On the popular BEIR benchmark [Thakur et al., 2021], XTR advances the state-of-the-art by 3.3 points, achieving 53.2 nDCG@10. Detailed analysis is conducted to confirm that the success of XTR indeed come from better recall of the token-level retrieval stage. View details
    End-to-End Query Term Weighting
    Karan Samel
    Tao Chen
    Mingyang Zhang
    Swaraj Khadanga
    Wensong Xu
    Xingyu Wang
    Kashyap Kolipaka
    Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23) (2023)
    Preview abstract Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets. View details