Jump to Content

Information Retrieval and the Web

The science surrounding search engines is commonly referred to as information retrieval, in which algorithmic principles are developed to match user interests to the best information about those interests.

Google started as a result of our founders' attempt to find the best matching between the user queries and Web documents, and do it really fast. During the process, they uncovered a few basic principles: 1) best pages tend to be those linked to the most; 2) best description of a page is often derived from the anchor text associated with the links to a page. Theories were developed to exploit these principles to optimize the task of retrieving the best documents for a user query.

Search and Information Retrieval on the Web has advanced significantly from those early days: 1) the notion of ""information"" has greatly expanded from documents to much richer representations such as images, videos, etc., 2) users are increasingly searching on their Mobile devices with very different interaction characteristics from search on the Desktops; 3) users are increasingly looking for direct information, such as answers to a question, or seeking to complete tasks, such as appointment booking. Through our research, we are continuing to enhance and refine the world's foremost search engine by aiming to scientifically understand the implications of those changes and address new challenges that they bring.

Recent Publications

Preview abstract The web utilizes permission prompts to moderate access to certain capabilities. We present the first investigation of user behavior and sentiment of this security and privacy measure on the web, using 28 days of telemetry data from more than 100M Chrome installations on desktop platforms and experience sampling responses from 25,706 Chrome users. Based on this data, we find that ignoring and dismissing permission prompts are most common for geolocation and notifications. Permission prompts are perceived as more annoying and interrupting when they are not allowed, and most respondents cite a rational reason for the decision they took. Our data also supports that the perceived availability of contextual information from the requesting website is associated with allowing access to a requested capability. More usable permission controls could facilitate adoption of best practices that address several of the identified challenges; and ultimately could lead to better user experiences and a safer web. View details
Preview abstract A recent large-scale experiment conducted by Chrome has demonstrated that a "quieter" web permission prompt can reduce unwanted interruptions while only marginally affecting grant rates. However, the experiment and the partial roll-out were missing two important elements: (1) an effective and context-aware activation mechanism for such a quieter prompt, and (2) an analysis of user attitudes and sentiment towards such an intervention. In this paper, we address these two limitations by means of a novel ML-based activation mechanism -- and its real-world on-device deployment in Chrome -- and a large-scale user study with 13.1k participants from 156 countries. First, the telemetry-based results, computed on more than 20 million samples from Chrome users in-the-wild, indicate that the novel on-device ML-based approach is both extremely precise (>99% post-hoc precision) and has very high coverage (96% recall for notifications permission). Second, our large-scale, in-context user study shows that quieting is often perceived as helpful and does not cause high levels of unease for most respondents. View details
VRDU: A Benchmark for Visually-rich Document Understanding
Zilong Wang
Wei Wei
2023 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Preview abstract Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we argue that existing benchmarks do not reflect the complexity of real documents seen in industry, and therefore not suitable for measuring progress in practical settings. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call VRDU for Visually Rich Document Understanding. VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as nested entities, complex templates including tables and multi-column layouts and diversity of different layouts within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and observe three conclusions: (1) generalizing to new templates from a document type is still very challenging, (2) few-shot performance continues to have a lot of headroom, and (3) models struggle with nested repeated fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope that it helps inspire and guide future research in this challenging area. View details
Preview abstract Conversational recommendation systems (CRS) aim to recommend suitable items to users through natural language conversation. However, most CRS approaches do not effectively utilize the signal provided by these conversations. They rely heavily on explicit external knowledge e.g., knowledge graphs to augment the models' understanding of the items and attributes, which is quite hard to scale. To alleviate this, we propose an alternative information retrieval (IR)-styled approach to the CRS item recommendation task, where we represent conversations as queries and items as documents to be retrieved. We expand the document representation used for retrieval with conversations from the training set. With a simple BM25-based retriever, we show that our task formulation compares favorably with much more complex baselines using complex external knowledge on a popular CRS benchmark. We demonstrate further improvements using user-centric modeling and data augmentation to counter the cold start problem for CRSs. View details
Preview abstract The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications. A number of techniques have been developed for ANN search that are efficient, accurate, and scalable. However, such techniques typically have a number of parameters that affect the speed-recall tradeoff, and exhibit poor performance when such parameters aren't properly set. Tuning these parameters has traditionally been a manual process, demanding in-depth knowledge of the underlying search algorithm. This is becoming an increasingly unrealistic demand as ANN search grows in popularity. To tackle this obstacle to ANN adoption, this work proposes a constrained optimization-based approach to tuning quantization-based ANN algorithms. Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks. View details
HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
Jiaying Lu
Bo Xiong
Wenjing Ma
Steffen Staab
Carl Yang
Proc. of The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
Preview abstract Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt. View details