Cheng Li

Cheng Li

I got my Ph.D. in information science from the University of Michigan, Ann Arbor. My advisor is Qiaozhu Mei. I received my Bachelor's degree from the College of Computer Science and Technology at Zhejiang University, China. I am particularly interested in data mining, information retrieval, machine learning (especially deep learning), and natural language processing, with application to the retrieval, management, and analysis of data from Web, scientific literature, social networks, and various online communities. My publications can be found in Google Scholar
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This workshop discusses the cutting-edge developments in research and applications of personalizing large language models (LLMs) and adapting them to the demands of diverse user populations and societal needs. The full-day workshop plan includes several keynotes and invited talks, a poster session and a panel discussion. View details
    Preview abstract Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference. View details
    SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval
    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), ACM(2023) (to appear)
    Preview abstract In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed – a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness. View details
    End-to-End Query Term Weighting
    Karan Samel
    Swaraj Khadanga
    Wensong Xu
    Xingyu Wang
    Kashyap Kolipaka
    Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23)(2023)
    Preview abstract Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets. View details
    Job Type Extraction for Service Businesses
    Yaping Qi
    Hayk Zakaryan
    Yonghua Wu
    Companion Proceedings of the ACM Web Conference 2023
    Preview abstract Google My Business (GMB) is a platform that allows business owners to manage their business profiles, which will be displayed when a user issues a relevant query on Google Search or Maps. Many GMB businesses provide diverse services from home cleaning and plumbing to legal services and education. However the exact service content, which we call job types, is often missing in their profiles. This leaves the burden of finding such content to the users, either by the tedious work of scanning through business websites or time-consuming calling of the owners. In the present paper, we describe how we build a pipeline to automatically extract the job types from websites of business owners and how we solve scalability issues for deployment. Rather than focusing on developing novel and sophisticated machine learning models, we share various challenges we have faced and practical experiences of building such a pipeline, including the cold start problem of dataset collection with limited human annotation resource, scalability, reaching a launch bar of high precision, and building a general pipeline with reasonable coverage of any free-text web pages without relying on the Document Object Model (DOM) structure. With these challenges, standard approaches for information extraction do not directly apply or are not scalable to be served. In this paper, we show how we address these challenges in different stages of the extraction pipeline, including: (1) utilizing structured content like tables and lists to tackle the cold start problem of dataset collection; (2) exploitation of various context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve generalizability, efficiency as well as coverage. The pipeline has been successfully deployed, and is scalable enough to be refreshed every few days to extract the latest online information. The extracted job types are serving millions of users of Google Search and Google Maps with at least three use cases: (1) job types of a place are directly displayed on mobile devices; (2) job types provide explanation as to why a place shows up given a query; (3) job types are used as a signal to rank business places. According to a user survey, the displayed job types has greatly enhanced the probability of a user hiring a service provider. View details
    Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23)(2023)
    Preview abstract A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings. One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability. However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary. In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems. As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams. We add an unsupervised pretraining task on C4 to learn internal representations for new tokens. Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies. View details
    Multi-Aspect Dense Retrieval
    Swaraj Khadanga
    Wensong Xu
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM(2022)
    Preview abstract Prior work in Dense Retrieval usually encodes queries and documents using single-vector representations (also called embeddings) and performs retrieval in the embedding space using approximate nearest neighbor search. This paradigm enables efficient semantic retrieval. However, the single-vector representations can be ineffective at capturing different aspects of the queries and documents in relevance matching, especially for some vertical domains. For example, in e-commerce search, these aspects could be category, brand and color. Given a query "white nike socks", a Dense Retrieval model may mistakenly retrieve some "white adidas socks" while missing out the intended brand. We propose to explicitly represent multiple aspects using one embedding per aspect. We introduce an aspect prediction task to teach the model to capture aspect information with particular aspect embeddings. We design a lightweight network to fuse the aspect embeddings for representing queries and documents. Our evaluation using an e-commerce dataset shows impressive improvements over strong Dense Retrieval baselines. We also discover that the proposed aspect embeddings can enhance the interpretability of Dense Retrieval models as a byproduct. View details
    Preview abstract Document layout comprises both structural and visual (\eg font size) information that are vital but often ignored by machine learning models. The few existing models which do use layout information only consider \textit{textual} contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout was never fully exploited. On the other hand, a series of document understanding tasks are calling out for layout information. One example is given a position in a document, which image is the best to fit in. To address current models' limitations and tackle layout-aware document understanding tasks, we first parse a document into blocks whose content can be textual, tabular, or multimedia (\eg images) using a proprietary tool. We then propose a novel hierarchical framework, LAMPreT, to encode the blocks. Our LAMPreT model encodes each block with a multimodal transformer in the lower-level, and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pre-training objectives where the lower-level model is trained with the standard masked language modeling (MLM) loss and the multimodal alignment loss, and the higher-level model is trained with three layout-aware objectives: (1) block-order predictions, (2) masked block predictions, and (3) image fitting predictions. We test the proposed model on two layout-aware tasks -- image suggestions and text block filling, and show the effectiveness of our proposed hierarchical architecture as well as pre-training techniques. View details
    Preview abstract Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching. View details
    Preview abstract Search engines often follow a 2-phase paradigm where in the first step an initial set of documents is retrieved (the \emp{retrieval} step) and in the second step the documents are ranked so as to obtain the final result list (the \emp{re-ranking} step). The focus of this paper is on improving the \emph{retrieval} step (measured mainly by recall) using deep neural network-based approaches. While deep neural networks were shown to improve the performance of the re-ranking step, there is little literature about using deep neural networks to improve the retrieval step. Previous works on deep neural networks for IR usually apply a simple lexical retrieval model for the retrieval step (e.g., BM25) and emphasize on the re-ranking step. In this paper, we propose and study a hybrid retrieval approach, which leverages both semantic (deep neural network based) and lexical (keyword matching based like BM25) matching techniques. The main idea is to perform semantic and lexical retrieval in parallel, and then to combine the result lists to generate the initial result set for re-ranking. An empirical evaluation, using a public TREC collection, shows that semantic retrieval model generated result lists often contain a substantial number of relevant documents not covered by the lexical-based generated lists. Further analysis of these relevant documents shows that they often also exhibit different characteristics than the lexical-based documents, attesting to the complementary nature of the two approaches. Finally, the experiments show that by combining the two result lists, the recall of the result list can increase significantly, the retrieval step can be greatly improved and these improvements are highly robust. View details