Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 355 publications
    First Passage Percolation with Queried Hints
    Kritkorn Karntikoon
    Aaron Schild
    Yiheng Shen
    Ali Sinop
    AISTATS (2024)
    Preview abstract Optimization problems are ubiquitous throughout the modern world. In many of these applications, the input is inherently noisy and it is expensive to probe all of the noise in the input before solving the relevant optimization problem. In this work, we study how much of that noise needs to be queried in order to obtain an approximately optimal solution to the relevant problem. We focus on the shortest path problem in graphs, where one may think of the noise as coming from real-time traffic. We consider the following model: start with a weighted base graph $G$ and multiply each edge weight by an independently chosen, uniformly random number in $[1,2]$ to obtain a random graph $G'$. This model is called \emph{first passage percolation}. Mathematicians have studied this model extensively when $G$ is a $d$-dimensional grid graph, but the behavior of shortest paths in this model is still poorly understood in general graphs. We make progress in this direction for a class of graphs that resembles real-world road networks. Specifically, we prove that if the geometric realization of $G$ has constant doubling dimension, then for a given $s-t$ pair, we only need to probe the weights on $((\log n) / \epsilon)^{O(1)}$ edges in $G'$ in order to obtain a $(1 + \epsilon)$-approximation to the $s-t$ distance in $G'$. We also demonstrate experimentally that this result is pessimistic -- one can even obtain a short path in $G'$ with a small number of probes to $G'$. View details
    Preview abstract Many geographic information systems applications rely on the data provided by user devices in the road network. Such applications include traffic monitoring, driving navigation, detecting road closures or the construction of new roads, etc. This signal is collected by sampling locations from the user trajectories and is a critical process for all such systems. Yet, it has not been sufficiently studied in the literature. The most natural way to sample a trajectory is perhaps using a frequency based algorithm, e.g., sample every $x$ seconds. However, as we argue in this paper, such a simple strategy can be very wasteful in terms of resources (e.g., server-side processing, user battery) and in terms of the amount of user data that it maintains. In this work we conduct a horizontal study of various location sampling algorithms (including frequency-based, road geography-based, reservoir-sampling based, etc.) and extract their trade-offs in terms of various metrics of interest, such as, the size of the stored data and the induced quality of training for prediction tasks (e.g., predicting speeds) using the road network of New York City. View details
    Preview abstract Computing efficient traffic signal plans is often based on the amount of traffic in an intersection, its distribution over the various intersection movements and hours as well as on performance metrics such as traffic delay. In their simple and typical form plans are fixed in the same hour over weekdays. This allows low operation costs without the necessity for traffic detection and monitoring tools. A critical factor on the potential efficiency of such plans is the similarity of traffic patterns over the days along each of the intersection movements. We refer to such similarity as the traffic stability of the intersection and define simple metrics to measure it based on traffic volume and traffic delay. In this paper, we propose an automatic probe data based method, for city-wide estimation of traffic stability. We discuss how such measures can be used for signal planning such as in selecting plan resolution or as an indication as which intersections can benefit from dynamic but expensive traffic detection tools. We also identify events of major changes in traffic characteristics of an intersection. We demonstrate the framework by using real traffic statistics to study the traffic stability in the city of Haifa along its 162 intersections. We study the impact of the time of day on the stability, detect major changes in traffic and find intersections with high and low stability. View details
    Business Intelligence Career Master Plan
    Danny Moncada
    Eduardo Chavez
    Packt (2023) (to appear)
    Preview abstract Navigating the challenging path of a business intelligence career requires considering individual expertise, interest, and skills. This book explores key skills like data modeling, visualization, and warehousing, alongside organizational structures, technology stacks, coursework, certifications, and interview advice, thus enabling readers to make informed decisions about their BI journey. The book will begin by assessing the different roles of BI and provide an in-depth walkthrough of the roadmap while helping you match your skills and career with the tech stack in business intelligence. The book will then teach you to build taxonomy and a data story using visualization types. You would be learning the fundamentals of programming, frontend development, backend development, software development lifecycle, and project management to give you a broad-level view of the end-to-end BI process. This book will also help you to identify what subjects and areas are crucial to study and what does not add much value to your BI skill set. You would be able to find the right job fit and build a business problem and data solutions matrix. By the end of this book, you would be able to make an informed, well-thought-out decision on which of the myriad paths to choose in your business intelligence journey. View details
    Preview abstract Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall. View details
    Preview abstract In this study, we explore the use of a hybrid approach in online surveys, combining traditional form-based closed-ended questions with open-ended questions administered by a chatbot. We trained a chatbot using OpenAI's GPT-3 language model to produce context-dependent probes to responses given to open-ended questions. The goal was to mimic a typical professional survey interviewer scenario where the interviewer is trained to probe the respondent when answering an open-ended question. For example, assume this initial exchange: “What did you find hard to use or frustrating when using Google Maps?” “It wasn't easy to find the address we were looking for” The chatbot would follow-up with “What made it hard to find the address?” or “What about it made it difficult to find?” or “What steps did you take to find it?”. The experiment consisted of a Qualtrics survey with 1,200 participants, who were randomly assigned to one of two groups. Both groups answered closed-ended questions, but the final open-ended question differed between the groups, with one group receiving a chatbot and the other group receiving a single open-ended question. The results showed that using a chatbot resulted in higher quality and more detailed responses compared to the single open-ended question approach, and respondents indicated a preference towards using a chatbot to open-ended questions. However, respondents also noted the importance of avoiding repetitive probes and expressed dislike for the uncertainty around the number of required exchanges. This hybrid approach has the potential to provide valuable insights for survey practitioners, although there is room for improvement in the conversation flow. View details
    Transformers as Graph-to-Graph Models
    James Henderson
    Alireza Mohammadshahi
    Andrei C. Coman
    The Big Picture Workshop, ACL, (2023)
    Preview abstract We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability explicit, by inputting graph edges into the attention weight computations and predicting graph edges with attention-like functions, thereby integrating explicit graphs into the latent graphs learned by pretrained Transformers. Adding iterative graph refinement provides a joint embedding of input, output, and latent graphs, allowing non-autoregressive graph prediction to optimise the complete graph without any bespoke pipeline or decoding strategy. Empirical results show that this architecture achieves state-of-the-art accuracies for modelling a variety of linguistic structures, integrating very effectively with the latent linguistic representations learned by pretraining. View details
    Preview abstract We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite this intense effort to obtain smaller and smaller sketches for these problems, all known techniques require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem, using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory. View details
    HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
    Jiaying Lu
    Bo Xiong
    Wenjing Ma
    Steffen Staab
    Carl Yang
    Proc. of The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
    Preview abstract Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt. View details
    Personalisation in digital ecomuseums: the case of Pros-Eleusis
    Ektor Vrettakis
    Akrivi Katifori
    Myrto Koukouli
    Maria Boile
    Apostolos Glenis
    Dimitra Petousi
    Maria Vayanou
    Yannis Ioannidis
    MDPI , Applied Sciences (2023) (to appear)
    Preview abstract In comparison with a traditional museum, an “ecomuseum” is radically different: It is not housed in a building and does not have a collection of physical objects or artifacts. It aims to help visitors discover the tangible and intangible cultural heritage of a region through the identification of important points of interest (POIs), while offering a variety of activities and direct engagement with the region’s cultural identity. The diversity and amount of information that may be available through digital means highlight the need for supporting the visitor in selecting which POIs to visit by offering personalized content. In this paper, we present our approach for a recommendation system for an ecomuseum, through its application in the city of Eleusis, Greece. We present the approach from needs to implementation, as well as the results of a preliminary evaluation, showing promising results for its application as an engaging visitor experience for an ecomuseum. We conclude the paper with a wider discussion about personalization in this context and in a cultural heritage context in general. View details
    Cold-Start Data Selection for Better Few-shot Language Model Fine-tuning: A Prompt-based Uncertainty Propagation Approach
    Yue Yu
    Rongzhi Zhang
    Ran Xu
    Jieyu Zhang
    Chao Zhang
    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) (2023)
    Preview abstract Large Language Models have demonstrated remarkable few-shot performance, but the performance can be sensitive to the selection of few-shot instances. We propose PATRON, a new method that uses prompt-based uncertainty estimation for data selection for pre-trained language model fine-tuning under cold-start scenarios, i.e., no initial labeled data are available. In PATRON, we design (1) a prompt-based uncertainty propagation approach to estimate the importance of data points and (2) a partition-then-rewrite (PTR) strategy to promote sample diversity when querying for annotations. Experiments on six text classification datasets show that PATRON outperforms the strongest cold-start data selection baselines by up to 6.9%. Besides, with 128 labels only, PATRON achieves 91.0% and 92.1% of the fully supervised performance based on vanilla fine-tuning and prompt-based learning respectively. View details
    Preview abstract Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled "article, headline" pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations. View details
    Unsupervised Event Chain Mining from Multiple Documents
    Yizhu Jiao
    Ming Zhong
    Yunyi Zhang
    Chao Zhang
    Jiawei Han
    Proceedings of the ACM Web Conference 2023
    Preview abstract Massive and fast-evolving news articles keep emerging on the web. To effectively summarize and provide concise insights into real-world events, we propose a new event knowledge extraction task Event Chain Mining in this paper. Given multiple documents about a super event, it aims to mine a series of salient events in temporal order. For example, the event chain of super event "Mexico Earthquake in 2017" is {"earthquake hit Mexico", "destroy houses", "kill people", "block roads"}. This task can help readers capture the gist of texts quickly, thereby improving reading efficiency and deepening text comprehension. To address this task, we regard an event as a cluster of different mentions of similar meanings. In this way, we can identify the different expressions of events, enrich their semantic knowledge and replenish relation information among them. Taking events as the basic unit, we present a novel unsupervised framework, EMiner. Specifically, we extract event mentions from texts and merge them with similar meanings into a cluster as a single event. By jointly incorporating both content and commonsense, essential events are then selected and arranged chronologically to form an event chain. Meanwhile, we annotate a multi-document benchmark to build a comprehensive testbed for the proposed task. Extensive experiments are conducted to verify the effectiveness of EMiner in terms of both automatic and human evaluations. View details
    Job Type Extraction for Service Businesses
    Yaping Qi
    Hayk Zakaryan
    Yonghua Wu
    Companion Proceedings of the ACM Web Conference 2023
    Preview abstract Google My Business (GMB) is a platform that allows business owners to manage their business profiles, which will be displayed when a user issues a relevant query on Google Search or Maps. Many GMB businesses provide diverse services from home cleaning and plumbing to legal services and education. However the exact service content, which we call job types, is often missing in their profiles. This leaves the burden of finding such content to the users, either by the tedious work of scanning through business websites or time-consuming calling of the owners. In the present paper, we describe how we build a pipeline to automatically extract the job types from websites of business owners and how we solve scalability issues for deployment. Rather than focusing on developing novel and sophisticated machine learning models, we share various challenges we have faced and practical experiences of building such a pipeline, including the cold start problem of dataset collection with limited human annotation resource, scalability, reaching a launch bar of high precision, and building a general pipeline with reasonable coverage of any free-text web pages without relying on the Document Object Model (DOM) structure. With these challenges, standard approaches for information extraction do not directly apply or are not scalable to be served. In this paper, we show how we address these challenges in different stages of the extraction pipeline, including: (1) utilizing structured content like tables and lists to tackle the cold start problem of dataset collection; (2) exploitation of various context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve generalizability, efficiency as well as coverage. The pipeline has been successfully deployed, and is scalable enough to be refreshed every few days to extract the latest online information. The extracted job types are serving millions of users of Google Search and Google Maps with at least three use cases: (1) job types of a place are directly displayed on mobile devices; (2) job types provide explanation as to why a place shows up given a query; (3) job types are used as a signal to rank business places. According to a user survey, the displayed job types has greatly enhanced the probability of a user hiring a service provider. View details
    Preview abstract We explore a fundamental question in language model pre-training with huge amounts of unlabeled and randomly sampled text data - should every data sample have equal contribution to the model learning. To this end, we use self-influence (SI) scores as an indicator of sample importance, analyzing the relationship of self-influence scores with the sample quality and probing the efficacy of SI scores for offline pre-training dataset filtering. Building upon this, we propose PRESENCE: Pre-training data REweighting with Self-influENCE, an online and adaptive pre-training data re-weighting strategy using self-influence scores. PRESENCE is a two-phased learning method: In the first phase of learning, the data samples with higher SI scores are emphasized more, while in the subsequent phase of learning, the data samples with higher SI scores are de-emphasized to limit the impact of noisy and unreliable samples. We validate PRESENCE over $2$ model sizes of multilingual-t5 with $5$ datasets across $3$ tasks, obtaining significant performance improvements over the baseline methods considered. Through extensive ablations and qualitative analyses, we put forward a new research direction for language model pre-training. View details