Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 357 publications
    First Passage Percolation with Queried Hints
    Sreenivas Gollapudi
    Kritkorn Karntikoon
    Aaron Schild
    Yiheng Shen
    Ali Sinop
    AISTATS (2024)
    Preview abstract Optimization problems are ubiquitous throughout the modern world. In many of these applications, the input is inherently noisy and it is expensive to probe all of the noise in the input before solving the relevant optimization problem. In this work, we study how much of that noise needs to be queried in order to obtain an approximately optimal solution to the relevant problem. We focus on the shortest path problem in graphs, where one may think of the noise as coming from real-time traffic. We consider the following model: start with a weighted base graph $G$ and multiply each edge weight by an independently chosen, uniformly random number in $[1,2]$ to obtain a random graph $G'$. This model is called \emph{first passage percolation}. Mathematicians have studied this model extensively when $G$ is a $d$-dimensional grid graph, but the behavior of shortest paths in this model is still poorly understood in general graphs. We make progress in this direction for a class of graphs that resembles real-world road networks. Specifically, we prove that if the geometric realization of $G$ has constant doubling dimension, then for a given $s-t$ pair, we only need to probe the weights on $((\log n) / \epsilon)^{O(1)}$ edges in $G'$ in order to obtain a $(1 + \epsilon)$-approximation to the $s-t$ distance in $G'$. We also demonstrate experimentally that this result is pessimistic -- one can even obtain a short path in $G'$ with a small number of probes to $G'$. View details
    Preview abstract Many geographic information systems applications rely on the data provided by user devices in the road network. Such applications include traffic monitoring, driving navigation, detecting road closures or the construction of new roads, etc. This signal is collected by sampling locations from the user trajectories and is a critical process for all such systems. Yet, it has not been sufficiently studied in the literature. The most natural way to sample a trajectory is perhaps using a frequency based algorithm, e.g., sample every $x$ seconds. However, as we argue in this paper, such a simple strategy can be very wasteful in terms of resources (e.g., server-side processing, user battery) and in terms of the amount of user data that it maintains. In this work we conduct a horizontal study of various location sampling algorithms (including frequency-based, road geography-based, reservoir-sampling based, etc.) and extract their trade-offs in terms of various metrics of interest, such as, the size of the stored data and the induced quality of training for prediction tasks (e.g., predicting speeds) using the road network of New York City. View details
    Preview abstract Computing efficient traffic signal plans is often based on the amount of traffic in an intersection, its distribution over the various intersection movements and hours as well as on performance metrics such as traffic delay. In their simple and typical form plans are fixed in the same hour over weekdays. This allows low operation costs without the necessity for traffic detection and monitoring tools. A critical factor on the potential efficiency of such plans is the similarity of traffic patterns over the days along each of the intersection movements. We refer to such similarity as the traffic stability of the intersection and define simple metrics to measure it based on traffic volume and traffic delay. In this paper, we propose an automatic probe data based method, for city-wide estimation of traffic stability. We discuss how such measures can be used for signal planning such as in selecting plan resolution or as an indication as which intersections can benefit from dynamic but expensive traffic detection tools. We also identify events of major changes in traffic characteristics of an intersection. We demonstrate the framework by using real traffic statistics to study the traffic stability in the city of Haifa along its 162 intersections. We study the impact of the time of day on the stability, detect major changes in traffic and find intersections with high and low stability. View details
    LinguaMeta: Unified Metadata for Thousands of Languages
    Uche Okonkwo
    Emily Drummond
    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
    Preview abstract We introduce LinguaMeta, a unified resource for language metadata for thousands of languages, including language codes, names, number of speakers, writing systems, countries, official status, coordinates, and language varieties. The resources are drawn from various existing repositories and supplemented with our own research. Each data point is tagged for its origin, allowing us to easily trace back to and improve existing resources with more up-to-date and complete metadata. The resource is intended for use by researchers and organizations who aim to extend technology to thousands of languages. View details
    Shorts vs. Regular Videos on YouTube: A Comparative Analysis of User Engagement and Content Creation Trends
    Caroline Violot
    Tugrulcan Elmais
    Mathias Humbert
    ACM Web Science Conference 2024 (WEBSCI24) (2024) (to appear)
    Preview abstract YouTube introduced the Shorts video format in 2021, allowing users to upload short videos that are prominently displayed on its website and app. Despite having such a large visual footprint, there are no studies to date that have looked at the impact Shorts introduction had on the production and consumption of content on YouTube. This paper presents the first comparative analysis of YouTube Shorts versus regular videos with respect to user engagement (i.e., views, likes, and comments), content creation frequency and video categories. We collected a dataset containing information about 70k channels that posted at least one Short, and we analyzed the metadata of all the videos (9.9M Shorts and 6.9M regular videos) they uploaded between January 2021 and December 2022, spanning a two-year period including the introduction of Shorts. Our longitudinal analysis shows that content creators consistently increased the frequency of Shorts production over this period, especially for newly-created channels, which surpassed that of regular videos. We also observe that Shorts target mostly entertainment categories, while regular videos cover a wide variety of categories. In general, Shorts attract more views and likes per view than regular videos, but attract less comments per view. However, Shorts do not outperform regular videos in the education and political categories as much as they do in other categories. Our study contributes to understanding social media dynamics, to quantifying the spread of short-form content, and to motivating future research on its impact on society. View details
    Business Intelligence Career Master Plan
    Danny Moncada
    Eduardo Chavez
    Packt (2023) (to appear)
    Preview abstract Navigating the challenging path of a business intelligence career requires considering individual expertise, interest, and skills. This book explores key skills like data modeling, visualization, and warehousing, alongside organizational structures, technology stacks, coursework, certifications, and interview advice, thus enabling readers to make informed decisions about their BI journey. The book will begin by assessing the different roles of BI and provide an in-depth walkthrough of the roadmap while helping you match your skills and career with the tech stack in business intelligence. The book will then teach you to build taxonomy and a data story using visualization types. You would be learning the fundamentals of programming, frontend development, backend development, software development lifecycle, and project management to give you a broad-level view of the end-to-end BI process. This book will also help you to identify what subjects and areas are crucial to study and what does not add much value to your BI skill set. You would be able to find the right job fit and build a business problem and data solutions matrix. By the end of this book, you would be able to make an informed, well-thought-out decision on which of the myriad paths to choose in your business intelligence journey. View details
    Preview abstract Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall. View details
    Preview abstract In this study, we explore the use of a hybrid approach in online surveys, combining traditional form-based closed-ended questions with open-ended questions administered by a chatbot. We trained a chatbot using OpenAI's GPT-3 language model to produce context-dependent probes to responses given to open-ended questions. The goal was to mimic a typical professional survey interviewer scenario where the interviewer is trained to probe the respondent when answering an open-ended question. For example, assume this initial exchange: “What did you find hard to use or frustrating when using Google Maps?” “It wasn't easy to find the address we were looking for” The chatbot would follow-up with “What made it hard to find the address?” or “What about it made it difficult to find?” or “What steps did you take to find it?”. The experiment consisted of a Qualtrics survey with 1,200 participants, who were randomly assigned to one of two groups. Both groups answered closed-ended questions, but the final open-ended question differed between the groups, with one group receiving a chatbot and the other group receiving a single open-ended question. The results showed that using a chatbot resulted in higher quality and more detailed responses compared to the single open-ended question approach, and respondents indicated a preference towards using a chatbot to open-ended questions. However, respondents also noted the importance of avoiding repetitive probes and expressed dislike for the uncertainty around the number of required exchanges. This hybrid approach has the potential to provide valuable insights for survey practitioners, although there is room for improvement in the conversation flow. View details
    Transformers as Graph-to-Graph Models
    James Henderson
    Alireza Mohammadshahi
    Andrei C. Coman
    The Big Picture Workshop, ACL, (2023)
    Preview abstract We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability explicit, by inputting graph edges into the attention weight computations and predicting graph edges with attention-like functions, thereby integrating explicit graphs into the latent graphs learned by pretrained Transformers. Adding iterative graph refinement provides a joint embedding of input, output, and latent graphs, allowing non-autoregressive graph prediction to optimise the complete graph without any bespoke pipeline or decoding strategy. Empirical results show that this architecture achieves state-of-the-art accuracies for modelling a variety of linguistic structures, integrating very effectively with the latent linguistic representations learned by pretraining. View details
    Preview abstract We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite this intense effort to obtain smaller and smaller sketches for these problems, all known techniques require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem, using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory. View details
    HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
    Jiaying Lu
    Jiaming Shen
    Bo Xiong
    Wenjing Ma
    Steffen Staab
    Carl Yang
    Proc. of The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
    Preview abstract Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt. View details
    Personalisation in digital ecomuseums: the case of Pros-Eleusis
    Ektor Vrettakis
    Akrivi Katifori
    Myrto Koukouli
    Maria Boile
    Apostolos Glenis
    Dimitra Petousi
    Maria Vayanou
    Yannis Ioannidis
    MDPI , Applied Sciences (2023) (to appear)
    Preview abstract In comparison with a traditional museum, an “ecomuseum” is radically different: It is not housed in a building and does not have a collection of physical objects or artifacts. It aims to help visitors discover the tangible and intangible cultural heritage of a region through the identification of important points of interest (POIs), while offering a variety of activities and direct engagement with the region’s cultural identity. The diversity and amount of information that may be available through digital means highlight the need for supporting the visitor in selecting which POIs to visit by offering personalized content. In this paper, we present our approach for a recommendation system for an ecomuseum, through its application in the city of Eleusis, Greece. We present the approach from needs to implementation, as well as the results of a preliminary evaluation, showing promising results for its application as an engaging visitor experience for an ecomuseum. We conclude the paper with a wider discussion about personalization in this context and in a cultural heritage context in general. View details
    Cold-Start Data Selection for Better Few-shot Language Model Fine-tuning: A Prompt-based Uncertainty Propagation Approach
    Yue Yu
    Rongzhi Zhang
    Ran Xu
    Jieyu Zhang
    Jiaming Shen
    Chao Zhang
    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) (2023)
    Preview abstract Large Language Models have demonstrated remarkable few-shot performance, but the performance can be sensitive to the selection of few-shot instances. We propose PATRON, a new method that uses prompt-based uncertainty estimation for data selection for pre-trained language model fine-tuning under cold-start scenarios, i.e., no initial labeled data are available. In PATRON, we design (1) a prompt-based uncertainty propagation approach to estimate the importance of data points and (2) a partition-then-rewrite (PTR) strategy to promote sample diversity when querying for annotations. Experiments on six text classification datasets show that PATRON outperforms the strongest cold-start data selection baselines by up to 6.9%. Besides, with 128 labels only, PATRON achieves 91.0% and 92.1% of the fully supervised performance based on vanilla fine-tuning and prompt-based learning respectively. View details
    “Why is this misleading?”: Detecting News Headline Hallucinations with Explanations
    Jiaming Shen
    Dan Finnie
    Negar Rahmati
    Proceedings of the ACM Web Conference 2023 (WWW 2023)
    Preview abstract Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled "article, headline" pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations. View details
    Unsupervised Event Chain Mining from Multiple Documents
    Yizhu Jiao
    Ming Zhong
    Jiaming Shen
    Yunyi Zhang
    Chao Zhang
    Jiawei Han
    Proceedings of the ACM Web Conference 2023
    Preview abstract Massive and fast-evolving news articles keep emerging on the web. To effectively summarize and provide concise insights into real-world events, we propose a new event knowledge extraction task Event Chain Mining in this paper. Given multiple documents about a super event, it aims to mine a series of salient events in temporal order. For example, the event chain of super event "Mexico Earthquake in 2017" is {"earthquake hit Mexico", "destroy houses", "kill people", "block roads"}. This task can help readers capture the gist of texts quickly, thereby improving reading efficiency and deepening text comprehension. To address this task, we regard an event as a cluster of different mentions of similar meanings. In this way, we can identify the different expressions of events, enrich their semantic knowledge and replenish relation information among them. Taking events as the basic unit, we present a novel unsupervised framework, EMiner. Specifically, we extract event mentions from texts and merge them with similar meanings into a cluster as a single event. By jointly incorporating both content and commonsense, essential events are then selected and arranged chronologically to form an event chain. Meanwhile, we annotate a multi-document benchmark to build a comprehensive testbed for the proposed task. Extensive experiments are conducted to verify the effectiveness of EMiner in terms of both automatic and human evaluations. View details