Adam Lelkes

Adam Lelkes

I am a Senior Software Engineer in Google Research, working on fundamental and applied NLP and ML research. Before joining Google Research, I received my Ph.D. in Mathematics at the University of Illinois at Chicago where I was advised by Lev Reyzin and György Turán, and was working on problems in computational complexity theory, combinatorial optimization, and machine learning.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities. View details
    Preview abstract Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality public datasets, largely due to the privacy and regulatory constraints on the use of real patient data. This paper introduces a new dataset, SDOH-NLI, that is based on fully public data. We formulate SDOH extraction as a natural language inference task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmark in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets. View details
    Preview abstract Popularized by the Differentiable Search Index, the emerging paradigm of Generative Retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus into the parameters of a single transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task consisting of 8.8M passages. After ablating for the most promising techniques, we then consider model scales up to 11B parameters. Along the way, we uncover several findings about scaling generative retrieval to millions of passages. Notably, the use of synthetic query generation as document representation is the only modeling technique critical to retrieval effectiveness. In addition, we find that the strongest performing architecture modifications from the literature at T5-Base initialization only perform well due to added parameters. Naively scaling to a comparable model size outperforms these proposed techniques. Finally, while model scale is necessary as corpus size increases, we find that given existing techniques, scaling model parameters past a certain point can be detrimental for retrieval effectiveness. This result might be counter-intuitive to the commonly held belief that model capacity is a limiting factor for scaling generative retrieval to larger corpora, and suggests the need for more fundamental improvements. In general, we believe that these findings will be highly valuable for the community to clarify the state of generative retrieval at scale and highlight the challenges currently facing the paradigm. View details
    Automated LOINC Standardization Using Pre-trained Large Language Models
    Eric Loreaux
    Emma Chesley
    Paul Gamble
    Martin Seneviratne
    Ming-Jun Chen
    PMLR(2022), pp. 343-355
    Preview abstract Harmonization of local source concepts to standard clinical terminologies is a prerequisite for multi-center data aggregation and sharing. Challenges in automating the mapping process stem from the idiosyncratic source encoding schemes adopted by different health systems and the lack of large publicly available training data. In this study, we aim to develop a scalable and generalizable machine learning tool to facilitate standardizing laboratory observations to the Logical Observation Identifiers Names and Codes (LOINC). Specifically, we leverage the contextual embedding from pre-trained T5 models and propose a two-stage fine-tuning strategy based on contrastive learning to enable learning in a few-shot setting without manual feature engineering. Our method utilizes unlabeled general LOINC ontology and data augmentation to achieve impressive performance on retrieving the most relevant LOINC targets when limited amount of labeled data are available. We further show that our model generalizes well to unseen targets. Taken together, our approach shows great potential to reduce manual effort in LOINC standardization and can be easily extended to mapping other terminologies. View details
    Instability in clinical risk prediction models using deep learning
    Daniel Lopez-Martinez
    Alex Yakubovich
    Martin Seneviratne
    Akshit Tyagi
    Ethan Steinberg
    N. Lance Downing
    Ron C. Li
    Keith E. Morse
    Nigam H. Shah
    Ming-Jun Chen
    Proceedings of the 2nd Machine Learning for Health symposium, PMLR(2022), pp. 552-565
    Preview abstract While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under-characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the same training data can result in significantly different outcomes at a patient level even though global performance metrics remain stable. We propose two stability metrics for measuring the effect of randomness of model training, as well as mitigation strategies for improving model stability. View details
    Preview abstract Multi-Task Learning (MTL) models have shown their robustness, effectiveness, and efficiency for transferring learned knowledge across tasks. In real industrial applications such as web content classification, multiple classification tasks are predicted from the same input text such as a web article. However, at the serving time, the existing multitask transformer models such as prompt or adaptor based approaches need to conduct N forward passes for N tasks with O(N) computation cost. To tackle this problem, we propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass. To illustrate real application usage, we release a multitask dataset on news topic and style classification. Our experiments show that our proposed method outperforms strong baselines on both the GLUE benchmark and our news dataset. View details
    Preview abstract We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on article-summary entailment relations for a subset of the clusters in the dataset. We aim to create strong baselines for the task by applying the top-performing pretrained single-document summarization model PEGASUS onto AgreeSum, leveraging both annotated clusters by supervised losses, and unannotated clusters by T5-based entailment-related and language-related losses. Compared to other baselines, both automatic evaluation and human evaluation show better article-summary and cluster-summary entailment in generated summaries. On a separate note, we hope that our article-summary entailment annotations contribute to the community's effort in improving abstractive summarization faithfulness. View details
    Quiz-Style Question Generation for News Stories
    Cong Yu
    Proceedings of The Web Conference(2021), pp. 2501-2511
    Preview abstract A large majority of American adults get at least some of their news from the Internet. Even though many online news products have the goal of informing their users about the news, they lack scalable and reliable tools for measuring how well they are achieving this goal, and therefore have to resort to noisy proxy metrics (e.g., click-through rates or reading time) to track their performance. As a first step towards measuring news informedness at a scale, we study the problem of quiz-style multiple-choice question generation, which may be used to survey users about their knowledge of recent news. In particular, we formulate the problem as two sequence-to-sequence tasks: question-answer generation (QAG) and distractor, or incorrect answer, generation (DG). We introduce NewsQuizQA, the first dataset intended for quiz-style question-answer generation, containing 20K human written question-answer pairs from 5K news article summaries. Using this dataset, we propose a series of novel techniques for applying large pre-trained Transformer encoder-decoder models, namely PEGASUS and T5, to the tasks of question-answer generation and distractor generation. We show that our models outperform strong baselines using both automated metrics and human raters. We provide a case study of running weekly quizzes on real-world users via the Google Surveys platform over the course of two months. We found that users generally found the automatically generated questions to be educational and enjoyable. Finally, to serve the research community, we are releasing the NewsQuizQA dataset. View details
    Investigating Rumor News Using Agreement-Aware Search
    Jingbo Shang
    Jiaming Shen
    Tianhang Sun
    Xingbang Liu
    Anja Gruenheid
    Cong Yu
    Jiawei Han
    CIKM(2018)
    Preview abstract Recent years have witnessed a widespread increase of rumor news generated by humans and machines in order to attract readership, influence opinions, and increase click-through revenue. Therefore, tools for investigating rumor news have become an urgent necessity. One useful function of such tools is to see ways a specific topic or event is represented by presenting different points of view from multiple sources. In this paper, we propose Maester, a novel agreement-aware search framework for investigating rumor news. Given an investigative question, Maester will retrieve related articles to that question, assign and display top articles from agree, disagree, and discuss categories to users. Splitting the results into these three categories provides the user a holistic view towards the investigative question. We build Maester based on the following two key observations: (1) relatedness can commonly be determined by keywords and entities occurring in both questions and articles, and (2) the level of agreement between the investigative question and the related news article can often be decided by a few key sentences. Accordingly, we use gradient boosting tree models with keyword/entity matching features for relatedness detection, and leverage recurrent neural network to infer the level of agreement. Our experiments on the Fake News Challenge (FNC) “stance detection” dataset demonstrate up to an order of magnitude improvement of Maester over the original FNC winning solution, for agreement-aware search. View details
    A Confidence-Based Approach for Balancing Fairness and Accuracy
    Benjamin Fish
    Jeremy Kun
    Proceedings of the 2016 SIAM International Conference on Data Mining(2016)
    Preview abstract We study three classical machine learning algorithms in the context of algorithmic fairness: adaptive boosting, support vector machines, and logistic regression. Our goal is to maintain the high accuracy of these learning algorithms while reducing the degree to which they discriminate against individuals because of their membership in a protected group. Our first contribution is a method for achieving fairness by shifting the decision boundary for the protected group. The method is based on the theory of margins for boosting. We empirically compare our method with other variants of these learning algorithms as well as results in previous papers in the fairness literature. Our method, in addition to outperforming many of the prior algorithms in terms of accuracy and low discrimination, also allows for a fast and transparent quantification of the trade-off between bias and error. Our second contribution addresses the shortcomings of the bias-error trade-off studied in most of the algorithmic fairness literature. We demonstrate that even hopelessly naive modifications of a biased algorithm, which cannot be reasonably said to be `fair,' can still achieve low bias and high accuracy. To help to distinguish between these naive algorithms and more sensible algorithms we propose a new measure of fairness, called resilience to random bias (RRB). We demonstrate that RRB distinguishes well between our naive and sensible fairness algorithms. RRB together with bias and accuracy provides a more complete picture of the fairness of an algorithm. View details
    Network installation under convex costs
    Alexander Gutfraind
    Jeremy Kun
    Lev Reyzin
    Journal of Complex Networks, 4(2016)
    Preview abstract We study the Neighbor Aided Network Installation Problem (NANIP) introduced previously which asks for a minimal cost ordering of the vertices of a graph, where the cost of visiting a node is a function of the number of neighbors that have already been visited. This problem has applications in resource management and disaster recovery. In this paper we analyze the computational hardness of NANIP. In particular we show that this problem is NP-hard even when restricted to convex decreasing cost functions, give a linear approximation lower bound for the greedy algorithm, and prove a general sub-constant approximation lower bound. Then we give a new integer programming formulation of NANIP and empirically observe its speedup over the original integer program. View details
    Fair boosting: a case study
    Benjamin Fish
    Jeremy Kun
    ICML 2015 Workshop on Fairness, Accountability, and Transparency in Machine Learning(2015)
    Preview abstract We study the classical AdaBoost algorithm in the context of fairness. We use the Census Income Dataset as a case study. We empirically evaluate the bias and error of four variants of AdaBoost relative to an unmodified AdaBoost baseline, and study the trade-offs between reducing bias and maintaining low error. We further define a new notion of fairness and measure it for all of our methods. Our proposed method, modifying the hypothesis output by AdaBoost by shifting the decision boundary for the protected group, outperforms the state of the art for the census dataset. View details
    Interactive Clustering of Linear Classes and Cryptographic Lower Bounds
    Lev Reyzin
    Proceedings of the 26th International Conference on Algorithmic Learning Theory(2015)
    Preview abstract We study an interactive model of supervised clustering introduced by Balcan and Blum (2008), where the clustering algorithm has query access to a teacher. We give an efficient algorithm clustering linear functionals over finite fields, which implies the learnability of parity functions in this model. We also present an efficient clustering algorithm for hyperplanes which are a natural generalization of the problem of clustering linear functionals over Rd. We also give cryptographic hardness results for interactive clustering. In particular, we show that, under plausible cryptographic assumptions, the interactive clustering problem is intractable for the concept classes of polynomial-size constant-depth threshold circuits, Boolean formulas, and finite automata. View details
    On the Computational Complexity of MapReduce
    Benjamin Fish
    Jeremy Kun
    Lev Reyzin
    György Turán
    Proceedings of the 29th International Symposium on Distributed Computing(2015)
    Preview abstract In this paper we study the MRC model, which aims to formally capture distributed MapReduce computations. We show that the class of regular languages, and moreover all of sublogarithmic space, lies in constant-round MRC. In addition, we prove that, conditioned on a weak version of the Exponential Time Hypothesis, there are strict hierarchies within MRC so that increasing the number of rounds or the amount of time per processor increases the power of MRC. Our work lays the foundation for further analysis relating MapReduce to established complexity classes. Our results also hold for Valiant's BSP model of parallel computation and the MPC model of Beame et al. View details
    Biclique coverings, rectifier networks and the cost of ε-removal
    Szabolcs Iván
    Judit Nagy-György
    Balázs Szörényi
    György Turán
    Proceedings of the 16th International Workshop on Descriptional Complexity of Formal Systems(2014)
    Preview abstract We relate two complexity notions of bipartite graphs: the minimal weight biclique covering number Cov(G) and the minimal rectifier network size Rect(G) of a bipartite graph G. We show that there exist graphs with Cov(G)≥Rect(G)3/2−ϵ. As a corollary, we establish that there exist nondeterministic finite automata (NFAs) with ε-transitions, having n transitions total such that the smallest equivalent ε-free NFA has Ω(n3/2−ϵ) transitions. We also formulate a version of previous bounds for the weighted set cover problem and discuss its connections to giving upper bounds for the possible blow-up. View details
    Improved algorithms for splitting full matrix algebras
    Gábor Ivanyos
    Lajos Rónyai
    JP Journal of Algebra, Number Theory and Applications, 28(2013), pp. 141-156
    Preview abstract Let K be an algebraic number field of degree d and discriminant Δ over Q. Let A be an associative algebra over K given by structure constants such that A ≅ Mn(K) holds for some positive integer n. Suppose that d, n and |Δ| are bounded. In a previous paper a polynomial time ff-algorithm was given to construct explicitly an isomorphism A → Mn(K). Here we simplify and improve this algorithm in the cases n≤43, K=Q, and n=2, with K=Q(√(-1)) or K=Q(√(-3)). The improvements are based on work by Y. Kitaoka and R. Coulangeon on tensor products of lattices. View details