Julian Martin Eisenschlos

Julian Martin Eisenschlos

NLP Researcher based in Zurich. I work on natural language understanding, with a focus on text generation, semantic parsing and multilinguality. Co-founded botmaker, a leading conversational AI platform in Latin America. Previously worked on NLP research at ASAPP and knowledge base fusion at Facebook.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices. View details
    DIFFQG: Generating Questions to Summarize Factual Changes
    Palak Jain
    Michael Zhang
    Eunsol Choi
    Bhuwan Dhingra
    European Association for Computational Linguistics EACL (2023)
    Preview abstract Question Generation has been emerging as new method to improve QA systems and represent factual information in text. However, despite the rash of new work on the topic, there is still no obvious method to evaluate such systems. Here we present DiffQG, a method to evaluate the precision and recall of question generation systems. DiffQG consists of expert labeled annotations, focusing on the particularly challenging task of generating questions from similar pieces of text. Given an edit to a Wikipedia passage and a noun phrase, annotators wrote questions that are answered by one passage but answered differently or not at all by the other. These questions are intended to be both unambiguous and information-seeking, pushing the bounds of current question generation systems' capabilities. Moreover, as annotators also marked when no such question exists, it serves as a new evaluation for difference detection, which also lacks evaluations with as much diversity as DiffQG. We hope that this dataset will be of value to researchers as they seek to improve such systems for a variety of purposes. View details
    Preview abstract A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks. View details
    Preview abstract Prior work on constructing challenging tabular inference data centered primarily on human annotation or automatic synthetic generation. Both techniques have their own set of issues. Human annotation, despite its diversity and superior reasoning, struggles from scaling concerns. Synthetic data, on the other hand, despite its scalability, suffers from lack of linguistic and reasoning diversity. In this paper, we address both of these concerns by presenting a recasting approach that semi-automatically generates tabular NLI instances. We transform the table2text dataset ToTTo (Parikh et al., 2020) into a tabular NLI dataset using our proposed framework. We demonstrate the use of our recasted data as an evaluation benchmark as well as augmentation data to improve performance on TabFact (Chen et al., 2020b). Furthermore, we test the effectiveness of models trained on our data on the TabFact benchmark in the zero-shot scenario. View details
    Preview abstract Encoder-only transformer models have been successfully applied to different table understanding tasks, as in TAPAS (Herzig et al., 2020). A major limitation of these architectures is that they are constrained to classification-like tasks such as cell selection or entailment detection. We present TABT5, an encoder-decoder model that generates natural language text based on tables and textual inputs. TABT5, overcomes the encoder-only limitation by incorporating a decoder component and leverages the input structure with table specific embeddings as well as pre-training. TABT5 achieves new state-of-the-art results on several domains, including spreadsheet formula prediction (15% increase in sequence accuracy), question answering (10% increase in sequence accuracy) and data-to-text generation (2% increas in BLEU). View details
    Preview abstract Language models (LMs) trained on raw texts have no access to the real physical world environment. Gordon and Van Durme (2013) argue that they thus suffer from reporting bias, meaning that texts rarely report commonsensical facts about the world but more frequently talk about non-commonsenscical facts/events. If LMs naively overfit to the co-occurrence statistics of training corpora, they learn a biased view of the physical world. While prior studies have repeatedly verified that LMs of smaller scales (e.g. RoBERTa, GTP-2) amplifies reporting bias, it remains unknown whether such trends continue when models are scaled up. We investigate reporting bias in larger language models (LLMs) such as PaLM and GPT-3. Specifically, we query LLMs for the colour of objects, using colour as a representative property for visual commonsense. Surprisingly, we found that LLMs significantly outperform smaller LMs on answering queries about an object's typical colour. We find that LLMs' predictions deviate from corpus co-occurrence statistics induced from resources such as Google Books Ngram and are closer to human judgement. We believe this serves as evidence that larger LMs can overcome reporting bias, rather than showing an inverse scaling function as previously suggested. View details
    Preview abstract By nature of the cost and time required to train Large Language Models (LLMs), the embedded knowledge within is usually frozen at the moment their training data is collected. As a result, LLMs have been shown to suffer from diachronic degradation. The in-context learning paradigm can provide a workaround for this limitation by supplying relevant information at inference time. We introduce a new benchmark to evaluate LLMs for one particular but critical aspect of diachronic change: language acquisition. To that end, we rewrite Winograd-style co-reference resolution problems by replacing a word for a new synthetic but plausible English word. The meaning of the word is given to the model in the prompt via a dictionary definition. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark and we believe this serves as a measure of progress for future models. View details
    Preview abstract Visual language such as charts and plots are ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art models are end-to-end multimodal Transformers pretrained with dedicated plot derendering and numerical reasoning objectives. However, the models reasoning capabilities still fall short and will generally fail on complex queries. In this paper, we decompose the multimodal reasoning problem into first, a modality conversion problem from image to text, then a purely textual reasoning problem. Through combining a pretrained image-to-text model and an LLM for the task of chart/figure reasoning. Compared with a SOTA model finetuned on >10k data points, our plug-and-play model DePlot-LLM achieves >20% improvement over finetuned SOTA with just one-shot prompting. View details
    Preview abstract Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose a set of pretraining tasks to enhance visual language models' capabilities in jointly modeling charts/plots and language data. We initialize with Pix2Struct, a recently proposed image-to-text visual language model and continue pretraining with our proposed objectives. We argue that numerical reasoning and plot deconstruction enable a model with the key capabilities of (1) extracting key information and (2) reasoning on the extracted information. On standard benchmarks such as PlotQA and ChartQA, our continually pretrained MatCha model outperforms state-of-the-art methods by as much as ~20%. We also examine how well does MatCha pretraining transfer to domains such as screenshot, textbook, and poster figures. We observe improvement over the base Pix2Struct checkpoint by 1.2% on average, verifying the usefulness of MatCha pretraining on broader visual language tasks. View details
    Preview abstract We propose a benchmark to assess the capability of large language models to reason with metaphor. Our benchmark combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by accurately selecting between the literal and metaphorical register. We examine the performance of state-of-the-art pretrained models on forced-choice tasks and find a large discrepancy between small and very large models, going from chance- to human-level performance. However, upon examining the generative performance of the largest model, we find that there is still a gap to bridge before human performance is reached in a more natural conversational setting. View details