Automatic Histograms: Leveraging Language Models for Text Dataset Exploration

Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA (2024), pp. 9

Abstract

Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data practitioners often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific, e.g., instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data practitioners often run custom analyses for each dataset, which is cumbersome and difficult, or use unsupervised methods. We present AutoHistograms, a visualization tool leveraging LLMs. AutoHistograms automatically identifies relevant entity-based features, visualizes their distributions, and allows the user to interactively query the dataset for new categories of entities. In a user study with (n=10) data practitioners, we observe that participants were able to quickly onboard to AutoHistograms, use the tool to identify actionable insights, and conceptualize a broad range of applicable use cases. We also describe a variety of usage scenarios from different types of users to highlight how this app can provide value in many different contexts. Finally, we present a quantitative evaluation of the tool. Together, this tool and user study contribute to the growing field of LLM-assisted sensemaking tools.