DS-STAR: A state-of-the-art versatile data science agent

November 6, 2025

Jinsung Yoon, Research Scientist, and Jaehyun Nam, Student Researcher, Google Cloud

DS-STAR is a state-of-the-art data science agent whose versatility is shown by its ability to automate a range of tasks — from statistical analysis to visualization and data wrangling — across various data types, culminating in a top-ranking performance on the famous DABStep benchmark.

Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.

To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science agents. The goal of these agents is to convert natural language questions into executable code for a desired task. But despite making significant progress, current data science agents have several limitations that hinder their practical use. A major issue is their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications. Another challenge is that many data science problems are open-ended and lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct.

DS-STAR -1

Data science agents answer user queries by generating code that operates on diverse data formats. Following the code's execution, the agent provides a final solution, which may take the form of a trained model, a processed database, a visualization, or a text-formatted answer.

To that end, we present DS-STAR, a new agent designed to solve data science problems. DS-STAR introduces three key innovations: (1) a data file analysis module that automatically extracts context from varied data formats, including unstructured ones; (2) a verification stage where an LLM-based judge assesses the plan’s sufficiency at each step; and (3) a sequential planning process that iteratively refines the initial plan based on feedback. This iterative refinement allows DS-STAR to handle complex analyses that draw verifiable insights from multiple data sources. We demonstrate that DS-STAR achieves state-of-the-art performance on challenging benchmarks like DABStep, KramaBench, and DA-Code. It especially excels with tasks involving diverse, heterogeneous data files.

DS-STAR

The DS-STAR framework operates in two main stages. First, it automatically examines all files in a directory and creates a textual summary of their structure and contents. This summary becomes a vital source of context for tackling the task at hand.

DS-STAR -2

DS-STAR creates a Python script to analyze diverse data files by extracting key information.

Second, DS-STAR engages in a primary loop of planning, implementing, and verifying. The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle. Importantly, DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding. This iterative cycle continues until a plan is deemed satisfactory or the maximum number of rounds (10) is reached, at which point the final code is delivered as the solution.

DS-STAR -3

DS-STAR's workflow is an iterative loop. It starts by executing a simple plan and uses a Verifier agent to check if it's sufficient. If the plan is inadequate, a Router agent guides the refinement by adding a step or correcting any errors before the cycle repeats. The process continues until the Verifier approves the plan or the maximum number of rounds is reached.

Evaluation

To evaluate DS-STAR’s effectiveness, we compared its performance to existing state-of-the-art methods (AutoGen, DA-Agent) using a set of well-regarded data science benchmarks, DABStep, KramaBench, and DA-Code. These benchmarks evaluate performance on complex tasks like data wrangling, machine learning, and visualization that use multiple data sources and formats.

The results show that DS-STAR substantially outperforms AutoGen and DA-Agent in all test scenarios. Compared to the best alternative, DS-STAR raised the accuracy from 41.0% to 45.2% on DABStep, 39.8% to 44.7% on KramaBench, and 37.0% to 38.5% on DA-Code. Notably, DS-STAR also secured the top rank on the public leaderboard for the DABStep benchmark (as of 9/18/2025). On both easy tasks (where the answer is in a single file) and hard tasks (requiring multiple files), DS-STAR consistently surpasses competing baselines, demonstrating its superior ability to work with multiple, heterogeneous data sources.

DS-STAR -4

This chart shows the normalized accuracy (%) on both easy (single-file) and hard (multi-file) tasks from the DABStep, KramaBench, and DA-Code benchmarks. DS-STAR consistently outperforms competing baselines, showing a particularly strong advantage in hard tasks that require processing multiple, heterogeneous data files.

In-depth analysis of DS-STAR

Next, we conducted ablation studies to verify the effectiveness of DS-STAR’s individual components and analyze the impact of the number of refinement rounds, specifically by measuring the iterations required to generate a sufficient plan.

Data File Analyzer: This agent is essential for high performance. Without the descriptions it generates (Variant 1), DS-STAR's accuracy on difficult tasks within the DABStep benchmark sharply dropped to 26.98%, underscoring the importance of rich data context for effective planning and implementation.

Router: The Router agent’s ability to determine if a new step is needed or to fix an incorrect step is vital. When we removed it (Variant 2), DS-STAR only added new steps sequentially, leading to worse performance on both easy and hard tasks. This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps.

Generalizability Across LLMs: We also tested DS-STAR's adaptability by using GPT-5 as the base model. This yielded promising results on the DABStep benchmark, indicating the framework's generalizability. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on hard tasks.

DS-STAR - table

Ablation study results for DS-STAR on the DABStep benchmark, evaluating individual agent effectiveness and LLM compatibility.

An analysis of the refinement process: The figure below shows that difficult tasks naturally require more iterations. On the DABStep benchmark, hard tasks needed an average of 5.6 rounds to solve, whereas easy tasks required only 3.0 rounds. Furthermore, over half of the easy tasks were completed in just a single round.

DS-STAR -5

An analysis of refinement rounds on the DABStep benchmark shows that difficult tasks require more iterations. Hard tasks average 5.6 rounds versus 3.0 for easy tasks, with over 50% of easy tasks being solved in the first round alone.

Conclusion

In this work, we introduced DS-STAR, a new agent that can autonomously solve data science problems. The framework is defined by two core innovations: the automatic analysis of diverse file formats and an iterative, sequential planning process that uses a novel LLM-based verification system. DS-STAR establishes a new state-of-the-art on the DABStep, KramaBench, and DA-Code benchmarks, outperforming the best alternative. By automating complex data science tasks, DS-STAR has the potential to make data science more accessible for individuals and organizations, helping to drive innovation across many different fields.

Acknowledgements

We would like to thank Jiefeng Chen, Jinwoo Shin, Raj Sinha, Mihir Parmar, George Lee, Vishy Tirumalashetty, Tomas Pfister and Burak Gokturk for their valuable contributions to this work.