AI as a research partner: Advancing theoretical computer science with AlphaEvolve

September 30, 2025

Ansh Nadga, Student Researcher, and Abhradeep Thakurta, Staff Research Scientist, Google DeepMind, and Prabhakar Raghavan, Chief Technologist, Google

We invoke AlphaEvolve, an LLM-based coding agent, to find and verify combinatorial structures that improve results on the hardness of approximately solving certain optimization problems.

Recently, large language models (LLMs) have demonstrated surprising capabilities in competitive mathematics and competitive programming, demonstrating world-leading performance across both of these fields. However, their successes in mathematical discovery — proving novel theorems or uncovering new combinatorial structures — have been relatively few (with some notable exceptions [1, 2, 3]). Since mathematics and theoretical computer science demand absolute correctness[94fb54], any AI-based method that makes mathematical discovery must either have a proof of correctness that can be confirmed computationally (without any human involvement), or have a domain-expert human in the loop to certify correctness.

In our recent paper, “Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory”, we demonstrate how an LLM-powered coding agent can help discover new mathematical structures that push the boundaries of our understanding of complexity theory (a sub-field of theoretical computer science). Our work utilizes AlphaEvolve, a system developed at Google DeepMind that uses LLMs to iteratively evolve code. By employing a feedback loop, AlphaEvolve began with populations of code snippets, evaluated the structures produced by the code snippets, and used an LLM to morph the most successful snippets toward better solutions. This approach led to new results in two distinct areas of complexity theory: 1) improving the state-of-the-art for the limit on our ability to approximate the outcome (i.e., the "inapproximability") of the maximum cut problem for 4 slices (which we define as the MAX-4-CUT problem), and 2) tightening the bounds on the average-case hardness of certifying properties of random graphs.

AI-assisted mathematical research can operate in the following modes:

  1. A person invokes an LLM to summarize the literature, to chart a research plan towards new theorems, or to directly generate chunks of (or entire) proofs.
  2. A person uses AI-derived tools, such as AlphaEvolve, to generate better proof elements.

Our work falls in the second category, where we obtain better proof elements using AlphaEvolve that can be automatically verified by a computer program.

The power of lifting: From finite constructions to universal statements

A fundamental challenge in using AI for theoretical computer science research lies in the universal nature of the problems studied. An AI system might find a solution to a specific instance of a problem — say, the optimal route for a traveling salesman visiting 50 specific cities. However, computer scientists often seek theorems that hold true universally for all problem instances and sizes (denoted as ∀n).

How can we use AlphaEvolve to prove a universal statement? The answer lies in a technique known as "lifting" (see image below). If a proof is viewed as a long string, then one can take a chunk of the proof (corresponding to a certain finite structure), and evolve it to support a stronger universal statement, while keeping the interface to the rest of the proof intact. The advantage of this approach is that to certify overall correctness, one needs to only certify the correctness of the finite structure that has been evolved.

Lifting: Morphing finite structures using AI, while keeping the interface to the broader proof intact.

Lifting: Morphing finite structures using AI, while keeping the interface to the broader proof intact.

In complexity theory, researchers often use established proof frameworks that rely on the existence of specific, highly optimized finite structures. If a better structure can be found, the entire proof framework "lifts" this improvement to a better universal result.

A key example of this is a "gadget reduction." To prove that a target problem is computationally hard (intractable), researchers try to map a known intractable source problem to it, hence demonstrating that the target problem is at least as hard as the source problem. A gadget is a recipe for locally transforming a small piece of the source problem into a piece of the target problem. These gadgets are finite structures, and finding the optimal gadget is a painstaking process often done by hand.

By tasking AlphaEvolve with finding better gadgets, we were able to discover structures far more complex than those previously known. These finite discoveries, when plugged into the existing mathematical frameworks, immediately yield new universal theorems in complexity theory.

New theorems in complexity theory

We applied this methodology to the MAX-k-CUT problem. Given a graph (a network of nodes and edges), the goal is to partition the nodes into k distinct sets such that the number of edges crossing between different sets is maximized. This is a classic intractable (NP-hard) problem, meaning we do not expect to find efficient algorithms that solve it exactly. Therefore, we focused on approximation algorithms — those that efficiently find solutions guaranteed to be close to the optimum.

The crucial question is: what is the limit of approximation?

MAX-4-CUT: A new state of the art

For MAX-4-CUT (partitioning into four sets), the previous best-known result proved that it is NP-hard to approximate the solution within a factor of 0.9883. AlphaEvolve was deployed to search for a new gadget reduction to MAX-4-CUT.

The system discovered an intricate gadget involving 19 variables (nodes) with a complex weighting scheme (some connections having up to 1429 times the weight of others). This discovery established a new inapproximability bound of 0.987.

This improvement may seem incremental, but in the mature field of hardness of approximation, such advances often require significant new techniques or combinatorial insights.

Drawing of a graph representing the gadget found by AlphaEvolve for the reduction to MAX-4-CUT.

Gadget found by AlphaEvolve for the reduction to MAX-4-CUT.

Average-case hardness and Ramanujan graphs

We also explored the hardness of problems on average, rather than in the worst case. Specifically, we studied the difficulty of certifying bounds on the MAX-2-CUT (as well as maximum independent set) of sparse random graphs[3d54ef]. Recent work connected this problem to the existence of specific Ramanujan graphs — deterministic graphs that “look” like sparse random graphs. They conjectured that the existence of Ramanujan graphs with unnaturally large cuts implies it is computationally hard to certify the MAX-2-CUT of a random graph.

Prior work used computer assistance to find such graphs on up to 10 nodes. Improving their results requires finding more extremal Ramanujan graphs on many more nodes, which are exceedingly difficult to find and verify. AlphaEvolve successfully navigated this vast search space, discovering Ramanujan graphs with even larger cuts on as many as 163 nodes.

Drawing of a 4-regular Ramanujan graph with large 2-cut found by AlphaEvolve.

A 4-regular Ramanujan graph with large 2-cut found by AlphaEvolve.

These discoveries significantly improved the lower bounds for average-case hardness. Furthermore, combined with new algorithmic progress (non-AI based), we were able to nearly settle the computational hardness of these questions, matching the upper and lower bounds to within the third decimal place.

The crucial role of verified correctness

A critical distinction of this work is that the results come with proofs of correctness.

When an LLM is prompted to generate a mathematical proof directly, it often produces a proof sketch or an argument that requires substantial human intervention to verify and complete. Hallucinations or subtle errors can render the output useless. As mentioned earlier, the standard for correctness in math is absolute.

In contrast, the approach taken here uses AI to discover a structure within the proof, not the proof itself. The validity of the final theorem relies on two components: the correctness of the lifting framework, and the verification of the discovered structure. While the frameworks are sound, verifying the structures discovered by AlphaEvolve is computationally intensive.

Remarkably, AlphaEvolve achieved a 10,000x speedup in the verification process by implementing sophisticated branch-and-bound strategies and system-level optimizations. This massive speedup was the key enabler for the research, allowing the system to explore much larger and more complex gadgets.

Crucially, the final gadgets discovered were still verified using the original, brute-force algorithm, ensuring the absolute correctness of the theorems.

The future of AI-assisted theory

While these initial research findings are far from conclusive, they suggest that AI is poised to become a helpful collaborator in mathematical discovery. We have observed the models in AlphaEvolve generate intricate mathematical objects that at times exhibit nascent reasoning capabilities. However, as we transition into an era where proofs may increasingly be attributed to AI, the crucial task of verification is set to become a significant bottleneck.

Acknowledgments

We would like to thank Adam Zsolt Wagner, Swarat Chaudhuri, Pasin Manurangsi and Sushant Sachdeva for helping us during various stages of the project.


  1. In mathematics, a statement is definitively true or false, with no intermediate state possible. This stands in contrast to several other applications of AI, such as essay-writing or artistic creation, which have subjective standards of correctness and do not need to be correct in an absolute sense.

  2. A sparse random graph is generated by randomly adding edges between a pair of nodes, where each node is guaranteed to have exactly d neighbors for some small d.