Gemini provides automated feedback for theoretical computer scientists at STOC 2026
December 15, 2025
Vincent Cohen-Addad and David Woodruff, Research Scientists, Google Research, on behalf of the research team
We describe a new tool that uses Gemini to help scientists rigorously verify the correctness of their conference submission papers, which was tested for the STOC 2026 conference.
The pursuit of truth in theoretical computer science and mathematics relies on the highest standards of proof, rigor, and clarity. While peer review is the crucial final check, the process of drafting and refining complex theoretical work often takes months, with simple errors, inconsistent variables, or subtle logical gaps frequently slowing down the entire research pipeline. But could a highly specialized AI tool act as a fast, rigorous collaborator, helping authors pre-vet their work before it ever reaches human reviewers?
To test this potential, we created an experimental program for the Annual ACM Symposium on Theory of Computing (STOC 2026) — one of the most prestigious venues in theoretical computer science. This program offered authors automated, pre-submission feedback generated by a specialized Gemini AI tool. Our objective was to provide constructive suggestions and identify potential technical issues within 24 hours of submission, helping authors polish their final drafts before the submission deadline.
The responses were very positive: the tool successfully identified a variety of issues, including calculation and logic errors. Here we report how we developed the tool and the results of its use.
Optimized for mathematical rigor
The feedback tool leveraged inference scaling methods in an advanced version of Gemini 2.5 Deep Think. This setup enables the method to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought. By combining different reasoning and evaluation traces, the method reduces inherent hallucinations and focuses on the most salient issues.
Feedback format
Authors received structured feedback divided into key sections: a summary of the paper's contributions, a list of potential mistakes and improvements (often analyzing specific lemmas or theorems), and a list of minor corrections and typos. See some feedback examples.
Impact and technical depth
The tool successfully identified a wide range of issues, from inconsistent variable names to complex problems like calculation errors, incorrect application of inequalities, and logical gaps in proofs. As one author noted, the tool found "a critical bug... that made our proof entirely incorrect," further adding that it was an "embarrassingly simple bug that evaded us for months."
Over 120 participants responded to our post-experiment survey and gave us consent, and the responses were very positive, with individuals citing the model’s success at finding critical errors and its ability to return insightful commentary. In summary:
- >80% of submitted papers at the time our experiment ended had opted-in for our AI review
- 97% found the feedback helpful
- 97% would use this tool again for future submissions
- 81% found the model improved clarity or readability of the paper
The user experience
Beyond technical accuracy, authors valued the speed and neutrality of the AI review. Participants noted receiving feedback in just two days. Others praised the "neutral tone and rigor" of the output, finding it a useful complement to human readers.
Interpreting the output
Because participants are experts in their respective fields, they were able to readily distinguish helpful insights from occasional "hallucinations". While the model sometimes struggled — particularly with parsing complex notation or interpreting figures — authors weren't dismissive of the LLM's output. Rather, they carefully filtered out the noise and extracted the important and correct parts of the output, and then used the feedback as a starting point for verification. This outcome clearly demonstrates the potential for AI to serve as a collaborative partner, augmenting the research workflow by helping human experts to make informed decisions based on the model's rigorous outputs.
Educational impact and future outlook
The research community surveyed in this experiment saw significant potential for this tool in training the next generation. 75% of surveyed authors believed the tool has educational value for students by offering immediate feedback on mathematical rigor and presentation clarity.
This pilot demonstrated the potential for specialized AI tools to serve as collaborative partners in fundamental areas, establishing a target for potential future research initiatives. Our overall goal is not to replace the critical peer review process, but rather to augment and enhance it. Reflecting this, 88% of participants expressed strong interest in having continuous access to such a tool throughout their entire research process.
Acknowledgements
Vincent Cohen-Addad, Rajesh Jayaram, Jon Schneider, and David Woodruff co-led this project