Gemini provides automated feedback for theoretical computer scientists at STOC 2026

December 15, 2025

Vincent Cohen-Addad and David Woodruff, Research Scientists, Google Research, on behalf of the research team

We describe a new tool that uses Gemini to help scientists rigorously verify the correctness of their conference submission papers, which was tested for the STOC 2026 conference.

The pursuit of truth in theoretical computer science and mathematics relies on the highest standards of proof, rigor, and clarity. While peer review is the crucial final check, the process of drafting and refining complex theoretical work often takes months, with simple errors, inconsistent variables, or subtle logical gaps frequently slowing down the entire research pipeline. But could a highly specialized AI tool act as a fast, rigorous collaborator, helping authors pre-vet their work before it ever reaches human reviewers?

To test this potential, we created an experimental program for the Annual ACM Symposium on Theory of Computing (STOC 2026) — one of the most prestigious venues in theoretical computer science. This program offered authors automated, pre-submission feedback generated by a specialized Gemini AI tool. Our objective was to provide constructive suggestions and identify potential technical issues within 24 hours of submission, helping authors polish their final drafts before the submission deadline.

The responses were very positive: the tool successfully identified a variety of issues, including calculation and logic errors. Here we report how we developed the tool and the results of its use.

Optimized for mathematical rigor

The feedback tool leveraged inference scaling methods in an advanced version of Gemini 2.5 Deep Think. This setup enables the method to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought. By combining different reasoning and evaluation traces, the method reduces inherent hallucinations and focuses on the most salient issues.

Feedback format

Authors received structured feedback divided into key sections: a summary of the paper's contributions, a list of potential mistakes and improvements (often analyzing specific lemmas or theorems), and a list of minor corrections and typos. See some feedback examples.

Impact and technical depth

The tool successfully identified a wide range of issues, from inconsistent variable names to complex problems like calculation errors, incorrect application of inequalities, and logical gaps in proofs. As one author noted, the tool found "a critical bug... that made our proof entirely incorrect," further adding that it was an "embarrassingly simple bug that evaded us for months."

Over 120 participants responded to our post-experiment survey and gave us consent, and the responses were very positive, with individuals citing the model’s success at finding critical errors and its ability to return insightful commentary. In summary:

  • >80% of submitted papers at the time our experiment ended had opted-in for our AI review
  • 97% found the feedback helpful
  • 97% would use this tool again for future submissions
  • 81% found the model improved clarity or readability of the paper
Three testimonials from university professors praising the tool for identifying significant errors and bugs in their research papers.

The user experience

Beyond technical accuracy, authors valued the speed and neutrality of the AI review. Participants noted receiving feedback in just two days. Others praised the "neutral tone and rigor" of the output, finding it a useful complement to human readers.

A testimonial graphic from Professor Shuchi Chawla praising the tool for providing valuable feedback that exceeded expectations.

Interpreting the output

Because participants are experts in their respective fields, they were able to readily distinguish helpful insights from occasional "hallucinations". While the model sometimes struggled — particularly with parsing complex notation or interpreting figures — authors weren't dismissive of the LLM's output. Rather, they carefully filtered out the noise and extracted the important and correct parts of the output, and then used the feedback as a starting point for verification. This outcome clearly demonstrates the potential for AI to serve as a collaborative partner, augmenting the research workflow by helping human experts to make informed decisions based on the model's rigorous outputs.

Educational impact and future outlook

The research community surveyed in this experiment saw significant potential for this tool in training the next generation. 75% of surveyed authors believed the tool has educational value for students by offering immediate feedback on mathematical rigor and presentation clarity.

This pilot demonstrated the potential for specialized AI tools to serve as collaborative partners in fundamental areas, establishing a target for potential future research initiatives. Our overall goal is not to replace the critical peer review process, but rather to augment and enhance it. Reflecting this, 88% of participants expressed strong interest in having continuous access to such a tool throughout their entire research process.

Acknowledgements

Vincent Cohen-Addad, Rajesh Jayaram, Jon Schneider, and David Woodruff co-led this project[8746db], with key contributions by Lalit Jain, Jieming Mao, and Vahab Mirrokni. We also thank the STOC 2026 PC chair Artur Czumaj and the many other authors who participated in this experiment and provided their valuable feedback, helpful suggestions, and discussions, including Mohammad Taghi Hajiaghayi, Ravi Kumar, Yossi Matias, and Sergei Vassilvitskii. Finally, this work builds on the efforts of the Deep Think team: Garrett Bingham, Irene Cai, Heng-Tze Cheng, Yong Cheng, Kristen Chiafullo, Vincent Cohen-Addad, Paul Covington, Golnaz Ghiasi, Chenjie Gu, Huan Gui, Ana Hosseini, Dawsen Hwang, Lalit Jain, Vihan Jain, Ragha Kotikalapudi, Chenkai Kuang, Chenkai Kuang, Maciej Kula, Nate Kushman, Jane Labanowski, Quoc Le, Jonathan Lee, Zhaoqi Leng, Steve Li, YaGuang Li, Hanzhao (Maggie) Lin, Evan Liu, Yuan Liu, Thang Luong, Jieming Mao, Vahab Mirrokni, Pol Moreno, Nigamaa Nayakanti, Aroonalok Pyne, Shubha Raghvendra, Sashank Reddi, Nikunj Saunshi, Siamak Shakeri, Archit Sharma, Xinying Song, Qijun Tan, Yi Tay, Trieu Trinh, Theophane Weber, Winnie Xu, Zicheng Xu, Shunyu Yao, Lijun Yu, Hao Zhou, Honglei Zhuang, and Song Zuo.


  1. The authors of this blog post and all team members are listed in alphabetical order.

×
×