play silent looping video pause silent looping video

Accelerating code migrations with AI

July 18, 2024

Stoyan Nikolov, Senior Staff Software Engineer, Google Core, and Siddharth Taneja, Senior Engineering Manager, Google Ads

Generative AI-powered workflows allow Google to migrate code faster and maintain its codebase more effectively

In the past decades, source code bases have grown exponentially. Google's monorepo is an example of large code datasets, which includes billions of lines of code. Keeping up with the code transformations (called “migrations”) to accommodate new language versions, framework updates, changing APIs and data types across this vast codebase is challenging, to say the least.

For years, Google has employed special infrastructure for large scale changes to execute complex code migrations. The infrastructure uses static analysis and tools like Kythe and Code Search to discover which locations need to be changed along with their dependencies. Tools like ClangMR are then employed to make the changes.

This approach works well for changes that are uniform in their structure and have a limited set of edge cases. Yet static analysis and simple migration scripts run into limitations when migrating code that has a complex structure — for example changing interfaces and their usages across multiple components and their dependencies or updating their tests.

In this post, we describe our internal approach to combine multiple AI-driven tasks within a new tool for Google developers that assists them in the process of code migrations at scale. The goal is to assist the engineer and let them focus on the complex aspects of the migration without isolating them from the process. Our case study demonstrates that our approach can successfully generate the majority of new code necessary for migrations and significantly reduce the human toil of the work.

Change creation workflow

For code migrations, we have built a new, complementary toolkit to address changes that would be difficult for the standard tooling and would benefit from the ability of machine learning (ML) models to adapt to the surrounding code.

We conceptually split the process of the migration into three stages:

  1. Targeting the locations in the codebase that needed modifications
  2. Edit generation and validation
  3. Change review and rollout

While each of these stages benefits from AI, we focus on #2.

To generate and validate the code changes we leverage a version of the Gemini model that we fine-tuned on internal Google code and data.

Each migration requires as input:

  • A set of files and the locations of expected changes: path + line number in the file
  • One or two prompts that describe the change
  • [Optional] Few-shot examples to decide if a file actually needs migration
play silent looping video pause silent looping video

Example execution of the multi-stage code migration process.

The file locations provided by the user are collected through a combination of pre-existing static tools and human input. Our migration toolkit automatically expands this set with additional relevant files that can include: test files, interface files, and other dependencies. This step is not yet AI-driven, but uses symbol cross-reference information.

In many cases the set of files to migrate provided by the user is not perfect. Because filtering the list of input can be onerous, it’s not unusual for some files to have already been partially or completely migrated. Thus, to avoid redundant changes or confusing the model during edit generation we provide the model with few-shot examples and ask it to predict if a file needs to be migrated.

The edit generation and validation step is where we have found the most benefit from an automated system. Our model was trained following the DIDACT methodology on data from Google’s monorepo and processes. At inference time, we annotate each line where we expect a change is needed with a natural language instruction as well as a general instruction for the model. In each model query, the input context can contain one or multiple files related to each other.

The model then predicts differences between the files (diffs) where changes are needed and can also change related sections so that the final code is correct.

This last capability is critical to increase migration velocity, because the generated changes might not be aligned with the initial locations requested, but they will solve the intent. This reduces the need to manually find the full set of lines where changes are needed and is a big step forward compared to purely deterministic change generation based on abstract syntax tree modifications.

play silent looping video pause silent looping video

In the example above we prompt the model to only update the constructor of the class where the type has to change. In the predicted unified diff, the model correctly also fixes the private field and usages within the class.

Different combinations of prompts yield different results depending on the input context. In some cases providing too many locations where one might expect a change results in worse performance than specifying a change in just one place in the file and prompting the model to apply the change to the file globally.

As we apply changes across dozens and potentially hundreds of files, we implement a mechanism that generates prompt combinations that are tried in parallel for each file group. This is similar to a pass@k strategy where instead of inference temperature we modify the prompting strategy.

We validate the resulting changes automatically. The validations are configurable and often depend on the migration. The two most common validations are compiling the changed files and running their unit tests. Each of the failed validation steps can optionally run an ML-powered “repair”. The model has also been trained on a large set of failed builds and tests paired with the diffs that then fixed them. For each of the build/test failures that we encounter, we prompt the model with the changed files, the build/test error and a prompt that requests a fix. With this approach, we observe that in a significant number of cases the model is able to fix the code.

As we generate multiple changes for each file group, we score them based on the validations and at the end decide which set of changes to propagate back to the final change list (similar to a pull request in Git).

Case study: Migrating integers from 32-bit to 64-bit

As Google’s codebase and its products evolve, assumptions made in the past (sometimes over a decade ago) no longer hold. For example, Google Ads has dozens of numerical unique “ID” types used as handles — for users, merchants, campaigns, etc. — and these IDs were originally defined as 32-bit integers. But with the current growth in the number of IDs, we expect them to overflow the 32-bit capacity much sooner than expected.

This realization led to a significant effort to port these IDs to 64-bit integers. The project is difficult for multiple reasons:

  • There are tens of thousands of locations across thousands of files where these IDs are used.
  • Tracking the changes across all the involved teams would be very difficult if each team were to handle the migration in their data themselves.
  • The IDs are often defined as generic numbers (int32_t in C++ or Integer in Java) and are not of a unique, easily searchable type, which makes the process of finding them through static tooling non-trivial.
  • Changes in the class interfaces need to be taken into account across multiple files.
  • Tests need to be updated to verify that the 64-bit IDs are handled correctly.

The full effort, if done manually was expected to require many, many software engineering years.

To accelerate the work, we employed our AI migration tooling and devised the following workflow:

  1. An expert engineer identifies the ID they want to migrate and, using a combination of Code Search, Kythe, and custom scripts, identifies a (relatively tight) superset of files and locations to migrate.
  2. The migration toolkit runs autonomously and produces verified changes that only contain code that passes unit tests. Some tests are themselves updated to reflect the new reality.
  3. The engineer quickly checks the change and potentially updates files where the model failed or made a mistake. The changes are then sharded and sent to multiple reviewers who own the part of the codebase affected by the change.

Note that the IDs used in the internal code base have appropriate privacy protections already applied. While the model migrates them to a new type, it does not alter or surface them, so all privacy protections will remain intact.

For this workstream we found that 80% of the code modifications in the landed CLs were AI-authored, the rest were human-authored. The total time spent on the migration was reduced by an estimated 50% as reported by the engineers doing the migration. There was significant reduction in communication overhead as a single engineer could generate all necessary changes. Engineers still needed to spend time on the analysis of the files that needed changes and on their review. We found that in Java files our model predicted the need to edit a file with 91% accuracy.

The toolkit has already been used to create hundreds of change lists in this and other migrations. On average we achieve >75% of the AI-generated character changes successfully landing in the monorepo.

Future directions

The next step is addressing more complex migrations that impact multiple components exchanging data or requiring system architecture changes. We have already had success when migrating from deprecated types that require non-trivial refactorings, as well as moving away from older testing frameworks.

We are researching how to apply AI in the other parts of the development journey, specifically to help with targeting the changes and better filtering those that are unnecessary. Another interesting area is to improve the migration user experience in the IDE so that it gives greater control and freedom to the change operator to mix-and-match the existing tooling.

Overall we see a wide potential application of this work, likely beyond the strict space of code migration and possibly also applied to error correction and general code maintenance at scale.

Acknowledgements

This work is the result of a collaboration between the Google Core Developer, Google Ads and Google DeepMind teams. We would like to thank key contributors Daniele Codecasa, Anna Sjövall, Ayoub Kachkach, Celal Ziftci, Max Kim, Jonathan Binghan, Ballie Sandhu, and Christoph Grotz. We would also like to thank our colleagues Alexander Frömmgen, Lera Kharatyan, Maxim Tabachnyk, Shardul Natu, Bar Karapetov, Kashmira Phalak, Andrew Villadsen, Maia Deutsch, AK Kulkarni, Satish Chandra, Danny Tarlow, Aditya Kini, Marc Brockschmidt, Yurun Shen, Milad Hashemi, Chris Gorgolewski, Don Schwarz, Chris Kennely, Sarah Drasner, Niranjan Tulpule, Madhura Dudhgaonkar and the developers of the DIDACT effort.