Goran Petrovic

Goran Petrovic

Goran Petrovic has been working in Google since 2012. His main focus areas are Mutation Testing and Engineering Productivity.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Modern code review is a process in which incremental code contributions made by one software developer are reviewed by one or more peers before it is committed to the version control system. An important element of modern code review is verifying that the code under review adheres to style guidelines and best practices of the corresponding programming language. Some of these rules are universal and can be checked automatically or enforced via code formatters. Other rules, however, are context-dependent and the corresponding checks are commonly left to developers who are experts in the given programming language and whose time is expensive. Many automated systems have been developed that attempt to detect various rule violations without any human intervention. Historically, such systems implement targeted analyses and were themselves expensive to develop. This paper presents AutoCommenter, a system that uses a state of the art large language model to automatically learn and enforce programming language best practices. We implemented AutoCommenter for four programming languages: C++, Java, Python and Go. We evaluated its performance and adoption in a large industrial setting. Our evaluation shows that a model that automatically learns language best practices is feasible and has a measurable positive impact on the developer workflow. Additionally, we present the challenges we faced when deploying such a model to tens of thousands of developers and provide lessons we learned for any practitioners that would like to replicate the work or build on top of it. View details
    Productive Coverage: Improving the Actionability of Code Coverage
    Gordon
    Luka Kalinovcic
    Mateusz Lewko
    Rene Just
    Yana Kulizhskaya
    ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (2024) (to appear)
    Preview abstract Code coverage is an intuitive and established test adequacy measure. However, not all parts of the code base are equally important, and hence additional testing may be critical for some uncovered code, whereas it may not be worthwhile for other uncovered code. As a result, simply visualizing uncovered code is not reliably actionable. To make code coverage actionable and further improve code coverage in our codebase, we developed Productive Coverage — a novel approach to code coverage that guides developers to uncovered code that that should be tested by (unit) tests. Specifically, Productive Coverage identifies uncovered code that is similar to existing code, which in turn is tested and/or frequently executed in production. We implemented and evaluated Productive Coverage for four programming languages (C++, Java, Go, and Python). The evaluation shows: (1) The developer sentiment, measured at the point of use, is strongly positive; (2) Productive Coverage meaningfully increases code coverage above a strong baseline; (3) Productive Coverage has no negative effect on code authoring efficiency; (4) Productive Coverage modestly improves code-review effiency; (5) Productive Coverage directly improves code quality and prevents bugs from being introduced, in addition to improving test quality View details
    Please fix this mutant: How do developers resolve mutants surfaced during code review?
    Gordon Fraser
    René Just
    2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
    Preview abstract This paper studies the effects of surfacing undetected mutants during code review. Based on a dataset of 633 merge requests and 78,000 mutants, it answers three research questions around the change in mutant location over the course of a merge request, how often mutants are resolved during code review, and the observed changes after mutant intervention. The results show that (1) for 64% of mutants, the mutated code changes as the merge request evolves; (2) overall, 38% of all mutants and 60% of productive mutants are resolved via code changes or test additions; (3) unresolved productive mutants stem from developers questioning the value of adding tests for surfaced mutants, mutants being later resolved in deferred code changes (atomicity of merge requests), and false positives (mutants being resolved by tests not considered in the experiment infrastructure); (4) resolved productive mutants are associated with more test and code changes, compared to unproductive mutants. View details
    MuRS: Suppressing and Ranking Mutants with IdentifierTemplates
    Malgorzata (Gosia) Salawa
    René Just
    Zimin Chen
    ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2023), 1798–1808
    Preview abstract Diff-based mutation testing is a mutation testing approach thatonly generates mutants in changed lines. At Google we executemore than 150 000 000 tests and submit more than40 000 commits per day. We have successfully integrated mutation testing into ourcode review process. Over the years, we have continuously gath-ered developer feedback on the surfaced mutants and measuredthe negative feedback rate. To enhance the developer experience,we manually implemented a large number of static rules, whichare used to suppress certain mutants. In this paper, we proposeMuRS, an automatic tool that finds patterns in the source codeunder test, and uses these patterns to rank and suppress futuremutants based on historical performance of similar mutants. Because MuRS learns mutant suppression rules fully automatically,it significantly reduces the build and maintenance cost of the mu-tation testing system. To evaluate the effectiveness of MuRS, we conducted an A/B experiment, where mutants in the experimentgroup were ranked and suppressed byMuRS, and mutants in thecontrol group were randomly shuffled. The experiment showeda statistically significant negative feedback rate of 11.45% in the experiment group versus 12.41% in the control group. Furthermore,we found that statement removal mutants received both most positive and negative developer feedback, suggesting a need for furtherinvestigation to identify valuable statement removal mutants. View details
    Practical Mutation Testing at Scale: A view from Google
    Gordon Fraser
    René Just
    IEEE Transactions on Software Engineering (2021)
    Preview abstract Mutation analysis assesses a test suites adequacy by measuring its ability to detect small artificial faults, systematically seeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds on top of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testing has long been considered intractable because the sheer number of mutants that can be created represents an insurmountable problemboth in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard. For example, Google has a codebase of two billion lines of code and more than 150,000,000 tests are executed on a daily basis. The traditional approach to mutation testing does not scale to such an environment; even existing solutions to speed up mutation analysis are insufficient to make it computationally feasible at such a scale. To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1) mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) mutants are filtered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per code review process; (3) mutants are selected based on the historical performance of mutation operators, further eliminating irrelevant mutants and improving mutant quality. This paper empirically validates the proposed approach by analyzing its effectiveness in a code-review-based setting, used by more than 24,000 developers on more than 1,000 projects. The results show that the proposed approach produces orders of magnitude fewer mutants and that context-based mutant filtering and selection improve mutant quality and actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the software development workflow and is applicable to industrial settings of any size. View details
    Long Term Effects of Mutation Testing
    Gordon Fraser
    René Just
    2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 910-921
    Preview abstract Various proxy metrics for test quality have been defined in order to guide developers when writing tests. Code coverage is particularly well established in practice, even though the question of how coverage relates to test quality is a matter of ongoing debate. Mutation testing offers a promising alternative: Artificial defects can identify holes in a test suite, and thus provide concrete suggestions for additional tests. Despite the obvious advantages of mutation testing, it is not yet well established in practice. Until recently, mutation testing tools and techniques simply did not scale to complex systems. Although they now do scale, a remaining obstacle is lack of evidence that writing tests for mutants actually improves test quality. In this paper, we fill this gap. We analyze a large dataset of 15 million mutants and investigate how the mutants influenced developers over time, and how the mutants relate to real faults. Our analyses suggest that developers using mutation testing write more tests, and actively improve their test suites with high quality tests such that fewer mutants remain. By analyzing a dataset of historic fixes of real faults we further provide evidence that mutants are indeed coupled with real faults. In other words, had mutation testing been used for the changes introducing the faults, it would have reported a live mutant that could have prevented the bug. View details
    Code coverage at Google
    René Just
    Gordon Fraser
    Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, pp. 955-963
    Preview abstract Code coverage is a measure of the degree to which a test suite exercises a software system. Although coverage is well established in software engineering research, deployment in industry is often inhibited by the perceived usefulness and the computational costs of analyzing coverage at scale. At Google, coverage information is computed for one billion lines of code daily, for seven programming languages. A key aspect of making coverage information actionable is to apply it at the level of changesets and code review. This paper describes Google’s code coverage infrastructure and how the computed code coverage information is visualized and used. It also describes the challenges and solutions for adopting code coverage at scale. To study how code coverage is adopted and perceived by developers, this paper analyzes adoption rates, error rates, and average code coverage ratios over a five-year period, and it reports on 512 responses, received from surveying 3000 developers. Finally, this paper provides concrete suggestions for how to implement and use code coverage in an industrial setting. View details
    An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions
    Robert Kurtz
    Paul Ammann
    René Just
    Proceedings of the 13th International Workshop on Mutation Analysis (Mutation 2018)
    Preview abstract Mutation analysis evaluates a testing or debugging technique by measuring how well it detects mutants, which are systematically seeded, artificial faults. Mutation analysis is inherently expensive due to the large number of mutants it generates and due to the fact that many of these generated mutants are not effective; they are redundant, equivalent, or simply uninteresting and waste computational resources. A large body of research has focused on improving the scalability of mutation analysis and proposed numerous optimizations to, e.g., select effective mutants or efficiently execute a large number of tests against a large number of mutants. However, comparatively little research has focused on the costs and benefits of mutation testing, in which mutants are presented as testing goals to a developer, in the context of an industrial-scale software devel- opment process. This paper aims to fill that gap. Specifically, it first reports on a case study from an open source context, which quantifies the costs of achieving a mutation adequate test set. The results suggest that achieving mutation adequacy is neither practical nor desirable. This paper then draws on an industrial application of mutation testing, involving more than 30,000+ developers and 1,890,442 change sets, written in 4 programming languages. It shows that mutation testing does not add a significant overhead to the software development process and reports on mutation testing benefits perceived by developers. Finally, this paper describes lessons learned from these studies, highlights the current challenges of efficiently and effectively applying mutation testing in an industrial-scale software development process, and outlines research directions. View details
    State of Mutation Testing at Google
    Proceedings of the 40th International Conference on Software Engineering 2017 (SEIP) (2018) (to appear)
    Preview abstract Mutation testing assesses test suite efficacy by inserting small faults into programs and measuring the ability of the test suite to detect them. It is widely considered the strongest test criterion in terms of finding the most faults and it subsumes a number of other coverage criteria. Traditional mutation analysis is computationally prohibitive which hinders its adoption as an industry standard. In order to alleviate the computational issues, we present a diff-based probabilistic approach to mutation analysis that drastically reduces the number of mutants by omitting lines of code without statement coverage and lines that are determined to be uninteresting - we dub these arid lines. Furthermore, by reducing the number of mutants and carefully selecting only the most interesting ones we make it easier for humans to understand and evaluate the result of mutation analysis. We propose a heuristic for judging whether a node is arid or not, conditioned on the programming language. We focus on a code-review based approach and consider the effects of surfacing mutation results on developer attention. The described system is used by 6,000 engineers in Google on all code changes they author or review, affecting in total more than 14,000 code authors as part of the mandatory code review process. The system processes about 30% of all diffs across Google that have statement coverage calculated. View details