Goran Petrovic
Goran Petrovic has been working in Google since 2012. His main focus areas are Mutation Testing and Engineering Productivity.
Research Areas
Authored Publications
Sort By
AI-assisted Assessment of Coding Practices in Industrial Code Review
Ivan Budiselic
Malgorzata (Gosia) Salawa
Juanjo Carin
Jovan Andonov
Mateusz Lewko
Rene Just
Preview abstract
Modern code review is a process in which incremental code contributions made by one software developer are reviewed by one or more peers before it is committed to the version control system. An important element of modern code review is verifying that the code under review adheres to style guidelines and best practices of the corresponding programming language. Some of these rules are universal and can be checked automatically or enforced via code formatters. Other rules, however, are context-dependent and the corresponding checks are commonly left to developers who are experts in the given programming language and whose time is expensive. Many automated systems have been developed that attempt to detect various rule violations without any human intervention. Historically, such systems implement targeted analyses and were themselves expensive to develop. This paper presents AutoCommenter, a system that uses a state of the art large language model to automatically learn and enforce programming language best practices. We implemented AutoCommenter for four programming languages: C++, Java, Python and Go. We evaluated its performance and adoption in a large industrial setting. Our evaluation shows that a model that automatically learns language best practices is feasible and has a measurable positive impact on the developer workflow. Additionally, we present the challenges we faced when deploying such a model to tens of thousands of developers and provide lessons we learned for any practitioners that would like to replicate the work or build on top of it.
View details
Productive Coverage: Improving the Actionability of Code Coverage
Gordon
Luka Kalinovcic
Mateusz Lewko
Rene Just
Yana Kulizhskaya
ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (2024) (to appear)
Preview abstract
Code coverage is an intuitive and established test adequacy measure.
However, not all parts of the code base are equally important, and
hence additional testing may be critical for some uncovered code,
whereas it may not be worthwhile for other uncovered code. As a
result, simply visualizing uncovered code is not reliably actionable.
To make code coverage actionable and further improve code
coverage in our codebase, we developed Productive Coverage —
a novel approach to code coverage that guides developers to uncovered code that that should be tested by (unit) tests. Specifically,
Productive Coverage identifies uncovered code that is similar to
existing code, which in turn is tested and/or frequently executed in
production. We implemented and evaluated Productive Coverage
for four programming languages (C++, Java, Go, and Python). The
evaluation shows: (1) The developer sentiment, measured at the
point of use, is strongly positive; (2) Productive Coverage meaningfully increases code coverage above a strong baseline; (3) Productive
Coverage has no negative effect on code authoring efficiency; (4)
Productive Coverage modestly improves code-review effiency; (5)
Productive Coverage directly improves code quality and prevents
bugs from being introduced, in addition to improving test quality
View details
Please fix this mutant: How do developers resolve mutants surfaced during code review?
Gordon Fraser
René Just
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
Preview abstract
This paper studies the effects of surfacing undetected mutants during code review. Based on a dataset of 633 merge requests and 78,000 mutants, it answers three research questions around the change in mutant location over the course of a merge request, how often mutants are resolved during code review, and the observed changes after mutant intervention. The results show that (1) for 64% of mutants, the mutated code changes as the merge request evolves; (2) overall, 38% of all mutants and 60% of productive mutants are resolved via code changes or test additions; (3) unresolved productive mutants stem from developers questioning the value of adding tests for surfaced mutants, mutants being later resolved in deferred code changes (atomicity of merge requests), and false positives (mutants being
resolved by tests not considered in the experiment infrastructure); (4) resolved productive mutants are associated with more test and code changes, compared to unproductive mutants.
View details
MuRS: Suppressing and Ranking Mutants with IdentifierTemplates
Malgorzata (Gosia) Salawa
René Just
Zimin Chen
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2023), 1798–1808
Preview abstract
Diff-based mutation testing is a mutation testing approach thatonly generates mutants in changed lines. At Google we executemore than 150 000 000 tests and submit more than40 000 commits per day. We have successfully integrated mutation testing into ourcode review process. Over the years, we have continuously gath-ered developer feedback on the surfaced mutants and measuredthe negative feedback rate. To enhance the developer experience,we manually implemented a large number of static rules, whichare used to suppress certain mutants. In this paper, we proposeMuRS, an automatic tool that finds patterns in the source codeunder test, and uses these patterns to rank and suppress futuremutants based on historical performance of similar mutants. Because MuRS learns mutant suppression rules fully automatically,it significantly reduces the build and maintenance cost of the mu-tation testing system. To evaluate the effectiveness of MuRS, we conducted an A/B experiment, where mutants in the experimentgroup were ranked and suppressed byMuRS, and mutants in thecontrol group were randomly shuffled. The experiment showeda statistically significant negative feedback rate of 11.45% in the experiment group versus 12.41% in the control group. Furthermore,we found that statement removal mutants received both most positive and negative developer feedback, suggesting a need for furtherinvestigation to identify valuable statement removal mutants.
View details
Practical Mutation Testing at Scale: A view from Google
Gordon Fraser
René Just
IEEE Transactions on Software Engineering (2021)
Preview abstract
Mutation analysis assesses a test suites adequacy by measuring its ability to detect small artificial faults, systematically seeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds on top of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testing has long been considered intractable because the sheer number of mutants that can be created represents an insurmountable problemboth in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard. For example, Google has a codebase of two billion lines of code and more than 150,000,000 tests are executed on a daily basis. The traditional approach to mutation testing does not scale to such an environment; even existing solutions to speed up mutation analysis are insufficient to make it computationally feasible at such a scale. To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1) mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) mutants are filtered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per code review process; (3) mutants are selected based on the historical performance of mutation operators, further eliminating irrelevant mutants and improving mutant quality. This paper empirically validates the proposed approach by analyzing its effectiveness in a code-review-based setting, used by more than 24,000 developers on more than 1,000 projects. The results show that the proposed approach produces orders of magnitude fewer mutants and that context-based mutant filtering and selection improve mutant quality and actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the software development workflow and is applicable to industrial settings of any size.
View details
Long Term Effects of Mutation Testing
Gordon Fraser
René Just
2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 910-921
Preview abstract
Various proxy metrics for test quality have been
defined in order to guide developers when writing tests. Code
coverage is particularly well established in practice, even though
the question of how coverage relates to test quality is a matter of
ongoing debate. Mutation testing offers a promising alternative:
Artificial defects can identify holes in a test suite, and thus provide
concrete suggestions for additional tests. Despite the obvious
advantages of mutation testing, it is not yet well established in
practice. Until recently, mutation testing tools and techniques
simply did not scale to complex systems. Although they now
do scale, a remaining obstacle is lack of evidence that writing
tests for mutants actually improves test quality. In this paper, we
fill this gap. We analyze a large dataset of 15 million mutants
and investigate how the mutants influenced developers over time,
and how the mutants relate to real faults. Our analyses suggest
that developers using mutation testing write more tests, and
actively improve their test suites with high quality tests such
that fewer mutants remain. By analyzing a dataset of historic
fixes of real faults we further provide evidence that mutants are
indeed coupled with real faults. In other words, had mutation
testing been used for the changes introducing the faults, it would
have reported a live mutant that could have prevented the bug.
View details
Code coverage at Google
René Just
Gordon Fraser
Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, pp. 955-963
Preview abstract
Code coverage is a measure of the degree to which a test suite
exercises a software system. Although coverage is well established
in software engineering research, deployment in industry is often
inhibited by the perceived usefulness and the computational costs
of analyzing coverage at scale. At Google, coverage information is
computed for one billion lines of code daily, for seven programming
languages. A key aspect of making coverage information actionable
is to apply it at the level of changesets and code review.
This paper describes Google’s code coverage infrastructure and
how the computed code coverage information is visualized and
used. It also describes the challenges and solutions for adopting
code coverage at scale. To study how code coverage is adopted and
perceived by developers, this paper analyzes adoption rates, error
rates, and average code coverage ratios over a five-year period,
and it reports on 512 responses, received from surveying 3000
developers. Finally, this paper provides concrete suggestions for
how to implement and use code coverage in an industrial setting.
View details
An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions
Robert Kurtz
Paul Ammann
René Just
Proceedings of the 13th International Workshop on Mutation Analysis (Mutation 2018)
Preview abstract
Mutation analysis evaluates a testing or debugging
technique by measuring how well it detects mutants, which
are systematically seeded, artificial faults. Mutation analysis is
inherently expensive due to the large number of mutants it
generates and due to the fact that many of these generated
mutants are not effective; they are redundant, equivalent, or
simply uninteresting and waste computational resources. A large
body of research has focused on improving the scalability of
mutation analysis and proposed numerous optimizations to, e.g.,
select effective mutants or efficiently execute a large number of
tests against a large number of mutants. However, comparatively
little research has focused on the costs and benefits of mutation
testing, in which mutants are presented as testing goals to a
developer, in the context of an industrial-scale software devel-
opment process. This paper aims to fill that gap. Specifically,
it first reports on a case study from an open source context,
which quantifies the costs of achieving a mutation adequate
test set. The results suggest that achieving mutation adequacy
is neither practical nor desirable. This paper then draws on
an industrial application of mutation testing, involving more
than 30,000+ developers and 1,890,442 change sets, written in
4 programming languages. It shows that mutation testing does
not add a significant overhead to the software development
process and reports on mutation testing benefits perceived by
developers. Finally, this paper describes lessons learned from
these studies, highlights the current challenges of efficiently
and effectively applying mutation testing in an industrial-scale
software development process, and outlines research directions.
View details
State of Mutation Testing at Google
Proceedings of the 40th International Conference on Software Engineering 2017 (SEIP) (2018) (to appear)
Preview abstract
Mutation testing assesses test suite efficacy by inserting small faults into programs and measuring the ability of the test suite to detect them. It is widely considered the strongest test criterion in terms of finding the most faults and it subsumes a number of other coverage criteria. Traditional mutation analysis is computationally prohibitive which hinders its adoption as an industry standard. In order to alleviate the computational issues, we present a diff-based probabilistic approach to mutation analysis that drastically reduces the number of mutants by omitting lines of code without statement coverage and lines that are determined to be uninteresting - we dub these arid lines. Furthermore, by reducing the number of mutants and carefully selecting only the most interesting ones we make it easier for humans to understand and evaluate the result of mutation analysis. We propose a heuristic for judging whether a node is arid or not, conditioned on the programming language. We focus on a code-review based approach and consider the effects of surfacing mutation results on developer attention. The described system is used by 6,000 engineers in Google on all code changes they author or review, affecting in total more than 14,000 code authors as part of the mandatory code review process. The system processes about 30% of all diffs across Google that have statement coverage
calculated.
View details