Assessing Transition-based Test Selection Algorithms at Google
Abstract
Continuous Integration traditionally relies on testing
every code commit with all impacted tests. This practice requires
considerable computational resources, which at Google scale,
results in delayed test results and high operational costs. To
deal with this issue and provide fast feedback, test selection and
prioritization methods aim to execute the tests which are most
likely to reveal changes in test results as soon as possible. In this
paper we present a simulation framework to support the study
and evaluation, with real data, of such techniques. We propose
a test selection algorithm evaluation method, and detail several
practical requirements which are often ignored by related work,
such as the detection of transitions, the collection and analysis of
data, and the handling of flaky tests. Based on this framework,
we design an experiment evaluating five potential regression test
selection algorithms, based on simple heuristics and inspired by
previous research, though the evaluation technique is applicable
to any number of algorithms for future experiments. Our results
show that algorithms based on the recent (transition) execution
history do not perform as well as expected (given the previously
reported results) and that the test selection problem remains
largely open. We found that the best performing algorithms are
based on the number of times a test has been triggered and
the number of distinct authors committing code that triggers
particular tests. More research is needed in order to close the
gap between the current approaches and the optimal solution.
every code commit with all impacted tests. This practice requires
considerable computational resources, which at Google scale,
results in delayed test results and high operational costs. To
deal with this issue and provide fast feedback, test selection and
prioritization methods aim to execute the tests which are most
likely to reveal changes in test results as soon as possible. In this
paper we present a simulation framework to support the study
and evaluation, with real data, of such techniques. We propose
a test selection algorithm evaluation method, and detail several
practical requirements which are often ignored by related work,
such as the detection of transitions, the collection and analysis of
data, and the handling of flaky tests. Based on this framework,
we design an experiment evaluating five potential regression test
selection algorithms, based on simple heuristics and inspired by
previous research, though the evaluation technique is applicable
to any number of algorithms for future experiments. Our results
show that algorithms based on the recent (transition) execution
history do not perform as well as expected (given the previously
reported results) and that the test selection problem remains
largely open. We found that the best performing algorithms are
based on the number of times a test has been triggered and
the number of distinct authors committing code that triggers
particular tests. More research is needed in order to close the
gap between the current approaches and the optimal solution.