Flake Aware Culprit Finding
Abstract
When a change introduces a bug into a large software repository, there is
often a delay between when the change is committed and when bug is detected.
This is true even when the bug causes an existing test to fail! These delays
are caused by resource constraints which prevent the organization from running
all of the tests on every change. Due to the delay, a Continuous Integration
system needs to locate buggy commits. Locating them is complicated by flaky
tests that pass and fail non-deterministically. The flaky tests introduce
noise into the CI system requiring costly reruns to determine if a failure was
caused by a bad code change or caused by non-deterministic test behavior. This
paper presents an algorithm, Flake Aware Culprit Finding, that locates
buggy commits more accurately than a traditional bisection search. The
algorithm is based on Bayesian inference and noisy binary search, utilizing
prior information about which changes are most likely to contain the bug. A
large scale empirical study was conducted at Google on 13,000+ test breakages.
The study evaluates the accuracy and cost of the new algorithm versus a
traditional deflaked bisection search.
often a delay between when the change is committed and when bug is detected.
This is true even when the bug causes an existing test to fail! These delays
are caused by resource constraints which prevent the organization from running
all of the tests on every change. Due to the delay, a Continuous Integration
system needs to locate buggy commits. Locating them is complicated by flaky
tests that pass and fail non-deterministically. The flaky tests introduce
noise into the CI system requiring costly reruns to determine if a failure was
caused by a bad code change or caused by non-deterministic test behavior. This
paper presents an algorithm, Flake Aware Culprit Finding, that locates
buggy commits more accurately than a traditional bisection search. The
algorithm is based on Bayesian inference and noisy binary search, utilizing
prior information about which changes are most likely to contain the bug. A
large scale empirical study was conducted at Google on 13,000+ test breakages.
The study evaluates the accuracy and cost of the new algorithm versus a
traditional deflaked bisection search.