Unit testing forms the bedrock of any Continuous Integration (CI) system. It warns software engineers of bugs in newly-implemented code and regressions in existing code, before it is merged. This ensures increased software reliability. It also improves overall developer productivity, as bugs are caught early in the software development lifecycle. Hence, building a stable and reliable testing system is often a key requirement for software development organizations.
Unfortunately, by definition, flaky unit tests undercut this requirement. A unit test is considered flaky if it returns different results (pass or fail) on any two executions, without any underlying changes to the source code. A flaky test can occur either due to program-level non-determinism (e.g., thread ordering and other concurrency issues) within the test code or the code being tested. Alternatively, it can occur due to variability in the testing environment (e.g., the machine on which it is executed, the set of tests that are executed concurrently, etc.). While the former requires fixing the code, the latter involves identifying the reasons that resulted in the non-determinism, and addressing them to remove the flakiness. The testing of both code patterns and infrastructure must be geared towards diminishing the potential for flaky tests to arise.
Flaky tests affect developer productivity across multiple dimensions. First, when a test fails due to extraneous reasons, the underlying issue has to be investigated, which can be time-consuming, given the non-deterministic reproducibility of the failure. In many cases, reproducing the failure locally may be impractical, as it requires specific test configurations and execution environments to manifest the error. Second, if the underlying root cause for the flakiness cannot be identified, then the test has to be retried sufficiently during CI so that a successful run of the test is observed and accompanying code changes can be merged. Both aspects of this process waste critical development time, thus necessitating building infrastructural support for handling the problem of flaky unit tests.