This article is a compilation of my recent conference talk at o11ycon, a conference put out by Honeycomb. It includes a blog post that Honeycomb autho

How Tracing Uncovers Half-truths in Slack’s CI Infrastructure

submited by
Style Pass
2021-07-27 20:00:02

This article is a compilation of my recent conference talk at o11ycon, a conference put out by Honeycomb. It includes a blog post that Honeycomb author Eric Thompson shared that summarized the talk. The blog article, reshared was originally published in Honeycomb’s blog

Slack experienced meteoric growth between 2017 and 2020—but that level of growth came with growing pains. In his talk at the 2021 o11ycon+hnycon, Frank Chen (LinkedIn), a Slack Senior Staff Engineer, detailed one of Slack’s biggest pain points in that period: flaky tests.

A flaky test returns both a passing and failing result despite no changes in the code. At one point, between 2017 and 2020, Slack’s flaky test rate reached as high as 50%. This amount of flakiness led to huge problems when it came to the DevOps practice of continuous integration (CI), where developers frequently integrate code into a central repository.

As a result, developers’ trust in tests was declining, developer velocity was starting to become sluggish, and huge incidents like a “large and cursed” Jenkins queue (as Frank described it) were starting to crop up.

Leave a Comment