We’ve been working to bring components of Quip’s technology into Slack with the canvas feature, while also maintaining the stand-alone Quip product. Quip’s backend, which powers both Quip and canvas, is written in Python. This is the story of a tricky bug we encountered last July and the lessons we learned along the way about being careful with TCP state. We hope that showing you how we tackled our bug helps you avoid — or find — similar bugs in the future! Our adventure began with a spike in EOFError errors during SQL queries. The errors were distributed across multiple services and multiple database hosts:
The stacktrace showed an asyncio.IncompleteReadError, which we translate to EOFError, when reading the response from the database:
There are a few places where we close our connection to the database, e.g. if we see one of a certain set of errors. One initial hypothesis was that another Python coroutine closed the connection for its own reasons, after we issued the query but before we read the response. We quickly discarded that theory since connection access is protected with an in-memory lock, which is acquired at a higher level in the code. Since that would be impossible, we reasoned that the other side of the connection must have closed on us. This theory was bolstered when we found that our database proxies closed a bunch of connections at the time of the event. The closes were distributed independent of database host, which was consistent with the incident’s surface area. The proxy-close metric, ClientConnectionsClosed, represents closes initiated by either side, but we have a separate metric for when we initiate the close, and there was no increase in activity from us at the time. The timing here was extremely suspicious, although it still didn’t quite make sense why this should result in such a high rate of errors on the response read since we had just ensured connection state.