If there was one metric at bunny.net that we obsess about more than performance, that would be reliability. We have redundant monitoring, auto-healing

The stack overflow of death. How we lost DNS and what we're doing to prevent this in the future. - bunny.net Blog

submited by

Style Pass

2021-06-23 09:30:12

If there was one metric at bunny.net that we obsess about more than performance, that would be reliability. We have redundant monitoring, auto-healing at multiple different levels, three redundant DNS networks and a system designed to tie all of this together and assure your services stay online.

That being said, this gets so much harder. After an almost stellar 2 year uptime, on 22nd of June, bunny.net experienced a 2+ hour near system-wide outage caused by DNS failure. In a blink of an eye, we lost over 60% of traffic, and wiped out hundreds of Gbits of throughput. Despite all of these systems being in place, a very simple update brought it all crumbling down, affecting over 750.000 websites.

To say we are disappointed would be an understatement, but we want to take this opportunity to learn, improve and build an even more robust platform. In the spirit of transparency, we also want to share what happened and what we're doing to resolve this going into future. Perhaps even help other companies learn from our mistakes.

I will say this is somehow probably the usual story. It all started with a routine update. We are currently in the process of massive reliability and performance improvements throughout the platform and a part of that was improving the performance of our SmartEdge routing system. SmartEdge leverages a large amount of data that is periodically synced to our DNS nodes. To do this, we take advantage of our Edge Storage platform that is responsible for distributing the large database files around the world through Bunny CDN.