Fly Volumes are fast. That sounds like a brag, but the truth is, we made tradeoffs to end up with fast Volumes. We back them with a pool of locally-at

Some Volumes Were Slow And We Figured Out Why

submited by
Style Pass
2024-04-19 23:00:07

Fly Volumes are fast. That sounds like a brag, but the truth is, we made tradeoffs to end up with fast Volumes. We back them with a pool of locally-attached NVMe drives, which means they’re pinned to specific physical servers, and while we do back them up, you generally want to be doing something at an upper layer to replicate them. They can lose data! But, the flip side is: they’re very fast.

So it was jarring, earlier this week, to get reports from folks experiencing what appeared to be I/O performance problems. We make it easy to spot I/O issues: you can just click out from our dashboard to Metrics, and look at the I/O Utilization percentage, which should be low.

One tricky thing about doing infra ops for a public cloud is that every possible thing can go wrong. Our customers exercise our hardware in every conceivable way. A performance problem could be on our side, or it could be an app stuck in an expensive tight loop. We started digging, but didn’t see any patterns.

Then our metrics cluster started dragging. Well, we’re confident in the performance envelope of that system. We built it to scale. And we were seeing the Fly Machines running it grinding to a halt. Our digging gained some urgency.

Leave a Comment