I have been building and architecting serverless systems for almost a decade now, for a variety of companies, from small start-ups to large enterprise

Navigating through failures, build resilient serverless systems

submited by

Style Pass

2024-05-05 10:30:04

I have been building and architecting serverless systems for almost a decade now, for a variety of companies, from small start-ups to large enterprises. And I have seen many strange things over the years.

Serverless services from AWS come with high availability and resiliency built in. But that is on the service level, AWS Lambda, StepFunctions, EventBridge, and so on are all highly available and have high resiliency.

If all components in our systems were serverless that would be great. However, that is not the case. In almost all systems that I have designed or worked on there has been components that are not serverless. It can be the need for a relational database, and yes, I do argue that not all data-models and search requirements can't be designed for DynamoDB. It can be that we need to integrate with a 3rd party and their API and connections come with quotas and throttling. It could be that we as developers are unaware of certain traits of AWS services that make our serverless services break. And even sometimes our users are even our worst enemy. When we are building our serverless systems we must always remember that there are components involved that don't scale as fast or to that degree that serverless components does.

In this post I'm going to give some thoughts on navigating failures and building a high resiliency serverless system. This post is not about preventing failures, instead it's about handling and recover from failures in a good way. I'll start with a common serverless architecture and add architecture concepts, that I often use, that can help enhance its resiliency.