This is a transcript of a talk given at ElixirDaze and CodeBEAMSF conferences in March of 2018, dealing with supervision trees and with the unexpected.
Among other things, I'm a systems architect, which means that my job is mostly coming up with broad plans, and taking the merit if they go well and otherwise blaming the developers if it they go wrong. More seriously, a part of my job is helping make abstract plans about systems, in such a way that they are understandable, leave room for developers to make decisions locally about what they mean, but while structuring things in such a way that the biggest minefields one might encounter when the rubber hits the road are taken care of.
This is kind of critical at Genetec because we’re designing security systems that are used in supermarkets, coffee shops, train stations, airports, or even city-wide systems. There’s something quite interesting about designing a system where if there’s any significant failure, the military—with dudes carrying submachine guns—has to be called as a back up for what you do (to be fair, I'm not the one who designed these, those are stories I heard!) The systems are often deployed at customer sites, with no direct management access by its developers. So the challenge is in coming up with solutions that require little to no active maintenance—even though they must be maintainable—must suffer no or little downtime, and must make the right decisions when everything goes bad and the system is fully isolated for hours.
Few of these plans survive the whiteboard, and most of them have to be living documents adapted to whatever developers encounter as they work. I think this is because of one simple problem underneath it all: we don’t know everything. In fact, we don’t know much at all.