Suppose you have a large, A-tier team of data analysts, data scientists, data engineers, platform engineers, and ETL (extract, transform, and load) developers. All of them will eventually need to manipulate data. They all are trying to accomplish different things using, presumably, the same collection of datasets.
These datasets might be in one central database. They might also be decentralized, spanning multiple databases of different kinds and varying sizes.
These datasets might be small, or maybe they are incredibly large–so large that it’s ill-advised to yank them all down locally for transformation and reloading.
This army of data wranglers might have different approaches to accessing, manipulating, and storing their data. They might use a set of database connectors to pull in iterables, or they might read a set of query strings to pull data into pandas dataframes, or maybe they’re using an internal tool that is an amalgamation of many external libraries.
No matter the location, size, or manipulation of the data, keeping your analytics codebase organized and efficient can be a house of cards–small changes in the back end or resource constriction in the front end can incur massive amounts of technical debt and bring your A-tier team to a F-tier sluggish crawl. Your team becomes more focused on cleaning up code and fixing errors, when you’d rather they spent more time on and innovation advantage of cutting-edge platform changes.