The Pedestrian Yet Subtle Art of Csv Ingestion

submited by
Style Pass
2024-09-27 05:30:02

For personal reasons (moving country and a baby arriving) I have changed employer twice in short succession recently. Moving through companies rapidly like this allows you to see certain recurring problems more clearly sometimes. One thing I noticed is that in all three of my recent employers there was at least one process which ingested CSV or excel files that were provided by a third party via FTP or email.

Now I know, this is 2024 and ChatGPT can read my mind already so this seems like a very unfashionable topic for a blog article. But what I noticed from seeing this situation multiple times is that it is a seemingly simple problem which is generally solved badly.

Whilst, like any developer, I’m excited by new technology, the real benefits of new technology will never reach society as a whole when most companies can’t get basic stuff right.

So, whats the problem? The fundamental problem with this kind of set up is that CSV is a bad choice of format to act as an API between two systems. Because remember, if it’s the output of one system and in the input to another system, that’s an API - it should be treated like one. In this scenario CSV is a bad format because, lacking types and schema, it offers no way to define a formal set of expectations on the data being transferred nor a method with which to verify the data arriving is as expected.

Leave a Comment