If you’ve worked with data for any length of time, you’ve come across the Comma-Separated Values (CSV) format. Its simplicity and ubiquity make CSV an extremely popular way for organizations to exchange data both internally and externally. While many programs can’t read or write Excel spreadsheets, almost anything can read and write CSVs, and a human can open a CSV file in any text editor and understand roughly what it contains.
Despite this ubiquity and ease of access, CSV is a wretched way to exchange data. The CSV format itself is notoriously inconsistent, with myriad competing and mutually-exclusive formats that often coexist within a single dataset (or, if you’re particularly unlucky, a single file). Exporting a dataset as a CSV robs it of a wealth of metadata that is very hard for the reader to reconstruct accurately, and many programs’ naïve CSV parsers ignore the metadata reconstruction problem entirely as a result. In practice, CSV’s human-readability is more of a liability than an asset.
CSVs often begin life as exported spreadsheets or table dumps from legacy databases, and often end life as a pile of undifferentiated files in a data lake, awaiting the restoration of their precious metadata so they can be organized and mined for insights. Much of the job of data preparation products is restoring the metadata lost when a spreadsheet is exported as a CSV. The world has lost a mind-boggling amount of time and information to CSV’s imprecision and under-specification.