Whether you have to do with data in form of CSV, JSON or a full-blooded programming language like C, JavaScript, Scala, or maybe a query language like

From String to AST: parsing

submited by
Style Pass
2024-11-22 13:00:08

Whether you have to do with data in form of CSV, JSON or a full-blooded programming language like C, JavaScript, Scala, or maybe a query language like SQL, you always transform some sequence of characters (or binary values) into a structured representation. Whatever you’ll do with that representation depends on your domain and business goals, and is quite often the core value of whatever you are doing. With a plethora of tools doing the parsing for us (including the error-handling), we might easily overlook how complex and interesting process it is.

First of all, most input formats that we handle follow some formal definition, telling e.g. how key-values are organized (JSON), how do you separate column names/values (CSV), how do you express projections and conditions (SQL). These rules are defined in an unambiguous way so that the input could be interpreted in a very deterministic way. It directly opposed language we use to communicate with other people, which often is ambiguous and put into the context. Wanna grab some burger? might be a nice suggestion if you are talking to a colleague that have to skip lunch and likes burgers, but might be offensive if told in a sarcastic tone to someone who doesn’t like meat. Then, words can have different meaning depending on the culture you are currently in, in which times you live or what are you and your conversationalist social position (vide, e.g. Japanese and how your position and suffixes you add at the end of the name change the tone of the whole conversation). Languages we use when communicating with a computer must be free of such uncertainties. The meaning should depend only on the input we explicitly entered and interpreted deterministically. (Just in case: by deterministically, I mean, deterministically interpreted, which doesn’t mean that it would always produce the same result. If I write currentTimeMillis(), the function will always return a different result, but the meaning will be always the same - compiler/interpreter will understand that I want to call currentTimeMillis() function, and it won’t suddenly decide that I want to e.g. change the compiler flag. Of course, the meaning of the function can change in time - for instance, if I edit the source code in between runs - and surely the value returned by it, which is bound to time).

Initially, it wasn’t known, how to parse languages. The reason, that we had to start with punching cards, sometime later moved on to assembly, and later on invent Fortran and Lisp, go through whole spaghetti code with Basic, get The case against goto statement by Dijkstra, until we could - slowly - started developing more sophisticated compilers we have today, was that there were no formal foundations to it.

Leave a Comment