Tabular Asa is a column-oriented, efficient, immutable, dataframe implementation for Racket. It has support for: b-tree indexes (and scanning), generi

ANN: Tabular Asa (dataframes for Racket) : Racket

submited by
Style Pass
2021-08-09 13:00:10

Tabular Asa is a column-oriented, efficient, immutable, dataframe implementation for Racket. It has support for: b-tree indexes (and scanning), generic sorting, joining (inner and outer), grouping, and aggregating. It can also read and write CSV and JSON (columns, records, and lines).

I plan on adding some more features in the near future, but it's at a good, stable place and thought others in the community might find it useful.

This kind of thing has to be very performant to be useful for data sets that are only a few GB in size. Is it pure Racket? Have you benchmarked against pandas?

Yes, this is 100% written in Racket. The purpose of that is many-fold, one of which is the next step in this endeavor: using it as an instructional template for people interested in learning how to implement DataFrames (read: a series of blog posts). For many programmers - especially young ones - libraries like Pandas and Spark are black magic. Revealing what's behind the curtain and why certain decisions are made (algorithmic, cache usage, better parallelization, etc.) is very beneficial to them down the road.

I use Pandas, R, and Spark every day at work; I agree that performance is critical for packages like this. Being pure Racket, I wouldn't expect it to compete in the performance department. But - for Racket - it's not bad. On average it's roughly 3-6x slower than Pandas currently (some ops are 6x slower, others are only 2x). There's plenty of improvements to be made when it comes to parallelizing many of the operations.

Leave a Comment