For a long time, Pandas was, and still is arguably the single most important library inside a data scientist's toolbox. In fact, it was the emergence

Dask vs Vaex - a qualitative comparison

submited by
Style Pass
2021-06-16 15:00:10

For a long time, Pandas was, and still is arguably the single most important library inside a data scientist's toolbox. In fact, it was the emergence of Pandas that helped to make Python such a popular programming language, both for data science tasks and generally. But datasets continue to grow, and now they often exceed the size of available RAM on most machines. Thus, the need for fast and efficient processing and analysis of data, i.e. the next generation of tooling grows too.

There are several popular technologies in the Python ecosystem that are frequently used for processing large datasets in the context of data science and data engineering. Dask is an open-source, general framework for parallel and distributed computations in Python. It is often the go-to technology for horizontal scaling of various types of computations and data science tasks. Its higher level API, the Dask.DataFrame provides arguably the most convenient way to sidestep the memory and performance hurdles one can encounter with Pandas when working even with moderately large datasets.

Vaex is a high-performance DataFrame library in Python, primarily built for the processing, exploration and analysis of datasets as large as the size of your hard-drive and on a single machine. When it comes to large datasets, Vaex argues that horizontal scaling, i.e. adding more computers/nodes, is simply not needed for the majority of the common data science and analytics tasks, provided one uses the right approaches for storing, accessing, and processing of the data.

Leave a Comment