A couple of weeks back, I came across this amazing library(modin pandas) that speeds up the existing pandas’ code almost 4x by changing just one line of code. Seeing such big claims gave me a reason to test it out and see the results myself.
I will be testing the CSV load feature for both modin.pandas and pandas on 2 different sized datasets and compare the performance for both the methods using python’s time module. And also just to make sure the numbers are stable, I will be considering the average load time over 3 runs.
Results are really impressive! Although I found it to be accelerating the performance by around 2.5x (probably not all the cores were free on my system), but that’s still better than just using plain pandas, right?
Also while doing a little back study, I found that Modin uses all of the CPU cores present on the host machine to distribute the works to get this remarkable performance, unlike pandas that uses just 1 core for any task that it does.
No doubt! that this is some really good contribution for Data Science / ML Enthusiasts. I would encourage you to give it a try once in your next Kaggle challenge!