Data science students were asked to improve their code snippets for data preprocessing. Some parts of the snippets were sub-optimal, and other parts w

Python and Pandas: the faster way | Dasha.AI

submited by
Style Pass
2021-06-15 17:00:17

Data science students were asked to improve their code snippets for data preprocessing. Some parts of the snippets were sub-optimal, and other parts were, well, thought to be optimal. Let’s take a look at what came as a result of this experiment. A piece of advice: while reading, don’t just skim through the text. Instead, try to guess which of the proposed code options are the fastest.

Let's say we have a Pandas data-frame, in which one of the columns is a string that contains price with the dollar sign at the end (fig 1). We need to convert it to an integer, i.e. remove the dollar sign and convert this type to int (price($)_v1).

The first option was offered to the students as a basic one. The logic here is simple: the outermost character is split off and is converted to int type. There is an obvious suboptimality here: you cannot translate it into the new type "element by element". The second option fixes this suboptimality. The disadvantage of both options is that they work only if the dollar sign is valid at the end of each string value of the attribute. The third option is more universal, which removes the dollar sign with help of the replace function. And in the fourth option, the lambda function is excluded. The execution time is shown on a bar chart (hereinafter, the time is calculated for a data frame with the number of rows = 10,000,000):

It has to be mentioned that the students did not choose the time-optimal option, they relied on the replace function. The fourth option, in theory, is an improved third option, though there’s a curveball: on average, it is slower than its progenitor. For some reason, lambda functions here are faster than the .str interface.

Leave a Comment