In a previous post, I’ve shown how to use the rayon framework in Rust to automatically parallelize a loop computation across multiple CPU cores. Disappointingly, my benchmarks showed that this only provided a 2x speedup for my workload, on a computer with 8 CPU threads. Worse, the total “user” and “system” times increased linearly with the number of threads, meaning potentially more wasted work. Even Python was only twice slower than my Rust code, when Rust is typically 10x to 100x faster than Python.
This was the starting point of an optimization journey that led me to a 10x speed-up from this baseline. In this post, I’ll first explain which profiling tools I used to chase optimizations, before diving into how I built a faster replacement of Rayon for my use case. In the next post, I’ll describe the other optimizations that made my code much faster. Spoiler alert: copying some data sped up my code!
Wait, what? Copying data, really?! Isn’t the whole point of fighting the Rust borrow checker that you unlock super-optimized zero-copy data structures?