Optimization adventures: making a parallel Rust workload even faster with data-oriented design (and other tricks)

submited by
Style Pass
2024-12-02 17:00:06

This post is the second part of my adventures to optimize a Rust workload running on multiple threads. In the first post, I explored how the rayon parallelism framework works, and explained how I designed a faster replacement for my use case.

However, the parallelism implementation was not the lowest-hanging fruit to optimize my program. In this post, I’ll present the other optimizations that the profiling tools led me to: compiler flags, inlining, fine-tuning data structures layout, re-writing arithmetic formulas, sorting and copying data.

Wait, what? Copying data, really?! Isn’t the whole point of fighting the Rust borrow checker that you unlock super-optimized zero-copy data structures?

This is the optimization that I didn’t expect to have to manually do, but which turned out to be very effective. As pointed out by Matklad’s post Can You Trust a Compiler to Optimize Your Code?, inlining is really the root of all optimizations, because most other optimizations are local to each function.

To be more concrete, my code contained a newtype for fixed-point arithmetic, which manipulates integers that represent fixed-precision decimal numbers. A simplified view is the following, for a type that handles 9 decimal places.

Leave a Comment