Table lookups are efficient

submited by
Style Pass
2024-10-14 22:30:05

When optimizing small functions, I often rely on a table lookup: I replace the actual computation with table of precomputed values. It is often surprisingly efficient.

Let us consider an example. Suppose that you are given an array of characters and you want to replace all instances of the character ‘\’ with the two-character string “\\”, all instances of the character ‘\n’ with the two-character string “\n”, and all instances of the character ‘\n’ with the two-character string “\n”. You might do it with a sequence of branches, like so:

My table is unnecessarily large: I leave it as an exercise to the reader to optimize the code so that a much smaller table could be used.

When processing inputs sequentially, we often have some data loading. In this case, we must load each character from the input array. The table lookups also require some load operations, but modern processors can typically execute two loads per cycle. Thus we are often not severely limited by the number of loads. The loads have some latency, but a table lookup in a hot function is often cached closed to the CPU. Thus we may have a relatively short latency (e.g., 5 cycles). If there is not too much dependency between how we process successive characters, this latency is too harmful. On the plus side, table lookups reduce the number of instructions issued, and they often keep branch mispredictions to a minimum.

You might be concerned that tables can make your binary code larger. However, that may not be true. One of the most important optimization that compiler make is ‘inlining’. That is, your small functions are often replicated several times to avoid the overhead of a function call and to allow for more advanced optimizations. A hot function implemented with tables is made of few instructions. These instructions can be reproduced at little cost all over your code base. In contrast, an implementation made of many instructions can be costly to inline.

Leave a Comment