Regular expressions are used, misused and abused nearly everywhere. Google is no exception, alas, and at our scale, even a simple change can save hundreds or thousands of cores. In this tip, we describe ways to use RE2 more efficiently.
NOTE: This tip is specifically about RE2 and C++. A number of the ideas below are universally applicable, but discussion of other libraries and other languages is out of scope.
As a prelude, let’s consider an example of how regular expressions are often used. This snippet looks for a zone ID at the end of the zone_name string and extracts its value into the zone_id integer:
This tip describes several techniques for improving efficiency in situations such as this. These fall into two broad categories: improving the code that uses regular expressions; and improving the regular expressions themselves.
In order to understand why the following techniques matter, we need to talk briefly about RE2 objects. In the initial example, we passed a pattern string to RE2::FullMatch(). Passing a pattern string instead of an RE2 object implicitly constructs a temporary RE2 object. During construction, RE2 parses the pattern string to a syntax tree and compiles the syntax tree to an automaton. Depending on the complexity of the regular expression, construction can require a lot of CPU time and can build an automaton that will have a large memory footprint.