Regex Optimizations


The best regex optimization, in my opinion, is to start with good style. So I highly recommend you visit the page on regex style if you haven't already done so.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Optimizing Your Regular Expressions
What is an Optimization, Exactly?
Unrolling an Alternation Quantified by a Star
In Alternations, put the Most Likely Cases First
Expose literal characters
Use and Expose Anchors
Distributing into the Alternation
Automated Optimization: Study Mode
Optimizations: In Conclusion


(direct link)

Optimizing Your Regular Expressions

Suppose you travel to an exotic country. With no prior knowledge of the exotic language spoken in that land, you learn to say such things as "Take me to the East Gate". Once you get into a cab, you realize that requesting your destination is not enough. Unless you also say how you want to get there, you might be taken all over town. That seems a bit unfair, doesn't it?

You could feel the same about optimizations in regex. They feel like a bit of a scam.

A bad regex engine is like a rude cab driver.
Let me explain what I mean. In a programming language, you can use high level functions without really having to worry about how they work. You say "open this, find that, print the other", and it just works. When you start to learn regex, you may think that it's enough to know how to say "find that", naively assuming that the whole point of the syntax is to let you specify what you are looking for. But that's not the case. In some situations, if you don't tell the regex engine how to find what you want, it may take you all over town. A search that might resolve in a hundred steps can take ten thousand if written the wrong way.

Now a well-optimized regex engine will be able to look at your pattern and be a little forgiving. It sees a lot of foreigners in town, and even if you didn't express yourself most elegantly, it appreciates the effort, tries to understand what you really meant, and takes you to your hotel as fast as possible.

So in some sense, optimizing your regex expressions means learning tricks to speak to impolite or uncivilized regex engines, i.e., engines that haven't been optimized. And that's the scam. Why should I use your programming language if you haven't taken the trouble to make it efficient?

What's more, optimizing your regex can force you to write expressions that are harder to write and read. That too seems a tad unfair.

Nevertheless, studying optimizations is fun and useful. When you study optimizations, you deepen your understanding of how the engine works, and that knowledge helps you write your expressions faster and more accurately.


(direct link)

What is an Optimization, Exactly?

Before we start, I should define what I mean by optimization. In this section, I am not talking about the "big picture" regex writing practices we discussed in the Elements of Regex Style, practices that hinge on fundamental features of your regex engine, such as the different paths down which a greedy and a lazy quantifier may lead your engine before you end up matching the same string.

Here, we are talking about "micro-optimizations" akin to turning a bolt by a quarter of a turn on your V-6.

Frankly, I'm not sure I always know what should qualify as a "big-picture practice" or as a "micro-optimization". In the end, I suppose the results produced by each technique make that decision for us.

About the optimization tests on this page
On this page, I took one trick from the syntax tricks page and a handful of optimizations I found in Jeffrey Friedl's Mastering Regex Expressions and tested how one particular version of one particular engine (PCRE 8.12) used within one particular version of one particular language (PHP 5.3.8) responds to each of them.

At some stage, I would like to run the same tests in .NET, Python, Java, JavaScript and Ruby. In the meantime, even if you don't use PHP, the optimizations are still interesting to study.


(direct link)

Unrolling an Alternation Quantified by a Star

In the trick to Mimic an Alternation Quantified by a Star, we see that (A|B)* can be unrolled to A*(?:B+A*)*. Is there a time benefit to doing away with the alternation?

Using pcretest, I benchmarked two patterns against this string:
14e987eaie7e7e7e9876ei14ou
The patterns: (?:\d|[aeiou])* and \d*(?:[aeiou]+\d*)*
The original pattern (with the alternation) compiles faster: 1.6 millionth of a second vs. 2.2 for the unrolled version. However, it executes a lot slower: 1.7 millionth of a second vs. 0.8.

This seems like a potentially useful optimization to implement at the engine level (in one's own code, it is a bit hard to maintain).


(direct link)

In Alternations, put the Most Likely Cases First

The engine reads from left to right. In the world of web addresses, dot-com is more frequent than dot-net and dot-biz, so if you were checking for these three domains in a pool of random names, in theory you would write
\.(?:com|net|biz)\b rather than
\.(?:biz|net|com)\b
In practice, in PCRE there doesn't seem to be a difference—I ran two patterns two million times and compared the results. In case you're interested in doing some benchmarking of your own, here is the (very basic) code I used for this test.

<?php
$start=time();
for ($i=0;$i<1000000;$i++)
    $res = preg_match('~\.(?:com|net|biz)\b~',
                      'apache.com');
$lap=time();
for ($i=0;$i<1000000;$i++)
     $res = preg_match('~\.(?:biz|net|com)\b~',
                       'apache.com');
$end=time();
$time1 = $lap - $start;
$time2 = $end - $lap;
echo $time1."<br />";
echo $time2."<br />";
?>


(direct link)

Expose Literal Characters

Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.

Example 1: AA* should be faster than A+. They mean the same: at least one A, possibly followed by more A characters.

I ran these two patterns two million times on the string BBBCCC. Both took the same amount of time. This tells me that the PHP PCRE engine must be "polite" as far as this optimization is concerned—meaning, it does it for you. Just use A+.

Example 2: th(?:is|at) should be faster than this|that.

I ran these two patterns two million times on the string that. The second ("less optimal") pattern was actually faster by eight percent, earning me half a millionth of a second per run, nothing to write home about. Again, the optimization must be built into PHP's PCRE regex engine. Maybe the theoretically "more optimal" pattern loses out somewhere in the compilation.

Should you use it?
Where speed is concerned, the answer depends on the engine you're using. For me, regardless of which engine I happen to use, the decision is one of readability—and therefore maintainability.

For instance, if I am matching all numbers from 10 to 19, I will always bring out the 1 and use 1[0-9]. For me, that is easier to read and maintain than spelling out each number, all the more so when working with a more complex range of numbers.

On the other hand, if I were crafting a regex for all the two-letter abbreviations of States in the USA, I would spell them out: \b(?:AL|AK|AZ|…)\b. I would do so even though I have tools at my fingertips that will automatically compress long alternations into their optimized counterparts (regex-opt in C, Regexp::Assemble and Regexp::Assemble::Compressed in Perl).

That is because any day of the week, I would rather have to debug this:
\b(?:AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|
MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|
WA|WV|WI|WY)\b


than that:
\b(?:AZ|TX|[AFI]L|[CGILMPVW]A|[CM]O|[CMUV]T|[DMN]E|[HMRW]I|[IMNS]D|[IMT]N|[KM]S|[KNW]Y|[NW]V|[NO]H|[NS]C|N[JM]|[AO][KR])\b


(direct link)

Use and Expose Anchors

The beginning- and end-of-string anchors ^ and $ can save your regex a lot of backtracking in cases where the match is bound to fail.

In theory, ^.*abc fails faster than .*abc. I ran this two million times on a "failure string" (forty z characters in a row). As in the last example, the "less optimal" pattern was faster by eight percent, earning me half a millionth of a second per run. Again, PCRE sounds polite. The lost time might have to do with processing the extra anchor.

It is also advised to expose anchors, which means taking anchors out of alternation parentheses when possible. For instance,
^(?:abc|def) is preferred to ^abc|^def.

I ran each of these two preg_match functions two million times.

$res = preg_match('~^(?:abc|def)~', 'definition');
$res = preg_match('~^abc|^def~', 'definition');

The first saved me one second (out of fourteen). On the one hand, that's a seven percent improvement. On the other hand, on a single run, that's only an improvement of half a millionth of a second.

Should you use it?
I use anchors wherever I can as a matter of good style—and saving unneeded backtracking. As to whether to expose them outside of alternations, I usually do so as well, not because it is faster but because it tends to be more readable.


(direct link)

Distributing into the Alternation

The last technique I tried is what Jeffrey calls "distributing into the alternation":

\b(?:com\b|net\b) vs. \b(?:com|net)\b

This technique did speed up the script by seven percent, saving a millionth of a second per run.

Will I use it? Probably not. I like exposing boundaries.


(direct link)

Automated Optimization: Study Mode

PCRE has a "Study" modifier you can tag at the end of a pattern. For patterns that do not start with a fixed character and are not anchored, this modifier causes the regex engine to study the string a little more before applying the pattern, just in case some optimizations can be discovered.

To use this mode, add a capital S after the closing delimiter, as in:
$pattern='/\d+\b(day|night)\b/S';
Apparently, this study mode can be useful when parsing long documents such as web pages. It may not help, but it costs less than a hundred thousandth of a second.


(direct link)

Optimizations: In Conclusion

When I tested them in PHP, the "micro-optimizations" in Mastering Regex Expressions did not seem to speed up the code. Does that mean they are bad? On the contrary. Maybe Philip Hazel and the other contributors to the PCRE engine read Jeffrey's book, found that the optimizations improved matching speed—or match failure speed, which is often more important—and incorporated them into the engine. In his book, Jeffrey hints that he wouldn't be sad for that to happen.

At least in PHP, I suggest you forget "micro-optimizations" such as the ones presented in this section. Good style matters more.

PCRE won't "scam you" when you board the cab: it's a polite engine. Just write regex that works, focusing on the big picture to avoid patterns that slow you down by orders of magnitude. Mainly, this means making judicious use of anchors, quantifiers (lazy vs. greedy), groups (atomic or backtracking-savvy), and anything that can make your regex more specific than the all-too-common match-all dot-star soup—such as literal characters and negative classes. For more, read the section about the Elements of Regex Style.




next
 Regex Cookbook





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.