The Explosive Quantifier Trap
If you're not careful, you can easily build a regular expression that has the potential to backtrack for a billion years—longer than most of us are prepared to wait at 10am on a Monday morning. Furthermore, such expressions can be used for regular expression denial of service attacks (ReDos).
This page aims to show you how to detect and troubleshoot such patterns. It's a companion to the tutorial about mastering regex quantifiers.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
✽ Introduction
✽ A Simple Example
✽ Identifying Explosive Quantifiers
✽ Symptom: A Quantifier Compounds a Quantifier
✽ Symptom: Contiguous Quantified Tokens are Not Mutually Exclusive
✽ Symptom: Tokens in Alternation are Not Mutually Exclusive
✽ Symptom: Remote Quantifiers Can Reach into Each Other's Territory
✽ Further Reading
(direct link)
Introduction
Backtracking is a wonderful feature of modern regex engines: if a token fails to match, the engine backtracks to any position where it could have taken a different path. A greedy quantifier may then give up one character, a lazy quantifier may expand to match one more, or the rightmost side of an alternation may be tried. If a pattern continues to fail, the engine systematically explores all available paths.We have already seen situations where a naive engine can explore more paths than it needs to. For instance, you may remember when we used the pattern
\b\d+E
to match a series of digits ending with an E. Using this pattern on the string 1234, after the \d+ has finished its work the E token will fail to match. At that stage, it is wasteful for the engine to backtrack into the \d+ and explore to see if the token E might have matched after the 2 or the 3. Possessive quantifiers and atomic groups help us handle such situations by turning a quantified token or a subpattern into a block that cannot be backtracked into. In this example, the syntax for those is \b\d++E
and \b(?>\d+)E
.
In this last example, the amount of potential backtracking needed is proportional to the length of the string. The potential damage isn't too severe.
However, you can write regular expressions where the potential for backtracking in relation to the length of the string is exponential. This is so wildly inefficient that your regex engine may well choke. It is therefore vital to learn to recognize such expressions.
When there is potential for wild backtracking, quantifiers are always at fault. To describe these situations, I speak of explosive quantifiers. In Mastering Regular Expressions, Jeffrey Friedl refers to these situations as exponential matches, while in The Regular Expressions Cookbook Jan Goyvaerts and Steven Levithan speak of catastrophic backtracking.
(direct link)
A Simple Example
Consider this simple pattern:^(A+)*B
It is not as contrived as it looks: for instance, it could be a window to look at the problem raised by ^(A+|X)*B, where A might stand [aeiou]
Let's see what happens when we try to match the string AAAC with that pattern ^(A+)*B. First, A+ matches all the A characters. The greedy * tries to repeat the A+ token, but there are no characters left to match. The engine advances to the next token: B fails to match. The engine backtracks, the A+ gives up the third A. The greedy * tries to repeat A+ and matches the third A. The engine advances to the next token: B fails to match. The greedy * gives up the second A+ token, i.e. the third A. The engine advances to the next token: B fails to match. Now the original A+ gives up the second A… Do you see where this is going?
(direct link)
Table of Combinations
The table below shows the combinations the engine will try for (A+)*. Since the * quantifies the A+ token, several A+ tokens can contribute to what (A+)* matches at any given time. Each column corresponds to the text matched by one of these A+ substrings. But don't worry too much about the details of the table. What matters is the number of rows.
A+ | A+ | A+ |
---|---|---|
AAA | — | — |
AA | A | — |
AA | — | — |
A | AA | — |
A | A | A |
A | A | — |
A | — | — |
— | — | — |
When the engine finally gives up, it has tried to match the final B token eight times while (A+)* contained various strings. In fact on these eight attempts (A+)* only contained four distinct strings (the empty string, A, AA and AAA), but some of these were tried multiple times as they were matched in different ways. For instance, we reached AAA in four ways:
✽ With one single A+ token matching AAA
✽ With one A+ token matching AA and another matching A
✽ With one A+ token matching A and another matching AA
✽ With three A+ tokens all matching a single A.
Note that we tried the final B token eight times, but it took many more steps for the engine to fail, because each time we reached the final B we had to backtrack. The debugger in regexbuddy says that it needs 28 steps before it can fail. In contrast, the possessive ^(A+)*+B and the atomic ^(?>(A+)*)B (which are fair comparisons for this simplified pattern) respectively fail after 5 and 7 steps.
I'm sure you can guess what happens when we try the pattern
^(A+)*B
against longer strings. The number of steps required to fail explodes.
Number of As | Steps to Fail |
---|---|
1, e.g. AC | 7 |
2, e.g. AAC | 14 |
3, e.g. AAAC | 28 |
4 | 56 |
5 | 112 |
10 | 3,584 |
20 | 3,670,016 — RegexBuddy has given up. How about your program? |
100 | 4,436,777,100,798,802,905,238,461,218,816 |
I'm sure you can see the pattern: for each A characters we add to the subject string, the number of steps required to reach failure doubles. From a computational standpoint, this exponential growth is a nightmare. For you big-O geeks out there, the complexity of exploring all the combinations is O(2n).
Fortunately, RegexBuddy gives up at three million steps (to compute the last two rows, I just multiplied an earlier row by a power of two). But other tools may not give up, and if your language imposes no limit on backtracking or regex matching time, you could be shipping the kind of software everyone loves to rant about.
(direct link)
Identifying Explosive Quantifiers
We need to learn to recognize situations where this kind of explosion can occur. But how?Let us return mentally to the table of combinations—where each row represents a distinct way for various A+ tokens to match some of the characters in AAA. In the simplest terms, you can reduce what is happening here as a situation where various A+ tokens (which are spawned by the * quantifier) "negotiate" to divide up the "pie of characters" as the engine explores possible combinations. You can interpret the table's rows as a tug of war between three potential A+ tokens. Of course this metaphor of "negotiation" or "tug of war" is not how things really happen—what we have is a regex engine systematically exploring all possible paths. But the metaphor is helpful in our search for symptoms.
What we're looking to avoid is such tugs of wars: situations where multiple tokens (either explicitly present in the pattern or spawned on the fly by a quantifier) can match the same portion of the subject string. That's easier said than done because such situations can arise in a number of ways. The following sections aim to make us alert to four kinds of symptoms.
Narrowing the Definition of Quantifier
In the following sections, when I mention quantifiers, let's agree that I won't be referring to quantifiers that cause minimal or no repetition, such as the gentle ? or the plain {2}.
The quantifiers we worry about are those that can repeat a token many times, resulting in an explosion in the number of combinations the engine needs to explore. In that basket, we should include:
✽ The plus + (one or more)
✽ The star * (zero or more)
✽ Range quantifiers with no upper boundary, such as {3,}
✽ Finite quantifiers with two or more digits, such as {10}. Remember that two to the tenth power is 1024 (in our example, the engine took 3,584 steps to fail a string with ten As). In particularly vicious configurations, even smaller quantifiers can be explosive.
✽ Range quantifiers with a fixed upper boundary comprised of two or more digits, such as {3,10}—for the same reasons as above.
(direct link)
Symptom: A Quantifier Compounds a Quantifier
Whenever you see that a quantifier applies to a token that is already quantified, as in (A+)*, there is potential for the number of steps to explode.Often, the "compounding quantifier" pattern happens when the outside quantifier applies to an alternation, as in
(?:\D+|0(?!1))*
. Unless you pay attention, you can miss that the (\D+…)* constitutes an explosive quantifier.The lesson here is that when a quantifier needs to apply to another quantifier, we need to prevent the engine from backtracking. We achieve this either by:
✽ making the outer quantifier possessive, e.g.
(?:\D+|0(?!1))*+
or
✽ enclosing the expression in an atomic group, e.g.
(?>(?:\D+|0(?!1))*)
(direct link)
Symptom: Contiguous Quantified Tokens are Not Mutually Exclusive
Consider this pattern:^\d+\w*@
The \d and the \w are both able to match digits: they are not mutually exclusive.
Against a string such as 123, the pattern must fail. While trying all the possibilities in order to find the match, the engine will let the \d+ give up characters that will be matched by the \w*. Exploring these paths takes time: the engine takes 16 steps to reach failure.
Adding one digit to the test string, e.g. 1234, the engine takes 25 steps to fail. With ten digits, it takes 121 steps. With 100 digits, it takes 10,201 steps. The situation is clearly far better than in the first example. The number of steps required to fail in relation to the size of the string does not grow exponentially, but it still explodes—without looking at it closely its complexity seems to be quadratic or thereabouts, i.e. O(n2). It takes 1,100 digits to reach a million steps. That's a lot more than many subject strings but a lot less than others—that's only a page-and-a-half of average text.
The lesson here is to try to use contiguous tokens that are mutually exclusive, following the rule of contrast from the regex style guide.
(direct link)
Symptom: Tokens in Alternation are Not Mutually Exclusive
Let us now consider a variant of our last faulty pattern:^(?:\d|\w)+@
This too will fail against 123. On the first attempt, each digit will be matched by a \d token, as it is the leftmost side of the alternation. When the @ token fails to match, the engine will backtrack into each alternation and let the \w side match characters that were previously matched by the \d. The engine takes 60 steps to fail.
Adding one digit to the test string, e.g. 1234, the engine takes 124 steps to fail. With ten digits, it takes 8,188 steps. With 16 digits, it takes 524,284. For longer strings, RegexBuddy maxes out. The complexity of exploring all the combinations is O(2n).
Clearly, this is far worse than the previous pattern
^\d+\w*@
, which at first sight looks fairly similar. Why? With the earlier pattern, the engine must find a match that is a series of digits \d, then optionally a series of word characters \w. The pie is always divided in that order—first \d tokens, then \w tokens. In contrast, the second pattern ^(?:\d|\w)+@
gives us many more ways to divide up the pie. The pie can be claimed by word characters tokens first, then digit tokens. Or by word character tokens and and digit tokens intermingled in every way imaginable.
In the literature, this symptom is usually shown in the form (A|AA)+, but in my view that's not really helpful. Why would you ever write such a silly pattern? Of course ^(?:\d|\w)+@ is silly too, but it brings out the salient symptom, which is that various components in a quantified alternation are able to "compete" for the same characters.
The lesson here is that when we build an alternation that is quantified, we must make sure that distinct branches cannot match the same characters.
Do character classes present the same risk?
Our vicious pattern
^(?:\d|\w)+@
could be written with a character class: ^[\d\w]+@
Let's forget for a moment that we would never write such a silly pattern—like the others, it is only meant to help us explore potentially explosive patterns. On the face of it, this pattern does the exact same thing as the version with the alternation: at each step, the engine can match either a digit or a word character. Surely it too must explode when the engine fails to find a match, right?
It is not so. Suppose we try
^[\d\w]+@
against the string 123. First, [\d\w]+ greedily matches all the digits. For a moment, let's assume that each of those digits (1, 2, 3) was matched by the \d token inside the character class. Please note that we don't know this for a fact. One engine might notice that \d is a subset of \w and optimize the entire character class to \w before even starting the match attempt. Another engine might have its own set of rules about which tokens in a character class to try first.After the @ token fails to match, the engine looks for positions to backtrack. First, the [\d\w]+ gives up the 3. The engine tries to match the 3 with the token @, and fails. At this stage, in the alternation version, the engine would have tried to match the 3 with the \w token on the right side of the alternation. In this case, however, the engine does not attempt the \w inside the [\d\w].
A character class constitutes a solid block, an atomic token. Once it matches a character, you don't backtrack into it to try different ways to make it match. When you give it up, you give it up.
Therefore, after the @ token fails to match the 3, the engine's next move is to backtrack once more and force the [\d\w] to give up the 2. Next, the @ token fails to match the 2. There is nothing left to backtrack, and the match attempt fails. In RegexBuddy's way of counting, reaching that point takes seven steps. The number of steps required to explore all the paths is directly proportional to the length of the string: the complexity is O(n), which is the best you can ask for, short of making the character class's quantifier possessive — [\d\w]++ — or enclosing it in an atomic group (?>[\d\w]+).
(direct link)
Symptom: Remote Quantifiers Can Reach into Each Other's Territory
If you're reading this page, I'm sure you can tell what's wrong with a pattern such as^.*A.*AB
Suppose our string is AAAAA. The first dot-star can match the whole string, nothing at all, or anything in between. The second dot-star can match a considerable portion of the string, nothing at all, or anything in between. Before the engine can determine that the match must fail, there will be a tug of war between the two dot-stars. It takes 53 steps for the RegexBuddy engine to fail on this short string, and 178 steps on a string that contains ten A characters.
The regex in this example is so short—and we are so used to distrusting dot-stars—that it probably jumps out at you that one dot-star can overreach into the other's territory. But the same situation can arise in less obvious ways.
Consider this pattern, which is only slightly longer than the previous one:
^\d*?1\d*?1B
The lazy \d*? seems to only want to extend up to the first 1, while the second \d*? extends to the second 1. That seems legit.But when the engine has trouble finding a match, the first \d*? can in fact jump over the first 1 if there are more ones to swallow. You may indeed remember from the Lazy Trap that lazy quantifiers can jump over the fence you thought you had made for them, because they expand as far as needed in order to allow a match. The delimiter 1 is not a true fence because the \d token can match it if it needs to.
For instance, against the string 11111C, where the match must clearly fail, at one stage the first \d*? will match all the ones. It takes 59 steps for the engine to fail. With ten ones, it takes 189 steps, and with a hundred ones, it takes 15,354 steps. Once again, we have an explosive quantifier—although nowhere as bad as in our exponential example.
If you thought the ^\d*?1\d*?1B was easy to spot, consider that the same phenomenon could be embedded in something like this:
.*?{START} (lots of stuff in between) .*?{END}
In my view, this is a lot harder to spot—unless you are sensitive to whether the tokens quantified by a lazy quantifier are able to match their intended delimiters.
The lesson here is to carefully consider whether a quantified token might reach over into a section of the string that you had intended for another token to match.
To contain the .*? in .*?{START} to the section before {START}, you need to tweak it or replace it using one of four techniques we have already seen:
✽ Bundle the characters to be matched before {START} together with {START} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {START}: (?>.*?{START})
✽ Use a Tempered Greedy Token: (?:(?!{START}).)*{START}
✽ Use an Explicit Greedy Alternation: (?:[^{]++|{(?!START}))*+{START}
✽ Use an Unrolled Star Alternation: [^{]*(?:(?:{(?!START}))+[^{]*)*{START}
(direct link)
Further Reading
Time Out feature in C#Regular Expression Denial of Service (ReDos) [Wikipedia]
Static Analysis for Regular Expression Denial-of-Service Attacks [PDF]
A good long look at Regex Conditionals