Recursive Regular Expressions



Recursion is an arcane but immensely helpful feature that only a few regex engines support. Before you invest your time sutdying this topic, I suggest you start out with the recursion summary on the main syntax page. If at that stage you are still interested, you can return here to dive deeper.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Support for Regex Recursion


(direct link)

Supported Engines

Recursion is supported in the following engines:
✽ PCRE (C, PHP, R…)
✽ Perl
✽ Ruby 2+
✽ Python via the alternate regex module
✽ JGSoft (not available in a programming language)

Historical Note
A bit of history: recursion appeared in PCRE seven years before it made it to Perl regex. This explains the differences in behavior between the two engines—which, if you insist on knowing at this early stage, pertain to atomicity and the carryover of capture buffers down the stack. When Perl introduced recursion into its regex it behaved differently from PCRE's, and PCRE stuck to its original behavior for backward compatibility.

Entire Recursion and Local Recursion
The recursive syntax is simple. In PCRE and Perl, you just add (?R) anywhere you like in your pattern. Basically, (?R) means "paste the entire regular expression right here, replacing the original (?R).

In Ruby, you use \g<0>

In its simplest form, a recursive pattern is the same as a plus quantifier.
Of the few people I've seen mentioning recursive patterns on the net, nearly all use it for the same purpose—to match nested parentheses. Is that just tremendously useful, are they all copying one another, or are they copying Jeffrey Friedl's book?

To see how recursion works, let's start with something a little simpler.

The Paste Method

I am going to show you two ways to work with recursive patterns. I call them the "paste method" and the "trace method". The paste method has little practical utility, but in the beginning it can give you an easy way to think about recursive patterns. So we will start with that. Here are a first pattern and a sample subject string.

Recursive Pattern #1
Pattern:   \w{3}\d{3}(?R)?
Subject:   aaa111bbb222

What does this pattern do?

First, it matches three word characters: \w{3}, then three digits: \d{3}. On our test string, with "aaa111", the expression can match on these first steps.

Next, since the expression has worked thus far, the (?R) "pastes" the whole pattern in its own place. (Bear in mind that this is only a manner of speaking. The actual regex engine wouldn't know pasting from pasta.) The question mark at the end of (?R)? is the usual "one or nothing" operator. It makes it so that if the expression in the "pasted pattern" fails, the engine can match "empty" in that same spot.

At this stage, the expanded expression looks like this:

Pattern 1 (first expansion): \w{3}\d{3}(?:\w{3}\d{3}(?R)?)?

The bold part of the pattern is where the full pattern has been pasted in place of (?R). As you can see, I have encapsulated the pasted pattern in a non-capturing group. Why? Because I needed to apply the question mark that was at the end of the (?R)? to the pasted pattern—the non-capturing group is just a way to express that syntax. Without the final question mark, we would not need the non-capturing group. By the way, even though the (?R) is inside parentheses, the parentheses do not capture anything. That is why I used a non-capturing group rather than simple parentheses.

After matching "aaa111", we are now at the beginning of the bold part of the expression. Our first job is to match three more word characters and three more digits. Luckily, with "bbb222", our test string supplies these. Next, we bump against (?R)? once again. The (?R) pastes itself in place. If you wrote out the whole expression at this stage, it would look like this:

Pattern 1 (second expansion): \w{3}\d{3}(?:\w{3}\d{3}#(?:\w{3}\d{3}(?R)?)?)?

This time, I have inserted a bold hash character (#) to show where we are in the expression at the moment. I have also bolded the question mark of the pattern we just pasted in order to emphasize that this new pattern is optional. At this stage, we try to match three word characters again. But we are at the end of our subject string (aaa111bbb222), so the new sub-pattern fails. Thanks to the bold question mark, we can go back to where we were before trying to match the pasted pattern. In the first expansion, that location is the point just before the (?R)?. The engine rolls over the (?R)?, successfully completes the expression, and returns a match: "aaa111bbb222".

As you can see, if you approach them like this, recursive expressions are nothing to be scared of. But the record should show that our recursion accomplished nothing more than the puny plus quantifier in:

Alternate to recursive pattern #1: (?:\w{3}\d{3})+

And as you can imagine, with anything a bit complex, the paste method would quickly become difficult to follow. Fortunately, the "trace method" eliminates the problem.

The Trace Method

As you may have gathered from the other pages on this site, I am a big fan of a regex tool called RegexBuddy, which helps me create and troubleshoot all kinds of expressions. I was thrilled when its author introduced support for recursive regex in version 4 (released in September 2013). A link to the free trial is in the right pane.

With the method I am about to show you, you can analyze recursive patterns of any complexity. All you need is a spreadsheet. The box below traces our recursive pattern #1 as it tries to match our subject.

#    Depth   Position in Regex   String Pos.     Notes                     Backtracks
S1   D0      \w{3}\d{3}(?R)?     aaa111bbb222    Match.
S2   D0      \w{3}\d{3}(?R)?     aaa111bbb222    Match.
S3   D0      \w{3}\d{3}(?R)?     aaa111bbb222    Try Depth 1. 
S4   D1      \w{3}\d{3}(?R)?     aaa111bbb222    Match.                    3
S5   D1      \w{3}\d{3}(?R)?     aaa111bbb222    Match.                    3
S6   D1      \w{3}\d{3}(?R)?     aaa111bbb222#   Try Depth 2.              3
S7   D2      \w{3}\d{3}(?R)?     aaa111bbb222#   No Match. Back to S6.     6,3
S8   D1      \w{3}\d{3}(?R)?#    aaa111bbb222#   D1 succeeds, back to D0   3
S9   D0      \w{3}\d{3}(?R)?#    aaa111bbb222#   D0 succeeds.

Return overall match: aaa111bbb222

Here is how it works.

✽ In the leftmost column of your spreadsheet, enter the step number. For easier reference later, I call these steps S1, S2, S3 etc.

✽ In the next column, you will have the depth level of the recursion. For easier reference, I call these depths D0, D1, D2 etc. In this example you can see how we start at D0, then go to D1 and D2, then back up to D1, and finally D0, which the expression needs to complete in order to succeed. The other levels are all made optional by the question mark at the end of (?R)?, so they are allowed to fail, as happens at Step S7, Depth D2.

✽ In the "Position in Regex" column, you have the expression of the current depth level. For instance, at step S4, we reach depth D1, and the expression shown is the pattern from the depth level being evaluated. By only showing the current level, you do not create a spaghetti of patterns and sub-patterns, as the paste method tends to do. At each step, bold the part of the expression that is being evaluated. When we reach the position at the end of the string, I add a bold hash character (#).

✽ In the next column, paste the string on each row, and bold the part of the string that is being evaluated or that just matched. When we reach the position at the end of the string, I add a bold hash character (#).

✽ In the Notes column, explain what is happening.

✽ In the Backtracks column, keep track of the sequence of steps to which you are allowed to backtrack if the current pattern fails. For instance, at S3, we decide to try D1, so S4 and the following steps on D1 have a backtrack mark to S3 in case D1 fails.

✽ For expressions that have capture groups, create a column for each capture group, and show the value of each capture group at each step. We will see an example of this later.

Navigating the Depths of Recursion
With the Trace method, you can follow a match through complex recursions.

To obtain an overall match, depth 0 (D0) must succeed all the way to the end of the expression. In the middle of D0, the engine may have to dip down a number of levels. These levels all eventually succeed or fail, throwing the engine back to the prior level. At some stage, the engine gets back to D0, and either fails or eventually succeeds in finding a match.

Recursion Depths are Atomic

One feature of PHP recursion that's important to understand is that each recursion level is atomic. What does this mean?

Suppose your expression sends you to D1, made optional by a question mark. You complete D1 successfully. But back on D0, the engine fails to match the next character. Now, D1 may have given the engine a number of options in the form of quantifiers and alternations. When D0 fails, the engine does not go back inside D1 to try the unexplored options (different quantities or other sides of alternations). Instead, it discards D1 as a block. This is the behavior of an atomic group.

This atomic feature of recursion levels can have profound effects on your match. Later on, we will see an example of that. It is a feature of PHP's PCRE engine. Perl, on the other hand, can backtrack into recursion levels.

Leaving a Way Out of the Recursion

In Pattern 1, you saw that whenever you paste the whole expression in place of the recursion marker (?R), you inherit another (?R). You have to, since it's part of the whole expression! To avoid madness and infinite loops, you need to make sure that at some stage the (?R) will stop breeding. In Pattern 1, this was achieved by adding a question mark after the (?R). This ensures that if a recursion level fails, the engine can continue with the match at the level just above.

Another way to make sure you can exit a recursion is to make the (?R) part of an alternation. Consider this:

Pattern 2: abc(?:$|(?R))

This pattern matches series of the string "abc" strung together. This series must be located at the end of the subject string, as it is anchored there by the dollar sign. The pattern matches "abc", "abcabc", but not "abc123". How does it work?

After each "abc" match, the regex engine meets an alternation. On the left side, if it finds the end of string position (expressed by the dollar symbol in the regex), that's the end of the expression. On the other hand, if the end of the string has not yet been reached, the engine moves to the right side of the alternation, goes down one level, and tries to find "abc" once again.

Without some kind of way out, the expression would never match, as eventually any string must run out of "abc"s to feed the regex engine.

Using Recursion to Match Palindromes (mirror words)

Instead of looking at the classic "match nested parentheses" pattern presented elsewhere, I will now show you a pattern that is just as powerful but easier to read. As an exercise, you can tweak it to match nested parentheses. Here's the pattern:

(\w)(?:(?R)|\w?)\1

What does this do? This pattern matches palindromes, which are "mirror words" that can be read in either direction, such as "level" and "peep". Let's unroll it to see how it works.

(?x)     # activate comment mode

(\w)     # capture one word character in Group 1

(?:(?R)  # non-capturing group: match the whole expression again,

|        # OR

\w?)     # match any word character, or "empty"

\1       # match the character captured in Group 1

The pattern starts with one word character. This character is mirrored at the very end with the Group 1 back reference. These are the basic mechanics of how we are "building our mirror". In the very middle of the mirror, we are happy to have either a single character (the \w in the alternation) or nothing (made possible by the question mark after the \w). Note that the pattern is not anchored, so it can match mirror words inside longer strings.

Here is some php code that tests the pattern against a few strings.

<?php
$subjects=array('dontmatchme','kook','book','paper','kayak','okonoko','aaaaa','bbbb');
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$match)) echo $match[0]."<br />";
else echo 'sorry, no match<br />';
}
?>

And here is the output:

dontmatchme -----> sorry, no match
kook ------------> kook
book ------------> oo
paper -----------> pap
kayak -----------> kayak
okonoko ---------> okonoko
aaaaa -----------> aaaaa
bbbb ------------> bbb


It worked perfectly! Well, almost perfectly. For the last string ("bbbb"), a match is found, but not the one we expected.

What is happening there? This has to do with the atomic nature of PHP recursion levels. To explain it properly we will need to use the trace method in order to see every little step taken by the PHP regex engine. For the fully-traced match, click the image.


Now pay close attention to step 22. At this stage, D0 has matched the first b, D1 has matched bbb, completing the string, and D0 cannot continue. If this were Perl, D0 could backtrack into D1, where other D1 matches could be explored:
when D1 matches the bold characters in (\w)(?:(?R)|\w?)\1, it returns bb, and D0 can match the final b, returning the complete intended match: bbbb.

However, this is not how the PCRE engine used by PHP's preg_match function works. Instead of going back into D1, the engine gives up D1 as a block. D0 then completes the match by eating two more "b"s, leaving the last one untouched, and returns "bbb".

May this serve as a warning about the potentially unexpected outcomes of recursive regular expressions!

Numbered Recursion

Now suppose we wanted to match mirror words only if they occupy the entire string. We would have to make sure that the match starts at the beginning of the string, and ends at the end. Easy, you might say, add a caret and a dollar anchor:

Attempt at anchored recursive pattern: ^(\w)(?:(?R)|\w?)\1$

Wait, not so fast… If you add anchors to the expression, when you hit the (?R), the anchors will be pasted back into the middle of the expression, yielding something like this:

Attempt at anchored recursive pattern (first expansion):
^(\w)(?:^(\w)(?:(?R)|\w?)\1$|\w?)\1$

Now you have two carets preceding two distinct characters. This pattern can never match.

Fortunately, you can build a recursive expression without using (?R), which repeats then entire expression. Instead, using the "subroutine expression" (or sub-pattern) syntax, you can paste a sub-pattern specified by a capture group. For instance, to paste the regex inside the Group 1 parentheses, you would use (?1) instead of (?R). Here is how our corrected anchored recursive pattern looks:

Anchored recursive pattern: ^((\w)(?:(?1)|\w?)\2)$

Everything between the two anchors now lives in a set of parentheses. This is Group 1. Therefore, the captured word at the start is now Group 2. In the middle, the repeating expression pastes the subpattern defined by the Group 1 parentheses in place of the alternation: the anchors are left out. At the end, the first character is mirrored by the back reference to Group 2.

This works perfectly. Except, once again, for the "bbbb" string. To see exactly why, you can use the trace method as shown in the earlier example.

Groups Contents and Numbering in Recursive Expressions

In a recursive regex, it can seem as though you are "pasting" the entire expression inside itself. In an expression where you have capture groups, as the one above, you might hope that as the regex shifts to deeper recursion levels and the overall expression "gets longer", the engine would automatically spawn new capture groups corresponding to the "pasted" patterns.

But it is not so. First off, "pasting" is only a way of speaking. And as explained in the section on group numbering, groups are strictly numbered from left to right as you find capturing parentheses by reading the expression on the screen—and I mean the original expression, not one filled with layers of virtual paste operations.

Group numbers are preserved from one depth level to the next. But what about their contents? There are a few simple rules about how group contents travel from one depth level to another. In PCRE,

✽ As you go down depth levels, the contents of a group (such as Group 1) stays the same at first…
✽ But the deeper level can overwrite the contents of a Group set above…
✽ Until you return to the higher level, where the captured groups resume their value.

In contrast, in Perl, as you go down depth levels, the contents of a group (such as Group 1) are wiped out.

To see this, I suggest a simple exercise for which I have prepared an expression and a test string. The expression matches the test string. The hints in the code box explain the value changes of the four capture groups.

Subject:     baacdbcd aacdbcda
Expression:  ^(.)((.)(?:(d)\1\3\4|\3(?2)))[ ]\2\3

Hints:
The recursion (in bold) calls the Group 2 pattern, i.e.: ((.)(?:(d)\1\3\4|\3(?2)))
At depth 0, the left side of the alternation fails.
Value of \1: b throughout the match.
Value of \2: aacdbcd once depth 1 recursion ends.
Value of \3: a on depth 0, c on depth 1, a again once we return to depth 1
Value of \4: a on depth 0 then discarded, d on depth 1, unset again on depth 0

For the fully-traced solution, click the image!


More about Recursive Expressions

There are great examples of regex recursion in several sections of the site. These will show you practical ways to use recursion in situations we haven't explored here.

Quantifier Capture
Matching Line Numbers

For more information about recursive regexes, you can visit the PHP manual's page on recursive patterns and see if some of the examples posted there speak to you.

For me, the best reference on recursive expressions lives in the PCRE documentation written by Philip Hazel, the creator of the PCRE engine.


next
 Quantifier Capture and Quantifier Arithmetic



Buy me a coffee
Buy me a coffee