Backtracking Control Verbs


Several regex flavors have special patterns that instruct the engine about how to match, rather than about what to match, as the other tokens do. In the documentation, these patterns are bundled under the label of special backtracking control verbs, although at first sight some verbs, such as (*ACCEPT) (which tells the engine to return the string it has matched so far) may not seem directly related to backtracking.

(Almost) everything you always wanted to know about backtracking control verbs but never dared to ask.
In practice, the idioms I find the most useful are (*SKIP), (*PRUNE) and the (*SKIP)(*FAIL) combination. If you're only looking for regex that you can put to immediate use, feel free to skip to these sections. Otherwise, stay with me as we explore the various backtracking control verbs.

(direct link)
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Engines that Support Backtracking Control Verbs
The Main Backtracking Control Verbs
(*THEN)
Autopossessification
(*PRUNE)
(*SKIP)
(*COMMIT)
Backtracking control verbs inside lookarounds, subpatterns and atomic groups
Backtracking control verbs inside Repeated Groups
(*ACCEPT)
(*FAIL)
(*MARK)
Using (*SKIP)(*FAIL) to Exclude Unwanted Matches
Documentation


(direct link)

Engines that Support Backtracking Control Verbs

At the moment, backtracking control verbs are a rarity:

✽ Perl introduced them in version 5.10 (final release: December 18, 2007)

✽ Four months earlier (28 August 2007), PCRE 7.30 "followed suit" with a back-to-the-future move as Perl 5.10 hadn't yet been officially released. Because of PCRE's wide use, the verbs can be found in contexts such as PHP, R and Apache.

✽ For Python, Matthew Barnett introduced a limited set of the verbs in the September 14, 2015 release of his regex engine (a brilliant alternate to the feature-poor re). So far the supported verbs are (*PRUNE), (*SKIP) and (*FAIL), probably the most important ones.

The slow speed of adoption of the backtracking control verbs reflects some simple truths: they are rarely used, likely because they are poorly known—and little understood by a considerable proportion of those few who have heard of them.

Frankly, this lack of awareness is not an issue because there is so much "basic" material that most people who only occasionally use regex need to master before the features offered by backtracking verbs become meaningful. But with this page, I hope to make them more accessible to those who are interested in polishing their regex—and I count myself firmly in that camp, since:
1. There is always more to learn,
2. There is a lot to forget, and
3. The regex landscape doesn't stand still, as new engines are introduced and old ones evolve.


(direct link)

The Main Backtracking Control Verbs

Four verbs can be called the "main" or "real" backtracking control verbs: (*THEN), (*PRUNE), (*SKIP) and (*COMMIT). In fact, the above list arranges them in "order of strength". Although the most important ones are (*PRUNE) and (*SKIP), it will be easier to gain a well-rounded understanding of the verbs if we explore them in this order.

(direct link)
During Forward Matching, the Verbs are "Invisible"
The first thing to understand about these four verbs is that as long as the engine is moving forward in the pattern and the string, the verbs have no influence on the matching. In other words, they always match. This is a zero-width match, much as the lookahead (?=), which asserts that an empty string can be found at the position immediately to the right of the string cursor: it is always true.

For instance, (*PRUNE) has no influence whatsoever when the pattern \d+(*PRUNE)\D+ is set to work on the string 12 Monkeys: the regex just matches, and the special semantics of (*PRUNE) are never activated.

(direct link)
The verbs activate when the engine needs to backtrack across them
The verbs are triggered at the point when the engine tries to give them up in order to continue a match attempt.

Specifically, all four backtracking control verbs forbid the engine from crossing them while backtracking—in other words, the engine can never go back to their left. What the engine must do at that point depends on the verb, the engine's actions being arranged on a scale of potency from (*THEN) to (*PRUNE), (*SKIP) and (*COMMIT).

To understand how the verbs work, we must therefore understand how backtracking works, which is no small feat considering that various optimizations may influence how different engines backtrack. Nevertheless, treading carefully, you'll usually have a good idea of what behavior to expect.

(direct link)

(*THEN)

This verb is probably the least useful of the four backtracking control verbs. (*THEN) is designed to work within an alternation—on the left side of one or more | tokens.

As with the three other backtracking control verbs, when the engine passes (*THEN) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string).

If the engine needs to backtrack, it cannot backtrack across (*THEN) (same as with the other backtracking control verbs). At that stage, if the engine is on the left side of an alternation (the | token), it gives up trying to match the current branch and starts trying to match the next branch.

Consider this regex:

Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)

We'll use it against this string:
Comedian: Bill Burr -- Comedian: Peter Sellers
The match is Peter Sellers.

Here's a top-level explanation. What we have here can be seen as a construct of the form
if:then…  elseif:then…  else:then

✽ The B\w+, E\w+ and P\w+ fragments are meant to match a first name.

✽ The idea is that if the first name starts with a B, THEN the last name must be Murray

✽ If the engine matches a first name starting with a B but the last name is something else than Murray, the engine is instructed not to slowly backtrack each of the tokens of the failed branch's pattern B\w+, but to give up the entire branch (this might remind you of an atomic group) and skip directly to the next branch in the alternation, which is E\w+ (*THEN)Murphy.

✽ In that branch, the idea is the same: if the first name starts with an E, THEN the last name must be Murphy, otherwise don't bother backtracking across the first name and skip to the next branch: P\w+ Sellers

In pseudo-regex, the expression reads:
- Match Comedian:
- If the first name starts with B, THEN match Murray
- Elseif the first name starts with E, THEN match Murphy
- Else a match first name starting with P, a space, and Sellers

In our example, the engine starts a match attempt at the beginning of this string:
Comedian: Bill Burr -- Comedian: Peter Sellers

After matching Comedian:, a space character, the Bill in Bill Burr and the space character, the engine matches the always-true (*THEN), but at that stage it fails to match the M in Murray. It cannot backtrack across the (*THEN), so it gives up the content of the alternation's first branch (Bill and a space) and tries to match the E at the start of the middle branch. That fails, so the engine tries to match the P in the third branch. That too fails, so the entire match attempt has failed. The engine advances in the string to the position preceding the o in Comedian and tries a second match attempt. That fails immediately. After a number of other immediately-failed match attempts, the engine tries a match attempt at the position before the C in Comedian: Peter Sellers. This match attempt succeeds.

Here are code snippets if you'd like to try it with the two engines that currently support (*THEN).

# Perl
$comedian_regex =
 qr/Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)/;
if ('Comedian: Bill Burr -- Comedian: Peter Sellers'
     =~
     $comedian_regex
   ) { print "\$&='$&'\n";  }
else { print "No match\n"; }

// PHP
$comedian_regex =
 '~Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)~';
echo preg_match($comedian_regex,
                'Comedian: Bill Burr -- Comedian: Peter Sellers',
                $m
                ) ? "$m[0]\n" : "No match\n";


Is this useful?
Not for this simple example. Consider the simpler alternate, where the two (*THEN) have been removed:

Comedian: (?:B\w+ Murray|E\w+ Murphy|P\w+ Sellers)
This matches the same strings.

(direct link)
Autopossessification
The (*THEN) are just meant to speed up the process by cutting down on backtracking. But the time lost to compile this more complex pattern more than offsets any time gained during matching. Furthermore, for this example, there is no time gain during matching (as least with the PCRE engine: I haven't timed Perl). That is because PCRE would never backtrack into the first name in the first place. Before matching, the engine studies the pattern, and an optimization kicks in that turns the \w+ token into a possessive \w++.

PCRE is able to do this because the token that follows \w+ is a space character. The \w token and the space character are mutually exclusive: even if it backtracks into \w+, the engine will never be able to match a space character where a word character was matched earlier. Therefore PCRE can treat \w as a \w++. This process is known by a charming term: autopossessification. Try not to say it while chewing on oatmeal.

When you turn off PCRE optimizations by using PCRE's start of pattern modifiers (*NO_START_OPT) and (*NO_AUTO_POSSESS), the (*THEN) pattern edges out the plain pattern. But only by a hair: this makes sense because Bill does not give \w+ much to backtrack.

That being said, when the expression to the left of a (*THEN) (within the same branch of the alternation) is particularly complex and time-consuming, I am sure there are situations when the verb would be useful. For my part, I have never used it except to try it out.

When (*THEN) is not found within an alternation
The (*THEN) verb works by sending the engine to the next branch of an alternation. When (*THEN) is not found within an alternation, it behaves like (*PRUNE).


(direct link)

(*PRUNE)

If you remember to use it, this verb is probably the second most useful of the four.

As with the other three backtracking control verbs, when the engine passes (*PRUNE) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string).

If the engine needs to backtrack, it cannot cross (*PRUNE) from right to left (same as with the other backtracking control verbs). At that stage, it gives up on the match attempt. As usual, if any characters are left in the string, the engine then advances to the next string position and starts a new match attempt.

(direct link)
Atomic bomb
This behavior probably reminds you of atomic groups. You could design some uses of (*PRUNE) so that it acts the same as an atomic group or a possessive quantifier, but in most cases it will behave differently. This is because you can drop (*PRUNE) anywhere—and once you try to backtrack across it, the whole match attempt explodes. In contrast, with atomic groups and possessive quantifiers, only a fragment of the match is taken out. If it helps, think of (*PRUNE) as an atomic bomb.

(direct link)
When (*PRUNE) is the same as an atomic group
First, let's look at an example where (*PRUNE) behaves the same as an atomic group or a possessive modifier. Let's say you're a security analyst sifting through your database for people called Jones whose first name has six or more letters. Your search should produce strings such as Quincy Jones: music producer, Rashida Jones: actress, Indiana Jones: movie character. At first, you come up with \w{6,} Jones:.*, then you decide to make the first name \w{6,} atomic since there's no point backtracking once you've established that you have the wrong last name. The following three expressions would be equivalent:

- (?>\w{6,}) Jones:.* (atomic group)
- \w{6,}+ Jones:.* (possessive quantifier)
- \w{6,}(*PRUNE) Jones:.* ("atomic bomb")

The reason (*PRUNE) behaves the same as an atomic group or possessive quantifier in this situation is that it appears immediately after the first sub-expression. The match attempt explodes at a point that happens to be the same place where the atomic and possessive versions cause the match to fail: once they've given up the \w{6,} there is nothing left to backtrack.

(direct link)
Usually, (*PRUNE) is not the same as an atomic group
Let's now look at the more general case, where (*PRUNE) behaves differently from an atomic group. This time, we want to match actor names. If the first name has two to four letters, the last name must be Murray; however, Bill Burr is acceptable too.

First, we try with an atomic group:
(?>\w{2,4} )Murray|Bill Burr|Peter Sellers
Against the string Bill Burr -- Peter Sellers, this regex matches Bill Burr. In the first match attempt (at the beginning of the string), after the first name and the space character match in the leftmost side of the alternation, the M in Murray fails. The engine gives up the atomic group, then restarts the match attempt in the same position—before the initial B. This time the middle part of the alternation—Bill Burr—is a direct match.

Let's now try the (*PRUNE) version of this pattern:

\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers
Against the same string Bill Burr -- Peter Sellers, this pattern never matches Bill Burr. Instead, it matches Peter Sellers. The difference is that on the first match attempt, in the leftmost side of the alternation, after the M in Murray fails to match, the engine tries to backtrack across the (*PRUNE). At that stage, the match attempt explodes—the other parts of the alternation are never visited. The engine advances to the next position in the string (between the initial B and the i) and starts a whole new match attempt. After this and several more match attempts fail, the engine starts a new match attempt at the string position immediately preceding the P in Peter Sellers, and the match succeeds with the rightmost branch of the alternation.

Here are some code snippets if you'd like to try this in the three engines that currently support (*PRUNE).

# Perl
$actor_regex = qr/\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers/;
if ('Comedian: Bill Burr -- Comedian: Peter Sellers'
    =~ $actor_regex
   ) { print "\$&='$&'\n";  }
else { print "No match\n"; }

// PHP
$actor_regex = '~\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers~';
echo preg_match($actor_regex,
                'Bill Burr -- Peter Sellers',
                $m
                ) ? "$m[0]\n" : "No match\n";

# Python
# if you don't have the regex package, pip install regex

import regex as mrab

# print(regex.__version__) should output 2.4.76 or higher
actor_regex = r'\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers'
print(mrab.search(actor_regex, 'Bill Burr -- Peter Sellers'))
# <regex.Match object; span=(13, 26), match='Peter Sellers'>


(direct link)
Is this (*PRUNE) useful?
For moderately complex expressions that may entail a lot of backtracking, (*PRUNE) can save the engine a lot of time. It's a powerful weapon to have in your regex arsenal: you drop it anywhere, and it causes the match attempt to explode if the engine ever tries to backtrack across it.

Certain conditions must be met before (*PRUNE) becomes efficient:
- the time saved must outweigh the longer time needed to compile the regex. If the only potential savings is backtracking across two letters (as with a \w{2,4}, this is not a place for (*PRUNE).
- (*PRUNE) must save real-life backtracking. On paper, it may look like a (*PRUNE) saves you some backtracking, but your engine may have performed some optimizations that would prevent backtracking from ever happening anyway: see the section on autopossessification.

When (*PRUNE) does work, in many cases (*SKIP) works even better.

Teaser: in PCRE, you can accomplish similar results to (*PRUNE) with a callout that returns a positive value.


(direct link)

(*SKIP)

The next backtracking control verb on our potency scale is my favorite. I'll give a quick explanation of (*SKIP) in relation to (*PRUNE) for those of you who have been following from the top, then I'll give the long version for those who skipped directly to this section.

The short version
Like (*PRUNE), (*SKIP) acts like an "atomic bomb": when the engine tries to backtrack across it, the match attempt explodes. In the case of (*PRUNE), if any characters are left in the string, the engine then advances to the next string position and starts a new match attempt. In the case of (*SKIP), the engine advances to the string position corresponding to the place in the pattern where (*SKIP) was encountered, and starts a new match attempt at that position. In other words, the engine skips to the string position corresponding to where (*SKIP) was matched—potentially saving a lot of fruitless match attempts.

The longer version
As with the other three backtracking control verbs, when the engine passes (*SKIP) from left to right, nothing visible happens: the token acts as an always-true assertion at that point in the string.

If the engine needs to backtrack, it cannot cross (*SKIP) from right to left (same as with the other backtracking control verbs). Instead, it gives up on the match attempt. At that stage, the way the engine works, it would normally advance to the next string position and starts a new match attempt. Instead, (*SKIP) causes the engine to skip to the position in the string corresponding to where (*SKIP) was matched, and to start the next match attempt at that position. This can save time by avoiding fruitless match attempts.

(direct link)
Double bomb
(*SKIP) is like a bomb that operates on two planes. Not only does backtracking through it blow up the match attempt (as (*PRUNE) does), it also blows up the part of the subject string leading up to it.

(direct link)
PCRE, Python: later match failure always triggers (*SKIP) (unless it is followed by other verbs)
Here it's important to understand how backtracking works. In theory, when a match fails somewhere after a (*SKIP), although you and I can sometimes tell that there is nothing to backtrack, a regex engine will always try to backtrack to the beginning of the string as it looks for other paths to explore. Therefore, if (*SKIP) is the last backtracking control verb in a pattern, a match failure beyond the (*SKIP) should always trigger the (*SKIP). If another verb occurs to the right of the (*SKIP) and to the left of the failure, that verb gets triggered first.

In practice, if you were writing a regex engine, you would try to avoid backtracking when it's obvious that backtracking would be fruitless (for instance, when the pattern contains no quantifiers or alternations). Internally, among all kinds of clever optimizations, Perl, PCRE and Python's regex package are smart enough to avoid fruitless backtracking. But externally, PCRE and Python behave as though they were backtracking all the way back. Perl's behavior is inconsistent.

For instance, given the string aaaardvark aaardwolf, suppose we run a match attempt with this regex:

aa(*SKIP)ard\w+
The engine starts a match attempt at the beginning of the string. It matches aa, passes over the (*SKIP), matches the third a then chokes on the r token because the next character in the string is a. Internally, our engines are smart enough to fail the match right away because there is nothing to backtrack (no quantifiers, no alternations). Externally, the engines behave as though they were conducting a naive path exploration that would cause them to give up the last a then attempt to backtrack across the (*SKIP) in search of a different match.

The (*SKIP) is triggered: the match attempt explodes; the engine skips in the string past the initial aa before starting its second match attempt. At this position, there is only one a character, so this match attempt fails, as do other attempts until we reach the position preceding aaardwolf.

In contrast, without the (*SKIP), the pattern would match the string aaardwark starting on the second a.

(direct link)
Perl: Inconsistent behavior on later failure
Unlike PCRE and Python, when a pattern looks like it should fail to the right of a (*SKIP), the Perl engine does not always fire the (*SKIP). For instance, with the earlier pattern aa(*SKIP)ard\w+ Perl matches aaardvark in aaaardvark aaardwolf.

Perhaps even worse, with a{1,2}(*SKIP)ard\w+, when matching against aaaardvark aaardwolf, after matching aaa and failing on the r token, Perl doesn't try to backtrack across (*SKIP) despite the quantified a{1,2}—a backtrackable expression which should clearly lure the engine to trigger the (*SKIP). The (*SKIP) doesn't fire, and the engine matches aaardvark instead of the expected aaardwolf.

When I reported this as a bug, a Perl team dev kindly explained that (*SKIP) doesn't fire if the "study phase" at the start of a match attempt is able to decide that the match should not even be attempted. This leads me to guess that at the first position in the string, the optimizer sees that the characters ar or aar must be matched but can't be. The match attempt aborts before it even starts, so technically (*SKIP) is never crossed and the engine starts the next match attempt at the position following the first a.

This "optimized behavior" contradicts the pattern writer's directions, so it seems undesirable to me. Since backtracking control verbs are described as experimental, it's entirely possible that the Perl team will decide to change the behavior at some stage.

Here are some code snippets if you'd like to try (*SKIP) in the three engines that currently support it.

# Perl
if ('123ABC' =~
   /123(*SKIP)B|.{3}/
   ) { print "\$&='$&'\n"; }
# $&='ABC'
# with (*PRUNE) instead, the match would be 23A

// PHP
echo preg_match('~aa(*SKIP)ard\w+~',
             'aaaardvark aaardwolf',
                                 $m) ?
             "$m[0]\n" : "No match\n";
# matches aaardwolf, whereas Perl matches aaardvark

# Python
#  if you don't have the regex package, pip install regex

import regex as mrab

# print(regex.__version__) should output 2.4.76 or higher
print(mrab.search(r'aa(*SKIP)ard\w+', 'aaaardvark aaardwolf'))
#  <regex.Match object; span=(11, 20), match='aaardwolf'>


Apart from potentially avoiding lots of fruitless match attempt, (*SKIP) is particularly useful as part of the (*SKIP)(*FAIL) construct, which we'll study shortly.

In the section about (*MARK), we'll see that you can also make (*SKIP) cause the engine to skip to a specific "bookmark" in the string, rather than to the position where (*SKIP) was encountered.


(direct link)

(*COMMIT)

The fourth and "strongest" backtracking control verb is probably the easiest to grasp. As with the other three backtracking control verbs, when the engine passes (*COMMIT) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string).

We are committed to finding a match in this match attempt, or none at all.
If the engine needs to backtrack, it cannot cross (*COMMIT) from right to left (same as with the other backtracking control verbs). At that stage, it gives up on the match attempt. Normally, the engine would then advance to the next string position and start a new match attempt. But when (*COMMIT) fires, the engine abandons any further match attempt, and the overall match just fails.

You can read the (*COMMIT) token as "past this point, we are committed to finding a match in this match attempt, or none at all".

(direct link)
But (*COMMIT) does not always commit
One thing to bear in mind is that other backtracking control verbs may influence the engine's behavior if it fails to match further to the right. For instance, if a (*PRUNE) or a (*SKIP) are crossed to the right of a (*COMMIT), when the match later fails and the engine starts to backtrack, it will try to cross (*PRUNE) or (*SKIP) before (*COMMIT) is reached. This fires the (*PRUNE) or (*SKIP) behavior, so a new match attempts can take place despite the (*COMMIT), which never fires.

At least that's the theory. Perl behaves inconsistently depending on whether (*PRUNE) or (*SKIP) is backtracked into. This is the object of a bug report.

Here are some code snippets if you'd like to see how this works in the two engines that support (*COMMIT).

# Perl
if ('123ABC' =~
   /123(*COMMIT)B|.{3}/
   ) { print "\$&='$&'\n"; }
else { print "No match\n"; }
# No match
# with /1(*COMMIT)23(*SKIP)B|.{3}/ the match is ABC:
#   (*COMMIT) never fires, but (*SKIP) does.
# with /1(*COMMIT)23(*PRUNE)B|.{3}/ there is no match! (Bug.)

// PHP
echo preg_match('~123(*COMMIT)B|.{3}~', '123ABC', $m) ?
                            "$m[0]\n" : "No match\n";
// No match: (*COMMIT) fires

echo preg_match('~1(*COMMIT)23(*PRUNE)B|.{3}~', '123ABC', $m) ?
                            "$m[0]\n" : "No match\n";
// 23A: (*COMMIT) never fires

echo preg_match('~1(*COMMIT)23(*SKIP)B|.{3}~', '123ABC', $m) ?
                            "$m[0]\n" : "No match\n";
// ABC: (*COMMIT) never fires


(direct link)

Backtracking Control Verbs inside Lookarounds, Subpatterns or Atomic Groups

Within a lookaround, subpattern or atomic group, a backtracking control verb only affects the sub-match. For instance, (*COMMIT) only commits the engine with respect to the match within a lookaround—not with respect to the overall match.

Mileage may vary so experimentation is key.

Double Negatives
If you should wish to use backtracking control verbs within negative lookarounds (come on, you know you're asking for trouble!) remember to pay close attention to logic. For instance, a (*SKIP) that causes the failure of the pattern that dwells within a negative lookahead thereby causes the negative lookahead assertion to succeed. Likewise, an (*ACCEPT) (a verb we'll soon see) makes the pattern succeed, resulting in a failure of the negative lookahead assertion.


(direct link)

Backtracking Control Verbs inside Repeated Groups

When a backtracking control verb lives within a repeated group, PCRE fires the verb at the point where it is backtracked. In contrast, Perl, strangely, chooses to ignore verbs when the quantified group has not yet been fully matched. This is the object of a bug report.

These examples may make this a little clearer.

# Perl
if ('1213' =~
   /(?:1(*COMMIT)2)+./
   ) { print "\$&='$&'\n"; }
# $&='121'
# The engine matches the first '12', then it matches
# '1' and the (*COMMIT) token. When the second '2' fails,
# the engine manages to backtrack across (*COMMIT),
# which never fires. Bug?

// PHP
echo preg_match('~(?:1(*COMMIT)2)+.~', '12123', $m) ?
                            "$m[0]\n" : "No match\n";
// No match: (*COMMIT) correctly fires after the second '2'
// fails to match.



(direct link)

(*ACCEPT)

The next two verbs are bundled with backtracking control verbs in the documentation, but I would drop the backtracking part of the description and keep the control. (*ACCEPT) doesn't seem to relate to backtracking. (*FAIL) does cause the engine to backtrack—which is not the same as controlling what happens when the engine backtracks, as in the case of the first four.

(*ACCEPT) is delightfully simple, but its utility is limited. When the engine encounters (*ACCEPT), it immediately returns the portion of the string it has matched so far. If the engine matches (*ACCEPT) in the middle of a capturing group, that group is set to whatever characters have been matched up to that point.

The only use case for (*ACCEPT) that I'm aware of is when the branches of an alternation are distributed into a later expression that is not required for all of the branches. For instance, suppose you want to match any of these patterns: BAZ, BIZ, BO.

You could simply write BAZ|BIZ|BO, but if B and Z stand for complicated sub-patterns, you'll probably look for ways to factor the B and Z patterns. A first pass might give you B(?:AZ|IZ|O), but that solution doesn't factor the Z. Another option would be B(?:A|I)Z|BO, but it forces you to repeat the B. This pattern allows you to factor both the B and the Z:

B(?:A|I|O(*ACCEPT))Z
If he engine follows the O branch, it never matches BOZ because it returns BO as soon as (*ACCEPT) is encountered—which is what we wanted.

Here is sample code to try (*ACCEPT) in the two languages that support it.
For each language, the second fragment demonstrates the behavior inside a capture group.

# Perl
if ('BOZ' =~
   /B(?:A|I|O(*ACCEPT))Z/
   ) { print "\$&='$&'\n"; }
# BO
# with 'BAZ' as subject, the match would be BAZ


if ('BOZX' =~
   /B((?:A|I|O(*ACCEPT))Z)X/
   ) { print "\$1='$1'\n"; }
# Group 1: O
# with 'BAZX' as subject, Group 1 would be AZ

// PHP
echo preg_match('~B(?:A|I|O(*ACCEPT))Z~',
                'BOZ',
                 $m) ? "$m[0]\n" : "No match\n";
// BO
// with 'BAZ' as subject, the match would be BAZ

echo preg_match('~B((?:A|I|O(*ACCEPT))Z)X~',
                'BOZX',
                 $m) ? "$m[1]\n" : "No match\n";
// Group 1: O
// with 'BAZX' as subject, Group 1 would be AZ


Are there other ways to factor the B and Z patterns? Sure. Conditionals come to mind, for instance:
- Option 1: B(?:(A|I)|O)(?(1)Z) (if A or I have been captured to Group 1, then match Z)
- Option 2: B(?:A|I|(O))(?(1)|Z) (if O has been captured to Group 1, then match the empty string, otherwise match Z)

Please note that (*ACCEPT) is not the opposite of (*FAIL)


(direct link)

(*FAIL)

This verb is available in Perl, PCRE and Python's alternate regex package.

(*FAIL) means at the present token in the string, the current token does not match. The result is no different from trying to match a b character with a p token: the engine chokes on the token, and its next move is to backtrack in order to find a different way to make the current match attempt succeed.

Alternates to (*FAIL)
If you want to shave three characters, you can write (*F) instead of (*FAIL).

And in most languages that don't support (*FAIL), you can simply use the classic (?!), which has the same effect. How does (?!) work? It's a negative lookahead that asserts that at the current position in the string, it is not possible to match the empty string. Since the empty string can be matched at any position, the assertion fails.

I recall reading in one of the engines' docs that internally (*FAIL) is translated to (?!)—unless it was the other way around.

The regex syntax offers many options besides (?!) to force the engine to fail at a certain point in the pattern—consider contradictory pairs such as (?=A)B. But (?!) is the most popular, probably because it is so compact. Some of these ways are explored in the trick about forcing a failure.

(direct link)
Use Cases for (*FAIL)
There are a number of use cases for (*FAIL) and its equivalents.

✽ You can use (*F) within conditionals to enforce certain balancing conditions. For instance, (INTRO)?(?:MAIN1|MAIN2(?(1)|(?!))) is a long-winded way of ensuring that MAIN2 is only matched if INTRO has been matched before. This approach is used elegantly in .NET balancing groups. It is also developed in the section on Conditionals At Work: Controlling Failure

✽ In my trick to write pre-defined subroutines for engines that don't support it, I use (*FAIL)—or its alias (?!)— to force the engine to backtrack after defining a capture group without intending to consume characters at that position in the string.

✽ In a moment, we'll look at the (*SKIP)(*FAIL) construct, which is a delightful way of excluding certain patterns from the match.

✽ In Perl, which has extensive callback facilities, (*FAIL) can be used to explore all the branches of a match tree. Consider this example:
'abc' =~ /\w+(?{print "$&\n";})(*F)/
This prints abc, ab, a, bc, b, c. After the \w+ matches the whole string, the code capsule prints the match, then (*F) forces the engine to backtrack. The engine gives up the c, the callback prints ab, the (*F) forces the engine to backtrack, and so on.

(direct link)
(*FAIL) is not the opposite of (*ACCEPT)
It's natural to imagine that (*FAIL) would be the opposite of (*ACCEPT), but that is not the case. If it were the opposite of (*ACCEPT), then (*FAIL) would mean at this point in the match, fail the match attempt. The engine would then throw away whatever had been matched thus far, and perhaps begin a new match attempt at the next starting position in the string.

(direct link)
Forcing the Match Attempt to Fail
If you truly want the opposite of (*ACCEPT) in order to abort the match attempt, you will need to use something like (*PRUNE)(*FAIL) or (*COMMIT)(*FAIL):

✽ In the case of (*PRUNE)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match. When it tries to backtrack across the (*PRUNE), the match attempt fails. The engine then tries a new match attempt at the next starting position in the string, if any.

✽ In the case of (*COMMIT)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match. When it tries to backtrack across the (*COMMIT), the match attempt fails, and the engine also abandons any further match attempts.

Soon we'll study a surprisingly useful variation on these themes: (*SKIP)(*FAIL).


(direct link)

(*MARK)

This verb is used either on its own or in conjunction with (*SKIP). You use it to tag (and in certain cases "bookmark") a position in the string, as in (*MARK:after_the_digits).

Note that (*:some_tag) is an alias for (*MARK:some_tag)

When the verb is used by itself, you typically pepper several instances of it in your code, as in (*MARK:tag1), (*MARK:tag2). You are later able to determine which path the engine has used to return the match by inspecting the $REGMARK variable in Perl or PCRE's pcre_extra data block (please refer to the documentation).

When the verb is used in conjunction with (*SKIP), the (*SKIP:some_tag) syntax specifies a "bookmark" in the string, the position where the engine should start its next match attempt if the match attempt explodes when the engine tries to backtrack across (*SKIP:some_tag).

If you'd like to see how this works, here's a code sample.

# Perl
if ('123ABC456' =~
  /123(*MARK:past_digits)[A-Z]+(*SKIP:past_digits)9..|.{6}/
   ) { print "\$&='$&'\n"; }
# $&='ABC456'
# instead of skipping past ABC (default (*SKIP) behavior),
# the engine only skips past 123

// PHP
echo preg_match(
  '~123(*MARK:past_digits)[A-Z]+(*SKIP:past_digits)9..|.{6}~',
  '123ABC456',
  $m) ? "MARK $m[0]\n" : "No match\n";
// match: ABC456
// instead of skipping past ABC (default (*SKIP) behavior),
// the engine only skips past 123

Marking with (*PRUNE) and (*THEN)
You can use (*PRUNE:some_tag) and (*THEN:some_tag) to record the matching path if you'd like to later inspect the $REGMARK variable in Perl or PCRE's pcre_extra data block.

One difference between these and (*MARK:some_tag) is that while (*SKIP:some_tag) looks for (*MARK:some_tag), it does not look for (*PRUNE:some_tag) or (*THEN:some_tag).


(direct link)

Using (*SKIP)(*FAIL) to Exclude Unwanted Matches

You might recall from the Elements of Regex Style that often, saying what you don't want is an important strategy to achieve your regex goals. We typically do that with negative character classes such as \D and [^"], or with negative assertions such as (?!A).

Sometimes, we want a bit more sophistication to express what we don't want. For instance, suppose we'd like to match all individual words in a text (defined by the pattern \b\w+\b) except if such words live between curly braces, {like these}.

For this kind of situation, the (*SKIP)(*FAIL) construct is wonderful. With it, instead of cooking up some convoluted negative logic, you express exactly what you want to avoid; if you find it, you skip it; if you don't find it, you match what you want.

Here is how we could match single words so long as they don't live inside a set of curly braces.

{[^}]*}(*SKIP)(*FAIL)|\b\w+\b
Note that for this pattern, we assume we know that our text can never contain {nested{braces}}.

Here is how the pattern works.

✽ First, notice the central pivot around the alternation |.

✽ On the left side, we use {[^}]*} to try to match what we don't want: anything within a set of curly braces. For this, we match a left curly brace, then [^}]* matches zero or more characters that aren't a right curly brace (another example of saying what we don't want), then we match a right curly brace.

✽ If the engine matches a set of curly braces, it then encounters the (*SKIP) verb, which always matches. Next, it encounters the token (*FAIL), which always fails.

✽ At that stage, the engine tries to backtrack across (*SKIP) in the hope of finding a different way to match within the current attempt. But (*SKIP) causes the match attempt to explode, and (*SKIP) also tells the engine to start the next match attempt at the string position corresponding to where (*SKIP) was encountered. This means the engine will never again look at the set of curly braces it just matched. The content we want to avoid has been skipped (matched then thrown away).

✽ At the start of each match attempt, if the engine is unable to match a set of curly braces, it jumps to the right branch of the | alternation. There, \b\w+\b attempts to match an individual word. If this fails, the match attempt fails. The engine then advances to the next position in the string, and tries a whole new match attempt (once again trying to match a set of curly braces first.)

In effect, (*SKIP)(*FAIL) says:

Throw away anything you can match to the left of me.


This is a very useful technique, and it also appears on the page about the best regex trick.

If you'd like to try (*SKIP)(*FAIL), here is sample code for the three engines that support it.

# Perl
while ('good words {and bad} {ones}' =~
      /{[^}]*}(*SKIP)(*FAIL)|\b\w+\b/g
      ) { print "matched: '$&'\n"; }
# matched: 'good'
# matched: 'words'

// PHP
if(preg_match_all('~{[^}]*}(*SKIP)(*FAIL)|\b\w+\b~',
    'good words {and bad} {ones}',
    $matches)) var_dump($matches[0]);
/* array(2) {
              [0]=>
              string(4) "good"
              [1]=>
              string(5) "words"
            } */


# Python
# if you don't have the regex package, pip install regex

import regex as mrab

# print(regex.__version__) should output 2.4.76 or higher
print(mrab.findall(r'{[^}]*}(*SKIP)(*FAIL)|\b\w+\b',
                   'good words {and bad} {ones}'))
# ['good', 'words']


(direct link)

Documentation

That's about all I have to say about backtracking control verbs for the time being. If you'd like to learn more, feel free to dip into the documentation—though my hope is to have made the verbs far more approachable than in those documents.

PCRE documentation on backtracking control verbs
Perl documentation on backtracking control verbs

Smiles,

Rex

next
 Quantifier Capture and Quantifier Arithmetic




1-1 of 1 Threads
Mike
October 18, 2015 - 05:51
Subject: Thank you so much for this

I've been trying to get as well versed in regex as possible and this was a topic that I felt I only had a loose understanding on. I've only skimmed this section a little, but I plan on using this as a more robust reference for future practicing. Very helpful for removing SQL comments, etc.
Reply to Mike
Rex
October 18, 2015 - 08:53
Subject: RE: Thank you so much for this

Hi Mike,

Thank you so much for your positive feedback, you made my day. I only finished this page five days ago after putting days of work into it.

I thought it was an esoteric topic that would attract few readers, so it's wonderful to see that at least one person is enjoying it already.

Wishing you a fun weekend,

Rex


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.