Capture Group Numbering & Naming: The Gory Details


Capture groups and back-references are some of the more fun features of regular expressions. You place a sub-expression in parentheses, you access the capture with \1 or $1… What could be easier?

For instance, the regex \b(\w+)\b\s+\1\b matches repeated words, such as regex regex, because the parentheses in (\w+) capture a word to Group 1 then the back-reference \1 tells the engine to match the characters that were captured by Group 1.

Yes, capture groups and back-references are easy and fun. But when it comes to numbering and naming, there are a few details you need to know, otherwise you will sooner or later run into situations where capture groups seem to behave oddly.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

How do Capture Groups Beyond \9 get Referenced?
Naming Groups—and referring back to them
How Capture Groups Get Numbered
Duplicating Group Names
Generating New Capture Groups Automatically (You Can't!)
Resetting Capture Groups like Variables (You Can't!)


(direct link)

How do Capture Groups Beyond \9 get Referenced?

Normally, within a pattern, you create a back-reference to the content a capture group previously matched by using a backslash followed by the group number—for instance \1 for Group 1. (The syntax for replacements can vary.)

In practice, you rarely need to create back-references to groups with numbers above 3 or 4, because when you need to juggle many groups you tend to create named capture groups. However, if you spend time in the smoky corridors of regex, at one time or another you're sure to wonder what is the correct syntax to create back-references to Groups 10 and higher.

So in a regular expression, what does \10 mean?

It looks ambiguous: on the face of it, that could refer either to Group 10, or to Group 1 followed by a zero. In fact, the meaning does depend on the regex engine. If Group 10 has been set, all major engines treat \10 as a back-reference to Group 10. If there is no Group 10, however, Java translates \10 as a back-reference to Group 1, followed by a literal 0; Python understands it as a back-reference to Group 10 (which will fail); and C#, PCRE, JavaScript, Perl and Ruby understand it as an instruction to match "the backspace character" (whatever that is)… because 10 is the octal code for the backspace character in the ASCII table!

To avoid this kind of ambiguity, here is the proper syntax to create a back-reference to Group 10.

✽ C#, Ruby: \k<10>
✽ PCRE, Perl: \g{10}
✽ Java, JavaScript, Python: no special syntax (use \10—knowing that if Group 10 is not set Java will treat this as Group 1 then a literal 0, while JavaScript will treat it as the elusive "backspace character")

Replacement Syntax for Group 10
As you probably know, there is no standard across engines to insert capture groups into replacements. Some engines use the \1 syntax, some use $1, some allow both. So how do you insert Group 10 in a replacement?

✽ C#: ${10}
✽ Java, JavaScript, Perl, PHP: $10 (if Group 10 has not been set, Java and JavaScript insert Group 1 then the literal 0, while Perl and PHP treat this as a back-reference to an undefined group)
✽ Python, Perl, PHP: \10 (if Group 10 has not been set, Python and and PHP treat this as a back-reference to an undefined group, while Perl inserts the backspace character, whatever that means)
✽ Ruby does not allow Group numbers above \1 in replacements (use a named group).


(direct link)

Naming Groups—and referring back to them

In this section, to summarize named group syntax across various engines, we'll use the simple regex [A-Z]+ which matches capital letters, and we'll name it CAPS.

✽ .NET (C#, VB.NET…), Java: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference, ${CAPS} inserts the capture in the replacement string.

✽ Python: (?P<CAPS>[A-Z]+) defines the group, (?P=CAPS) is a back-reference, \g<CAPS> inserts the capture in the replacement string.

✽ Perl: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. The P syntax (see Python) also works. $+{CAPS} inserts the capture in the replacement string.

✽ PHP: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. The P syntax (see Python) also works. To insert the capture in the replacement string, you must either use the group's number (for instance \1) or use preg_replace_callback() and access the named capture as $match['CAPS']

✽ Ruby: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. To insert the capture in the replacement string, you must use the group's number, for instance \1.

✽ JavaScript does not have named groups (along with lookbehind, inline modifiers and other useful features.)


(direct link)

How Capture Groups Get Numbered

This section starts with basics and moves on to more advanced topics. You can skip the topics that don't pertain to your regex engine or to regex features you aren't planning to use in the coming days, but the next three paragraphs (in which I've made sure to insert a streak of yellow) are required reading if you want to understand how group numbering works.

In a regex pattern, every set of capturing parentheses from left to right as you read the pattern gets assigned a number, whether or not the engine uses these parentheses when it evaluates the match. This left-to-right numbering is strict—with the exceptions of the branch reset feature in Perl and PCRE (PHP, R…) and of duplicated group names in .NET. If the first capture group as you read the expression from the left never gets captured (probably because it lives on the wrong side of an | alternation operator), it is still Group 1, even though it will be empty.

Capture groups get numbered from left to right. End of story.
Capture Groups with Quantifiers
In the same vein, if that first capture group on the left gets read multiple times by the regex because of a star or plus quantifier, as in ([A-Z]_)+, it never becomes Group 2. For good and for bad, for all times eternal, Group 2 is assigned to the second capture group from the left of the pattern as you read the regex. What happens to the number of the first group when it gets captured multiple times? It remains Group 1.

The Returned Value for a Given Group is the Last One Captured
Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z]_)+, when you inspect the match, Group 1 will be D_. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched.

(direct link)
The .NET Exception: Capture Collections
As far as I know, the only engine that doesn't throw away intermediate captures is the .NET engine, available for instance through C# and VB.NET. In the above example, when you inspect the match and request the value for Group 1, C# will also return D_. But it will also make available to you a CaptureCollection object that stores all the values captured for Group 1 along the way. To see how this works, see the CaptureCollection section of the C# page.

Perl, PHP, R, Python: Group Numbering with Subroutines and Recursion
Some engines—such as Perl, PCRE (PHP, R, Delphi…) and Matthew Barnett's regex module for Python—allow you to repeat a part of a pattern (a subroutine) or the entire pattern (recursion). For instance, ([A-Z])_(?1) could be used to match A_B, as (?1) repeats the pattern inside the Group 1 parentheses, i.e. [A-Z]. The subroutine should be considered as a function call: in a sense, it has its own "local variable", i.e. its own version of Group 1. Likewise, each depth level of a recursion has its own version of Group 1 (and therefore no matter how many times you recurse, Group 1 is always Group 1 for a given depth).

What this means is that when ([A-Z])_(?1) is used to match A_B, the Group 1 value returned by the engine is A. It also means that (([A-Z])\2)_(?1) will match AA_BB (Group 1 will be AA and Group 2 will be A). Whatever Group 1 values were used in the subroutine or recursion are discarded.

In PCRE but not Perl, one interesting twist is that the "local" version of a Group in a subroutine or recursion starts out with the value set at the next depth level up the ladder, until it is overwritten. This means that in PCRE, ([A-Z]\2?)([A-Z])_(?1) would match AB_CB (but not in Perl). I covered this point in the section on group contents and numbering in recursive patterns.

(direct link)
Perl, PHP, R: Group Numbering in Pre-Defined Subroutines
Perl and PCRE (PHP, R…) allow you to pre-define and name a subpattern. This allows you to build beautifully modular expressions.

The main syntax page explains the (?(DEFINE) … ) syntax in detail, but we'll look at a short example to refresh our memory. There is also a beautiful example on the page on matching numbers in plain English.

(?(DEFINE)(?<CAPS>[A-Z]+)) lets you define a subpattern called CAPS that matches uppercase letters. Thereafter, you can drop (?&CAPS) anywhere in your expression to match upper-case letters. How does group numbering work with these defined subpatterns?

You should think of these defined subpatterns as function calls: capture groups used by a subpattern won't be available outside. To complicate matters, the subroutine itself is assigned a number, so remember to count it when counting group numbers from left to right.

For instance, in (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS), the subpattern (?<TWOCAPS>([A-Z])\2)) counts as Group 1 and can in fact be called with (?1) instead of (?&TWOCAPS). In turn, ([A-Z]) counts as Group 2, so that the entire TWOCAPS pattern matches two identical upper-case letters. However, the regex (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)\2 will fail on AAA because the final \2 is outside the subroutine and therefore refers to a group that has not been set at that depth. However, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)(?2) would match AAB, as (?2) refers to the pattern of Group 2, i.e. [A-Z].

Remember that we number groups from left to right: therefore, after the TWOCAPS definition, the next available group number is 3. As a result, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)([BC])\3 will happily match AABB, as Group 3 ([BC]) captured the first B.

(direct link)
Branch Reset (Perl and PCRE, e.g. PHP and R)
There are two exceptions to the strict left-to-right numbering. One is the numbering with duplicated group names in .NET. The other is branch reset syntax (?|…(this)…|…(that)…)available in Perl and PCRE. A branch reset is introduced by (?|. In the pseudo-example above, the groups capturing (this) and (that) would be assigned the same number.

For details and examples, see the Branch Reset section of the main regex syntax page.

(direct link)
Duplicating Group Names
In .NET, PCRE (C, PHP, R…), Perl and Ruby, you can use the same group name at various places in your pattern. (In PCRE you need to use the (?J) modifier or PCRE_DUPNAMES option.) In these engines, this regex would be valid:

:(?<token>\d+)|(?<token>\d+)#
This particular example could be handled by branch reset syntax (supported by Perl and PCRE), but in more complex constructions the feature can come in handy.

In PCRE, Perl and Ruby, the two groups still get numbered from left to right: the leftmost is Group 1, the rightmost is Group 2.

In .NET, there is only one group number—Group 1. This is one of the two exceptions to capture groups' left-to-right numbering, the other being branch reset. If the group matches in several places, all captures get added to the capture collection for Group 1 (or named group token). See the section on named group reuse.

Conclusion on Group Numbering
Once you understand the strict left-to-right numbering of capture groups, much potential confusion about "which group should capture what" melts away.

This numbering mode may seem annoying when it stands in the way of intricate logic you would love to inject in your regex. But it's simple and consistent, and that wins the day.

The two following sections present examples that illustrate the main traps of group numbering we just saw, hopefully helping drill them down.


(direct link)

Generating New Capture Groups Automatically (You Can't!)

One of the wonderful features of regular expressions is that you can apply quantifiers to patterns. As you know, \d+ repeatedly eats up digits. As you also know, you can capture pieces of the subject. For instance, the regex (\d) captures a digit into group 1.

So what happens if you put the two together into this regex?

(\d)+
When I was a young dinosaur, I romantically hoped that if I applied that regex to the string 1234, the engine would eat each of the digits, one at a time, capturing them into four groups that could later be referenced using \1, \2, \3 and \4. Was I deluded!

Things don't work that way—although alone among the crowd, .NET capture collections offer something nearly identical. The capturing parentheses you see in a pattern only capture a single group. So in (\d)+, capture groups do not magically mushroom as you travel down the string. Rather, they repeatedly refer to Group 1, Group 1, Group 1… If you try this regex on 1234 (assuming your regex flavor even allows it), Group 1 will contain 4—i.e. the last capture.

In essence, Group 1 gets overwritten every time the regex iterates through the capturing parentheses.

The same happens if you use recursive patterns instead of quantifiers.

And the same happens if you use a named group: (?P<MyDigit>\d)+

The group named MyDigit gets overwritten with each digit it captures. That is less surprising, and this scenario helps explain the first, because the two phrasings are equivalent. It may not jump out at you from the raw symbols, but in the regex (\d)+, the set of parentheses refers to a specific group—Group 1—even though that group is not explicitly named, as it is in (?P<MyDigit>\d)+. The name is implied.

If you were hoping to use a quantifier to spawn multiple captures as the engine travels down the string, forget about it—unless you use .NET capture collections, a feature that lets you inspect the successive captures made by a quantified group.

But there's always a solution. For instance, the "match all" feature of your language lets you break down the string into chunks, each with its set of captures, which you can put back together if need be. For instance, you could use:

C#: Matches() or iterate with Match() and NextMatch()
Python: finditer or findall
PHP: preg_match_all()
Java: matcher() with while… find()
JavaScript: match() or iterate with exec()
Perl: $subject =~ m!
Ruby: subject.scan


(direct link)

Resetting Capture Groups like Variables (You Can't!)

Closely related to the desire (explored above) to spawn new capture groups by using a quantifier, there is the temptation to try and capture a certain named group at different points in the regex—perhaps using a sophisticated machinery of conditionals as you might be used to doing in a programming language.

For instance, you might try to write this to set the PriorError group at various places in the pattern, much like a variable or a flag:

(?x) # free-spacing mode

# this regex attempts to match "dog",
# allowing for a one-character error, e.g. dig or bog, but not bug

d? # "d" is the first character. It is optional.

(?(?<!d)(?P<PriorError>o)|.) # If the previous character is not d, set PriorError and require "o". Otherwise (no error so far), accept any character.

(?(PriorError)g # If PriorError is set, require g (no more errors!)

| (?:(?<!o)(?P<PriorError>g)|.)) # Otherwise (no prior error), if the previous character is not o, set PriorError and require g (no more errors!), otherwise accept any character


This is real cool regex code (I wrote it as a juvenile.) The only problem is that it doesn't work: You are not allowed to set a capture group at two places in the regex.

Regular expressions just aren't a language were you can set and reset variables anywhere you like in a pattern—with the possible exception of a few intricate tricks, for instance using .NET's balancing groups. For some operations, to get what you want, you may need to use more than one regex.


next
 Mastering Lookahead and Lookbehind




1-1 of 1 Threads
Ken – US
November 05, 2015 - 09:47
Subject: Great discussion of a complex topic

Great discussion of a complex topic, thank you! The other sites I found left me more confused than when I started… The "You Can't" topics were very helpful too.


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.