Capturing Quantifiers and Quantifier Arithmetic


This page of the regular expressions "black belt program" discusses a special power that is only available to those who have managed to steal a green egg from the velociraptor by the cave near the mountain top: capturing quantifiers, and quantifier arithmetic.

I have not used this feature myself because I got lost on the way to the cave. Joke aside, quantifier capture is a feature I started wondering about in January 2014 and mulled over in the background until drafting this page three months later. It's a simple idea, so I imagined it would be implemented in some engines.

However, after making enquiries from several people who write regex engines, it appears that the feature is not implemented at the moment. On April 13 2014, Jan Goyvaerts, who is probably the arch-expert on cross-engine syntax because he has to maintain many engines for RegexBuddy, wrote:

I do not know of any regex flavors that allow you to capture a quantifier by itself or that allow you to use a backreference inside a quantifier in any way.

If the situation changes, please write at the bottom of the page.

Quantifier Capture

Sometimes, it would be handy to be able to "capture a quantifier"—meaning some way to capture the number of times a quantifier such as "+" or "*" is able to match.

For instance, suppose you were interested in matching balanced lines such as these:

@@ "Snow White" == "1987" -- "Animation" // "Erich Kobler" or
@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas" but not
@@@ "Groundhog Day" = "1993" ----- "Comedy" // "Harold Ramis"

The point is that you want to make sure you have the same number of "@" characters as of "=" character, "-" characters and "/" characters.
For that purpose, a regex such as the following will not do because it will match all lines.

^@+ "[^"]+" =+ "[^"]+" -+ "[^"]+" /+ "[^"]+"$

Balancing the number of {@,-,=,/} is fairly straightforward in languages that use .NET regex thanks to the balancing groups feature, and I give a demo of this lower down. In PCRE, it is also possible but far less straightforward, as you need to use some neat tricks: the syntax is far too complex and error-prone for it to be useful on a regular basis. I explain these tricks (which I did not invent) lower on the page.

What we really need is a syntax such as the following:

(*) captures the number of characters the star quantifier is able to match. Likewise, (+), (?) and ({2,9}) capture the number of characters these quantifiers are able to match.

\q1 refers to the first captured quantifier, \q2 refers to the second captured quantifier and so on.

\q+1 refers to the next captured quantifier. \q+2 refers to the second-next captured quantifier, \q-3 refers to the third-previous quantifier.

With this syntax, we could easily match our desired pattern with the following regex:

^@(+) "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /{\q1} "[^"]+"$

Alternately, using relative addressing, you could use either of the following expressions, where the quantifier is captured further down the string..

^@+{\q1} "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /(+) "[^"]+"$
^@+{\q+2} "[^"]+" ={\q+1} "[^"]+" -(+) "[^"]+" /{\q-1} "[^"]+"$


The syntax would also allow you to use captured quantifiers in range quantifiers such as {\q1,\q2}.

An Alternative to Recursion
In some cases, capturing quantifiers would elegantly replace recursion. For instance, suppose you want to match a number of zeros and ones framed by the same number of Ls and Rs, a in "LLL100110100RRR".

With recursion, you can write:
L((?R)|[01]*)R

With captured quantifiers, you can write:
L(+)[01]+R{\q1}


Usually, the non-recursive version would be easier to write and read.

Quantifier Arithmetic

A natural extension of capturing quantifiers is to play with the captured integers. For instance, it won't be long until you want to match twice as many (i.e., 2x) instances of a certain character or sequence than you matched of another sequence or character. Or you might want to match three more instances (x+3), or some other function (such as 2x+5).

This could be implemented either directly in the syntax, or with the help of callbacks.

A biologist looking at genomes comprised of the letters A, T, G and C might have a particular interest in finding sequences where a certain number x of "A"s are followed by any number of T and Gs, then "2x + 1" Cs, as in:

ATGGTTTGTCCC, but not
AAATGGTTTGTCCC

Without callbacks, the syntax to accomplish this kind of arithmetic could become cumbersome. Here are two possible implementations, without and with callback:

A(+)[TG]+C{2*\q1+1}

A(+)[TG]+C{CALL_VERB somefunction(\q1)}

In conclusion, it seems to me that quantifier capture (as a first step) and quantifier arithmetic (as a second step) would nicely enhance the expressiveness of regular expressions and would be a logical extension of the syntax.

Current Solutions to Balance the Number of Certain Characters

If you don't have quantifier capture in your regex flavor, you can still check that strings like the one shown higher on the page are balanced. To make the example easier to understand, I simplified the kind of strings we are trying to validate:

L555M002R or
LLL88MMM7281RRR but not
LLL88M7281RRR

The idea is that we want to make sure we have the same number of L, M, and R characters (think of them as "Left", "Middle" and "Right" separators.


.NET: Balancing Groups

In languages that can tap into .NET regex, checking that a string such as LL00MM11RR has the same number of L, M, and R characters is fairly straightforward thanks to balancing groups:

^(?:L(?<c1>)(?<c2>))+[^LM]+(?<-c1>M)+[^MR]+(?<-c2>R)+(?(c1)(?!))(?(c2)(?!))$
This is fairly long, but it is simple.

• First, notice that the string is anchored by ^ and $ to prevent the engine from matching a balanced string within an unbalanced string.

• The non-capturing group (?:L(?<c1>)(?<c2>))+ is quantified with a + and matches all the L characters. It also contains two named capture groups c1 and c2. These groups are empty, so they don't match anything; or, rather, they match an empty string. You may know that .NET deals with quantified capture groups in a special way: it adds the successive captures of a given group to a collection of captures. This means that each time an L is matched, a capture is added to the capture collection for named groups c1 and c2. We will use c1 and c2 as counters. By the time the engine exits the quantified non-capturing group, the c1 and c2 groups both hold the same number of captures, which is the number of Ls that were matched.

• The [^LM]+ eats up all the characters up to the first M.

• The (?<-c1>M)+ is a quantified group that eats up all the Ms. It may look like it is a named group called "-c1", but -c1 is .NET regex syntax to pop (and discard) the last capture in group c1. This means that each time we match an M, one c1 capture is thrown away. In essence, we are decrementing our c1 counter. If we have already matched as many Ms as Ls, group c1 is empty. If there are any Ms left at that point, when the engine tries to pop one capture from the c1 group, it cannot do so, and the regex fails. This ensures that there cannot be more Ms than Ls in the string. Later, we will add a check to ensure that there are no more Ls than Ls.

• The [^MR]+ eats up all the characters up to the first R.

• The (?<-c2>R)+ matches all the Rs while decrementing c2, ensuring that there cannot be more Rs than Ms.

• The (?(c1)(?!)) is a conditional that checks if capture group c1 is set. If c1 is set, the engine tries to match (?!), which is a trick to force the regex to fail. The conditional therefore forces the regex to fail if there are captures left in gorup c1, which would mean that we have not matched enough Ms to fully decrement our c1 "counter". This expression ensures that we cannot have more Ls than Ms.

• Likewise, (?(c2)(?!)) ensures that we cannot have more Ls than Rs.

That's a bit of syntax to explain, but I hope you'll agree that once you understand that syntax, writing such an expression is straightforward.


PCRE: Balancing by Building Capture Groups Accretively

In PCRE, checking that a string such as LL00MM11RR has the same number of L, M, and R characters is possible, but tricky.

This trick and the next are shown for a more complicated pattern on this Stack thread. I have modified the recipes slightly for easier comprehension, but not in its essence. Later if you are interested you can inspect my version for the "Star Wars pattern".

I gave detailed comments in the code box below, but it may be helpful to have a high-level overview of certain aspects before diving in.

Group 1 will capture all the "L" characters. Group 2 captures the Ms, Group 3 captures the Rs.

The expression is in two parts. First, we match all the Ls, and as we do so, a lookahead checks that we have at least as many Ms and Rs. Second, when we are done matching all the Ls and have satisfied ourselves that we have at least as many Ms and Rs as we want, we continue the match, specifying exactly the characters we want, which (among other effects) ensures we have no more Ms and Rs than needed.

In the first part, as the engine matches each L character at a time, the content of Groups 1, 2 and 3 change each time a new "L" is matched. The parentheses of Group 2 actually refer to the current value of Group 2, i.e., \2, in the expression (\2?+M). What does this do? After the first L is matched, Group 2 is undefined, and the "?" in \2?+ makes \2 optional, allowing that part of the expression to match. Group 2 then matches the first M, and the value of Group 2 becomes "M". After the second L is matched, Group 2 is still "M", so the \2 in (\2?+M) matches "M", then we match the second "M", and the value of Group 2 becomes "MM". The "+" in \2?+ ensures that if we fail to match the M that follows, the engine doesn't backtrack by activating the optional \2? and dropping the first M. Without the "+", we could match strings such as "LLL1M2R". See the section on atomic groups.

Please understand that although Group 2 looks like a self-reference, the expression in Group 2 refers to the previously stored value. Therefore, the value \2 of Group 2 after the closing parenthesis is not what it was inside the parentheses.

Ready? Here we go.

^((?:L(?=[^M]*(\2?+M)[^R]*(\3?+R)))+)\d+\2\d+\3$

Easy, right?… Just kidding. Here is the commented break-down.

(?xm) # Free-spacing mode, multi-line
^ # Assert Beginning of Line
( # Begin Group 1
  (?:     # Non-capturing group, which will be repeated
  L       # Match one L
  (?=     # Begin Lookahead
    [^M]* # Match any chars that are not M
    (     # Begin Group 2
    \2?+  # Match Group 2 if possible, and if so
          # do not later give up the match.
          # In other words if Group 2 can be matched, match it.
          # This could be expressed as (?(2)\2)+
          # After we match the first L, Group 2 starts out undefined
          # so the ? will be used.
          # After we match the 2nd L, Group 2 is M
          # so at that point we must match M.
          # After we match the 3rd L, Group 2 is MM
          # so at that point we must match MM.
    M     # Match M
    )     # End Group 2
          # After matching the first L, Group 2 is M
          # After matching the second L, Group 2 is MM
          # After matching the third L, Group 2 is MMM
          # etc.
    [^R]*   # Match any chars that are not R
    (\3?+R) # Group 3 follows the same principle as Group 2
            # If you have a hard time following, simplify
            # the test string
            # and remove the Group 3 section
  )    # End Lookahead
  )+   # Repeat the non-capturing group
)      # End Group 1

# If we stopped right there, the regex would match strings
# that have x characters L and at least x each of characters {M,R}
# but possibly more: there would be no guarantee of balance

# To validate that we have no more than needed,
# we now match (or lookahead) precisely what we want
# after all the L characters we have matched.

\d+ # Match some digits
\2  # Match the characters captured in Group 2
\d+ # Match some digits
\3  # Match the characters captured in Group 3
$   # Assert End of Line
	

Hope you enjoyed this one! Working through it is a great exercise.

But if it shows one thing, apart from the cleverness of certain coders, it's that realistically, to balance strings as we have done, you need something like the quantifier capture syntax advocated on this page.

In the unlikely case you'd like to see the same principle applied to the @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas" example from the top, the code is here.

PCRE: Balancing with Recursion

As a reminder, we are trying to check that a string such as LL00MM11RR has the same number of L, M, and R characters.

This method uses regex recursion. For those who jumped in to this point from another page, the task at hand is to validate balanced strings such as

L00M123R or
LLL22MMM1111RRR but not
LLL22M1111RRR

The idea is that we want to make sure we have the same number of L, M, and R characters (think of them as "Left", "Middle" and "Right" separators.

The overall structure of this expression is that of a password validation regex. We have two lookaheads to validate some conditions, then we watch what we want, if possible. The first lookahead validates that the Ms balance with the Ls. The second lookahead validates that the Rs balance with the Ls.

The structure of the first lookahead (which is equivalent to the second one) is as follows. We begin Group 1, which is the group whose pattern we will repeat (recursion). Group 1 matches "L, stuff, then M". The "stuff" in the middle is either another instance of Group 1 (L, stuff, M) or, if there are no more Ls to consume, any characters that are neither L nor M. If you trace this recursion on paper, you will see that for "LLL00MMM123RRR", within this lookahead the engine matches L (level 0), then L (level 1), then L00M (level 2), then M (closing level 1), then M (closing level 0).

Ready? Here we go.

(?xm) # Free-spacing mode, multi-line
^     # Assert Beginning of String

# The function of the following lookahead is to check
# that the Ms balance with the Ls
(?=   # Begin Lookahead
  (   # Begin Group 1
      # Group 1 will match "L stuff M", where
	  # "stuff" may recurse to "L stuff M"
	  # The base case for "stuff" when we run out of Ls
	  # will be characters that are neither L nor M.
    L # Match L
      (?> # Begin Atomic group
        (?-1)   # Recursion to the next level:
                # Match the pattern defined by the previous 
                # defined group, i.e. Group 1, i.e. match the
                # next L then what follows...
        |       # OR (if we cannot match an L)
        [^LM]++ # Match characters that are neither L nor M
                # but do not give up the match
                # if what follows fails (possessive)
      ) # End Atomic Group
    M   # Match M, completing the "L stuff M" of Group 1.
  )     # End Group 1
  (?!M) # Assert that the next character is not "M"
) # End Lookahead

# The next lookahead has the same structure as the previous one
# Its function is to check that the Rs balance with the Ls
(?=(L(?>(?-1)|[^LR]++)*R)(?!R)) 

# We now know that the Ls, Ms and Rs are balanced
# What's left to do is to actually match what we want.

L+\d+M+\d+R+$        # Match what we want
	


Was that awesome?

I thought so when I worked through these tricks! To understand them, I carefully commented each line in RegexBuddy, then worked an example on paper. I highly recommend this procedure as a way to understand such complex expressions.


next
 Regex Tricks



Buy me a coffee
Buy me a coffee