Regex Anchors


Anchors belong to the family of regex tokens that don't match any characters, but that assert something about the string or the matching process. Anchors assert that the engine's current position in the string matches a well-determined location: for instance, the beginning of the string, or the end of a line.

This type of assertion is useful for several reasons. First, it is expressive: it lets you specify that you want to match digits at the end of a line, but not anywhere else. Second, it is efficient: when you tell the engine that you want to find a compex pattern at a given location, it doesn't have to spend time trying to find that pattern at many other locations. This is why the regex style guide recommends using anchors whenever possible—even when your regex would match without them.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Boundaries vs. Anchors and Delimiters
Caret ^: Beginning of String (or Line)
Dollar $: End of String (or Line)
Beginning of String: \A
Very End of String: \z
End of String (or Before Optional Line Break): \Z
Bringing Some Order: Multiline All the Time
Beginning of String or End of Previous Match: \G
Anchors Anywhere, Anytime
Anchors seen as Delimiters


(direct link)

Anchors vs. Boundaries and Delimiters

Why is ^ called an anchor while \b is called a boundary and (?<!.) is called (by me) a delimiter? These semantic questions are explored in Boundaries vs. Anchors on the boundaries page, and in Anchors Seen as Delimiters lower on this page.


(direct link)

Caret ^: Beginning of String (or Line)

In most engines (but not Ruby), when multiline mode hasn't been turned on, the caret anchor ^ asserts that the engine's current position in the string is the beginning of the string. Therefore, ^a matches an a at the beginning of the string.

Ruby
In Ruby, ^ always asserts that the engine's current position in the string is the beginning of a line—which is either the beginning of the string or a position following a line feed character \n. Therefore, ^a matches the a on both of these lines:
apple
apricot


(direct link)
Multiline Mode: ^ Matches on Every Line
Apart from Ruby, all engines allow you to turn on a mode where the anchors ^ and $ match on every line. That mode is often turned on with an inline modifier (?m), but this varies across engines and applications, and the Modifiers page has instructions for how to turn on multiline mode in various languages.

When that mode is set, ^a matches the a on both of these lines:
apple
apricot



(direct link)

Dollar $: End of String (or Line)

The dollar anchor changes its meaning depending on which engine we use and on whether the multiline mode is on. Let's start with the narrowest meaning then gradually expand.

(direct link)
All Engines: Match at the Very End of the String
All engines agree on one string position where the dollar anchor $ is allowed to match: at the very end of the string. For instance, if your string is the apple, the $ anchor matches the position after the e, and e$ matches the final e, but not the first.

For convenience, most engines build some flexibility into $, allowing it to match before final line breaks—we'll examine this in the following sections.

JavaScript, however, does not have this kind of flexibility (which won't surprise you if you remember that JavaScript is by a long shot the worst among all the major regex engines). If you add a line break at the end of your the apple string, as in the apple\n, the JavaScript engine no longer matches the final e.

(direct link)
All Engines Except JavaScript: Match Before One Final Line Break
In all major engines except JavaScript, if the string has one final line break, the $ anchor can match there. For instance, in the apple\n, e$ matches the final e.

This is convenient: When you want to match some characters at the end of a string, you don't have to worry about whether it might have a line break at the end.

(direct link)
For all engines except PCRE, the final line break character must be a linefeed character \n. But in PCRE (C, PHP, R…), the definition of a line break for the purpose of the $ anchor depends on how PCRE was compiled. Depending on the language and tool, that final newline character may be allowed to be a linefeed, a carriage return, a linefeed-and-carriage-return sequence, or any of the above.

Apart from these potential differences resulting from how PCRE was compiled, PCRE also lets you override the default newline (and therefore the behavior of the $ anchor) with one of PCRE's special beginning of pattern modifiers, such as (*ANYCRLF), (*CR) and (*ANY)

(direct link)
Multiline Mode (and Ruby Default): $ Matches on Every Line
In Ruby, the $ anchor always matches at the end of a line. For instance, e$ matches the e on both of these lines:
apple
orange


Other engines (including JavaScript) also allow the $ to match at the end of a line when you turn on the multiline mode (follow the link for how to turn on multiline in various languages).

At first, Ruby insistence to allow ^ and $ to match on every line may seem annoying because it is not standard. However, once you learn about anchors such as \A and \Z, you see that this way of proceeding makes a lot of sense as it gives anchors an unambiguous meaning.


(direct link)

Beginning of String: \A

In all major regex flavors except JavaScript, the \A anchor asserts that the engine's current position in the string is the beginning of the string. Therefore, in the following string, \Aa matches the a in apple but not apricot:
apple
apricot


Of course the ^ anchor also matches that position. But since ^ changes its meaning when multiline is on, \A gives us an unambiguous anchor to ensure we really only ever match at the beginning of the string.


(direct link)

Very End of String: \z

A string goes
from \A to \z
In all major regex flavors except JavaScript and Python (where this token does not exist), the \z anchor asserts that the engine's current position in the string is the very end of the string (past any potential final line breaks). Therefore, in the following string, e\z matches the e in orange but not apple:
apple
orange


Of course the $ anchor also matches that position. But since $ changes its meaning when multiline is on, \z gives us an unambiguous anchor to ensure we really only ever match at the very end of the string.

Since the \A anchor specifies the very beginning of the string, we could have expected another capital letter for the end of the string—rather than the lowercase \z. Instead, the \Z (which we'll see next) gives us some flexibility around the line break (except in Python). Therefore, it's helpful to think of a string as going from \A to \z.


(direct link)

End of String (or Before Optional Line Break): \Z

This flexible "end-of-string" anchor behaves differently depending on the engine.

✽ In Python, the token \Z does what \z does in other engines: it only matches at the very end of the string.

✽ In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string apple\norange\n

✽ In Java, \Z will also match before a final line break character, which may be a line feed, a carriage return and other line separators.

✽ In PCRE (C, PHP, R…), \Z will also match before a final line break character. As discussed earlier, what PCRE defines as a line break character depends on how it was compiled, as well as on the potential presence of PCRE's special beginning of pattern modifiers, such as (*ANYCRLF), (*CR) and (*ANY)


(direct link)

Bringing Some Order: Multiline All the Time

There is a good argument for always leaving multiline mode on in order to give clear roles to all the anchors we've seen so far. If you do this, you know that $ and ^ always match on every line. And if you need to match specifically at the beginning or the end of the string, you use \A, \Z or \z.

In fact, this is Ruby's behavior. That is also where Perl 6 is headed. To me, this idea makes a lot of sense: it gets rid of the multiline mode m and makes anchors more meaningful.


(direct link)

Beginning of String or End of Previous Match: \G

.NET, PCRE (C, PHP, R…), Java, Perl and Ruby support \G, a useful anchor that can match at one of two positions:
✽ The beginning of the string,
✽ The position that immediately follows the end of the previous match.

Among other uses, \G can come in handy in tokenized strings when you want to match tokens in certain areas of the string but not in others. Consider for instance this string showing Jane and Tarzan's times on three separate swim tests:
Tarzan A:33 B:32 C:36 Jane A:35 B:33 C:31
If we are only interested in matching Jane's scores, we can use:

(?:Jane|\G) \w+:(\d+)
There are other ways to do this, especially if you have infinite lookbehind (.NET), but this approach is particularly economical.

How does it work? When the engine tries to match at the beginning of the string, the first token (?:Jane|\G) succeeds because \G matches at the beginning of the string. However, the next token (a space character) fails against Tarzan's T. The next chance for the pattern to match is at the position preceding Jane. The engine matches "Jane A:35", capturing the 35 to Group 1. At the starting position of the next match attempt, \G matches, and the engine matches " B:33". Finally, \G matches again, and the engine matches " C:31".

Incidentally, in PCRE, Perl and Ruby, you don't need to retrieve the times from Group 1: you can match them directly with this small variation, where \K tells the engine to drop what it matched so far from the match to be returned:
(?:Jane|\G) \w+:\K\d+

(direct link)
"Beginning of String" Match: Using or Bridling \G
The fact that \G matches at the beginning of the string is neither convenient nor inconvenient. Half the time, we use that property. The other half, we work around it.

For our second example, consider this string, which might represent two potential positions for placing a "submarine" on a paper grid in preparation for a naval battle:
A1B1C1vsA1A2A3
Each position (on either side of vs) has three tokens composed of one letter and one digit.

Suppose we want to match the first three tokens, i.e. A1, B1, C1. We can do this quite easily with this regex:
\G[A-Z]\d
The \G matches at the beginning of the string, allowing us to match A1. Then \G matches before the next token, so we match it, as well as the following token. \G succeeds again before the vs, but [A-Z] cannot match the v, so the match fails. There is no more position for \G to match, and we therefore avoid the tokens to the right, as we wanted.

Now suppose we want to match the second position's tokens, i.e. A1, A2, A3. Remembering the Tarzan and Jane example, we could try (?:vs|\G)([A-Z]\d)… but the strings in these two examples are not built the same way, and this regex would match all the tokens! Let's see how. After the \G matches at the beginning of the string, [A-Z]\d is able to match the first token. Then \G matches again, so we match the second token, and the third. Then, when we hit vs, \G still matches, but [A-Z] fails against vs. The engine backtracks and tries the other side of the alternation, vs, which matches. [A-Z]\d matches the fourth token, then \G helps us with tokens 5 and 6.

Clearly, this time \G is in our way: we wish it didn't match at the beginning of the string. To solve this, we can "bridle \G" by placing the negative lookahead (?!\A) right next to it. It asserts that what immediately follows the current position is not the beginning of the string, so \G can no longer match there. The regex becomes:

(?:vs|\G(?!\A))([A-Z]\d)
It may sound strange that we used (?!\A) to negate the anchor \A. As it turns out, (?<!\A) would also have worked. We'll explore this in the section about anchors within a lookaround.


(direct link)

Anchors Anywhere, Anytime

If you read too much beginner literature, overexposure to patterns such as ^Name: \w+$ could lead you believe that anchors such as ^ and $ must be used in immutable positions at either end of the string. The (?!\A) in the regex just above hints that it is not so.

Anchors are not
sacred cows. Move them
where you like.
There is nothing sacred about anchors. A single anchor such as ^ may appear multiple times in a regex pattern. It can also participate in any subexpression, such as a group, an alternation, or a lookbehind. Let's see some examples.


(direct link)
Anchor in Multiple Places
^cat|^mouse Here we use the beginning-of-string anchor twice, on both sides of an alternation. This ensures that whichever word we match will be at the beginning of the string.

In contrast, if we used ^cat|mouse, the ^ would only apply to cat. On the other side of the alternation, mouse is not anchored, so it could match anywhere in the string—perhaps not what we intend.

Another way to ensure the anchor applies to both sides would be to enclose the alternation in a group, as in ^(?:cat|mouse)

(direct link)
Anchor in an Alternation
\bcat(\w+|$) Here we use the end-of-string anchor $ in an alternation. After the word boundary \b and the letters cat, the engine must either match some word characters \w+ or be able to assert that the current position is the end of the string.

As a result, the word cat on its own can match, but only at the end of the string. In contrast, catch can match anywhere in the string.


(direct link)
Anchor Within a Lookaround
\w+(?=,|$) Here we use the end-of-string anchor $ within a lookahead. The engine matches word characters, then asserts that what follows is either a comma or the end of the string. In the string one apple, two peaches, three plums, this regex would match the fruits but not the numbers.

It's worth taking a moment to examine the meaning of an anchor within a lookaround. When the engine is standing at the beginning of the string, you could say that what immediately follows or precedes this position is also the beginning of the string. If immediately refers to an infinitely small distance to the left or to the right, that is indeed the case. This is a matter of semantics and perspective, of course, but it's a perspective that the engine takes on board: ^, (?<=^) and (?=^) all match at the beginning of a string.

The same applies to other anchors: whenever an anchor is within a lookaround, the meaning of the anchor is the same as if the lookaround weren't there at all. This is convenient for several reasons:

✽ We can use a negative lookahead or a negative lookbehind to assert that the current position does not correspond to an anchor. For instance, (?<!^) checks that the current position is not the beginning of the string.

✽ We can place an anchor in an alternation within a lookaround to make a complex delimiter, as in the \w+(?=,|$) example a few paragraphs above. To take another example, (?<=>>|^) ensures that what precedes the current position is either two "greater than" characters >, or—assuming we're in multiline mode—the beginning of a line. This could be useful, for instance, in parsing the text of an email. Note that as in any alternation, the order of tokens is important: (?<=^|>>) matches at the beginning of the string if possible, or past >> if not. A few lines above, the priority was reversed.


(direct link)

Anchors seen as Delimiters

Just as you can argue that anchors are a kind of boundary, you can argue that they are a kind of delimiter.

Take the beginning of string anchor ^: you can express it with the negative lookbehind (?s)(?<!.)a, which asserts that what precedes the current position is not a character. (In this regex, the mode modifier (?s)—also known as dotall mode—is necessary to allow the dot to match any character including line breaks.)

Likewise, in multiline mode, where ^ matches the beginning of every line, you can mimic the anchor's behavior with the negative lookbehind (?<!.)a, which asserts that what precedes the current position is not a character, except perhaps a line break.

Expressed in this way, the anchor ^ no longer seems to refer to a specific position in the string: it is just a shorthand notation for a "do-it-yourself delimiter".

Ultimately, this will boil down to a question of semantics. But the question is not futile. Rephrasing anchors as DIY delimiters suggests ways to implement these anchors using other meta-characters if you ever decided to write your own engine. More importantly, exploring alternate ways to perform the work of anchors brings out some of their features and suggests possible variations.

For instance, in many engines, if you wanted a variation on the multiline anchor ^ that only matches at the beginning of lines two and beyond, you could use something like the lookbehind (?<=\n), which asserts that what precedes the current position is a linefeed character.




next
 Everything You've Wanted to know about Capture Groups




1-1 of 1 Threads
Yon
May 20, 2015 - 02:35
Subject: The $ effect

In the following use cases, the "$" anchor changes the regex behavior when trying to match the simple string "a" :
- regex "^b*" matches,
- regex "^b*$" does not match. How can we explain that? Thanks,
-Yon
Reply to Yon
Rex
May 20, 2015 - 09:48
Subject: RE: The $ effect

Hi Yon, Neither of these patterns matches the string "a".


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.