Character Class Subtraction, Intersection & Union


Some regex engines let you do some fancy operations within your character classes. You can match characters that belong to one class but not to another (subtraction); match characters that belong both to one class and another (intersection), or match characters that belong to either of several classes (union).

At the moment, the engines that officially support one or several of these features are .NET, Java, Ruby 1.9+ (Onigmo engine) and Matthew Barnett's regex module for Python. In addition, Perl 5.18 introduced experimental support for set operations with its Extended Bracketed Character Classes, which I plan to document at some point.

For other engines, I'll present some workarounds.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Character Class Intersection ( optional bracketsworkaround )
Character Class Subtraction ( Java & Ruby.NETPythonworkaround )
Character Class Union ( Java & RubyPythonworkaround )


(direct link)

Character Class Intersection

In Java, Ruby 1.9+ and the regex module for Python, you can use the AND operator && to specify the intersection of multiple classes within a character class: […&&[…]&&[…]] specifies a character class representing the intersection of three sub-classes—meaning that the character matched by the class must belong to all three sub-classes. For instance, [\S&&[\D]] specifies one character that is both a non-whitespace character and a non-digit.

Character class intersection really comes into its own when using Unicode properties, as it lets us zoom in on a desired set of characters. For instance, the regex [\p{InArabic}&&[\p{L}]] matches a character that is both in the Arabic Unicode block and a letter—in other words, an Arabic letter. Likewise, [\p{ASCII}&&\p{L}] matches an ASCII character that is also a letter.

(direct link)
Brackets are optional but… Use them with negation
As long as the class after the && is not a negated class [^…], the brackets are optional. Therefore, the above regex can be written
[\p{InArabic}&&\p{L}]
When negation is involved, I recommend always using brackets because things can get messy. For instance, what does [^a&&[ab]] mean? For Java, it means [[^a]&&[ab]] and therefore only matches b. For Ruby, it means [^[a&&[ab]]], i.e. "not the intersection of two subclasses" (which is a), and it therefore matches any character that is not an a.

Combined with negated classes, intersection is very useful to create character class subtraction.

(direct link)
Workaround for Engines that Don't Support Character Class Intersection
For engines that don't support character class intersection, we can simply use a lookahead. For instance, our Arabic letter regex [\p{InArabic}&&\p{L}] can be written like this in Perl:

(?=\p{InArabic})\p{L}
The lookahead asserts that the following character belongs to the Arabic Unicode block. Then \p{L} matches a letter, which is guaranteed to be an Arabic letter.


(direct link)

Character Class Subtraction

The syntax for character class subtraction differs depending on whether you're using Java, Ruby 1.9+ or .NET.

(direct link)
Character Class Subtraction in Java and Ruby 1.9+
Java and Ruby do not have dedicated syntax for character class subtraction. Rather, the feature is just a logical by-product of their character intersection syntax. For instance,
[a-z&&[^aeiou]] matches characters that are both English lowercase letters and not vowels. In effect, it subtracts the vowel class [aeiou] from the class of letters [a-z]. The effect is to match all English lowercase consonants.

Subtraction becomes particularly useful with Unicode properties. For instance,
[\p{InArabic}&&[^\P{L}]] subtracts non-letters \P{L} from the set of characters in the Arabic Unicode block—guaranteeing we match an Arabic letter.

(direct link)
Character Class Subtraction in .NET
In .NET, with […-[…]], you can specify that the character to be matched belongs to a certain class (everything before the hyphen), except if it belongs to another class (the embedded character class, which is "subtracted" by the hyphen).

For instance, the class
[a-z-[aeiou]] matches an English lower-case consonant.

Using Unicode properties, you can use this feature to zoom in on a useful character range. For instance, you could try [\p{IsArabic}-[\d]] to match one character in the Arabic code block, except if it is a digit. Do not think that gives you an Arabic letter, though, as the Arabic code block also includes punctuation and various marks and symbols.

In contrast, [\p{IsArabic}-[\D]] is much more useful: it gives us one character in the Arabic code block, except if it is a non-digit—guaranteeing that it is an Arabic digit.

You can have nested subtraction—using subtraction within a class being subtracted. For instance, having defined [a-z-[aeiou]] as English lowercase consonants, we could subtract those from the word character class: [\w-[a-z-[aeiou]]]

Note that [^a-z-[0-9]] is not interpreted as (using pseudoregex) [^{a-z-[0-9]}], but as (pseudoregex) [{^a-z}-[0-9]]. But you would never do that: [^a-z0-9] is much simpler.

(direct link)
Character Class Subtraction in the regex module for Python
The syntax is the same as for .NET, except for one added hyphen. For instance, the class
[a-z--[aeiou]] matches an English lower-case consonant.

In addition, when the subtracted class does not include a range, its brackets are optional. The above can therefore also be written as [a-z--aeiou]

(direct link)
Workaround for Engines that Don't Support Character Class Subtraction
For engines that don't support character class subtraction, we can simply use a negative lookahead. For instance, our English consonant regex [a-z&&[^aeiou]] can be written like this in PCRE (PHP, R…), Perl, Python and JavaScript:

(?![aeiou])[a-z]
The negative lookahead asserts that the following character is not a lowercase vowel. Then [a-z] matches a letter, which is guaranteed not to be a vowel.


(direct link)

Character Class Union

The syntax for the union of character classes depends on which engine you use.

(direct link)
Character class union in Java and Ruby 1.9+
In Java and Ruby 1.9+, you can embed character classes within a character class like so: […[…][…]] to specify that the character to be matched can belong to either of the specified classes. For instance, [\D[9]] matches one character that is either a non-digit or a 9.

Since this can also be written [\D9], unions tend to be useful only in convoluted cases that involve negation or other character class operations (subtraction and intersection). For instance, [0[^\W\d]] is a reduced word character class where the only allowable characters are letters, underscores and the digit 0.

Each embedded class has access to the entire character class syntax, so it too can embed classes, leading to multiple levels of nesting, as in […[…][…[…][…]]]. Looking at it this way, I don't know what situation would ever led you to specify such a class, but this could come in handy if you build patterns dynamically, concatenating classes that originate in various parts of your application. (Thanks to nhahtdh for bringing this to my attention.)

(direct link)
Character class union in the regex module for Python
In the regex module for Python, to create the union of multiple character classes, we use the OR operator ||. For instance, [0||[^\W\d]] specifies a character that is either 0 or a word character that is not a digit.

(direct link)
Workaround for Engines that Don't Support Character Class Union
For engines that don't support character class unions, we can simply use an alternation within a non-capturing group. For instance, our regex [0||[^\W\d]] can be rewritten like this in all major engines:

(?:0|[^\W\d])




next
 Mastering Lookahead and Lookbehind



Buy me a coffee


Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, may I gently ask that you go through these crazy hoops…
Buy me a coffee