Interesting Character Classes
My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
✽ How do these Character Classes Work?
✽ Useful ASCII Ranges
✽ Obnoxious Ranges
✽ Strange or Beautiful Ranges
✽ Line-Break-Related
✽ Language-Related
(direct link)
How do these Character Classes Work?
Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]…If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.
Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]
(direct link)
Useful ASCII Ranges
All Printable Characters in the ASCII Table[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
[!-~]
All "Special Characters" in the ASCII Table
(?![a-zA-Z0-9])[!-~]
All "Special Characters" in the ASCII Table—Without Using Lookahead
[!-/:-@\[-`{-~]
All Latin and Accented Characters
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])
All English Consonants
[b-df-hj-np-tv-z]
(direct link)
Obnoxious Ranges
Alphanumeric Characters[^\W_]
This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.
Binary Number
[^\D2-9]+
This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".
(direct link)
Strange or Beautiful Ranges
Square BracketsThis will work in .NET, Perl, PCRE and Python.
[][]
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].
Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
(?i)\b[a-fq-tv-xz]+\b
Words you can Type with your Right Hand (QWERTY keyboard)
(?i)\b[ug-py]+\b
Words that only use Letters from the Top Row (QWERTY keyboard)
(?i)\b[eio-rtuwy]+\b
(direct link)
Line-Break-Related
Any Character Including Line BreaksThese are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).
Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
[^\S\n]
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.
(direct link)
Language-Related
French Letters[a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]
German Letters
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-zA-ZäöüßÄÖÜẞ]
Polish Letters
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]
Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]
Italian Letters
[a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]
Spanish Letters
[a-zA-ZáéíñóúüÁÉÍÑÓÚÜ]
Smiles,
Rex