Interesting Character Classes

My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

How do these Character Classes Work?
Useful ASCII Ranges
Obnoxious Ranges
Strange or Beautiful Ranges

(direct link)

How do these Character Classes Work?

Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]

If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.

Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]

(direct link)

Useful ASCII Ranges

All Printable Characters in the ASCII Table
[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
All "Special Characters" in the ASCII Table
All "Special Characters" in the ASCII Table—Without Using Lookahead
All Latin and Accented Characters
All English Consonants

(direct link)

Obnoxious Ranges

Alphanumeric Characters
[^\W_] This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.

Binary Number
[^\D2-9]+ This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".

(direct link)

Strange or Beautiful Ranges

Square Brackets
This will work in .NET, Perl, PCRE and Python.
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].

Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
Words you can Type with your Right Hand (QWERTY keyboard)
Words that only use Letters from the Top Row (QWERTY keyboard)

(direct link)


Any Character Including Line Breaks
These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).

Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.

(direct link)


French Letters
German Letters
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
Polish Letters
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ] Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters
Spanish Letters



Be the First to Leave a Comment

All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.