Using Regular Expressions with C#


The C# regex tutorial is not as fully fleshed out as I would like, but I'm working on it. In the meantime, I have some great material to get you started in C#'s particular flavor of regex.


What's on this Page

With the C# page and the other language pages, my goal is not to teach you regex. That's what the rest of the site is for! My goal here is to get you fully up and running in your language by:
✽ Explaining the features that are specific to your language and regex flavor
✽ Giving you full working programs that demonstrate these features

I believe in learning features by seeing working code. This is what most of this page is about.

(direct link)
Table of Contents
Here are some jump points to the content on this page.
What does the .NET regex flavor taste like?
What's missing from .NET regex?
C# regex: the first three things you must know
Two programs for all common regex tasks
The "simple" program
The "advanced" program
Differences in .NET Regex across .NET Versions
Capture Groups that can be Quantified
Named Capture Reuse
Quantified Reused Name Groups
Balancing Groups
An Alternate engine: PCRE.NET


(direct link)

What does the .NET regex flavor taste like?

If you hate Windows, you're going to hate .NET regular expressions.
Why… Because it sucks?
No. Because feature-for-feature, it may well be the best regex engine out there—or at least one of top two contenders for the spot. What's more, if for any reason you just don't like it, you can use the brilliant PCRE.NET interface to the PCRE library.

Among other features, .NET regular exprssions have:

✽ Infinite-width lookbehind. This means you can write something like (?<=\d+\w+)—extremely convenient when you need to check context. If you are writing code, the only other engine to offer this feature is Matthew Barnett's experimental regex module for Python. Jan Goyvaerts' proprietary JGSoft flavor (EditPad, RegexBuddy, PowerGrep) also support infinite-width lookbehind, but only Jan is allowed to write code with it.

Capture groups that can be quantified. This means that if you write (\w+\s)+, the engine will return not just one Group 1 capture, but a whole array of them. This has terrific value for parsing strings with an arbitrary number of tokens.

Character class subtraction. This allows you to write [a-z0-9-[mp3]], which means you shouldn't be listening to loud music while writing regex. Err… sorry, I meant, this means you can match all lowercase English letters and digits except the characters m, p and 3.

Reuse capture group names. In .NET, you can reuse the name of a capture group, as pets in (?<pets>cats) and dogs|(?<pets>pigs) and whistles

✽ Optional right-to-left matching. I'll soon add a trick to demonstrate a situation where this could be handy. Bear in mind that in other languages, a workaround would be to reverse the string before matching, then to reverse the results.

Balancing groups. This feature allows you to ensure that a subexpression is matched the same number of times as another—and to construct a number of useful counting tricks.

(?n) modifier (also accessible via the ExplicitCapture option). This turns all (parentheses) into non-capture groups. To capture, you must use named groups.


(direct link)

What's Missing from .NET regex?

.NET regex certainly doesn't have it all, though some features that seem lacking are cleanly achieved through other means. If one of your favorite feature from Perl or PCRE is really missing, don't despair yet, as you can use the brilliant PCRE.NET interface to the PCRE engine.

✽ Subroutines. In .NET, you cannot write (\d+):(?1). This is a feature I miss. By extension, neither can you write something like (?(DEFINE)(?<digits>\d+))(?&digits):(?&digits)

✽ .NET regex does not have the \K "keep out" feature. However, in Perl and PCRE, \K is only a convenience to (partially) make up for the lack of infinite-width lookbehinds.

✽ No possessive quantifiers as in \w++. I know, this is only a shorthand notation for the atomic group (?>\w+)… But it is much tidier.

✽ No branch reset. .NET does now allow (?|(cats) and dogs|(pigs) and whistles), where Group 1 can be defined in multiple places in the string. However, .NET lets you achieve the same by recycling a named groups: (?<pets>cats) and dogs|(?<pets>pigs) and whistles


(direct link)

The first three things you must know to use regex in C#

Before you start using regex in C#, at a bare minimum, you should know these three things.

1. Import the .NET Regex Library
To access the regex classes, you need this line at the top of your code:

using System.Text.RegularExpressions;
2. Use Verbatim Strings
Regex patterns are full of backslashes. In a normal string, you have to escape them, which prevents you from pasting patterns straight from a regex tool. To get around this problem, use C#'s verbatim strings, whose characters lose any special significance for the compiler. To make a verbatim string, precede your string with an @ character, like so:

string myPattern = @"Score: \w+: \d+";
Verbatim strings can span multiple lines. This is useful for your regex subjects as well as regex patterns that use free-spacing mode. For instance:

string mySubject = @"Arizona, AZ 100
California, CA 122
South Dakota, SD 33
";

string myPattern = @"(?xm) # free-spacing mode
^([\w\s]+),\s # State
([A-Z]{2}\s) # State abbreviation
(\d+) # Value of a dollar, in cents
";

3. Watch Out for \w and \d
By default, .NET RegularExpressions classes assume that your string is encoded in UTF-8. The regex tokens \w, \d and \s behave accordingly, matching any utf-8 codepoint that is a valid word character, digit or whitespace character in any language.

This means that by default,

\d+ will match 123۳۲١८৮੪૯୫୬१७੩௮௫౫೮൬൪๘໒໕២៧៦᠖

\w+ will match abcdᚠᚱᚩᚠᚢᚱტყაოსdᚉᚔమరמטᓂᕆᔭᕌसられま래도

\s+ will match all kinds of strange whitespace characters you've never dreamed of.

If all you wanted was English digits for \d, English letters, English digits and underscore for \w and whitespace characters you can understand for \s, then you need to set the ECMAScript option. Here's how to do it:

var r2 = new Regex(@"\d+", RegexOptions.ECMAScript);

(direct link)

Two C# programs that show
how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well.

This is what I have for you in the two following complete C# regex programs. There are two programs: a "simple" one and an "advanced" one. Yes, these terms are subjective.

Both programs perform the six same most common regex tasks, but in different contexts. The first four tasks answer the most common questions we use regex for:

✽ Does the string match?
✽ How many matches are there?
✽ What is the first match?
✽ What are all the matches?

The last two tasks perform two other common regex tasks:

✽ Replace all matches
✽ Split the string

If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with C#.

Differences between the "Simple" and "Advanced" program

Here is the difference in a nutshell.
• The simple program assumes that the overall match and its capture groups is the data we're seeking. This is what you would expect.
• The advanced program assumes that we have no interest in the overall matches, but that the data we're seeking is in capture Group 1, if it is set.

Have fun tweaking
With these two programs in hand, you should have a solid base to understand how to do basic things—and fairly advanced ones as well. I hope you'll have fun changing the pattern, deleting code fragments you don't need and tweaking those you do need.

Disclaimer
As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a C# pro may look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment.


(direct link)

C# Regex Program #1: Simple

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter.


Click to Show / Hide code
or leave the site to view an online demo
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main()    {
string s1 = @"apple:green:3 banana:yellow:5";
var myRegex = new Regex(@"(\w+):(\w+):(\d+)");

///////// The six main tasks we're likely to have ////////

// Task 1: Is there a match?
Console.WriteLine("*** Is there a Match? ***");
if (myRegex.IsMatch(s1)) Console.WriteLine("Yes");
else Console.WriteLine("No");

// Task 2: How many matches are there?
MatchCollection AllMatches = myRegex.Matches(s1);
Console.WriteLine("\n" + "*** Number of Matches ***");
Console.WriteLine(AllMatches.Count);

// Task 3: What is the first match?
Console.WriteLine("\n" + "*** First Match ***");
Match OneMatch = myRegex.Match(s1);
if (OneMatch.Success)    {
    Console.WriteLine("Overall Match: "+ OneMatch.Groups[0].Value);
    Console.WriteLine("Group 1: " + OneMatch.Groups[1].Value);
    Console.WriteLine("Group 2: " + OneMatch.Groups[2].Value);
    Console.WriteLine("Group 3: " + OneMatch.Groups[3].Value);
    }

// Task 4: What are all the matches?
Console.WriteLine("\n" + "*** Matches ***");
if (AllMatches.Count > 0)    {
    foreach (Match SomeMatch in AllMatches)    {
        Console.WriteLine("Overall Match: " + SomeMatch.Groups[0].Value);
        Console.WriteLine("Group 1: " + SomeMatch.Groups[1].Value);
        Console.WriteLine("Group 2: " + SomeMatch.Groups[2].Value);
        Console.WriteLine("Group 3: " + SomeMatch.Groups[3].Value);
    }
}

// Task 5: Replace the matches
// simple replacement: reverse groups
string replaced = myRegex.Replace(s1, 
       delegate(Match m) {
       return m.Groups[3].Value + ":" +
              m.Groups[2].Value + ":" +
              m.Groups[1].Value;
        }
        );
Console.WriteLine("\n" + "*** Replacements ***");
Console.WriteLine(replaced);

// Task 6: Split
// Let's split at colons or spaces
string[] splits = Regex.Split(s1, @":|\s");
Console.WriteLine("\n" + "*** Splits ***");
foreach (string split in splits) Console.WriteLine(split);

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program


(direct link)

C# Regex Program #2: Advanced

The second full C# regex program is featured on my page about the best regex trick ever.

✽ Here is the article's Table of Contents
✽ Here is the explanation for the code
✽ Here is the C# code



(direct link)

Differences in .NET Regex across .NET Versions

If you're using the latest version of .NET, don't worry about this section. After 2.0—and you're unlikely to target an earlier version—the new features are few.

(direct link)
New regex features in .NET 4.5
The Time Out feature lets you control the risk of catastrophic backtracking. When you initialize a Regex, you can now specify a third argument to control the timeout. For instance,
var myregex = new Regex( @"(A+)+",
                         RegexOptions.IgnoreCase,
                         TimeSpan.FromSeconds(1) )	
ensures the engine searches for a match for one second at the most, after which it throws a RegexMatchTimeoutException exception. This timeout is observed by IsMatch, Match, Matches, Replace, Split and Match.NextMatch.


(direct link)

Character Class Subtraction

The syntax […-[…]], which allows you to subtract one character class from another, is unique to .NET—though Java and Ruby 2+ has syntax that allows you a similar operation.

For details, see character class subtraction in .NET on the page about character class operations.


(direct link)

Capture Groups that can be Quantified

You'll recall from the page about regex capture groups that when you place a quantifier after a capturing group, as in (\d+:?)+, the regex engine doesn't create multiple capture groups for you. Instead, the capture group returns the string that was captured last. For instance, if we used the above regex on the string 111:22:33, the engine would match the whole string, and capture Group 1 would be reported as 33.

Well, with .NET regex, all of that changes. If you just ask, C# will still report that Group 1 is 33. But if you dig deeper into Group 1, C# will also return a collection of captures with all the values that Group 1 captured in succession because of the + quantifier.

This feature is a game changer, because it lets you easily parse strings with an unknown number of tokens.

For instance, consider a file with a series of word translations for a number of languages, like so:

Italian:one=uno,two=due German:one=ein,two=zwei,three=drei,four=vier Japanese:one=ichi,two=ni,three=san

For each language, you would like to parse the English word (e.g. "two") and its translation (e.g. "zwei") into variables. If you had the same number of definitions for each language, you could accomplish this with a fixed number of capture groups. But, as you can see, Italian has two definitions, German has four, Japanese has three. For normal regex, the task is complex because you cannot create capture groups on the fly.

With .NET, you have a simple solution. Consider a regex that matches each language at a time. It could look like this:

\w+:(?:(\w+)=(\w+),?)+
The \w+: corresponds to the language (e.g. Italian:). Inside of the non-capturing parentheses, we define a dictionary pair, capturing the English word to Group 1 and the translation to Group 2. The ,? is just an optional comma (there is no comma after each language's last entry). So far, this is all normal. The odd thing here is the + quantifier that repeats the expression for a dictionary pair. What happens to the capture Groups?

In normal regex, if we had just matched the Italian entries, Group 1 and Group 2 would correspond to the last dictionary pair captured for that match, i.e. two for Group 1 and due for Group 2. In .NET, Group 1 and Group 2 are objects. Their Value property is the same as in other regex flavors, i.e. the the last dictionary pair captured for the current match. The twist is that each Group has a member called Captures, which is an object that contains all the captures that were made for that group during the match. Therefore for the first match (the Italian entries), Group 1's Captures member will contain two captures, whose values are "one" and "two".

The code below uses this example and shows you exactly how to implement the feature.

Before you examine the code, have a look at the output, which explains how the groups work.

Click to Show / Hide code
or leave the site to view an online demo
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string ourString = @"Italian:one=uno,two=due Japanese:one=ichi,two=ni,three=san";
string ourPattern = @"\w+:(?:(\w+)=(\w+),?)+";
var ourRegex = new Regex(ourPattern);
MatchCollection AllMatches = ourRegex.Matches(ourString);

Console.WriteLine("**** Understanding .NET Capture Groups with Quantifiers ****");
Console.WriteLine("\nOur string today is:" + ourString);
Console.WriteLine("Our regex today is:" + ourPattern);
Console.WriteLine("Number of Matches: " + AllMatches.Count);

Console.WriteLine("\n*** Let's Iterate Through the Matches ***");
int matchNum = 1;
foreach (Match SomeMatch in AllMatches)    {
    Console.WriteLine("Match number: " + matchNum++);
    Console.WriteLine("Overall Match: " + SomeMatch.Value);
    Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count);
    Console.WriteLine("Why three Groups, not two? Because the overall match is Group 0");
    // Another way of printing the overall match
    Console.WriteLine("Group 0: " + SomeMatch.Groups[0].Value);
    Console.WriteLine("Since Groups 1 and 2 have quantifiers, the value of Group 1 and Group 2 is the last capture of each group");
    Console.WriteLine("Group 1: " + SomeMatch.Groups[1].Value);
    Console.WriteLine("Group 2: " + SomeMatch.Groups[2].Value);
    // For this match, let's look all the Group 1 captures manually
    int g1capCount = SomeMatch.Groups[1].Captures.Count;
    Console.WriteLine("Number of Group 1 Captures: " + g1capCount);
    Console.WriteLine("Group 1 Capture 0: " + SomeMatch.Groups[1].Captures[0].Value);
    Console.WriteLine("Group 1 Capture 1: " + SomeMatch.Groups[1].Captures[1].Value);
    // To be safe, we'll check if we have a third capture for Group 1
    // Because the first overall match only has two captures
    if(g1capCount>2) Console.WriteLine("Group 1 Capture 2: " + SomeMatch.Groups[1].Captures[2].Value);
    // Let's look at Group 2 captures automatically
    int g2capCount = SomeMatch.Groups[2].Captures.Count;
    Console.WriteLine("Number of Group 2 Captures: " + g2capCount);
    int i2 = 0;
    foreach (Capture g2capture in SomeMatch.Groups[2].Captures ) {
        Console.WriteLine("Group 2 Capture " + i2 + ": " + g2capture.Value);
        i2++;
        } // end iterate G2 captures
    Console.WriteLine("\n");
    } // end iterate matches

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program


(direct link)

Named Capture Reuse

C# allows you to reuse the same named group multiple times. For a given group, you retrieve the array of captures in the same way as with quantified capture groups. The following program shows you how.

Click to Show / Hide code
or leave the site to view an online demo
using System;
using System.Text.RegularExpressions;
class Program {
static void Main()    {
string ourString = @"one:uno dos:two three:tres";
string ourPattern = @"(?<someword>\w+):(?<someword>\w+)";
var ourRegex = new Regex(ourPattern);
MatchCollection AllMatches = ourRegex.Matches(ourString);

Console.WriteLine("**** Understanding Named Capture Reuse ****");
Console.WriteLine("\nOur string today is:" + ourString);
Console.WriteLine("Our regex today is:" + ourPattern);
Console.WriteLine("Number of Matches: " + AllMatches.Count);

Console.WriteLine("\n*** Let's Iterate Through the Matches ***");
int matchNum = 1;
foreach (Match SomeMatch in AllMatches)    {
    Console.WriteLine("Match number: " + matchNum++);
    Console.WriteLine("Overall Match: " + SomeMatch.Value);
    Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count);
    Console.WriteLine("Why two Groups, not one? Because the overall match is Group 0");
    // Another way of printing the overall match
    Console.WriteLine("Groups[0].Value = " + SomeMatch.Groups[0].Value);
    Console.WriteLine(@"Since the 'someword' group appears more than once in the pattern, the value of Groups[1] and Groups[""someword""] is the last capture of each group");
    Console.WriteLine("Groups[1].Value = " + SomeMatch.Groups[1].Value);
    Console.WriteLine(@"Groups[""someword""].Value = " + SomeMatch.Groups["someword"].Value);
    // Let's look all the first captures manually
    Console.WriteLine("Someword Capture 0: " + SomeMatch.Groups["someword"].Captures[0].Value);

// Let's look at someword captures automatically
    int somewordCapCount = SomeMatch.Groups["someword"].Captures.Count;
    Console.WriteLine("Number of someword Captures: " + somewordCapCount);
    int i2 = 0;
    foreach (Capture someword in SomeMatch.Groups["someword"].Captures)    {
        Console.WriteLine("someword Capture " + i2 + ": " + someword.Value);
    i2++;
    } // end iterate G2 captures
    Console.WriteLine("\n");
} // end iterate matches

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program


(direct link)

Quantified Reused Named Groups

What if you were to combine quantified capture groups with reused named groups? No problem. For the given named capture, C# just keeps adding captured strings in the order they are captured. The following program shows you how this works.

Click to Show / Hide code
or leave the site to view an online demo
using System;
using System.Text.RegularExpressions;
class Program {
static void Main()    {
string ourString = @"one-two-three:uno-dos-tres one-two-three:ichi-ni-san";
string ourPattern = @"(?:(?<someword>\w+)-?)+:(?:(?<someword>\w+)-?)+";
var ourRegex = new Regex(ourPattern);
MatchCollection AllMatches = ourRegex.Matches(ourString);

Console.WriteLine("**** Understanding Named Capture Reuse ****");
Console.WriteLine("\nOur string today is:" + ourString);
Console.WriteLine("Our regex today is:" + ourPattern);
Console.WriteLine("Number of Matches: " + AllMatches.Count);

Console.WriteLine("\n*** Let's Iterate Through the Matches ***");
int matchNum = 1;
foreach (Match SomeMatch in AllMatches)    {
    Console.WriteLine("Match number: " + matchNum++);
    Console.WriteLine("Overall Match: " + SomeMatch.Value);
    Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count);
    Console.WriteLine("Why two Groups, not one? Because the overall match is Group 0");
    // Another way of printing the overall match
    Console.WriteLine("Groups[0].Value = " + SomeMatch.Groups[0].Value);
    Console.WriteLine(@"Since the 'someword' group appears more than once in the pattern, the value of Groups[1] and Groups[""someword""] is the last capture of each group");
    Console.WriteLine("Groups[1].Value = " + SomeMatch.Groups[1].Value);
    Console.WriteLine(@"Groups[""someword""].Value = " + SomeMatch.Groups["someword"].Value);
    // Let's look all the first captures manually
    Console.WriteLine("Someword Capture 0: " + SomeMatch.Groups["someword"].Captures[0].Value);

// Let's look at someword captures automatically
    int somewordCapCount = SomeMatch.Groups["someword"].Captures.Count;
    Console.WriteLine("Number of someword Captures: " + somewordCapCount);
    int i2 = 0;
    foreach (Capture someword in SomeMatch.Groups["someword"].Captures)    {
        Console.WriteLine("someword Capture " + i2 + ": " + someword.Value);
    i2++;
    } // end iterate G2 captures
    Console.WriteLine("\n");
} // end iterate matches

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program


(direct link)

Balancing Groups

I haven't yet written this section, but there are great examples of this feature in several sections of the site. These will show you everything you need to know to get started with balancing groups.

Matching Balanced Strings such as AAA foo BBB
Matching Line Numbers
Quantifier Capture



(direct link)

An Alternate engine: PCRE.NET

PCRE is another of my favorite engines. In fact, this site probably has the most comprehensive resources about PCRE on the web, from the PCRE documentation and changelog to PCRE's special start of pattern modifiers, backtracking control verbs and the pcregrep and pcretest utilities.

So when I found out about Lucas Trzesniewski's .NET wrapper around the PCRE library, I was excited. This means you can get around .NET's lack of a few features such as recursion.

In Visual Studio 2015, installation is a snap:

✽ Create a project.
✽ Press Ctrl + Q for the Quick Launch window, type nuget and select Manage Nuget Packages for Solution.
✽ In the search window, type pcre.net, making sure that the filters pull-down is set to All.
✽ Install.

The Visual C++ Redistributable for Visual Studio 2015 is a requirement, but you probably won't need to install it if you installed all of VS2015.

Compared with using .NET regex, one difference to keep in mind is that on top of the .exe file, you'll have to distribute PCRE.NET.dll (which will be in your build folder). It only weighs 350kB so that's not a big deal. Still if for some reason you're shooting for the size of a small console program such as the one below (about 7kB once compiled), this will blow up the budget. Of course in the case of a pure .NET solution you're probably "paying" a similar weight, but it's hidden in the framework's System.Text.RegularExpressions.dll (29kB) and (I assume) its parents.

To get you started, I'll give you a simple but fully functioning program that showcases the main methods. Beyond that,
✽ please visit my page about PCRE callouts, which shows more code examples in PCRE.NET
✽ see the GitHub repo if you'd like more information.

I hope you'll forgive the weird indentation—I wanted everything to fit within the narrow box.

using System; using PCRE; using System.Linq; class Program { static void Main() { string subject = "<000> 111 <222> 333 4444"; // Match three digits, unless they live inside angle brackets var digits_regex = new PcreRegex(@"<[^>]+(*SKIP)(*F)|\b\d(\d)\d\b"); // Does the pattern match? Console.WriteLine("=== Does it Match? ==="); Console.WriteLine(digits_regex.IsMatch(subject)); // What is the first match? Console.WriteLine("=== First Match ==="); var onematch = digits_regex.Match(subject); if (onematch.Success) { // onematch.Value is the same as onematch.Groups[0].Value Console.WriteLine(onematch.Value); } // What is Capture Group 1? Console.WriteLine("=== Capture Group 1 ==="); if (onematch.Success) { Console.WriteLine(onematch.Groups[1].Value); } // What are all the matches? Console.WriteLine("=== Matches ==="); var matches = digits_regex.Matches(subject); if (matches.Any()) { foreach (var match in matches) { Console.WriteLine(match.Value); } // Replace: surround with angle brackets Console.WriteLine("=== Replacements ==="); string replaced = digits_regex.Replace(subject, "<$0>"); Console.WriteLine(replaced); // Replace using callback Console.WriteLine("=== Replacements with Callback ==="); string replaced2 = digits_regex.Replace(subject, delegate (PcreMatch m) { if (m.Value == "111") return "<ones>"; else return m.Value; }); Console.WriteLine(replaced2); // Split Console.WriteLine("=== Splits ==="); var splits = digits_regex.Split(subject); foreach (var split in splits) { Console.WriteLine(split); } Console.WriteLine("Press Any Key"); Console.ReadKey(); } } }






Smiles,

Rex

Buy me a coffee
Buy me a coffee