Using Regular Expressions with Python


Lately, I have been working hard on beefing up the site. There are exciting new pages, and old ones have shiny new sections. The Python regex tutorial is not fully ready for prime-time, but it's one of four at the top of my priority list. I'm working on it!

In the meantime, I don't want to leave you Python coders out dry, so below there are two programs that show everything you need to get started with Python regex. But first, I feel that a word is in order about the feature set in the re module.

(direct link)

What's Missing from the re Module (lots!)…

 …and why the alternate regex Module is One of the Best

I'm sure you don't want to hear a lot of bad news right at the start, but this is both bad news and excellent news. You should know that re, the default regex engine for Python, is the second worst among all the major engines (granted, JavaScript wins the loser contest by a long margin.) You should also know that Python has an alternate regular expressions module called regex, which is possibly the very best engine available in the major languages.

Missing from the re module
So what's missing from the re module?
Here is a list I've cobbled together. It's incomplete but will give you an idea:

✽ Atomic groups and possessive quantifiers
✽ Unicode properties
✽ Variable-width lookbehind
\G
\K
\z
Splitting on zero-width matches (fixed in Python 3.7)
✽ Subroutine calls and recursion
✽ Character class operations
✽ Branch reset
(*SKIP)(*FAIL)
✽ Advanced features for inline modifiers such as (?i): setting them in the middle of a pattern, turning them off as in (?-i), applying them to a subexpression as in (?i:foo)


Why I use the regex package
In my view, the alternate regex package by Matthew Barnett may possibly be the very best regex engine available in a mainstream programming language. Before you Perl fans send me flame letters, I'll explain why: the regex package combines some of the advanced features of .NET (infinite lookbehind, capture collections, character class operations, right-to-left matching) with some of the advanced features of Perl, PCRE and Ruby (subroutines and recursion). It even has a fuzzy matching feature.

The recent addition of \K, (?(DEFINE)…) and (*SKIP)(*FAIL) make it a delight to translate advanced patterns from Perl or PCRE. If I could add anything to my perfect regex engine dreamlist to round up this amazing engine, it would be balancing groups and some kind of ground-breaking quantifier capture feature.


An iPython Notebook presentation about the regex package
Around the time I was thinking of putting together a presentation about Python regex for our local Python meetup, I received a message from Rex Dwyer, who kindly shared a presentation he had made for his local Python users' group. Synchronicity!

You can download the presentation here. It is an iPython notebook. I have confirmed that all the cells run in Jupyter for Python 3, but I haven't yet had the time to read the presentation.


(direct link)

Curated Changelog to the re module

Python 3.7: splitting on zero-width matches
Python 3.8: \N was added to match specific characters by name, e.g. \N{YEN SIGN} instead of \u00A5 to match the ¥ character


(direct link)

Two Python programs that show
how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well.

This is what I have for you in the following complete Python regex programs. There are two programs: a "simple" one and an "advanced" one. Yes, these terms are subjective.

Both programs perform the six same most common regex tasks, but in different contexts. The first four tasks answer the most common questions we use regex for:

✽ Does the string match?
✽ How many matches are there?
✽ What is the first match?
✽ What are all the matches?

The last two tasks perform two other common regex tasks:

✽ Replace all matches
✽ Split the string

If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Python.

Differences between the "Simple" and "Advanced" program

Here is the difference in a nutshell.
• The simple program assumes that the overall match and its capture groups is the data we're seeking. This is what you would expect.
• The advanced program assumes that we have no interest in the overall matches, but that the data we're seeking is in capture Group 1, if it is set.

Have fun tweaking
With these two programs in hand, you should have a solid base to understand how to do basic things—and fairly advanced ones as well. I hope you'll have fun changing the pattern, deleting code fragments you don't need and tweaking those you do need.

Disclaimer
As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Python pro may look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment.


Python Regex Program #1: Simple

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter.


Click to Show / Hide code
or leave the site to view an online demo
import re
# import regex # if you like good times
# intended to replace `re`, the regex module has many advanced
# features for regex lovers. http://pypi.python.org/pypi/regex
pattern = r'(\w+):(\w+):(\d+)'
subject = 'apple:green:3 banana:yellow:5'
regex = re.compile(pattern)

######## The six main tasks we're likely to have ########

# Task 1: Is there a match?
print("*** Is there a Match? ***")
if regex.search(subject):
	print ("Yes")
else:
	print ("No")

# Task 2: How many matches are there?
print("\n" + "*** Number of Matches ***")
matches = regex.findall(subject)
print(len(matches))

# Task 3: What is the first match?
print("\n" + "*** First Match ***")
match = regex.search(subject)
if match:
	print("Overall match: ", match.group(0))
	print("Group 1 : ", match.group(1))
	print("Group 2 : ", match.group(2))
	print("Group 3 : ", match.group(3))
	
# Task 4: What are all the matches?
print("\n" + "*** All Matches ***\n")
print("------ Method 1: finditer ------\n")
for match in regex.finditer(subject):
	print ("--- Start of Match ---")
	print("Overall match: ", match.group(0))
	print("Group 1 : ", match.group(1))
	print("Group 2 : ", match.group(2))
	print("Group 3 : ", match.group(3))
	print ("--- End of Match---\n")		

print("\n------ Method 2: findall ------\n")
# if there are capture groups, findall doesn't return the overall match
# therefore, in that case, wrap the pattern in capturing parentheses
# the overall match becomes group 1, so other group numbers are bumped up!
wrappedpattern = "(" + pattern + ")"
wrappedregex = re.compile(wrappedpattern)
matches = wrappedregex.findall(subject)
if len(matches)>0:
	for match in matches:
	    print ("--- Start of Match ---")
	    print ("Overall Match: ",match[0])
	    print ("Group 1: ",match[1])
	    print ("Group 2: ",match[2])
	    print ("Group 3: ",match[3])
	    print ("--- End of Match---\n")		

# Task 5: Replace the matches
# simple replacement: reverse group
print("\n" + "*** Replacements ***")
print("Let's reverse the groups")
def reversegroups(m):
	return m.group(3) + ":" + m.group(2) + ":" + m.group(1)
replaced = regex.sub(reversegroups, subject)
print(replaced)

# Task 6: Split
print("\n" + "*** Splits ***")
# Let's split at colons or spaces
splits = re.split(r":|\s",subject)
for split in splits:
	    print (split)


Python Regex Program #2: Advanced

The second full Python regex program is featured on my page about the best regex trick ever.

✽ Here is the article's Table of Contents
✽ Here is the explanation for the code
✽ Here is the Python code


For versions prior to 3.7, re does not split on zero-width matches

EDIT: the following text is obsolete as of Python 3.7



In most regex engines, you can use lookarounds to split a string on a position, i.e. a zero-width match, obtained for instance by using boundaries or lookarounds.

For instance, you would use (?=-) to split when the next character is a dash.

However, for historical reasons—a bug that is now too old to fix—Python's re.split does not split on zero-width matches. For instance, re.split("(?=-)", "a-beautiful-day") returns ['a-beautiful-day'].

To split on zero-width matches in Python, we need to use the regex module in V1 mode. For instance, regex.split("(?V1)(?=-)", "a-beautiful-day") will return ['a', '-beautiful', '-day']—which is what we want.



Smiles,

Rex

Buy me a coffee


Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, may I gently ask that you go through these crazy hoops…
Buy me a coffee