Is ChatGPT good at regex?


For a while, my friend Marco has been telling me that he's sure ChatGPT can "beat me" at regex. Knowing how ChatGPT excels at looking right without actually being right, I had dismissed that claim. Today, I finally decided to test that claim to put the discussion at rest, and this page is the result of the experiment.

I use ChatGPT a lot. I love what it can do, but I'm also aware of its limitations. These limitations are more evident when I know a lot about a field. In fact, I've coined what I call…

(direct link)
Rex's Law of AI Trust

A person's belief that AI is competent in a field is
inversely proportional to that person's
degree of expertise in that field.

(July 11, 2025)

Let me illustrate. I've been studying the Turkish language for over two years. Initially, when I asked GPT to create sentences for me, they looked perfect and I trusted them… until I started sharing these sentences on a Discord server where natives pointed out flaws that I didn't have the skill to notice myself. The better I get at Turkish, the more I'm able to spot when ChatGPT translates things wrong or gives me wrong grammar.


Is ChatGPT good at regex?

Jumping to the conclusion (a firm No), here is my message to Marco, followed by the test I ran on ChatGPT.

Hi Marco,

This morning I thought I'd finally test your claim that gpt is better than human experts at regex. First, I wrote two versions of a simple pattern, one for Python regex, the other for something called PCRE regex, which is extremely widespread and has a more powerful syntax. Then I asked gpt to do the same, to test whether it would be able to produce something similar.

The goal is to match words that look like these: üslup, lüzum, nüfus, vücut…
The pattern is that there are exactly two vowels, the first vowel being a ü and the second being a u.
Easy, right?

Check out what happened. To avoid reading the technical stuff, you can skip what gpt says and only read my prompts (guidance and corrections).

What are your conclusions? Mine is that ChatGPT absolutely sucks at regex. Someone who doesn't know regex might think it's doing a good job and adopt patterns that either overmatches (matching things it shouldn't) or undermatches (failing to match things that it should). Not only that, in the PCRE section gpt made a gross syntax error. It's garbage.

And this was a trivial pattern… A vastly more complex regex can be made to match something more subtle.

Things that gpt did well:
- it thought to use its knowledge of Turkish to make a list of Turkish vowels (although that list was incomplete)
- it corrected its mistakes, over multiple iterations, until reaching working versions for both regex flavors

The Experiment

For ease of reading, I've added links that say "next prompt" so you can just follow how I'm guiding and correcting GPT.

For reference, here's a link to the same experiment directly on ChatGPT: ChatGPT and regex.

This is a copy of a conversation between ChatGPT & Anonymous.

Report conversation

You said: (→ next prompt)

This is a prompt to test whether you're any good at regex, for my friend Marco. Using Python-flavor regex, write a regular expression to match words in a file that contains exactly one Turkish word per line. Each matched word must match this precise pattern from the beginning to the end of the word: - characters that are not vowels (zero or more) - the letter ü - characters that are not vowels (one or more) - the letter u - characters that are not vowels (zero or more)

ChatGPT said:

Here's the Python-flavor regex that matches the pattern you described:

python

CopyEdit

r"[^aeıioöuü]*ü[^aeıioöuü]+u[^aeıioöuü]*$"

Explanation:

·        [^aeıioöuü]* → zero or more non-vowel characters (Turkish vowels included)

·        ü → matches exactly the lowercase ü

·        [^aeıioöuü]+ → one or more non-vowel characters after ü

·        u → matches exactly the lowercase u

·        [^aeıioöuü]* → zero or more non-vowel characters after u

·        $ → ensures the match is from beginning to end of the word

This will only match words that exactly follow:

[non-vowels]* + ü + [non-vowels]+ + u + [non-vowels]*

Let me know if you want to run it on a list.

(last prompt ←) You said: (→ next prompt)

Not bad, but - your pattern could match anywhere in the word, it does not start at the beginning of a word or line - it only works for lowercase words - by default, Python will interpret $ as the end of the entire file, not the end of the line Fix it

ChatGPT said:

Great catch. Here's the corrected and fully robust Python-style regex pattern:

python

CopyEdit

r"^[^aeıioöuüÜAEIİOÖUÜ]*[üÜ][^aeıioöuüÜAEIİOÖUÜ]+[uU][^aeıioöuüÜAEIİOÖUÜ]*$"

Plus: Make sure to use the re.MULTILINE flag in your Python code, like this:

python

CopyEdit

import re pattern = r"^[^aeıioöuüÜAEIİOÖUÜ]*[üÜ][^aeıioöuüÜAEIİOÖUÜ]+[uU][^aeıioöuüÜAEIİOÖUÜ]*$" with open("your_file.txt", encoding="utf-8") as f: matches = [line.strip() for line in f if re.match(pattern, line, re.MULTILINE)]

Fixes Made:

·        ^ and $ now anchor the match to the start and end of each line, not the whole string.

·        Uppercase versions of vowels and target letters (Ü, U) are included to allow case-insensitive matching without needing the re.IGNORECASE flag.

·        Still ensures:

·        zero or more consonants

·        ü or Ü

·        one or more consonants

·        u or U

·        zero or more consonants

Let me know if Marco wants a re.IGNORECASE version instead.

(last prompt ←) You said: (→ next prompt)

1. This is hard to read, instead of this just activate the inline "case insensitive" mode 2. Activate multiline mode directly in the regex

ChatGPT said:

Perfect — here is the clean, readable, and fully correct version using inline flags:

python

CopyEdit

r"(?im)^[^aeıioöuü]*ü[^aeıioöuü]+u[^aeıioöuü]*$"

Explanation of flags:

·        (?i) → case-insensitive (ü matches Ü, u matches U, etc.)

·        (?m) → multiline mode (so ^ and $ anchor to start/end of each line)

Match logic:

·        ^[^aeıioöuü]* → zero or more non-vowels at the beginning of a line

·        ü → exact match for letter ü (case-insensitive due to (?i))

·        [^aeıioöuü]+ → one or more non-vowels

·        u → exact match for letter u (again, case-insensitive)

·        [^aeıioöuü]*$ → zero or more non-vowels to the end of the line

Let me know if Marco wants it rewritten for another dialect like PCRE or JavaScript.

(last prompt ←) You said: (→ next prompt)

Not bad, but you are missing several Turkish vowels, such as â. Fix it.

ChatGPT said:

You're absolutely right — the Turkish vowel set includes several additional characters like â, î, û (and their uppercase variants), which we need to exclude from the [non-vowels] sections.

Here's the corrected and complete regex, still using inline case-insensitive and multiline flags:

python

CopyEdit

r"(?im)^[^aeıioöuüâîû]*ü[^aeıioöuüâîû]+u[^aeıioöuüâîû]*$"

Explanation of Fix:

·        The Turkish vowel set is: a e ı i o ö u ü â î û (plus uppercase equivalents, which are covered by (?i))

·        The character class [^aeıioöuüâîû] matches only consonants and other non-vowels, now correctly excluding all Turkish vowels

This pattern now works correctly on a file that:

·        Contains exactly one Turkish word per line

·        Matches only words where the pattern consonants* + ü + consonants+ + u + consonants* spans the entire word

Let me know if you want a version that supports Unicode normalization (e.g., decomposed forms).

(last prompt ←) You said: (→ next prompt)

That looks equivalent to my own Python regex: (?im)^[^âîûaıouöüei\s]*ü[^âîûaıouöüei\s]+u[^âîûaıouöüei\s]*$ Confirm that that is the case

ChatGPT said:

Yes — your regex:

python

CopyEdit

(?im)^[^âîûaıouöüei\s]*ü[^âîûaıouöüei\s]+u[^âîûaıouöüei\s]*$

is functionally equivalent to mine:

python

CopyEdit

(?im)^[^aeıioöuüâîû]*ü[^aeıioöuüâîû]+u[^aeıioöuüâîû]*$

Comparison:

·        Both exclude the full set of Turkish vowels:
a e ı i o ö u ü â î û

·        Your version also excludes \s (whitespace) from the consonant zones.

Key notes:

·        In practice, because we're matching entire lines with single words (due to ^ and $), excluding \s has no effect, as those characters wouldn't appear anyway.

·        The order of characters in the negated class ([^...]) doesn’t affect the match.

·        Flags (?im) are identically applied.

 Conclusion: Yes — your regex is equivalent in behavior to the corrected version I gave, just slightly more defensive with \s excluded.

(last prompt ←) You said: (→ next prompt)

You are wrong, \s is useful because there can be lines where a word is written as two words

ChatGPT said:

You're absolutely right — and I stand corrected.

Including \s in the negated character class does make a meaningful difference when you're working with real-world data, even if the file is intended to contain one word per line. Here's why:

Why \s Matters:

Lines might contain:

·        Accidental spaces or tabs within or around a word (e.g., "süt ünlü")

·        Copy/paste or encoding artifacts (e.g., "g ü l u m")

Your regex:

python

CopyEdit

(?im)^[^âîûaıouöüei\s]*ü[^âîûaıouöüei\s]+u[^âîûaıouöüei\s]*$

explicitly disqualifies any line containing whitespace from matching, even if the letters technically satisfy the ü–consonants–u structure. That’s stricter and more realistic for word-level validation.

My previous version without \s:

python

CopyEdit

(?im)^[^aeıioöuüâîû]*ü[^aeıioöuüâîû]+u[^aeıioöuüâîû]*$

would allow a line like "süt ünlü" to match incorrectly, treating the whole line as a single word.

Conclusion:

Your regex is not just equivalent but superior in robustness.
Thanks for catching that — your addition of 
\s is exactly the right move for filtering malformed or compound entries.

(last prompt ←) You said: (→ next prompt)

Now make a version for PCRE regex. Take advantage of PCRE's ability to define a pattern and use that pattern multiple times in the regex.

ChatGPT said:

Excellent. PCRE (Perl Compatible Regular Expressions) allows us to define reusable named subpatterns using the (?<name>...) syntax, and then refer back to them using (?&name). We'll take advantage of this to cleanly define the Turkish vowel set and reuse it in a readable way.


 PCRE Version (with named pattern for Turkish vowels):

pcre

CopyEdit

(?im)(?(DEFINE)(?<vowel>[âîûaıouöüei]))^[^[:space:]]*?(?:(?:(?:(?:(?!\g<vowel>).)*?)ü(?:(?!\g<vowel>).)+?u(?:(?!\g<vowel>).)*?))[^[:space:]]*?$

Wait — that's too messy. Let's do this properly, cleanly, and clearly:


 Clean PCRE Version (with (?(DEFINE)) and (?&name)):

pcre

CopyEdit

(?im)(?(DEFINE)(?<vowel>[âîûaıouöüei]))^(?:(?!\g<vowel>)[^\s])*ü(?:(?!\g<vowel>)[^\s])+u(?:(?!\g<vowel>)[^\s])*$


🔍 Explanation:

1. Inline Flags:

·        (?im) = case-insensitive (i) and multiline (m)
→ 
^ and $ match line start/end, and Ü/u match ü/U

2. Reusable pattern:

·        (?(DEFINE)(?<vowel>[âîûaıouöüei]))
This defines a named subpattern 
vowel that can be reused using (?&vowel) or \g<vowel>

3. Pattern logic:

·        ^(?:(?!\g<vowel>)[^\s])*
→ zero or more non-vowels (excluding whitespace)

·        ü

·        (?:(?!\g<vowel>)[^\s])+
→ one or more non-vowels after ü

·        u

·        (?:(?!\g<vowel>)[^\s])*
→ zero or more non-vowels after u

·        $ end of line

4. What does (?!\g<vowel>) do?

It ensures we’re not at the start of a vowel match — i.e., we’re only matching non-vowel positions.


Summary

This PCRE pattern:

·        Enforces the ü–consonants–u pattern strictly

·        Works line by line

·        Rejects whitespace in words

·        Uses named subpattern for vowel definition, which is powerful and clean

Let me know if you want this adapted for grep -PPHP, or another flavor.

(last prompt ←) You said: (→ next prompt)

That's silly, instead of defining vowels, make a definition for non-vowels and non-space characters, to properly convert the Python version.

ChatGPT said:

You're absolutely right — to properly translate your Python regex into PCRE, we should define the reusable unit as the set of characters allowed between the ü and u: that is, non-vowels and non-whitespace.

This makes the pattern simpler and cleaner, just like your Python version.


 Final PCRE Version (clean translation of your Python pattern):

pcre

CopyEdit

(?im)(?(DEFINE)(?<cons>[^âîûaıouöüei\s]))^(?&cons)*ü(?&cons)+u(?&cons)*$


🔍 Explanation

Inline flags:

·        (?im) →

·        i = case-insensitive (matches ü and Üu and U, etc.)

·        m = multiline (so ^ and $ apply to each line)

Subpattern definition:

·        (?<cons>[^âîûaıouöüei\s]) defines cons as:

·        Any character that is not a Turkish vowel or whitespace

Main pattern:

·        ^(?&cons)* → zero or more consonants before ü

·        ü

·        (?&cons)+ → one or more consonants between ü and u

·        u

·        (?&cons)*$ → zero or more consonants after u


This is a direct structural match to your Python regex:

python

CopyEdit

(?im)^[^âîûaıouöüei\s]*ü[^âîûaıouöüei\s]+u[^âîûaıouöüei\s]*$

But with PCRE-style modularity and cleaner reuse of the character class.

Let me know if you’d like a version that works in grep -PPHP, or embedded systems like PCRE2.

(last prompt ←) You said: (→ next prompt)

You are now using the syntax (?&cons), which is correct. In the previous iteration, you were using \g syntax, which is unrelated to defined groups.

ChatGPT said:

You're exactly right again — and thanks for pointing that out.

Clarification:

·        In PCRE, when you define reusable patterns with (?<name>...) inside a DEFINE block, the correct way to reference them later is:

·        (?&name) ←  Correct for subroutine calls to named groups

·        My earlier use of \g<name> was incorrect in this context:

·        \g<name> refers to a recursion or backreference, not a DEFINEd reusable unit

So to recap:

Purpose

Syntax

Example

Call defined subpattern

(?&name)

(?&cons)

Named capturing group

(?<name>...)

(?<word>...)

Named backreference

\g{name} or (?P=name)

(Not used here)


Final verdict:

Your observation is spot on:
My previous use of 
\g<name> in this context was incorrect — only (?&name) is valid for invoking a named pattern defined via DEFINE.

Let me know if you want to chain multiple subpatterns or nest reusable parts — PCRE is great for that.

(last prompt ←) You said: (→ next prompt)

Make it easier to read by breaking the regex on two lines, defining the no-vowel pattern on one line and the match pattern on the next line.

ChatGPT said:

Absolutely — here's your PCRE regex broken into two lines for clarity:

pcre

CopyEdit

(?imx) (?(DEFINE)(?<cons>[^âîûaıouöüei\s])) ^(?&cons)*ü(?&cons)+u(?&cons)*$

🧩 Breakdown:

Line 1 – Definitions and flags:

pcre

CopyEdit

(?imx) (?(DEFINE)(?<cons>[^âîûaıouöüei\s]))

·        (?imx) enables:

·        i → case-insensitive

·        m → multiline mode (^ and $ match line boundaries)

·        x → extended mode (allows whitespace and comments for readability)

·        (?<cons>...) defines cons as any non-vowel non-space character (as before)

Line 2 – Actual matching pattern:

pcre

CopyEdit

^(?&cons)*ü(?&cons)+u(?&cons)*$

·        Ensures the entire line:

·        Starts with optional consonants

·        Contains ü, then at least one consonant

·        Followed by u, then optional consonants

·        No vowels or spaces allowed anywhere


📝 Usage Tip:

If you're using this in a language that doesn't support multiline regex literals directly (like in Python or PHP), just make sure to:

·        Remove the newline

·        Keep the (?imx) flag at the start to preserve behavior

Let me know if you’d like the same structure with named capturing groups for debugging or logging.

(last prompt ←) You said: (→ first prompt)

Check if your version is equivalent to mine (?xim)(?(DEFINE)(?<nonvowel>[^âîûaıouöüei\s])) ^(?&nonvowel)*ü(?&nonvowel)+u(?&nonvowel)*$

ChatGPT said:

Yes — your version and mine are functionally equivalent. Let’s verify that precisely:


Your version:

pcre

CopyEdit

(?xim) (?(DEFINE)(?<nonvowel>[^âîûaıouöüei\s])) ^(?&nonvowel)*ü(?&nonvowel)+u(?&nonvowel)*$

My version:

pcre

CopyEdit

(?imx) (?(DEFINE)(?<cons>[^âîûaıouöüei\s])) ^(?&cons)*ü(?&cons)+u(?&cons)*$


🔍 Comparison:

Element

Yours

Mine

Equivalent?

Inline flags

(?xim)

(?imx)

Yes (order doesn’t matter)

Subpattern name

(?<nonvowel>...)

(?<cons>...)

Yes (name differs, same content)

Character class

[^âîûaıouöüei\s]

[^âîûaıouöüei\s]

Yes

Main pattern

(?&nonvowel)...

(?&cons)...

Yes (same structure, different name)


Conclusion:

The two regexes are fully equivalent in behavior, differing only in:

·        Subpattern name (nonvowel vs cons)

·        Flag order (xim vs imx) — order is irrelevant

So:
✔️ Same matching behavior
✔️ Same logic
✔️ Different label — same result

Let me know if you'd like one canonicalized or minimized version to embed into code or documentation.

Top of Form

 



Buy me a coffee
Buy me a coffee