Cheatsheet for regular expression

Regex is a DSL with a very small vocabulary. It is so common that everyone has to deal with it, sooner or later. I have finally reached to the point that I want to know the exact rules instead of consulting an AI for even very simple regex patterns. So here we are, a cheatsheet.

In the most common cases, the goal is to retrieve a substring out of a given string that presents some pattern. Here are the tools at our disposal.

anchors

^ matches the start of the string and $ the end.

characer class

Use square bracket [] inside which we put the character we want to match e.g.

[abc] matches any of the three characters a,b,c
[a-zA-Z] matches an alphabet lower/upper case

grouping

Use round bracket () inside which we put the entire string we want to match e.g.

(awesome)

matches exactly awesome

quantifiers

To be composed with character class or grouping,

Quantifier	Meaning
*	0 or more
+	1 or more
?	0 or 1
{n}	Exactly n times
{n,}	n or more
{n,m}	Between n and m times

For example,

(go)+

would match go, gogo, gogogo etc

escape with `\`

Quantifiers and backets are special characters that we may want to match too. To do this use backslash to escape like so

\([0-9]?

to match an openning bracket followed by one or zero digits.

shorthand

Shorthand	Meaning	Matches
\d	Digit	[0-9]
\D	Non-digit	[^0-9]
\w	Word character	[a-zA-Z0-9_]
\W	Non-word character	[^a-zA-Z0-9_]
\s	Whitespace character	[ \t\r\n\f\v]
\S	Non-whitespace character	[^ \t\r\n\f\v]
.	Any character (except \n)

OR with | , NEGATION with ^

(live|die) matches live, it also matches die. We already saw the use of negation in the shorthand table.

Example: match the endpoints in intervals

We have strings like this

(-inf,1.1]
(1.1,6]
(6,inf]

We can retrieve the left endpoint like so

import re

s = "(-inf, 1.1]"
match = re.match(r"\(([^,]+),", s)
if match:
    print(match.group(1))

here .group(1) means we retrieve the first appearance of the grouping, which is what we want.

In polars we can

import polars as pl

pattern = r"\(([^,]+),"
pl.select(pl.col("interval").str.extract(pattern))

where .extract method has the default group=1.

Bonus: GPT4 split pattern

Try this

GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

This is used in the BPE algorithm as the initial step to split a large chunk of text into words, handing punctuation, spaces, unicode etc.

Reference

https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py