Cheatsheet for regular expression
Regex is a DSL with a very small vocabulary. It is so common that everyone has to deal with it, sooner or later. I have finally reached to the point that I want to know the exact rules instead of consulting an AI for even very simple regex patterns. So here we are, a cheatsheet.
In the most common cases, the goal is to retrieve a substring out of a given string that presents some pattern. Here are the tools at our disposal.
anchors
^ matches the start of the string and $ the end.
characer class
Use square bracket [] inside which we put the character we want to match e.g.
- [abc] matches any of the three characters a,b,c
- [a-zA-Z] matches an alphabet lower/upper case
grouping
Use round bracket () inside which we put the entire string we want to match e.g.
matches exactly awesomequantifiers
To be composed with character class or grouping,
Quantifier | Meaning |
---|---|
* | 0 or more |
+ | 1 or more |
? | 0 or 1 |
{n} | Exactly n times |
{n,} | n or more |
{n,m} | Between n and m times |
For example,
would match go, gogo, gogogo etcescape with \
Quantifiers and backets are special characters that we may want to match too. To do this use backslash to escape like so
to match an openning bracket followed by one or zero digits.
shorthand
Shorthand | Meaning | Matches |
---|---|---|
\d | Digit | [0-9] |
\D | Non-digit | [^0-9] |
\w | Word character | [a-zA-Z0-9_] |
\W | Non-word character | [^a-zA-Z0-9_] |
\s | Whitespace character | [ \t\r\n\f\v] |
\S | Non-whitespace character | [^ \t\r\n\f\v] |
. | Any character (except \n) |
OR with | , NEGATION with ^
(live|die) matches live, it also matches die. We already saw the use of negation in the shorthand table.
Example: match the endpoints in intervals
We have strings like this
We can retrieve the left endpoint like so
here .group(1)
means we retrieve the first appearance of the grouping, which is what we want.
In polars we can
where .extract
method has the default group=1
.
Bonus: GPT4 split pattern
Try this
GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
This is used in the BPE algorithm as the initial step to split a large chunk of text into words, handing punctuation, spaces, unicode etc.