Regular Expressions - Theory and Tools
April 18, 2020
A regex, regexp or regular expression is a notation used to describe a language : letters and words governed by a set of rules.
For instance: the language described by
ab*c includes strings that start with an
a, followed by any number of
bs (including 0), then a
The syntax features literal characters, metacharacters (
.), character classes, quantifiers (
+), alternation (
|), flags (
m), and more.
In a programming context, they are used to find strings of characters that match the described language, and sometimes act on those matches.
Typical applications include text processing, text generation and input validation.
When writing a regex, it is generally easy to match what we’re looking for but difficult not to match what we want to exclude.
In short, edge cases can be a real pain. Consequently, one of the most useful piece of advice one can give you about regex is when not to use them.
Here are some non-exhaustive examples of good and bad applications.
What are regexes good at?
- looking for simple information in a text (“Find all the phone numbers in this paragraph”)
- simple pattern-based string validation (“Is this a well-formed ISO8601 date?”)
- tokenization (“Is this a reserved word?”)
- advanced search and replace
What are regexes bad at?
- parsing languages with any kind of limitless nesting (programming languages with
- complex string validation (“Is this a valid email?”)
- natural language processing (“Who is the subject in this sentence?”)
Theory and tools
Think of a regular expression as a program. This program is a finite state automaton(FSM) written in an extremely concise form.
If you go looking for documentation on regular expressions on the internet, you may stumble upon conflicting information such as “it has limited memory” and “you can reference any number of capturing groups”.
This is because, depending on the context, “regular expressions” means the theory or the practical tools build from the theory, such as software libraries.
For example PCRE (Perl-Compatible Regular Expressions) has features like assertions (lookaround) and recursion that go beyond what is possible with a FSM.
Virtually every programming environment has some kind of regex support. Here are some examples.
Famous programs such as
awk heavily use regular expressions.
preg_-prefixed functions are based on PCRE.
Notable particularities include the
preg_match_all function that effectively replaces the
You can find a few interesting examples in the league/uri codebase.
RegExpobject is built into the language,
- there is a literal form:
/ab*c/.test("abbbc")is valid code.
Learn more on MDN.
- iHateRegex lets you visualize your regex as a graph, which illustrates well its automaton origin.
- extendsclass has a regexp tester and visualizer.
- regular-expressions.info provides useful documentation.
A bit of History
- Stephen Kleene seems to be generally considered the inventor of Regular Expressions, in the early 1950s.
- Also in the 50s, Chomsky introduced his hierarchy of languages, with regular languages being “Type-3”.
- One of the first usage in a program is attributed to Ken Thompson’s ED text editor (and, also, one of the first examples of just-in-time compilation), in the late 60s.
- In the 80s, Perl introduced even more powerful implementations.