Show Lecture.RegularExpressions as a slide show.
CS253 Regular Expressions
Inclusion
To do regular expressions you need to:
#include <regex>
Some constants, such as std::regex_constants::icase
,
are in a separate namespace std::regex_constants
,
so you could :
#include <regex>
using namespace std::regex_constants;
but I usually don’t bother.
Nomenclature
- A certain sort of pattern is called a Regular Expression,
alias a regexp, regex, or just an re.
- It’s the middle of grep. In vi, the command
:g/re/p
means to do a:
global match of all lines that match a given
regular
expression, and
print those lines.
- Wikipedia says:
Regular expressions describe regular languages in formal language theory.
They have the same expressive power as regular grammars.
Pattern Matching
- It is often useful to perform pattern matching.
- For example, an actor contemplating a role might want to know how many
times the name “Cornelius” occurs in Shakespeare’s Hamlet :
% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
In C++
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
if (line.find("Cornelius") != string::npos)
cout << line << '\n';
You, good Cornelius, and you, Voltimand,
That’s only one match. Didn’t we see more than that?
Case-independence
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
if (line.find("Cornelius") != string::npos ||
line.find("cornelius") != string::npos ||
line.find("CORNELIUS") != string::npos)
cout << line << '\n';
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
Not satisfied
- That’s better but it’s not truly case-independent.
- What about “CorNELius”, or “cOrNeLiUs”?
- There are 29, or 512, combinations,
which would make the code quite tedious.
- As if the Bard of Avon would really write “CoRNelIUs”!
Regular expressions
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius"); // Create the pattern
for (string line; getline(play, line); )
if (regex_search(line, r)) // Search the line
cout << line << '\n';
You, good Cornelius, and you, Voltimand,
OK, but it’s not case-independent.
Regular expressions
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius", regex_constants::icase);
for (string line; getline(play, line); )
if (regex_search(line, r))
cout << line << '\n';
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
Dialects
You’ve all learned English; did you learn:
- American English?
- Canadian English?
- U.K. English?
- Australian English?
Well, same with regular expressions. There are dialects.
Regular expression dialects
The second argument to the regex ctor is a bitmask of flags.
regex_constants::icase
indicates a case-independent
pattern. You can also specify the regular expression dialect:
Flag | Explanation |
regex_constants::ECMAScript | ECMAScript (Javascript) (default) |
regex_constants::basic | Basic POSIX |
regex_constants::extended | Extended POSIX |
regex_constants::awk | Awk POSIX |
regex_constants::grep | Grep POSIX |
regex_constants::egrep | Egrep POSIX |
Not filename patterns
- These are not filename patterns.
- For example,
*
does not mean “anything” in a regexp.
It means “repeat what came before”.
- Filename patterns are used to match filenames.
- Regular expressions are used to match all sorts of data.
I mean it!
Regular expressions are
NOT
filename patterns!
Regex components: Character classes
What | Description |
. | any one char but \n |
[a-fxy0-9] | any one of these, where - means a range |
[^a-fxy0-9] | any char but one of these |
\d or \D | digit: [0-9] or [^0-9] |
\w or \W | word: [0-9a-zA-Z_] or [^0-9a-zA-Z_] |
\s or \S | space: [ \t\n\r\f\v] or [^ \t\n\r\f\v] |
These all match exactly one character. \w
does not
match an entire word; you need \w+
for that.
\d
matches a digit (6
), not a number (42
).
Regex components: Repetition
What | Description |
* | 0–∞ of previous (any number) |
+ | 1–∞ of previous (many) |
? | 0–1 of previous (optional) |
{17} | 17 of previous |
{3,8} | 3–8 of previous |
{,9} | 0–9 of previous |
{12,} | 12–∞ of previous |
These modify what came before. *
on its own doesn’t match
anything, but a*
matches any number of a
characters.
Regex components: Grouping
What | Description |
| | alternation |
( …) | grouping & capturing |
These are used for choices. Consider
(Abe|Abraham) Lincoln
. Without the ()
, the pattern
Abe|Abraham Lincoln
would match either “Abe” or “Abraham Lincoln”,
but not the whole string “Abe Lincoln”.
You can also refer back to the text captured by ()
with \1
, \2
, …. For example, ([a-z])\1
matches
doubled letters. This is called a backreference.
Regex components: Assertions
What | Description |
\b or \B | word boundary or not |
^ | beginning of line |
$ | end of line |
These match a zero-length string, but only at certain
places. ^
matches a zero-length string
at the start of a line (string). It does not match
the first character.
\b
matches the beginning or end of a word, that is,
the transition between \w
and \W
, or between \W
and \w
.
Regex components: Inherited from string syntax
What | Description |
\t | tab |
\n | newline |
\v | vertical tab |
\f | form feed |
\r | carriage return |
\0 digits | octal number |
\x digits | hexadecimal number |
\u digits | Unicode code point |
Examples
Pattern | What it matches | Explanation |
b | abracadabra | Take the first match |
ac | abracadabra | A plain-text string matches itself |
^abra | abracadabra | ^ matches start of string/line |
abra$ | abracadabra | $ matches end of string/line |
ca. | abracadabra | Any single character |
r.*b | abracadabra | * modifies . to match any string (greedy) |
ac.+a | abracadabra | + must match at least one |
cx?a | abracadabra | ? matches zero or one |
Examples
Pattern | What it matches |
[a-fXY0-9] | My dog has fleas. |
[^a-fXY0-9] | Your dog has fleas. |
flea|tick | My dog has fleas. |
(My|Your) (dog|cat) | My dog has fleas. |
\bDogg\b | Snoop Doggy Dogg has fleas. |
\d | File your 1040 form! |
\s | File your 1040 form! |
\w+ | File your 1040 form! |
Construction
To use a regular expression, construct a regex object:
regex r("^(Ben(jamin)?\\s+)?Franklin$"); // double \ to get it into string
If your regular expression is syntactically incorrect, it lets you know:
regex r("abc(def"); // 🦡
terminate called after throwing an instance of 'std::regex_error'
what(): Parenthesis is not closed.
SIGABRT: Aborted
Match a number
Let’s try to match a number:
const regex r("[0-9]");
cout << boolalpha << regex_search("123", r) << '\n';
true
Hooray, it worked!
Well, perhaps a bit more testing might be worthwhile …
Match a number
Let’s try to match a number:
const regex r("[0-9]");
cout << boolalpha
<< regex_search("123", r) << '\n'
<< regex_search("ab45xy", r) << '\n'
<< regex_search("Bjarne", r) << '\n';
true
true
false
Testing—what a concept! Not very DRY, though.
Match a number
Let’s try to match a number:
const regex r("[0-9]");
for (auto s : {"123", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Bjarne false
OK, now it’s DRY. Why does ab45xy
succeed?
Match a number
Add *
:
const regex r("[0-9]*");
for (auto s : {"123", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Bjarne true
Huh—that got worse. Why did "Bjarne"
succeed?
Match a number
Add +
:
const regex r("[0-9]+");
for (auto s : {"123", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Bjarne false
At least we got rid of Bjarne.
Problem is, we haven’t told the regex that it has to match
the whole line. It’s happy just matching part of the line.
Match a number
Anchored:
const regex r("^[0-9]+$");
for (auto s : {"123", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy false
Bjarne false
Now it has to match the entire line, since ^
only matches at the
start of the string, and $
only matches at the end of the
string.
Match a number
How about floating-point?
const regex r("^[0-9]+$");
for (auto s : {"123", "45.67", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 false
ab45xy false
Bjarne false
Match a number
Need to add the decimal point:
const regex r("^[0-9.]+$");
for (auto s : {"123", "45.67", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
ab45xy false
Bjarne false
Match a number
We might be too liberal, now:
const regex r("^[0-9.]+$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. true
127.0.0.1 true
ab45xy false
Bjarne false
Match a number
Let’s insist on digits point digits :
const regex r("^[0-9]+\\.[0-9]+$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 false
45.67 true
78. false
.89 false
. false
127.0.0.1 false
ab45xy false
Bjarne false
Why the double backslash?
Match a number
No, the parts should be optional:
const regex r("^[0-9]*\\.?[0-9]*$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. true
127.0.0.1 false
ab45xy false
Bjarne false
Match a number
Let’s stop hacking and design.
- Digits before or after
.
are optional, but a naked .
is
bad, so here are the possibilities:
- digits
- digits
.
digits
- digits
.
-
.
digits
- or:
- digits
- digits
.
optional-digits
-
.
digits
We express alternation with |
.
Match a number
const regex r("^([0-9]+|[0-9]+\\.[0-9]*|\\.[0-9]+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Bjarne false
Match a number
Combine the first two cases:
const regex r("^([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Bjarne false
Match a number
Let’s use \d
instead of [0-9]
:
const regex r("^(\\d+(\\.\\d*)?|\\.\\d+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Bjarne false
Match a number
Those double backslashes are hideous. Use a raw string, which works like
this:
R"(
stuff-taken-literally-even-backslashes )"
const regex r(R"(^(\d+(\.\d*)?|\.\d+)$)");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Bjarne false
Match a number
Should’ve used regex_match() instead of regex_search(); regex_match()
matches the entire string. Now we don’t need ^$
and the
parentheses:
const regex r(R"(\d+(\.\d*)?|\.\d+)");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Bjarne"})
cout << setw(10) << left << s
<< boolalpha << regex_match(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Bjarne false
Capturing Match
- regex_search() takes an optional smatch object, which contains
the match results.
sm[0]
is the entire matched string
sm[1]
is the first string captured with ()
sm[2]
is the second string captured with ()
- Actually,
sm[n]
is a more complicated object, but you
can treat it as a string.
string in = "My dog Kokopelli is a Chihuahua-terror";
regex r("(\\S+) is a (.*)");
if (smatch sm; regex_search(in, sm, r))
cout << "All: " << sm[0] << '\n'
<< "Name: " << sm[1] << '\n'
<< "Breed: " << sm[2] << '\n';
else
cout << "No match\n";
All: Kokopelli is a Chihuahua-terror
Name: Kokopelli
Breed: Chihuahua-terror
Contractions
- Let’s do multiple matches on the same string.
- Let’s look for all of the contractions in a string.
- Define a contraction as letters apostrophe letters.
Match contractions
const string s = "Can’t feed y’all before three o’clock!";
const regex r("[a-z]+’[a-z]+");
cout << boolalpha
<< regex_search(s, r) << '\n';
true
- Thanks—that was existential.
- Where are the contractions?
- I want a list!
Match contractions
const string s = "Can’t feed y’all before three o’clock!";
const regex r("[a-z]+’[a-z]+");
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
1: an’t
13: y’all
34: o’clock
- Exactly what is
iter
iterating over ?
- Hey, that first contraction is wrong.
Match contractions
Let’s add regex_constants::icase
:
const string s = "Can’t feed y’all before three o’clock!";
const regex r("[a-z]+’[a-z]+", regex_constants::icase);
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
0: Can’t
13: y’all
34: o’clock
Match contractions
- Even though that worked, the
[a-z]
smacks of chauvinism.
[a-z]
represents the entire alphabet, but, does it?
Who says that the alphabet begins with a and ends with z?
- Instead of
[a-z]
, use [[:alpha:]]
,
or [[:lower:]]
,
which mean “all alphabetical/lowercase characters”.
Match contractions
const string s = "Can’t feed y’all before three o’clock!";
const regex r("[[:alpha:]]+’[[:alpha:]]+"); // no more icase
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
0: Can’t
13: y’all
34: o’clock
[[:alpha:]]
is not [:alpha:]
. There are two sets of
square brackets. See https://cplusplus.com/reference/regex/ECMAScript/
for [[:upper:]]
, [[:xdigit:]]
, [[:punct:]]
,
and other such character classes.
Crossword Puzzle
The website
has, believe it or not, regular expression crossword puzzles. It has to be
seen to be believed!