Show Lecture.RegularExpressions as a slide show.

CS253 Regular Expressions

Nomenclature

A certain sort of pattern is called a Regular Expression, alias a regexp, regex, or just an re.
It’s the middle of grep. In vi, the command :g/re/p means to do a:
global match of all lines that match a given
regular
expression, and
print those lines.
Wikipedia says:

Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.

Pattern Matching

It is often useful to perform pattern matching.
For example, an actor contemplating a role might want to know how many times the name “Cornelius” occurs in Shakespeare’s Hamlet:

% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

In C++

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos)
        cout << line << '\n';

		You, good Cornelius, and you, Voltimand,

That’s only one match. Didn’t we see more than that?

Case-independence

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos ||
        line.find("cornelius") != string::npos ||
        line.find("CORNELIUS") != string::npos)
        cout << line << '\n';

CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Not satisfied

That’s better but it’s not truly case-independent.
What about “CorNELius”, or “cOrNeLiUs”?
There are 2⁹, or 512, combinations, which would make the code quite tedious.
- As if the Bard of Avon would really write “CoRNelIUs”!

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius");     // Create the pattern

for (string line; getline(play, line); )
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';

		You, good Cornelius, and you, Voltimand,

OK, but it’s not case-independent.

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius", regex_constants::icase);

for (string line; getline(play, line); )
    if (regex_search(line, r))
        cout << line << '\n';

CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Dialects

You’ve all learned English; did you learn:

American English?
Canadian English?
U.K. English?
Australian English?

Well, same with regular expressions. There are dialects.

Regular expression dialects

The second argument to the regex ctor is a bitmask of flags. regex_constants::icase indicates a case-independent pattern. You can also specify the regular expression dialect:

Flag	Explanation
`regex_constants::ECMAScript`	ECMAScript (Javascript) (default)
`regex_constants::basic`	Basic POSIX
`regex_constants::extended`	Extended POSIX
`regex_constants::awk`	Awk POSIX
`regex_constants::grep`	Grep POSIX
`regex_constants::egrep`	Egrep POSIX

Not filename patterns

These are not filename patterns.
For example, * does not mean “anything” in a regexp. It means “repeat what came before”.
Filename patterns are used to match filenames.
Regular expressions are used to match all sorts of data.

I mean it!

Regular expressions are NOT filename patterns!

Components of regular expressions:

What	Description	What	Description
`.`	any one char but \n	`\|`	alternation
`[a-fxy0-9]`	any one of these	`(`…`)`	grouping & capturing
`[^a-fxy0-9]`	any char but one of these	`\b`	word boundary
`*`	0–∞ of previous (any number)	`\d` or `\D`	`[0-9]` or not (just one char)
`+`	1–∞ of previous (many)	`\s` or `\S`	`[ \t\n\r…]` or not (just one char)
`?`	0–1 of previous (optional)	`\w` or `\W`	`[0-9a-zA-Z_]` or not (just one char)
`{17}`	17 of previous	`^`	beginning of line
`{3,8}`	3–8 of previous	`$`	end of line

Examples

Pattern	What it matches	Pattern	What it matches
`b`	`abracadabra`	`[a-fXY0-9]`	`My dog has fleas.`
`ac`	`abracadabra`	`[^a-fXY0-9]`	`Your dog has fleas.`
`^abra`	`abracadabra`	`flea\|tick`	`My dog has fleas.`
`abra$`	`abracadabra`	`(My\|Your) (dog\|cat)`	`My dog has fleas.`
`ca.`	`abracadabra`	`\bDogg\b`	`Snoop Doggy Dogg has fleas.`
`r.*b`	`abracadabra`	`\d`	`File your 1040 form!`
`ac.+a`	`abracadabra`	`\s`	`File your 1040 form!`
`cx?a`	`abracadabra`	`\w+`	`File your 1040 form!`

Construction

To use a regular expression, construct a regex object:

regex r("^J(ohn|ack)( Applin)?");

If your regular expression is syntactically incorrect, it lets you know:

regex r("abc(def");

terminate called after throwing an instance of 'std::regex_error'
  what():  Parenthesis is not closed.
SIGABRT: Aborted

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha << regex_search("123", r) << '\n';

true

Hooray, it worked!

Well, perhaps a bit more testing might be worthwhile …

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",    r) << '\n'
     << regex_search("ab45xy", r) << '\n'
     << regex_search("Jack",   r) << '\n';

true
true
false

Testing—what a concept! Not very DRY, though.

Match a number

Let’s try to match a number:

const regex r("[0-9]");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
ab45xy    true
Jack      false

OK, now it’s DRY. Why does ab45xy succeed?

Match a number

Add *:

const regex r("[0-9]*");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
ab45xy    true
Jack      true

Huh—that got worse. Why did "Jack" succeed?

Match a number

Add +:

const regex r("[0-9]+");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
ab45xy    true
Jack      false

At least we got rid of Jack.

Problem is, we haven’t told the regex that it has to match the whole line. It’s happy just matching part of the line.

Match a number

Anchored:

const regex r("^[0-9]+$");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
ab45xy    false
Jack      false

Now it has to match the entire line, since ^ only matches at the start of the string, and $ only matches at the end of the string.

Match a number

How about floating-point?

const regex r("^[0-9]+$");

for (auto s : {"123", "45.67", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     false
ab45xy    false
Jack      false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
ab45xy    false
Jack      false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 true
ab45xy    false
Jack      false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       false
45.67     true
78.       false
.89       false
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Let’s stop hacking and design.

Digits before or after . are optional, but a naked . is bad, so here are the possibilities:
- digits
- digits . digits
- digits .
- . digits
or:
- digits
- digits . optional-digits
- . digits

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\.\\d+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Those double backslashes are hideous. Use a raw string, which works like this:
R"( stuff-taken-literally-even-backslashes )"

const regex r(R"(^(\d+(\.\d*)?|\.\d+)$)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Should’ve used regex_match() instead of regex_search(); regex_match() matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\.\d+)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_match(s, r) << '\n';

123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Capturing Match

regex_search() takes an optional smatch object, which contains the match results.
- sm[0] is the entire matched string
- sm[1] is the first string captured with ()
- sm[2] is the second string captured with ()
Actually, sm[n] is a more complicated object, but you can treat it as a string.

string in = "My dog Kokopelli is a Chihuahua-terror";
regex r("(\\S+) is a (.*)");

if (smatch sm; regex_search(in, sm, r))
    cout << "All:   " << sm[0] << '\n'
         << "Name:  " << sm[1] << '\n'
         << "Breed: " << sm[2] << '\n';
else
    cout << "No match\n";

All:   Kokopelli is a Chihuahua-terror
Name:  Kokopelli
Breed: Chihuahua-terror

Contractions

Let’s do multiple matches on the same string.
Let’s look for all of the contractions in a string.
Define a contraction as letters apostrophe letters.

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';

true

Thanks—that was existential.
Where are the contractions?
I want a list!

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';

1: an't
11: y'all
30: o'clock

Exactly what is iter iterating over?
Hey, that first contraction is wrong.

Match contractions

Let’s add regex_constants::icase:

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+", regex_constants::icase);

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';

0: Can't
11: y'all
30: o'clock

Match contractions

Even if that worked, the [a-z] is problematic.
[a-z] represents the entire alphabet, but, does it? Who says that the alphabet begins with 𝒜 and ends with 𝒵?
Swedish alphabetical order is ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ.
- This does not end with Z, so [A-Z] would exclude ÅÄÖ.
Modern Greek alphabetical order is ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ.
- Latin Z (Z) ≠ Greek Zeta (Ζ), even if they look alike.
Instead of [a-z], use [[:alpha:]], which means “all alphabetical characters”.

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+");  // no more icase

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';

0: Can't
11: y'all
30: o'clock

Note that [[:alpha:]] is not [:alpha:]. There are two sets of square brackets. See https://cplusplus.com/reference/regex/ECMAScript/ for [[:upper:]], [[:digit:]], [[:space:]], and other such character classes..