CS253 Regular Expressions

Nomenclature

A certain sort of pattern is called a Regular Expression.
It’s also called a regexp, or regex, or just an re.
It’s the middle of grep. In vi:

    :g/re/p

means to do a global match of all lines that match a given regular expression, and print those lines.

Wikipedia says:

Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.

Pattern Matching

It is often useful to perform pattern matching.
For example, an actor contemplating a role might want to know how many times the name “Cornelius” occurs in Shakespeare’s Hamlet:

% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

In C++

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Cornelius") != string::npos)
        cout << line << '\n';

		You, good Cornelius, and you, Voltimand,

That’s only one match. Didn’t we find more than that?

Case-independence

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Cornelius") != string::npos ||
        line.find("cornelius") != string::npos ||
        line.find("CORNELIUS") != string::npos)
        cout << line << '\n';

CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Not satisfied

That’s better but it’s not truly case-independent.
What about “CorNELius”, or “cOrNeLiUs”?
There are 2⁹, or 512, combinations, which would make the code quite tedious.
- As if the Bard of Avon would really write “CoRNelIUs”!

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Cornelius");     // Create the pattern

while (getline(play, line))
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';

		You, good Cornelius, and you, Voltimand,

OK, but it’s not case-independent.

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Cornelius", regex_constants::icase);

while (getline(play, line))
    if (regex_search(line, r))
        cout << line << '\n';

CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Dialects

You’ve all learned English; did you learn:

American English?
Canadian English?
U.K. English?
Australian English?

Well, same with regular expressions. There are dialects.

Basic components of regular expressions:

What	Description	What	Description
`.`	any one char but \n	`\|`	alternation
`[a-fxy0-9]`	any one of these	`(`…`)`	grouping
`[^a-fxy0-9]`	not one of these	`\b`	word boundary
`*`	0–∞ of previous	`\d` or `\D`	[0-9] or not
`+`	1–∞ of previous	`\s` or `\S`	`[ \n\r…]` or not
`?`	0–1 of previous	`\w` or `\W`	[0-9a-zA-Z] or not
`{17}`	17 of previous	`^`	beginning of line
`{3,8}`	3–8 of previous	`$`	end of line

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
false

Match a number

Add *:

const regex r("[0-9]*");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true

Huh—that got worse. Why did "Jack" succeed?

Match a number

Add +:

const regex r("[0-9]+");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
false

At least we got rid of Jack.

Match a number

Anchored:

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
false
false

Match a number

How about floating-point?

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
false
false
false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
false
false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
true
true
false
false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

false
true
false
false
false
false
false
false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
true
false
false
false

Match a number

Digits before or after . are optional, but a naked . is bad:
- digits
- digits.digits
- digits.
- .digits
or:
- digits
- digits.optional-digits
- optional-digits.digits

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
false
false
false
false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
false
false
false
false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
false
false
false
false

Match a number

Those double backslashes are hideous. Use a raw string:

const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';

true
true
true
true
false
false
false
false

Match a number

Should’ve used regex_match instead of regex_search; regex_match matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\d*\.\d+)");

cout << boolalpha
     << regex_match("123",       r) << '\n'
     << regex_match("45.67",     r) << '\n'
     << regex_match("78.",       r) << '\n'
     << regex_match(".89",       r) << '\n'
     << regex_match(".",         r) << '\n'
     << regex_match("127.0.0.1", r) << '\n'
     << regex_match("abc123def", r) << '\n'
     << regex_match("Jack",      r) << '\n';

true
true
true
true
false
false
false
false

Change of topic

Enough about matching numbers.
Let’s do multiple matches on the same string.
Let’s look for all of the contractions in a string.
Define a contraction as letters apostrophe letters.

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';

true

Thanks—that was existential.
Where are the contractions?
I want a list!

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';

1: an't
11: y'all
30: o'clock

Exactly what is iter iterating over?

Match contractions

That [a-z] is a problem. What does it mean?
It’s related to collating order (sorting order).
What is the sorting order?
Is it a…z, A…Z? A…Z, a…z? AaBbCc…XxYyZz? aAbBcC…xXyYzZ?
Did your answer involve ASCII?
What would Auður Ava Ólafsdóttir think of your answer?

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';

0: Can't
11: y'all
30: o'clock

CS253: Software Development with C++

Spring 2018

Regular Expressions

CS253 Regular Expressions

Nomenclature

Pattern Matching

In C++

Case-independence

Not satisfied

Regular expressions

Regular expressions

Dialects

Basic components of regular expressions:

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Match a number

Change of topic

Match contractions

Match contractions

Match contractions

Match contractions