CS253: Software Development with C++

Fall 2019

Regular Expressions

Show Lecture.RegularExpressions as a slide show.

CS253 Regular Expressions

made at imgflip.com

Nomenclature

    :g/re/p
means to do a global match of all lines that match a given regular expression, and print those lines.

Wikipedia says:

Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.

Pattern Matching

% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

In C++

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos)
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

That’s only one match. Didn’t we find more than that?

Case-independence

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos ||
        line.find("cornelius") != string::npos ||
        line.find("CORNELIUS") != string::npos)
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Not satisfied

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius");     // Create the pattern

for (string line; getline(play, line); )
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

OK, but it’s not case-independent.

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius", regex_constants::icase);

for (string line; getline(play, line); )
    if (regex_search(line, r))
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Dialects

You’ve all learned English; did you learn:

Well, same with regular expressions. There are dialects.

Regular expression dialects

The second argument to the regex ctor is a bitmask of flags. regex_constants::icase indicates a case-independent pattern. You can also specify the regular expression dialect:

FlagExplanation
regex_constants::ECMAScript   ECMAScript (Javascript) (default)
regex_constants::basicBasic POSIX
regex_constants::extendedExtended POSIX
regex_constants::awkAwk POSIX
regex_constants::grepGrep POSIX
regex_constants::egrepEgrep POSIX

Not filename patterns

Components of regular expressions:

What Description What Description
. any one char but \n | alternation
[a-fxy0-9] any one of these () grouping
[^a-fxy0-9] not one of these \b word boundary
* 0–∞ of previous \d or \D [0-9] or not (just one!)
+ 1–∞ of previous \s or \S [ \n\r…] or not
? 0–1 of previous \w or \W [0-9a-zA-Z_] or not
{17} 17 of previous ^ beginning of line
{3,8} 3–8 of previous $ end of line

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha << regex_search("123", r) << '\n';
true

Hooray, it worked!

Well, perhaps a bit more testing might be worthwhile …

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",    r) << '\n'
     << regex_search("ab45xy", r) << '\n'
     << regex_search("Jack",   r) << '\n';
true
true
false

Testing—what a concept! Not very DRY, though.

Match a number

Let’s try to match a number:

const regex r("[0-9]");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Jack      false

OK, now it’s DRY. Why does ab45xy succeed?

Match a number

Add *:

const regex r("[0-9]*");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Jack      true

Huh—that got worse. Why did "Jack" succeed?

Match a number

Add +:

const regex r("[0-9]+");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Jack      false

At least we got rid of Jack.

Match a number

Anchored:

const regex r("^[0-9]+$");

for (auto s : {"123", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    false
Jack      false

Match a number

How about floating-point?

const regex r("^[0-9]+$");

for (auto s : {"123", "45.67", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     false
ab45xy    false
Jack      false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
ab45xy    false
Jack      false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 true
ab45xy    false
Jack      false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       false
45.67     true
78.       false
.89       false
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Let’s stop hacking and design.

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\.\\d+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Those double backslashes are hideous. Use a raw string:

const regex r(R"(^(\d+(\.\d*)?|\.\d+)$)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Match a number

Should’ve used regex_match instead of regex_search; regex_match matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\.\d+)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Jack"})
    cout << setw(10) << left << s
         << boolalpha << regex_match(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Jack      false

Change of topic

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';
true

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
1: an't
11: y'all
30: o'clock

Match contractions

Let’s add regex_constants::icase:

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+", regex_constants::icase);

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock

Oh, good—that didn’t help at all. Why not!?

Match contractions

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+");  // no more icase

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock

Note that [[:alpha:]] is not [:alpha:]. There are two sets of square brackets. See http://www.cplusplus.com/reference/regex/ECMAScript/ for [[:upper:]], [[:digit:]], [[:space:]], and other such character classes..