Show Lecture.RegularExpressions as a slide show.
CS253 Regular Expressions
Nomenclature
- A certain sort of pattern is called a Regular Expression.
- It’s also called a regexp, or regex, or just an re.
- It’s the middle of
grep
. In vi:
:g/re/p
- means to do a global match of all lines that match a given regular expression, and print those lines.
Wikipedia says:
Regular expressions describe regular languages in formal language theory.
They have the same expressive power as regular grammars.
Pattern Matching
- It is often useful to perform pattern matching.
- For example, an actor contemplating a role might want to know how many
times the name “Cornelius” occurs in Shakespeare’s Hamlet:
% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
In C++
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
if (line.find("Cornelius") != string::npos)
cout << line << '\n';
You, good Cornelius, and you, Voltimand,
That’s only one match. Didn’t we find more than that?
Case-independence
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
if (line.find("Cornelius") != string::npos ||
line.find("cornelius") != string::npos ||
line.find("CORNELIUS") != string::npos)
cout << line << '\n';
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
Not satisfied
- That’s better but it’s not truly case-independent.
- What about “CorNELius”, or “cOrNeLiUs”?
- There are 29, or 512, combinations,
which would make the code quite tedious.
- As if the Bard of Avon would really write “CoRNelIUs”!
Regular expressions
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius"); // Create the pattern
for (string line; getline(play, line); )
if (regex_search(line, r)) // Search the line
cout << line << '\n';
You, good Cornelius, and you, Voltimand,
OK, but it’s not case-independent.
Regular expressions
const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius", regex_constants::icase);
for (string line; getline(play, line); )
if (regex_search(line, r))
cout << line << '\n';
CORNELIUS |
POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
You, good Cornelius, and you, Voltimand,
CORNELIUS |
[Exeunt VOLTIMAND and CORNELIUS]
[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
[Exeunt VOLTIMAND and CORNELIUS]
Dialects
You’ve all learned English; did you learn:
- American English?
- Canadian English?
- U.K. English?
- Australian English?
Well, same with regular expressions. There are dialects.
Regular expression dialects
The second argument to the regex
ctor is a bitmask of flags.
regex_constants::icase
indicates a case-independent
pattern. You can also specify the regular expression dialect:
Flag | Explanation |
regex_constants::ECMAScript | ECMAScript (Javascript) (default) |
regex_constants::basic | Basic POSIX |
regex_constants::extended | Extended POSIX |
regex_constants::awk | Awk POSIX |
regex_constants::grep | Grep POSIX |
regex_constants::egrep | Egrep POSIX |
Not filename patterns
- These are not filename patterns.
- For example,
*
does not mean “anything” in a regexp.
It means “repeat what came before”.
- Filename patterns are used to match filenames.
- Regular expressions are used to match all sorts of data.
Components of regular expressions:
What
| Description
| What
| Description
|
---|
.
| any one char but \n
| |
| alternation
|
[a-fxy0-9]
| any one of these
| ( …)
| grouping
|
[^a-fxy0-9]
| not one of these
| \b
| word boundary
|
*
| 0–∞ of previous
| \d or \D
| [0-9] or not (just one!)
|
+
| 1–∞ of previous
| \s or \S
| [ \n\r…] or not
|
?
| 0–1 of previous
| \w or \W
| [0-9a-zA-Z_] or not
|
{17}
| 17 of previous
| ^
| beginning of line
|
{3,8}
| 3–8 of previous
| $
| end of line
|
Match a number
Let’s try to match a number:
const regex r("[0-9]");
cout << boolalpha << regex_search("123", r) << '\n';
true
Hooray, it worked!
Well, perhaps a bit more testing might be worthwhile …
Match a number
Let’s try to match a number:
const regex r("[0-9]");
cout << boolalpha
<< regex_search("123", r) << '\n'
<< regex_search("ab45xy", r) << '\n'
<< regex_search("Jack", r) << '\n';
true
true
false
Testing—what a concept! Not very DRY, though.
Match a number
Let’s try to match a number:
const regex r("[0-9]");
for (auto s : {"123", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Jack false
OK, now it’s DRY. Why does ab45xy
succeed?
Match a number
Add *
:
const regex r("[0-9]*");
for (auto s : {"123", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Jack true
Huh—that got worse. Why did "Jack"
succeed?
Match a number
Add +
:
const regex r("[0-9]+");
for (auto s : {"123", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy true
Jack false
At least we got rid of Jack.
Match a number
Anchored:
const regex r("^[0-9]+$");
for (auto s : {"123", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
ab45xy false
Jack false
Match a number
How about floating-point?
const regex r("^[0-9]+$");
for (auto s : {"123", "45.67", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 false
ab45xy false
Jack false
Match a number
Need to add the decimal point:
const regex r("^[0-9.]+$");
for (auto s : {"123", "45.67", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
ab45xy false
Jack false
Match a number
We might be too liberal, now:
const regex r("^[0-9.]+$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. true
127.0.0.1 true
ab45xy false
Jack false
Match a number
Let’s insist on digits point digits:
const regex r("^[0-9]+\\.[0-9]+$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 false
45.67 true
78. false
.89 false
. false
127.0.0.1 false
ab45xy false
Jack false
Why the double backslash?
Match a number
No, the parts should be optional:
const regex r("^[0-9]*\\.?[0-9]*$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. true
127.0.0.1 false
ab45xy false
Jack false
Match a number
Let’s stop hacking and design.
- Digits before or after . are optional, but a naked . is
bad, so here are the possibilities:
- digits
- digits . digits
- digits .
- . digits
- or:
- digits
- digits . optional-digits
- . digits
We express alternation with |
.
Match a number
const regex r("^([0-9]+|[0-9]+\\.[0-9]*|\\.[0-9]+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Jack false
Match a number
Combine the first two cases:
const regex r("^([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Jack false
Match a number
Let’s use \d
instead of [0-9]
:
const regex r("^(\\d+(\\.\\d*)?|\\.\\d+)$");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Jack false
Match a number
Those double backslashes are hideous. Use a raw string:
const regex r(R"(^(\d+(\.\d*)?|\.\d+)$)");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_search(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Jack false
Match a number
Should’ve used regex_match
instead of regex_search
;
regex_match
matches the entire string.
Now we don’t need ^$
and the parentheses:
const regex r(R"(\d+(\.\d*)?|\.\d+)");
for (auto s : {"123", "45.67", "78.", ".89", ".",
"127.0.0.1", "ab45xy", "Jack"})
cout << setw(10) << left << s
<< boolalpha << regex_match(s, r) << '\n';
123 true
45.67 true
78. true
.89 true
. false
127.0.0.1 false
ab45xy false
Jack false
Change of topic
- Enough about matching numbers.
- Let’s do multiple matches on the same string.
- Let’s look for all of the contractions in a string.
- Define a contraction as letters apostrophe letters.
Match contractions
const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");
cout << boolalpha
<< regex_search(s, r) << '\n';
true
- Thanks—that was existential.
- Where are the contractions?
- I want a list!
Match contractions
const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
1: an't
11: y'all
30: o'clock
- Exactly what is
iter
iterating over?
- Hey, that first contraction is wrong.
Match contractions
Let’s add regex_constants::icase
:
const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+", regex_constants::icase);
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock
Oh, good—that didn’t help at all. Why not!?
Match contractions
- That
[a-z]
is a problem. What does it mean?
- It’s related to collating order (sorting order).
- What is the sorting order?
- Is it a…z, A…Z? A…Z, a…z? AaBbCc…XxYyZz? aAbBcC…xXyYzZ?
- Did your answer involve ASCII?
- What would Auður Ava Ólafsdóttir think of your answer?
Match contractions
const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+"); // no more icase
sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;
for (; iter!=end; ++iter)
cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock
Note that [[:alpha:]]
is not [:alpha:]
.
There are two sets of square brackets.
See http://www.cplusplus.com/reference/regex/ECMAScript/
for [[:upper:]]
, [[:digit:]]
, [[:space:]]
,
and other such character classes..