grep
. In vi:
:g/re/p
Wikipedia says:
Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.
% grep -i "Cornelius" ~cs253/pub/hamlet.txt CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Cornelius") != string::npos) cout << line << '\n';
You, good Cornelius, and you, Voltimand,
That’s only one match. Didn’t we find more than that?
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Cornelius") != string::npos || line.find("cornelius") != string::npos || line.find("CORNELIUS") != string::npos) cout << line << '\n';
CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; const regex r("Cornelius"); // Create the pattern while (getline(play, line)) if (regex_search(line, r)) // Search the line cout << line << '\n';
You, good Cornelius, and you, Voltimand,
OK, but it’s not case-independent.
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; const regex r("Cornelius", regex_constants::icase); while (getline(play, line)) if (regex_search(line, r)) cout << line << '\n';
CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
You’ve all learned English; did you learn:
Well, same with regular expressions. There are dialects.
What | Description | What | Description |
---|---|---|---|
.
| any one char but \n | |
| alternation |
[a-fxy0-9]
| any one of these | ( …)
| grouping |
[^a-fxy0-9]
| not one of these | \b
| word boundary |
*
| 0–∞ of previous | \d or \D
| [0-9] or not |
+
| 1–∞ of previous | \s or \S
| [ \n\r…] or not
|
?
| 0–1 of previous | \w or \W
| [0-9a-zA-Z] or not |
{17}
| 17 of previous | ^
| beginning of line |
{3,8}
| 3–8 of previous | $
| end of line |
Let’s try to match a number:
const regex r("[0-9]"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false
Add *
:
const regex r("[0-9]*"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true
Huh—that got worse. Why did "Jack"
succeed?
Add +
:
const regex r("[0-9]+"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false
At least we got rid of Jack.
Anchored:
const regex r("^[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true false false
How about floating-point?
const regex r("^[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true false false false
Need to add the decimal point:
const regex r("^[0-9.]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false false
We might be too liberal, now:
const regex r("^[0-9.]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true true true false false
Let’s insist on digits point digits:
const regex r("^[0-9]+\\.[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
false true false false false false false false
Why the double backslash?
No, the parts should be optional:
const regex r("^[0-9]*\\.?[0-9]*$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true true false false false
.
are optional, but a naked .
is bad:
We express alternation with |
.
const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Combine the first two cases:
const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Let’s use \d
instead of [0-9]
:
const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Those double backslashes are hideous. Use a raw string:
const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Should’ve used regex_match
instead of regex_search
;
regex_match
matches the entire string.
Now we don’t need ^$
and the parentheses:
const regex r(R"(\d+(\.\d*)?|\d*\.\d+)"); cout << boolalpha << regex_match("123", r) << '\n' << regex_match("45.67", r) << '\n' << regex_match("78.", r) << '\n' << regex_match(".89", r) << '\n' << regex_match(".", r) << '\n' << regex_match("127.0.0.1", r) << '\n' << regex_match("abc123def", r) << '\n' << regex_match("Jack", r) << '\n';
true true true true false false false false
const string s = "Can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); cout << boolalpha << regex_search(s, r) << '\n';
true
const string s = "Can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
1: an't 11: y'all 30: o'clock
Exactly what is iter
iterating over?
[a-z]
is a problem. What does it mean?
const string s = "Can't feed y'all before three o'clock!"; const regex r("[[:alpha:]]+'[[:alpha:]]+"); sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
0: Can't 11: y'all 30: o'clock