CS253 Regular Expressions
% grep "Osric" ~cs253/pub/hamlet.txt Osric, who brings back to him that you attend him in KING CLAUDIUS Give them the foils, young Osric. Cousin Hamlet, LAERTES Why, as a woodcock to mine own springe, Osric; % grep -c "Osric" ~cs253/pub/hamlet.txt 3 % grep -ic "Osric" ~cs253/pub/hamlet.txt 32
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Osric") != string::npos) cout << line << '\n';
Osric, who brings back to him that you attend him in KING CLAUDIUS Give them the foils, young Osric. Cousin Hamlet, LAERTES Why, as a woodcock to mine own springe, Osric;
That’s only three matches. Didn’t we find 32 matches?
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Osric") != string::npos || line.find("osric") != string::npos || line.find("OSRIC") != string::npos) cout << line << '\n';
OSRIC | [Enter OSRIC] OSRIC Your lordship is right welcome back to Denmark. OSRIC Sweet lord, if your lordship were at leisure, I OSRIC I thank your lordship, it is very hot. OSRIC It is indifferent cold, my lord, indeed. OSRIC Exceedingly, my lord; it is very sultry,--as OSRIC Nay, good my lord; for mine ease, in good faith. OSRIC Your lordship speaks most infallibly of him. OSRIC Sir? OSRIC Of Laertes? OSRIC I know you are not ignorant-- OSRIC You are not ignorant of what excellence Laertes is-- OSRIC I mean, sir, for his weapon; but in the imputation OSRIC Rapier and dagger. OSRIC The king, sir, hath wagered with him six Barbary OSRIC The carriages, sir, are the hangers. OSRIC The king, sir, hath laid, that in a dozen passes OSRIC I mean, my lord, the opposition of your person in trial. OSRIC Shall I re-deliver you e'en so? OSRIC I commend my duty to your lordship. [Exit OSRIC] Osric, who brings back to him that you attend him in Lords, OSRIC, and Attendants with foils, &c] KING CLAUDIUS Give them the foils, young Osric. Cousin Hamlet, OSRIC Ay, my good lord. OSRIC A hit, a very palpable hit. OSRIC Nothing, neither way. OSRIC Look to the queen there, ho! OSRIC How is't, Laertes? LAERTES Why, as a woodcock to mine own springe, Osric; OSRIC Young Fortinbras, with conquest come from Poland,
That’s better but it’s not truly case-independent. What about “OsRiC”, or “oSRIc”? There are 2⁵, or 32, combinations, making the code quite tedious:
if (line.find("osric") != string::npos || line.find("osriC") != string::npos || line.find("osrIc") != string::npos || line.find("osrIC") != string::npos || line.find("osRic") != string::npos || line.find("osRiC") != string::npos || line.find("osRIc") != string::npos || line.find("osRIC") != string::npos || line.find("oSric") != string::npos || line.find("oSriC") != string::npos || line.find("oSrIc") != string::npos || line.find("oSrIC") != string::npos || line.find("oSRic") != string::npos || line.find("oSRiC") != string::npos || line.find("oSRIc") != string::npos || line.find("oSRIC") != string::npos || line.find("Osric") != string::npos || line.find("OsriC") != string::npos || line.find("OsrIc") != string::npos || line.find("OsrIC") != string::npos || line.find("OsRic") != string::npos || line.find("OsRiC") != string::npos || line.find("OsRIc") != string::npos || line.find("OsRIC") != string::npos || line.find("OSric") != string::npos || line.find("OSriC") != string::npos || line.find("OSrIc") != string::npos || line.find("OSrIC") != string::npos || line.find("OSRic") != string::npos || line.find("OSRiC") != string::npos || line.find("OSRIc") != string::npos || line.find("OSRIC") != string::npos)
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; const regex r("Osric"); // Create the pattern while (getline(play, line)) if (regex_search(line, r)) // Search the line cout << line << '\n';
Osric, who brings back to him that you attend him in KING CLAUDIUS Give them the foils, young Osric. Cousin Hamlet, LAERTES Why, as a woodcock to mine own springe, Osric;
OK, but it’s not case-independent.
ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt"); string line; const regex r("Osric", regex_constants::icase); while (getline(play, line)) if (regex_search(line, r)) cout << line << '\n';
OSRIC | [Enter OSRIC] OSRIC Your lordship is right welcome back to Denmark. OSRIC Sweet lord, if your lordship were at leisure, I OSRIC I thank your lordship, it is very hot. OSRIC It is indifferent cold, my lord, indeed. OSRIC Exceedingly, my lord; it is very sultry,--as OSRIC Nay, good my lord; for mine ease, in good faith. OSRIC Your lordship speaks most infallibly of him. OSRIC Sir? OSRIC Of Laertes? OSRIC I know you are not ignorant-- OSRIC You are not ignorant of what excellence Laertes is-- OSRIC I mean, sir, for his weapon; but in the imputation OSRIC Rapier and dagger. OSRIC The king, sir, hath wagered with him six Barbary OSRIC The carriages, sir, are the hangers. OSRIC The king, sir, hath laid, that in a dozen passes OSRIC I mean, my lord, the opposition of your person in trial. OSRIC Shall I re-deliver you e'en so? OSRIC I commend my duty to your lordship. [Exit OSRIC] Osric, who brings back to him that you attend him in Lords, OSRIC, and Attendants with foils, &c] KING CLAUDIUS Give them the foils, young Osric. Cousin Hamlet, OSRIC Ay, my good lord. OSRIC A hit, a very palpable hit. OSRIC Nothing, neither way. OSRIC Look to the queen there, ho! OSRIC How is't, Laertes? LAERTES Why, as a woodcock to mine own springe, Osric; OSRIC Young Fortinbras, with conquest come from Poland,
What | Description | What | Description |
---|---|---|---|
.
| any one char but \n | |
| alternation |
[a-fxy0-9]
| any one of these | ( …)
| grouping |
[^a-fxy0-9]
| not one of these | \b
| word boundary |
*
| 0–∞ of previous | \d or \D
| [0-9] or not |
+
| 1–∞ of previous | \s or \S
| [ \n\r…] or not
|
?
| 0–1 of previous | \w or \W
| [0-9a-zA-Z] or not |
{17}
| 17 of previous | ^
| beginning of line |
{3,8}
| 3–8 of previous | $
| end of line |
Let’s try to match a number:
const regex r("[0-9]"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false
Add *
:
const regex r("[0-9]*"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true
Huh—that got worse.
Add +
:
const regex r("[0-9]+"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false
Anchored:
const regex r("^[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true false false
How about floating-point?
const regex r("^[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true false false false
Need to add the decimal point:
const regex r("^[0-9.]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true false false
We might be too liberal, now:
const regex r("^[0-9.]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true true true false false
Let’s insist on digits point digits:
const regex r("^[0-9]+\\.[0-9]+$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
false true false false false false false false
Why the double backslash?
No, the parts should be optional:
const regex r("^[0-9]*\\.?[0-9]*$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true true false false false
.
are optional, but a naked .
is bad:
We express alternation with |
.
const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Combine the first two cases:
const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Let’s use \d
instead of [0-9]
:
const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Those double backslashes are hideous. Use a raw string:
const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("45.67", r) << '\n' << regex_search("78.", r) << '\n' << regex_search(".89", r) << '\n' << regex_search(".", r) << '\n' << regex_search("127.0.0.1", r) << '\n' << regex_search("abc123def", r) << '\n' << regex_search("Jack", r) << '\n';
true true true true false false false false
Should’ve used regex_match
instead of regex_search
;
regex_match
matches the entire string.
Now we don’t need ^$
and the parentheses:
const regex r(R"(\d+(\.\d*)?|\d*\.\d+)"); cout << boolalpha << regex_match("123", r) << '\n' << regex_match("45.67", r) << '\n' << regex_match("78.", r) << '\n' << regex_match(".89", r) << '\n' << regex_match(".", r) << '\n' << regex_match("127.0.0.1", r) << '\n' << regex_match("abc123def", r) << '\n' << regex_match("Jack", r) << '\n';
true true true true false false false false
const string s = "I can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); cout << boolalpha << regex_search(s, r) << '\n';
true
That was useless. Where are the contractions? I want a list!
const string s = "I can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
2: can't 13: y'all 32: o'clock
Exactly what is iter
iterating over?
Modified: 2017-04-24T14:32 User: Guest Check: HTML CSSEdit History Source |
Apply to CSU |
Contact CSU |
Disclaimer |
Equal Opportunity Colorado State University, Fort Collins, CO 80523 USA © 2015 Colorado State University |