Expression Lexing and Regulare Expressions
-
Introduce the topic of lexical analysis in a programming language such as
Java
. -
Develop a robust lexer that is successful regardless of the whitespace.
-
Should be able to parse both
"(6 * a) + (b / 4)"
and"(6*a)+(b/4)"
.
-
-
Create a project called
L14
-
Import
ExpressionLexing-starter.jar
into yourL14
project.
Your directory should look like this:
L14/ └── src ├── Lexer.java └── TestCode.java
Lexical analysis is the first phase of a compiler. It involves taking a series of words and breaking them down into tokens by removing whitespace and comments.
The Lexer has several different versions of a lexing method for identifying tokens within an expression.
-
The first method is called
scannerLexer
, and uses aScanner
object. -
The second method is called
splitLexer
, which uses the methodString.split()
.
Use the javadoc to implement the methods is the Lexer
class.
HINT: After a token is returned from each of the different lexing methods,
call the String.trim()
method to remove extra whitespace from the beginning
and end of the string. If the token is empty, do not add it to the
ArrayList
.
This portion of the lab must be completed on a linux machine (such as the lab machines).
Regular expressions are invaluable for pattern matching, filtering strings, and finding occurences of phrases in large projects.
First, open a terminal and navigate to your eclipse workspace for this semester.
cd ~/<path to your workspace>
Run the following command, and see what output you get.
grep -r -P --include="*.java" 'print(f|ln)?\(' .
Let’s break down this command.
-
grep
is a command which searches one or more files for lines which match a string pattern, and prints each matching line. -
-r
this flag instructs grep to search recursively, including all files in subdirecories. -
-P
this flag instructs grep to interpret the pattern as a regular expression. -
--include="*.java"
this instructs grep to include only files which end with the .java extension. -
'print(f|ln)?\('
this is the pattern grep searches for. It will search for the string 'print', followed by either the strings 'f' or 'ln', which are optional as instructed by?
, followed by the string '('. The escape character '\' is necessary to capture ceratin special characters such as(
.
Now, run the following command to see how many lines of code you’ve written which incude a print statement.
grep -r -P --include="*.java" 'print(f|ln)?\(' . | wc -l
Try running the first command, replacing the pattern with '[a-zA-Z]+[0-9]'
. This should show you every line where you’ve referenced a varaible that ends in a number. [a-zA-Z]
matches any character in the range a to z and A to Z. +
signifies that one or more character in this category must be present. [0-9]
matches any character in the range 0 to 9. Think about how you could improve this pattern to match only instances where variables are called.
Using this same command format modifying only the pattern, and the tutorials below for reference, write regular expressions to answer the following questions:
-
How many times have you called the
print
command (not printf or println)? -
How many times have you written a single line comment (one that begins with //)?
-
How many times have you written a single line comment on the same line as a line of code (such as
int foo = 5 //assign 5 to foo
)
-
-
How many times have you written a for loop or a while loop?
-
How many times have you written a for loop or while loop and not included a space between
for
and the first(
?
-
-
How many times have you written a for each loop (such as
for (Movie m : movies) {
)? -
OPTIONAL BONUS: How many times have you written a for loop that used
i
as the incrementor (such asfor (int i = 0; i < 10; i++) {
)?
Use the following resources to learn more about Regular Expressions
and to answer questions.
-
Regex 101. Breaks down and explains a given expression. (Select python)
Submission
To receive credit for this recitation show your TA or helper that your program passes the TestCode and the answers to the questions above, along with the grep output you used to find them.