CS 270 Programming Assignment #4
Due Tuesday October 28th (by 11:59pm),
Unique LC3 assembly program due to mailing list by Sunday October 26th (by 11:59pm)
Introduction
Programming assignments are
to be done individually. This programming assignment does NOT
include additional homework questions.
For this assignment, you will be implementing and testing
parts of an LC3 assembler.
Your assembler will take as input
LC3 assembly code and generate as output the LC3 machine code
that was used as input to the LC3sim program developed for
PA3. The following is an example of how the two programs will
be used together:
% ./LC3assem file.asm file.LC3
% ./LC3sim file.LC3
For testing, you will be using unit testing and regression testing.
The goals for this assignment are as follows:
- Learn what a symbol table is and how to use it to do a multi-pass
translation of LC3 assembly code to LC3 machine code.
- Learn and use a regression testing technique.
- Write some simple programs in LC3 assembly language.
- Learn about programming with C strings.
- Learn about programming with singly-linked lists in C.
- Learn the C system calls malloc() and strcmp().
The Assignment
The assignment is composed of two implementation pieces: (1) writing a symbol table routine
for looking up a label so as to find its corresponding address and
(2) writing the actual assembler passes.
There is also an important testing component. You will need to use
some regression testing to check that your assembler generates the same
error messages as the LC3assem.public executable. Also, the machine code
you generate must be able to run through LC3sim.public. To help everyone test,
each student must
submit one unique example
LC3 assembly program to the mailing list by October 26th,
11:59pm.
Requirements for the test LC3 assembly program
- It must be submitted via the class mailing list.
- It must be different than any previous test cases sent to the mailing list.
- The test case must either cause the LC3assem.public executable to generate
an error or the resulting machine code file must run with LC3sim.public without causing an infinite loop. In other words, LC3sim.public must stop within one minute on
the generated machine code.
If the test LC3 assembly program is submitted before noon on Sunday October 26th, then we will attempt to let you know whether it is an acceptable test
case before 10pm the 26th.
Part 1: The Symbol Table
The symbol table allows you to map strings representing labels (for
example, "START") to addresses (for example, 0x3000).
The symbol table is a specific example of a dictionary data structure.
You will be given the implementation for the following function:
sym_node_t* sym_table_insert(sym_node_t *st, char* sym, int addr)
This function adds label sym with the address addr to the
symbol table. sym is the key and addr is the value associated
with that key.
Implement the following function in a separate file, sym_table_lookup.c.
int sym_table_lookup(sym_node_t *st, char* sym)
This function returns the address associated with symbol sym in the
symbol table.
The prototypes for all of the functions in the assignment
are in the header files LC3assem and sym_table.h.
It is much easier to unit test sym_table_insert and sym_table_lookup together
by inserting symbols into the symbol table and verifying that they can
be found. Therefore, you need only use one unit test called
test_sym.c, which tests both of the symbol table functions.
test_sym.c is given to you, but you must add it to the Makefile after
implementing and adding sym_table_lookup.c to the Makefile.
Symbol tables can be implemented in many ways (hash tables, linked
lists, arrays, etc.), but we will use a binary tree representation.
Each node in the tree is represented via the structure below.
typedef struct sym_node_struct sym_node_t;
struct sym_node_struct {
char* symbol;
int address;
sym_node_t* left;
sym_node_t* right;
};
The symbol field points to an array of characters (string)
representing the symbol. The address field is the address
in simulated memory for
the symbol. The left and right fields are pointers to
the left and right sym_node_t child structures in the tree
(NULL if child does not exist). The "root" of the tree is pointed to by
a local variable in LC3assem.c, sym_table_ptr, which is initially NULL:
sym_node_t* sym_table_ptr = NULL; // ptr to empty symbol table
To provide fast lookup of symbols, we want to build the tree such that
given any node in the tree, all labels reachable from the left child
are alphabetically less than the label of the node ("car" is
alphabetically before "dog", but it is after "ant"). Similarly, all
labels reachable from the right child are alphabetically greater than
the label of the node. The strcmp(char* s1, char* s2)
function from the C standard library does just such an alphabetical
comparison. The function strcmp(char* s1, char* s2) returns
a 0 if the strings are the same, a number greater than 0 if s1 is
alphabetically after s2, and a number less than 0 if s1 is
alphabetically before s2.
To complete the symbol table implementation, implement the following
function:
- int sym_table_lookup(sym_node_t *st, char* sym): This function
traverses the tree based on the symbol string it is passed. When the
string is found in a node, the value of the address field is
returned. If a non-matching childless node is encountered, the symbol
does not appear in the symbol table. If symbol SYMBOL cannot
be found, print exactly the following error message (so our
testing scripts can detect it) followed by a new line and exit (to exit
a C program, using the following function call: exit(1);).
ERROR: symbol not found (SYMBOL)
SYMBOL is the symbol that could not be found.
The pseudocode:
initialize a node pointer to g_symboltable_ptr
while the node pointer is not NULL do:
use strcmp() to compare sym and the symbol field of structure pointed to by
the node pointer
if the result of the comparison is less than zero:
set the node pointer to be the left pointer of the structure pointed to
by the node pointer
otherwise, if the result of the comparison is greater than zero:
set the node pointer to be the right pointer of the structure pointed to
by the node pointer
otherwise:
the result of the function is the address field of the structure pointed
to by the node pointer
If the symbol was not found, print an error message (using fprintf) and then exit(1)
Our C implementation of this function is less than 20 lines long.
See the implementation of sym_table_insert() to see an example of
how to traverse the binary tree.
Part 2: The Assembler
You will be given the implementation for the following functions:
insn_t* parse(FILE *f)
int insert_field(int bits, int new_bits, int hi_bit, int lo_bit)
int next_address(insn_t* insn_ptr, int current_address)
void write_instr_to_file(FILE * fptr, int address, int word, char * comment)
void pass_one(insn_t* insn_ptr, sym_node_t** sym_table_ptr_ptr)
Adds symbols to symbol table.
void dump(FILE* stream, insn_t* starting_insn_ptr)
Prints all of the instructions in terms of their structure details
to the given stream. Useful for debugging.
Implement the following functions, each in a separate file named based on the function (e.g. pass_two.c).
void pass_two(insn_t* insn_ptr, sym_node_t* sym_table_ptr)
Resolves label references using symbol table.
void pass_three(FILE *output_file, insn_t* insn_ptr)
Actually generates machine code instructions and
writes them to the output file.
Recall from Chapter 7 that assemblers make multiple passes over the
program. Our assembler will make several passes over the program (but
one could combine some of these passes):
- Pass 0: parse the assembly program into a linked list,
- Pass 1: add symbols to symbol table,
- Pass 2: resolve label references using symbol table, and
- Pass 3: actually generate machine code instructions.
Pass 1 and 2 require the symbol table that you completed in Part 1
of the assignment.
Pass #0: Parsing Instructions into a List
Parsing is quite complex, so we provide this code (in
parser.c and parser.h). Feel free to examine this
code if you like. Although we provide the parser, you still need to
understand the structure into which instructions are placed. The
program is parsed into a linked list where each node of the list is an
instance of the structure below.
enum opcodes_t {
// LC3 opcodes
BR=0, ADD=1, LD=2, ST=3, JSR_JSRR=4, AND=5, LDR=6, STR=7,
RTI=8, NOT=9, LDI=10, STI=11, JMP_RET=12, LEA=14, TRAP=15,
// assembler directives
ORIG, FILL, BLKW, STRINGZ, HALT,
// no operation
UNDEF
};
typedef struct insn_struct insn_t;
struct insn_struct {
int address; // address where this instruction should be placed
enum opcodes_t opcode; // the opcode of instruction (see above)
char* label; // label on this line (can be NULL)
int first_reg; // 1st reg number (-1 implies there isn't one)
int second_reg; // 2nd reg number (-1 implies there isn't one)
int third_reg; // 3rd reg number (-1 implies there isn't one)
int immediate; // immediate value
int offset; // offset value
char* label_ref; // label referenced by BR/LD/ST/JSR/LDI/STI/LEA/FILL
char* string; // string for STRINGZ directive
int n; // neg bit (only for BR)
int z; // zero bit (only for BR)
int p; // pos bit (only for BR)
insn_t* next; // pointer to next instruction in list
};
The fields are:
- The address field indicates the address in memory at which
this instruction will be placed.
- The opcode field indicates what kind of instruction this
structure represents (e.g., ADD, AND, etc.).
Opcodes values must come from the opcodes_t enumeration.
Note that the first 16 opcodes correspond directly to actually LC-3
opcode values. For example, the opcode of BR is 0x0, ADD is 0x1, LD is
0x2, etc.
The remaining opcodes are for assembler directive
(i.e., they do not correspond to actual instructions). Note the
JSR_JSRR and JMP_RET enum values. We have named these in this way
because, for example, JSR and JSRR both have the same opcode encoding
in an LC-3 instruction. UNDEF is a special opcode meaning the
structure does not correspond to an instruction at all. This is used
when a label appears on a line without an instruction.
- The label field is a pointer to a null-terminated array
of characters (i.e., a C-style string) representing the label
associated with this line. The value of this pointer is NULL for
instructions that are not preceded by a label.
- The first_reg, second_reg, and
third_reg fields specify the register operands for the
instruction. Instructions that require fewer than 3 registers will
have -1 in the unused registers. For example, if an instruction
requires on two register operands
(e.g., NOT), third_reg will be -1.
- The immediate field is used to hold
immediate values. For example,
an ADD immediate instruction will have an immediate value in
this field.
This field is 0 for instructions that don't require an
immediate, but you can't conclude this by looking at the field because
an instruction might have an immediate with a value of 0. For
example, to determine whether the third operand of
an ADD or AND instruction is a
register or immediate, examine the value of third_reg; if it
is -1, use the immediate in immediate.
To accomodate assembly programs like the one in exercise 9.5 in the book,
instructions such as ST, LD, etc. that typically have a LABEL operand can
instead have a constant address as an operand. That constant address will
be stored in the immediate field.
- The offset field is used to hold offsets. For example,
an LD instruction will have a PCoffset in it.
The offsets for LDR and STR are determined by the parser.
Anything with a PCoffset has its offset calculated in pass_two().
- The label_ref field is a pointer to a null-terminated
array of characters that represents the label referenced by the
instruction (e.g., LABEL1 in "BR LABEL1"). For
instructions not using a label, this pointer will be NULL.
- The string field is a pointer to a null-terminated array of
characters that represents the string operand to the .stringz
directive. It has a NULL value for all instructions except STRINGZ.
- The n, z, and p fields specify the n,
z, and p fields in the BR instruction (1 indicates the
corresponding bit is set in the instruction). These fields are
obviously meaningless for non-BR instructions.
- The next field is a pointer to the next insn_t
structure, representing the next instruction in the program. The next field
has the value NULL if this is the last instruction in the input file.
Passes #1 and #2: Inserting and Looking Up Labels into the Symbol Table
Passes 1 and 2 use a symbol table to store the address that corresponds to all
of the labels in the program. Pass 1 iterates over all the instructions in the
linked list, looking at the label field and adding label/address pairs
to the symbol table. Pass 2 iterates over all the instructions again, but this
time it looks up all label_ref fields in the symbol table and uses the
address of the current instruction and the address of the label to compute the
correct value of offset. After pass 2,
the label_ref field is unused.
Pass 2 also changes constant address immediates in instructions like BR or ST
to offsets (see exercise 9.5 in the book as an example).
Pass 1 is given and can be found in pass_one.c.
You'll need to build pass_two():
- void pass_two(insn_t* insn_ptr, sym_node_t* sym_table_ptr): This function looks
at each instruction for a non-NULL label_ref field. If the
label_ref is non-NULL, the label_ref is looked up in
the symbol table to get the address associate with the label. If the
opcode is FILL, this address becomes the
immediate field. Otherwise, we must compute the
PC-relative offset as follows: address associated with symbol in symbol
table - (current_pc + 1). This value is then stored in
offset. This function will need to call the
sym_table_lookup() function. If there is no label_ref,
the opcode is in the set {BR, JSR_JSRR, LD, LDI, LEA, ST, STI }, and
there is a non-zero immediate, then the immediate
represents a constant address and the PC-relative offset needs to be
computed as follows: immediate - (current_pc + 1).
Pass #3: Instruction Generation
After pass 1 and 2, the list of insn_t structures contains all
the information necessary to generate encoded LC-3 instructions.
Instruction encoding consists of taking the information in the
insn_t structure and encoding it in a single 16-bit value.
We'll provide some auxiliary routines to make this
task more manageable.
- void write_instr_to_file(FILE * fptr, int address, int word, char * comment): This function
(which we provide for you) outputs the 16-bit binary representation of
a single instruction to the output file. It outputs the machine
instruction in the format expected by LC3sim.
- int insert_field(int bits, int new_bits, int hi_bit, int lo_bit): This function (also provided for you) takes a value
(bits) and replaces the bits from hi_bit
to lo_bit with the value in new_bits and returns
that value. For example, to encode an instruction opcode we could do
something like the following:
new_insn = insert_field(old_insn, insn_ptr->opcode, 15, 12);
Similarly, to encode the destination register in an ADD
instruction we would do the following:
new_insn = insert_field(old_insn, insn_ptr->first_reg, 11, 9);
You'll need to complete
the pass_three() function described below.
- void pass_three(FILE *output_file, insn_t* insn_ptr): We've given you
the basic structure of this pass in the pass_three()
function. This function consists principally of a switch statement,
and there is a case for each opcode. Within each case you will uses
the fields of the insn_t structure and insert_field()
to build an instruction, which you will then output to a file
with write_word_to_file(). We have completed
the ADD case for you.
Note that your assembler should print nothing to the display unless an
error is encountered. If an error is encountered, the output from your
LC3assem executable should be the same as the LC3assem.public executable,
which can be found in ~cs270/public/.
Regression Testing
Do some googling to learn about regression testing. We will
also be discussing it some in class. The main idea of regression
testing is to set up a set of tests that you run everytime
you make a change to your program. For this assignment, you
have unit tests that test specific functions, but a regression
test will test the whole assembler. Often a regression test
compares the current output with an output that is known to be
correct. To do this within the context of a course, we will
be using a reference executable LC3assem.public. LC3assem.public
can be found in ~cs270/public/. The output of your LC3assem executable
should be the same as that of LC3assem.public. Here is an example
of how you can compare your LC3asem output to that of LC3assem.public.
// running the oracle
% ./LC3assem.public TestCases/question-7.5a.asm t-public.LC3 >& output.public
% ./LC3sim.public t-public.LC3 >& sim-out.public
// running your LC3assem
% ./LC3assem TestCases/question-7.5a.asm t.LC3 >& output.mine
% ./LC3sim.public t.LC3 >& sim-out.mine
// diff will return nothing if the results are the same
% diff output.public output.mine
% diff sim-out.public sim-out.mine
There may be differences between the comments generated to the machine code file
by LC3assem.public and your LC3assem, but the output generated to stdout and stderr should be the same. Also, the results of simulating the machine code should be the same.
Extra credit
For 10 extra credit points learn how to do some shell scripting to
automate regression testing. We are not going to provide a strict specification
for this. Your script should make it easy to put a bunch of LC3 assembly files into a subdirectory and run the LC3assem.public and your LC3assem and compare results.
Write your own script, because copying someone else's script
on extra credit is still against the academic integrity policy.
Put your shell script in your PA4 subdirectory and call it PA4-regress.
Debugging
You will certainly need to use appropriate tools to debug your code
(staring at the source code will only damage your eyesight!).
Notice the debug flag in LC3assem.c. Turn that flag on to help
with debugging and turn it off when comparing your LC3assem results
with those generated by LC3assem.public.
The GDB
debugger will be very useful. You can use it to determine exactly
what statement caused a segmentation fault. You can also set break
points and step through the program's execution.
You will also want to use the Valgrind memory check. Valgrind can help
identify all kinds of memory errors such as the use of uninitialized
memory, dangling pointers, wild pointers, corrupting the stack, memory
leaks, and more. To run your program with Valgrind:
% valgrind LC3assem test.asm test.LC3
When errors are encountered, Valgrind prints them to the display.
They are mostly self explanatory, but you may want to refer to the
following sources of more information on Valgrind:
Extra Credit: Bug Reports
Since we are not exhaustively testing our functions for this assignment,
there will probably be bugs in the provided code.
If you are the first student to submit a valid bug report for a particular bug
to the
instructor and/or the TAs, then you will receive 5 extra points for this
assignment. A valid bug report must include detailed instructions on how to reproduce the bug. For example,
the code for a
test within a unit test driver that causes the bug to show itself.
If you are the first to find an inconsistency between the commenting and the code, then you will
be given 2 extra credit points.
README
For this assignment the README file should contain
your name, userid, a high-level description of what the LC3assem program does, how to compile and run the unit tests and the LC3 assembler,
and an overview of how
the functions were tested (5 sentences max).
We will be covering README examples in class.
Getting Started
The start of a Makefile, a doxygen configuration file (optional, try command make docs), and some of the functions and regression tests have been provided in the public directory (~cs270/public/PA4-start/). Let a copy of the PA4-start subdirectory become your initial PA4 subdirectory.
To accomplish this do the following at the command prompt:
% cd ~/CS270
(Just to make sure you are in your CS270 directory.)
% cp -r ~cs270/public/PA4-start ~/CS270/PA4
cp: cannot access `/s/bach/a/class/cs270/public/PA4-start/.svn': Permission denied
The above error about the .svn subdirectory is fine. Do an 'ls' in your PA4 directory and you should now see the following files:
LC3assem.c
LC3assem.h
Makefile
TestCases
cs270.doxygen.config
dump.c
insert_field.c
next_address.c
parser.c
pass_one.c
pass_three.c
sym_table.h
sym_table_insert.c
test_insert_field.c
test_next_address.c
test_sym.c
test_write_instr_to_file.c
write_instr_to_file.c
See SVN Notes for directions for setting
up a subversion repository and working directory.
We STRONGLY recommend that you implement one function at a time.
In pass_three() work on one or two assembly instructions at a time and use
regression testing as you go.
Submitting your Assignment
-
Make sure your code follows the code style requirements (see http://www.cs.colostate.edu/~cs270/Programs/code-requirements.html).
-
Put all of the required information in the README file.
- The subversion.txt file should include output from the following commands:
% svn log -r HEAD:1 // must be done in subversion working directory
% svn info // must be done in subversion working directory
% svnlook tree REPOSITORY_PATH // svnlook must be issued on the machine where the repository is stored
We recommend copying the output of the above commands into an editor and
saving the result as subversion.txt.
-
Do a "make clean" in PA4/ before creating the tar ball for the assignment.
-
Include all the source files, the README file, the Makefile,
and a subversion.txt file in a tar file.
% cd ~/CS270/
% tar cvf PA4_userid.tar PA4
-
Submit assignment using checkin utility
% ~cs270/bin/checkin PA4 PA4_username.tar
-
Quick Sanity Check:
% ~cs270/bin/SubmissionVerifier.sh PA4
Full Sanity Check (procedure we will use to grade your assignment):
% cd ~/CS270/
% cp PA4_username.tar ~/Temp
% cd ~/Temp
% tar xf PA4_username.tar
% cd PA4
% make all // this should compile and run unit tests
We will also be copying in our own unit tests for each function, and
we will be reading through the code comments for the code style requirements
portion of the grade.
Late Policy
Late assignments will be accepted up to 24 hours past the due date and time for a deduction of
20% and will not accepted past this period. One minute late is still late.
Based on an assignment written by Milo Martin.
mstrout@cs.colostate.edu
.... October 24, 2008