CS 270 Programming Assignment #4

Due Tuesday October 28th (by 11:59pm),
Unique LC3 assembly program due to mailing list by Sunday October 26th (by 11:59pm)

Introduction

Programming assignments are to be done individually. This programming assignment does NOT include additional homework questions.

For this assignment, you will be implementing and testing parts of an LC3 assembler. Your assembler will take as input LC3 assembly code and generate as output the LC3 machine code that was used as input to the LC3sim program developed for PA3. The following is an example of how the two programs will be used together:

    % ./LC3assem file.asm file.LC3
    % ./LC3sim file.LC3

For testing, you will be using unit testing and regression testing.

The goals for this assignment are as follows:

Learn what a symbol table is and how to use it to do a multi-pass translation of LC3 assembly code to LC3 machine code.
Learn and use a regression testing technique.
Write some simple programs in LC3 assembly language.
Learn about programming with C strings.
Learn about programming with singly-linked lists in C.
Learn the C system calls malloc() and strcmp().

The Assignment

The assignment is composed of two implementation pieces: (1) writing a symbol table routine for looking up a label so as to find its corresponding address and (2) writing the actual assembler passes. There is also an important testing component. You will need to use some regression testing to check that your assembler generates the same error messages as the LC3assem.public executable. Also, the machine code you generate must be able to run through LC3sim.public. To help everyone test, each student must submit one unique example LC3 assembly program to the mailing list by October 26th, 11:59pm.

Requirements for the test LC3 assembly program

It must be submitted via the class mailing list.
It must be different than any previous test cases sent to the mailing list.
The test case must either cause the LC3assem.public executable to generate an error or the resulting machine code file must run with LC3sim.public without causing an infinite loop. In other words, LC3sim.public must stop within one minute on the generated machine code.

If the test LC3 assembly program is submitted before noon on Sunday October 26th, then we will attempt to let you know whether it is an acceptable test case before 10pm the 26th.

Part 1: The Symbol Table

The symbol table allows you to map strings representing labels (for example, "START") to addresses (for example, 0x3000). The symbol table is a specific example of a dictionary data structure. You will be given the implementation for the following function:

    sym_node_t* sym_table_insert(sym_node_t *st, char* sym, int addr)
    
        This function adds label sym with the address addr to the
        symbol table.  sym is the key and addr is the value associated
        with that key.

Implement the following function in a separate file, sym_table_lookup.c.

    int sym_table_lookup(sym_node_t *st, char* sym)

        This function returns the address associated with symbol sym in the
        symbol table.

The prototypes for all of the functions in the assignment are in the header files LC3assem and sym_table.h. It is much easier to unit test sym_table_insert and sym_table_lookup together by inserting symbols into the symbol table and verifying that they can be found. Therefore, you need only use one unit test called test_sym.c, which tests both of the symbol table functions. test_sym.c is given to you, but you must add it to the Makefile after implementing and adding sym_table_lookup.c to the Makefile.

Symbol tables can be implemented in many ways (hash tables, linked lists, arrays, etc.), but we will use a binary tree representation. Each node in the tree is represented via the structure below.

typedef struct sym_node_struct sym_node_t;

struct sym_node_struct {
  char* symbol;
  int address;
  sym_node_t* left;
  sym_node_t* right;
};

The symbol field points to an array of characters (string) representing the symbol. The address field is the address in simulated memory for the symbol. The left and right fields are pointers to the left and right sym_node_t child structures in the tree (NULL if child does not exist). The "root" of the tree is pointed to by a local variable in LC3assem.c, sym_table_ptr, which is initially NULL:

  sym_node_t* sym_table_ptr = NULL;     // ptr to empty symbol table

To provide fast lookup of symbols, we want to build the tree such that given any node in the tree, all labels reachable from the left child are alphabetically less than the label of the node ("car" is alphabetically before "dog", but it is after "ant"). Similarly, all labels reachable from the right child are alphabetically greater than the label of the node. The strcmp(char* s1, char* s2) function from the C standard library does just such an alphabetical comparison. The function strcmp(char* s1, char* s2) returns a 0 if the strings are the same, a number greater than 0 if s1 is alphabetically after s2, and a number less than 0 if s1 is alphabetically before s2.

To complete the symbol table implementation, implement the following function:

int sym_table_lookup(sym_node_t *st, char* sym): This function traverses the tree based on the symbol string it is passed. When the string is found in a node, the value of the address field is returned. If a non-matching childless node is encountered, the symbol does not appear in the symbol table. If symbol SYMBOL cannot be found, print exactly the following error message (so our testing scripts can detect it) followed by a new line and exit (to exit a C program, using the following function call: exit(1);).

  ERROR: symbol not found (SYMBOL)

SYMBOL is the symbol that could not be found. The pseudocode:

initialize a node pointer to g_symboltable_ptr

while the node pointer is not NULL do:
    use strcmp() to compare sym and the symbol field of structure pointed to by
    the node pointer
    if the result of the comparison is less than zero:
        set the node pointer to be the left pointer of the structure pointed to
        by the node pointer
    otherwise, if the result of the comparison is greater than zero:
        set the node pointer to be the right pointer of the structure pointed to
        by the node pointer
    otherwise:
        the result of the function is the address field of the structure pointed
        to by the node pointer

If the symbol was not found, print an error message (using fprintf) and then exit(1)

Our C implementation of this function is less than 20 lines long. See the implementation of sym_table_insert() to see an example of how to traverse the binary tree.

Part 2: The Assembler

You will be given the implementation for the following functions:

    insn_t* parse(FILE *f)

    int insert_field(int bits, int new_bits, int hi_bit, int lo_bit)

    int next_address(insn_t* insn_ptr, int current_address)

    void write_instr_to_file(FILE * fptr, int address, int word, char * comment)

    void pass_one(insn_t* insn_ptr, sym_node_t** sym_table_ptr_ptr)
        Adds symbols to symbol table.

    void dump(FILE* stream, insn_t* starting_insn_ptr)
        Prints all of the instructions in terms of their structure details 
        to the given stream.  Useful for debugging.

Implement the following functions, each in a separate file named based on the function (e.g. pass_two.c).

    void pass_two(insn_t* insn_ptr, sym_node_t* sym_table_ptr)
        Resolves label references using symbol table.
    
    void pass_three(FILE *output_file, insn_t* insn_ptr)
        Actually generates machine code instructions and 
        writes them to the output file.

Recall from Chapter 7 that assemblers make multiple passes over the program. Our assembler will make several passes over the program (but one could combine some of these passes):

Pass 0: parse the assembly program into a linked list,
Pass 1: add symbols to symbol table,
Pass 2: resolve label references using symbol table, and
Pass 3: actually generate machine code instructions.

Pass 1 and 2 require the symbol table that you completed in Part 1 of the assignment.

Pass #0: Parsing Instructions into a List

Parsing is quite complex, so we provide this code (in parser.c and parser.h). Feel free to examine this code if you like. Although we provide the parser, you still need to understand the structure into which instructions are placed. The program is parsed into a linked list where each node of the list is an instance of the structure below.

enum opcodes_t { 
  // LC3 opcodes
  BR=0, ADD=1, LD=2, ST=3, JSR_JSRR=4, AND=5, LDR=6, STR=7, 
  RTI=8, NOT=9, LDI=10, STI=11, JMP_RET=12, LEA=14, TRAP=15,
  // assembler directives
  ORIG, FILL, BLKW, STRINGZ, HALT,
  // no operation
  UNDEF
};

typedef struct insn_struct insn_t;

struct insn_struct {
  int address;            // address where this instruction should be placed
  enum opcodes_t opcode;  // the opcode of instruction (see above)
  char* label;            // label on this line (can be NULL)
  int first_reg;          // 1st reg number (-1 implies there isn't one)
  int second_reg;         // 2nd reg number (-1 implies there isn't one)
  int third_reg;          // 3rd reg number (-1 implies there isn't one)
  int immediate;          // immediate value
  int offset;             // offset value
  char* label_ref;        // label referenced by BR/LD/ST/JSR/LDI/STI/LEA/FILL
  char* string;           // string for STRINGZ directive
  int n;                  // neg bit (only for BR)
  int z;                  // zero bit (only for BR)
  int p;                  // pos bit (only for BR)
  insn_t* next;           // pointer to next instruction in list
};

The fields are:

The address field indicates the address in memory at which this instruction will be placed.
The opcode field indicates what kind of instruction this structure represents (e.g., ADD, AND, etc.). Opcodes values must come from the opcodes_t enumeration. Note that the first 16 opcodes correspond directly to actually LC-3 opcode values. For example, the opcode of BR is 0x0, ADD is 0x1, LD is 0x2, etc. The remaining opcodes are for assembler directive (i.e., they do not correspond to actual instructions). Note the JSR_JSRR and JMP_RET enum values. We have named these in this way because, for example, JSR and JSRR both have the same opcode encoding in an LC-3 instruction. UNDEF is a special opcode meaning the structure does not correspond to an instruction at all. This is used when a label appears on a line without an instruction.
The label field is a pointer to a null-terminated array of characters (i.e., a C-style string) representing the label associated with this line. The value of this pointer is NULL for instructions that are not preceded by a label.
The first_reg, second_reg, and third_reg fields specify the register operands for the instruction. Instructions that require fewer than 3 registers will have -1 in the unused registers. For example, if an instruction requires on two register operands (e.g., NOT), third_reg will be -1.
The immediate field is used to hold immediate values. For example, an ADD immediate instruction will have an immediate value in this field. This field is 0 for instructions that don't require an immediate, but you can't conclude this by looking at the field because an instruction might have an immediate with a value of 0. For example, to determine whether the third operand of an ADD or AND instruction is a register or immediate, examine the value of third_reg; if it is -1, use the immediate in immediate. To accomodate assembly programs like the one in exercise 9.5 in the book, instructions such as ST, LD, etc. that typically have a LABEL operand can instead have a constant address as an operand. That constant address will be stored in the immediate field.
The offset field is used to hold offsets. For example, an LD instruction will have a PCoffset in it. The offsets for LDR and STR are determined by the parser. Anything with a PCoffset has its offset calculated in pass_two().
The label_ref field is a pointer to a null-terminated array of characters that represents the label referenced by the instruction (e.g., LABEL1 in "BR LABEL1"). For instructions not using a label, this pointer will be NULL.
The string field is a pointer to a null-terminated array of characters that represents the string operand to the .stringz directive. It has a NULL value for all instructions except STRINGZ.
The n, z, and p fields specify the n, z, and p fields in the BR instruction (1 indicates the corresponding bit is set in the instruction). These fields are obviously meaningless for non-BR instructions.
The next field is a pointer to the next insn_t structure, representing the next instruction in the program. The next field has the value NULL if this is the last instruction in the input file.

Passes #1 and #2: Inserting and Looking Up Labels into the Symbol Table

Passes 1 and 2 use a symbol table to store the address that corresponds to all of the labels in the program. Pass 1 iterates over all the instructions in the linked list, looking at the label field and adding label/address pairs to the symbol table. Pass 2 iterates over all the instructions again, but this time it looks up all label_ref fields in the symbol table and uses the address of the current instruction and the address of the label to compute the correct value of offset. After pass 2, the label_ref field is unused. Pass 2 also changes constant address immediates in instructions like BR or ST to offsets (see exercise 9.5 in the book as an example).

Pass 1 is given and can be found in pass_one.c. You'll need to build pass_two():

void pass_two(insn_t* insn_ptr, sym_node_t* sym_table_ptr): This function looks at each instruction for a non-NULL label_ref field. If the label_ref is non-NULL, the label_ref is looked up in the symbol table to get the address associate with the label. If the opcode is FILL, this address becomes the immediate field. Otherwise, we must compute the PC-relative offset as follows: address associated with symbol in symbol table - (current_pc + 1). This value is then stored in offset. This function will need to call the sym_table_lookup() function. If there is no label_ref, the opcode is in the set {BR, JSR_JSRR, LD, LDI, LEA, ST, STI }, and there is a non-zero immediate, then the immediate represents a constant address and the PC-relative offset needs to be computed as follows: immediate - (current_pc + 1).

Pass #3: Instruction Generation

After pass 1 and 2, the list of insn_t structures contains all the information necessary to generate encoded LC-3 instructions. Instruction encoding consists of taking the information in the insn_t structure and encoding it in a single 16-bit value. We'll provide some auxiliary routines to make this task more manageable.

void write_instr_to_file(FILE * fptr, int address, int word, char * comment): This function (which we provide for you) outputs the 16-bit binary representation of a single instruction to the output file. It outputs the machine instruction in the format expected by LC3sim.
int insert_field(int bits, int new_bits, int hi_bit, int lo_bit): This function (also provided for you) takes a value (bits) and replaces the bits from hi_bit to lo_bit with the value in new_bits and returns that value. For example, to encode an instruction opcode we could do something like the following:
```
    new_insn = insert_field(old_insn, insn_ptr->opcode, 15, 12);
```
Similarly, to encode the destination register in an ADD instruction we would do the following:
```
    new_insn = insert_field(old_insn, insn_ptr->first_reg, 11, 9);
```

You'll need to complete the pass_three() function described below.

void pass_three(FILE *output_file, insn_t* insn_ptr): We've given you the basic structure of this pass in the pass_three() function. This function consists principally of a switch statement, and there is a case for each opcode. Within each case you will uses the fields of the insn_t structure and insert_field() to build an instruction, which you will then output to a file with write_word_to_file(). We have completed the ADD case for you.

Note that your assembler should print nothing to the display unless an error is encountered. If an error is encountered, the output from your LC3assem executable should be the same as the LC3assem.public executable, which can be found in ~cs270/public/.

Regression Testing

Do some googling to learn about regression testing. We will also be discussing it some in class. The main idea of regression testing is to set up a set of tests that you run everytime you make a change to your program. For this assignment, you have unit tests that test specific functions, but a regression test will test the whole assembler. Often a regression test compares the current output with an output that is known to be correct. To do this within the context of a course, we will be using a reference executable LC3assem.public. LC3assem.public can be found in ~cs270/public/. The output of your LC3assem executable should be the same as that of LC3assem.public. Here is an example of how you can compare your LC3asem output to that of LC3assem.public.

    // running the oracle
    % ./LC3assem.public TestCases/question-7.5a.asm t-public.LC3 >& output.public
    % ./LC3sim.public t-public.LC3 >& sim-out.public
    
    // running your LC3assem
    % ./LC3assem TestCases/question-7.5a.asm t.LC3 >& output.mine
    % ./LC3sim.public t.LC3 >& sim-out.mine
    
    // diff will return nothing if the results are the same
    % diff output.public output.mine
    % diff sim-out.public sim-out.mine

There may be differences between the comments generated to the machine code file by LC3assem.public and your LC3assem, but the output generated to stdout and stderr should be the same. Also, the results of simulating the machine code should be the same.

Extra credit

For 10 extra credit points learn how to do some shell scripting to automate regression testing. We are not going to provide a strict specification for this. Your script should make it easy to put a bunch of LC3 assembly files into a subdirectory and run the LC3assem.public and your LC3assem and compare results. Write your own script, because copying someone else's script on extra credit is still against the academic integrity policy.

Put your shell script in your PA4 subdirectory and call it PA4-regress.

Debugging

You will certainly need to use appropriate tools to debug your code (staring at the source code will only damage your eyesight!).

Notice the debug flag in LC3assem.c. Turn that flag on to help with debugging and turn it off when comparing your LC3assem results with those generated by LC3assem.public.

The GDB debugger will be very useful. You can use it to determine exactly what statement caused a segmentation fault. You can also set break points and step through the program's execution.

You will also want to use the Valgrind memory check. Valgrind can help identify all kinds of memory errors such as the use of uninitialized memory, dangling pointers, wild pointers, corrupting the stack, memory leaks, and more. To run your program with Valgrind:

% valgrind LC3assem test.asm test.LC3

When errors are encountered, Valgrind prints them to the display. They are mostly self explanatory, but you may want to refer to the following sources of more information on Valgrind:

Valgrind Project Homepage - http://valgrind.org/
Valgrind HOWTO - http://www.tldp.org/HOWTO/Valgrind-HOWTO/closerview.html

Extra Credit: Bug Reports

Since we are not exhaustively testing our functions for this assignment, there will probably be bugs in the provided code. If you are the first student to submit a valid bug report for a particular bug to the instructor and/or the TAs, then you will receive 5 extra points for this assignment. A valid bug report must include detailed instructions on how to reproduce the bug. For example, the code for a test within a unit test driver that causes the bug to show itself.

If you are the first to find an inconsistency between the commenting and the code, then you will be given 2 extra credit points.

README

For this assignment the README file should contain your name, userid, a high-level description of what the LC3assem program does, how to compile and run the unit tests and the LC3 assembler, and an overview of how the functions were tested (5 sentences max). We will be covering README examples in class.

Getting Started

The start of a Makefile, a doxygen configuration file (optional, try command make docs), and some of the functions and regression tests have been provided in the public directory (~cs270/public/PA4-start/). Let a copy of the PA4-start subdirectory become your initial PA4 subdirectory. To accomplish this do the following at the command prompt:

    % cd ~/CS270
    (Just to make sure you are in your CS270 directory.)
    % cp -r ~cs270/public/PA4-start ~/CS270/PA4
    cp: cannot access `/s/bach/a/class/cs270/public/PA4-start/.svn': Permission denied

The above error about the .svn subdirectory is fine. Do an 'ls' in your PA4 directory and you should now see the following files:

LC3assem.c
LC3assem.h
Makefile
TestCases
cs270.doxygen.config
dump.c
insert_field.c
next_address.c
parser.c
pass_one.c
pass_three.c
sym_table.h
sym_table_insert.c
test_insert_field.c
test_next_address.c
test_sym.c
test_write_instr_to_file.c
write_instr_to_file.c

See SVN Notes for directions for setting up a subversion repository and working directory.

We STRONGLY recommend that you implement one function at a time. In pass_three() work on one or two assembly instructions at a time and use regression testing as you go.

Submitting your Assignment

Make sure your code follows the code style requirements (see http://www.cs.colostate.edu/~cs270/Programs/code-requirements.html).
Put all of the required information in the README file.

The subversion.txt file should include output from the following commands:

    % svn log -r HEAD:1     // must be done in subversion working directory
    % svn info              // must be done in subversion working directory
    % svnlook tree REPOSITORY_PATH  // svnlook must be issued on the machine where the repository is stored

We recommend copying the output of the above commands into an editor and saving the result as subversion.txt.

Do a "make clean" in PA4/ before creating the tar ball for the assignment.
Include all the source files, the README file, the Makefile, and a subversion.txt file in a tar file.
```
% cd ~/CS270/
```
```
% tar cvf PA4_userid.tar PA4
```

Submit assignment using checkin utility

% ~cs270/bin/checkin PA4 PA4_username.tar

Quick Sanity Check:
```
  % ~cs270/bin/SubmissionVerifier.sh PA4
  
```
Full Sanity Check (procedure we will use to grade your assignment):
```
    % cd ~/CS270/
    % cp PA4_username.tar ~/Temp
    % cd ~/Temp
    % tar xf PA4_username.tar
    % cd PA4
    % make all      // this should compile and run unit tests  
  
```
We will also be copying in our own unit tests for each function, and we will be reading through the code comments for the code style requirements portion of the grade.

Late Policy

Late assignments will be accepted up to 24 hours past the due date and time for a deduction of 20% and will not accepted past this period. One minute late is still late.

Based on an assignment written by Milo Martin.

mstrout@cs.colostate.edu .... October 24, 2008