CS 270 Programming Assignment #4

Due Tuesday October 28th (by 11:59pm),
Unique LC3 assembly program due to mailing list by Sunday October 26th (by 11:59pm)

Introduction

Programming assignments are to be done individually. This programming assignment does NOT include additional homework questions.

For this assignment, you will be implementing and testing parts of an LC3 assembler. Your assembler will take as input LC3 assembly code and generate as output the LC3 machine code that was used as input to the LC3sim program developed for PA3. The following is an example of how the two programs will be used together:
    % ./LC3assem file.asm file.LC3
    % ./LC3sim file.LC3
For testing, you will be using unit testing and regression testing.

The goals for this assignment are as follows:

The Assignment

The assignment is composed of two implementation pieces: (1) writing a symbol table routine for looking up a label so as to find its corresponding address and (2) writing the actual assembler passes. There is also an important testing component. You will need to use some regression testing to check that your assembler generates the same error messages as the LC3assem.public executable. Also, the machine code you generate must be able to run through LC3sim.public. To help everyone test, each student must submit one unique example LC3 assembly program to the mailing list by October 26th, 11:59pm.

Requirements for the test LC3 assembly program If the test LC3 assembly program is submitted before noon on Sunday October 26th, then we will attempt to let you know whether it is an acceptable test case before 10pm the 26th.

Part 1: The Symbol Table

The symbol table allows you to map strings representing labels (for example, "START") to addresses (for example, 0x3000). The symbol table is a specific example of a dictionary data structure. You will be given the implementation for the following function:
    sym_node_t* sym_table_insert(sym_node_t *st, char* sym, int addr)
    
        This function adds label sym with the address addr to the
        symbol table.  sym is the key and addr is the value associated
        with that key.
Implement the following function in a separate file, sym_table_lookup.c.
    int sym_table_lookup(sym_node_t *st, char* sym)

        This function returns the address associated with symbol sym in the
        symbol table.

The prototypes for all of the functions in the assignment are in the header files LC3assem and sym_table.h. It is much easier to unit test sym_table_insert and sym_table_lookup together by inserting symbols into the symbol table and verifying that they can be found. Therefore, you need only use one unit test called test_sym.c, which tests both of the symbol table functions. test_sym.c is given to you, but you must add it to the Makefile after implementing and adding sym_table_lookup.c to the Makefile.

Symbol tables can be implemented in many ways (hash tables, linked lists, arrays, etc.), but we will use a binary tree representation. Each node in the tree is represented via the structure below.

typedef struct sym_node_struct sym_node_t;

struct sym_node_struct {
  char* symbol;
  int address;
  sym_node_t* left;
  sym_node_t* right;
};
The symbol field points to an array of characters (string) representing the symbol. The address field is the address in simulated memory for the symbol. The left and right fields are pointers to the left and right sym_node_t child structures in the tree (NULL if child does not exist). The "root" of the tree is pointed to by a local variable in LC3assem.c, sym_table_ptr, which is initially NULL:
  sym_node_t* sym_table_ptr = NULL;     // ptr to empty symbol table
To provide fast lookup of symbols, we want to build the tree such that given any node in the tree, all labels reachable from the left child are alphabetically less than the label of the node ("car" is alphabetically before "dog", but it is after "ant"). Similarly, all labels reachable from the right child are alphabetically greater than the label of the node. The strcmp(char* s1, char* s2) function from the C standard library does just such an alphabetical comparison. The function strcmp(char* s1, char* s2) returns a 0 if the strings are the same, a number greater than 0 if s1 is alphabetically after s2, and a number less than 0 if s1 is alphabetically before s2.

To complete the symbol table implementation, implement the following function:

Part 2: The Assembler

You will be given the implementation for the following functions:
    insn_t* parse(FILE *f)

    int insert_field(int bits, int new_bits, int hi_bit, int lo_bit)

    int next_address(insn_t* insn_ptr, int current_address)

    void write_instr_to_file(FILE * fptr, int address, int word, char * comment)

    void pass_one(insn_t* insn_ptr, sym_node_t** sym_table_ptr_ptr)
        Adds symbols to symbol table.

    void dump(FILE* stream, insn_t* starting_insn_ptr)
        Prints all of the instructions in terms of their structure details 
        to the given stream.  Useful for debugging.

Implement the following functions, each in a separate file named based on the function (e.g. pass_two.c).
    void pass_two(insn_t* insn_ptr, sym_node_t* sym_table_ptr)
        Resolves label references using symbol table.
    
    void pass_three(FILE *output_file, insn_t* insn_ptr)
        Actually generates machine code instructions and 
        writes them to the output file.
Recall from Chapter 7 that assemblers make multiple passes over the program. Our assembler will make several passes over the program (but one could combine some of these passes): Pass 1 and 2 require the symbol table that you completed in Part 1 of the assignment.

Pass #0: Parsing Instructions into a List

Parsing is quite complex, so we provide this code (in parser.c and parser.h). Feel free to examine this code if you like. Although we provide the parser, you still need to understand the structure into which instructions are placed. The program is parsed into a linked list where each node of the list is an instance of the structure below.

enum opcodes_t { 
  // LC3 opcodes
  BR=0, ADD=1, LD=2, ST=3, JSR_JSRR=4, AND=5, LDR=6, STR=7, 
  RTI=8, NOT=9, LDI=10, STI=11, JMP_RET=12, LEA=14, TRAP=15,
  // assembler directives
  ORIG, FILL, BLKW, STRINGZ, HALT,
  // no operation
  UNDEF
};

typedef struct insn_struct insn_t;

struct insn_struct {
  int address;            // address where this instruction should be placed
  enum opcodes_t opcode;  // the opcode of instruction (see above)
  char* label;            // label on this line (can be NULL)
  int first_reg;          // 1st reg number (-1 implies there isn't one)
  int second_reg;         // 2nd reg number (-1 implies there isn't one)
  int third_reg;          // 3rd reg number (-1 implies there isn't one)
  int immediate;          // immediate value
  int offset;             // offset value
  char* label_ref;        // label referenced by BR/LD/ST/JSR/LDI/STI/LEA/FILL
  char* string;           // string for STRINGZ directive
  int n;                  // neg bit (only for BR)
  int z;                  // zero bit (only for BR)
  int p;                  // pos bit (only for BR)
  insn_t* next;           // pointer to next instruction in list
};

The fields are:

Passes #1 and #2: Inserting and Looking Up Labels into the Symbol Table

Passes 1 and 2 use a symbol table to store the address that corresponds to all of the labels in the program. Pass 1 iterates over all the instructions in the linked list, looking at the label field and adding label/address pairs to the symbol table. Pass 2 iterates over all the instructions again, but this time it looks up all label_ref fields in the symbol table and uses the address of the current instruction and the address of the label to compute the correct value of offset. After pass 2, the label_ref field is unused. Pass 2 also changes constant address immediates in instructions like BR or ST to offsets (see exercise 9.5 in the book as an example).

Pass 1 is given and can be found in pass_one.c. You'll need to build pass_two():

Pass #3: Instruction Generation

After pass 1 and 2, the list of insn_t structures contains all the information necessary to generate encoded LC-3 instructions. Instruction encoding consists of taking the information in the insn_t structure and encoding it in a single 16-bit value. We'll provide some auxiliary routines to make this task more manageable. You'll need to complete the pass_three() function described below. Note that your assembler should print nothing to the display unless an error is encountered. If an error is encountered, the output from your LC3assem executable should be the same as the LC3assem.public executable, which can be found in ~cs270/public/.

Regression Testing

Do some googling to learn about regression testing. We will also be discussing it some in class. The main idea of regression testing is to set up a set of tests that you run everytime you make a change to your program. For this assignment, you have unit tests that test specific functions, but a regression test will test the whole assembler. Often a regression test compares the current output with an output that is known to be correct. To do this within the context of a course, we will be using a reference executable LC3assem.public. LC3assem.public can be found in ~cs270/public/. The output of your LC3assem executable should be the same as that of LC3assem.public. Here is an example of how you can compare your LC3asem output to that of LC3assem.public.
    // running the oracle
    % ./LC3assem.public TestCases/question-7.5a.asm t-public.LC3 >& output.public
    % ./LC3sim.public t-public.LC3 >& sim-out.public
    
    // running your LC3assem
    % ./LC3assem TestCases/question-7.5a.asm t.LC3 >& output.mine
    % ./LC3sim.public t.LC3 >& sim-out.mine
    
    // diff will return nothing if the results are the same
    % diff output.public output.mine
    % diff sim-out.public sim-out.mine
There may be differences between the comments generated to the machine code file by LC3assem.public and your LC3assem, but the output generated to stdout and stderr should be the same. Also, the results of simulating the machine code should be the same.

Extra credit

For 10 extra credit points learn how to do some shell scripting to automate regression testing. We are not going to provide a strict specification for this. Your script should make it easy to put a bunch of LC3 assembly files into a subdirectory and run the LC3assem.public and your LC3assem and compare results. Write your own script, because copying someone else's script on extra credit is still against the academic integrity policy.

Put your shell script in your PA4 subdirectory and call it PA4-regress.

Debugging

You will certainly need to use appropriate tools to debug your code (staring at the source code will only damage your eyesight!).

Notice the debug flag in LC3assem.c. Turn that flag on to help with debugging and turn it off when comparing your LC3assem results with those generated by LC3assem.public.

The GDB debugger will be very useful. You can use it to determine exactly what statement caused a segmentation fault. You can also set break points and step through the program's execution.

You will also want to use the Valgrind memory check. Valgrind can help identify all kinds of memory errors such as the use of uninitialized memory, dangling pointers, wild pointers, corrupting the stack, memory leaks, and more. To run your program with Valgrind:
% valgrind LC3assem test.asm test.LC3
When errors are encountered, Valgrind prints them to the display. They are mostly self explanatory, but you may want to refer to the following sources of more information on Valgrind:

Extra Credit: Bug Reports

Since we are not exhaustively testing our functions for this assignment, there will probably be bugs in the provided code. If you are the first student to submit a valid bug report for a particular bug to the instructor and/or the TAs, then you will receive 5 extra points for this assignment. A valid bug report must include detailed instructions on how to reproduce the bug. For example, the code for a test within a unit test driver that causes the bug to show itself.

If you are the first to find an inconsistency between the commenting and the code, then you will be given 2 extra credit points.

README

For this assignment the README file should contain your name, userid, a high-level description of what the LC3assem program does, how to compile and run the unit tests and the LC3 assembler, and an overview of how the functions were tested (5 sentences max). We will be covering README examples in class.

Getting Started

The start of a Makefile, a doxygen configuration file (optional, try command make docs), and some of the functions and regression tests have been provided in the public directory (~cs270/public/PA4-start/). Let a copy of the PA4-start subdirectory become your initial PA4 subdirectory. To accomplish this do the following at the command prompt:
    % cd ~/CS270
    (Just to make sure you are in your CS270 directory.)
    % cp -r ~cs270/public/PA4-start ~/CS270/PA4
    cp: cannot access `/s/bach/a/class/cs270/public/PA4-start/.svn': Permission denied
The above error about the .svn subdirectory is fine. Do an 'ls' in your PA4 directory and you should now see the following files:
LC3assem.c
LC3assem.h
Makefile
TestCases
cs270.doxygen.config
dump.c
insert_field.c
next_address.c
parser.c
pass_one.c
pass_three.c
sym_table.h
sym_table_insert.c
test_insert_field.c
test_next_address.c
test_sym.c
test_write_instr_to_file.c
write_instr_to_file.c

See SVN Notes for directions for setting up a subversion repository and working directory.

We STRONGLY recommend that you implement one function at a time. In pass_three() work on one or two assembly instructions at a time and use regression testing as you go.

Submitting your Assignment

Late Policy

Late assignments will be accepted up to 24 hours past the due date and time for a deduction of 20% and will not accepted past this period. One minute late is still late.


Based on an assignment written by Milo Martin.
mstrout@cs.colostate.edu .... October 24, 2008