Colorado State University Logo | CS 163/4: Java Programming (CS 1) Colorado State University Logo | CS 163/4: Java Programming (CS 1)
CS 163/4: Java Programming (CS 1)
Computer Science

Practical Three - Data Wrangling University Data

Introduction

When you talk about machine learning and data science, data wrangling becomes essential. What is data wrangling? It is taking data, and putting into a format that is easy to work with structuring your data in a way, so the problem becomes easier to manage. Machine Learning is about using data to create predictions, and react to those predictions, so you can see the critical nature of wrangling that data into manageable formats. For this practical assignment, you are working for Institutional Research. A group that analyzes data across the university to help departments make funding decisions and strategic initiatives. In particular, you have been provided with enrollment counts and the gender (as listed in the university database at the time of retrieval) of every student officially majoring in Engineering or Natural Sciences major at the university.

Your task will be to read data from a Comma Separated Value (CSV) file, store the data into a custom structure using arrays and objects and then write that data back out to a JSON file format. CSV file formats are called “flat-data” formats, while JSON data is represented as semi-structured data. CSV is very common, and you have probably used it if you have used Microsoft Excel. JSON formats are extremely common for web applications, and the standard for passing information around between web applications.

You will learn about:

  • More String.format
  • Loops - for and for:each
  • Objects (make heavy use of objects)
  • Arrays

Provided Code

Similar to Practical 2, you will first download your files for use of eclipse. If you would like to review how to set up your Eclipse Environment and download the files - visit practical 2 for a detailed explanation. As with Practical 2, there are a large number of files, and you will notice that you will only work on four files in a much larger application. As with the previous practicals, once you get the code on your machine, you should go through and take the coding quiz in canvas, now.

A note on adding the files to your project:

  • src
    Your source folder should contain your .java files
  • data
    We recommend creating a data folder under the project root, and storing the following files in that folder:
    • CollegeData.csv
    • CollegeData10sample.csv (random sampling of only 10 students)
    • CollegeData100sample.csv (random sampling of only 100 students)
    • CollegeData1000sample.csv (random sampling of only 1000 students)
  • project root
    You will want to put test.csv in the project root folder.

Running the Code

After you get the code setup, you will notice running the application is slightly different from the rest. The program doesn’t use a menu system, the entire design is simply to run the application passing in an input file, and saving out to the file system. This is extremely common when working with files and data wrangling. You also have a number of command-line argument options you can do to turn on different features, which is also common for command-line programs. They are as follows:

  • Filename (any name) Loads the CSV file. For your CollegeData files, you will often use data/CollegeData.csv or data/CollegeData10sample.csv (etc). That is because they are in the data folder from your project root folder.
  • Output file (any name) A second command-line argument will cause the program to save out the data
  • Special args:
    • -v --verbose
      The verbose option enables printing when detailed output is desired (very common for command-line programs)
    • -d --debug The debug option enables printing of all debug statements
    • --tests
      Run the unit tests even when file options are given (tests first, and then the program)

If you run the program without any command-line arguments, you will get the following display to your console:

Missing program arguments...
usage: Main [-v|--verbose] [-d|--debug] [--tests] inputFile [outputFile]
   input file is required to run the program unless running tests
   output file is optional, but recommended to save out JSON structure
   -v|--verbose to enable verbose printing, add -v or --verbose
   -d|--debug to enable debug printing, add -d or --debug
   --tests to enable unit testing
Examples:
   java Main data/CollegeData.csv output.json
   java Main --tests
Process finished with exit code 0

It is highly suggested that you run with –tests in your program arguments when working through the various steps.

Step 1 - GenderCounter: toJson()

GenderCounter is an object used throughout the code to help track gender counts across colleges, departments, and majors. Using the divide conquer glue paradigm, the goal is to keep this class as simple as possible. Looking through the code, you will notice that while the data only currently has two gender definitions (admittedly, poorly behind times), the class is designed to expand in the future. In fact, if you read the comments of the code in the class, you can see the efforts members of the CS department have worked on to help the university move forward. Beyond that you should take note of the following arrays:

1
2
3
   public final static String[] OPTIONS = {"F", "M"};  
   public final static String[] FRIENDLY_NAMES = {"Female", "Male"};
   private final int[] counts = new int[OPTIONS.length]; 

The OPTIONS and FRIENDLY_NAMES arrays are fixed, and the locations match up exactly. You will also notice by looking at .addCount(String gender) that the index of the counts matches whatever the index of OPTIONS location is for that type. Essentially, index 0 is the same index for F, Female, and the count of females in GenderCounter. It also means that if I wanted to print out all my friendly names, I could do the following:

1
2
3
for(int i = 0; i < OPTIONS.length; i++) {
  System.out.println(FRIENDLY_NAMES[i]);
}

Pro Tip
You could also make the options array dynamic as the data loads, which would be an even better way. However, that would have overcomplicated things for this assignment. It is also why a lot of data files also have data definitions that are shared with them listing the possibilities in each category.

Writing .toJson()

With the above information in mind, you will want to go down to .toJson(), and you will notice it needs you to add some lines of code. What you will want to do is write a loop that pairs FRIENDLY_NAMES[i] with counts[i], where i is the shared index (see code example above). The method expects you to return a JSON friendly string.

The format for a JSON key:value pair is as follows:

  • Strings (such as Female or Male) need to have double quotes around them.
  • Numbers can just be the number without quotes
  • As you are pairing name to count (key to value), you want a colon : between them
  • To split different pairs, you want a comma between them.

For example:

 
{"Female":2,"Male":1}
{"Female":21,"Male":22}

You will notice the provided code already does the curly brackets and removes the last comma, for example, if leaving the loop you have a string of "Female":21,"Male":22, it will fix it to be "Female":21,"Male":22 by removing the last character. You can remove that line of code if you don’t use it, but know, you should iterator through the options array - as the options may change! Don’t hard code just grabbing the first two elements.

As a reminder, to add a quote (“) to a String, use the escape character \”. Also, here is a reminder about StringBuilder.

Testing .toJson()

It is very important you test this method, as an error can cascade throughout your code if you don’t. You will want to test it in two ways. First, run the program with the --tests program argument, and that will run the unit tests. You will notice we have only tested one option for GenderCounter, and that option matches the example above. You should test other options, to make sure it works.

The second test you will want to do is confirm your JSON format is correct. In order to do that, you are going to use a Lint, called JSONLint. When you visit the JSONLint website, you are presented with an option to paste in your JSON string. Go ahead and copy it, past it into the text box, and click Validate JSON.

If the JSON format is correct, the results line will say “Valid JSON”. If it is not correct, it will attempt to detect the line with the error. In both cases, it will format your JSON to be more human-readable. Don’t worry about that change. JSON is meant to be stored without the spaces or returns as that makes the file size smaller. In fact, there should be no spaces in the JSON you generate for this method.

You will continue to come back to JSONLint, and you need to make sure your String is a properly formatted JSON String in each step, or the next steps become more difficult.

Step 2 - Major: getName() and addStudent(String gender)

In this step, you will implement two methods, both that only require one line of code.

  • getName()
    Simply returns the string stored in name. You will notice name is a final variable that is set in the constructor, which is why there is no mutator method, only the accessor getName().
  • addStudent(String gender)
    Adding a student is a void method, that doesn’t return anything. Instead, the goal is to add to the gender counter by calling the addCount(String) method. You will notice at the top of Major, there is a genderCounter object. You will want to call addCount on that object.

Testing

You will notice that there are no tests written for these two methods. You should correct that and add some to the unit tests!

Step 3 - Major.toJson()

The toJson method is meant to return a String in the following JSON format.

{"name":"Astronomy","data":{"Female":1,"Male":0}}
{"name":"Basket Weaving","data":{"Female":1,"Male":1}}

Notice, the name value matches with the return value of getName() and the data value matches with the value of genderCounter.toJson(). None of this method is written for you, but we found String.format to be extremely useful. The only space that can show up in the return value is if it part of the name of the major (e.g. Computer Science). You will not need a loop for this method, as you only have the two key:value pairs and they are fixed.

Pro Tip
name:value is called a key:value pair. The idea is if you have the key, and the key is unique (which major names are), then the value is associated with it. This is covered more in data structures and Map style collections. Javascript makes heavy use of maps, as do most languages, including Java.

Testing .toJson()

Similar to the last .toJson method you wrote, you will want to test it both via the unit tests and by using JSONLint. You want to make sure it is formatted correctly, or the next toJson method will be incorrect. So you know, if the linter is saying something is wrong in the String returned from genderCounter.toJson(), it is probably not that spot. Instead similar to a semicolon error in java, it is probably a line next to that return (missing quote, comma, etc).

Step 4 - Department: getName() and getMajors()

You will want to implement both the accessor methods, getName() and getMajors(). They both return the associated variables (name and majors), and are only one line.

Testing

You will notice we did not add tests for these methods. You should do that now in unit tests!

Step 5 - Department: findMajor(String major)

This method is essential, and you can go to College.java for similar methods implemented on finding the department. What find major does is loop through the majors array, looking at the name of each major (hint: majors[i].getName()), if String major matches the name of the item in the array, return the major object (majors[i]). If you don’t find the major, return null (already done).

Overall, you are simply writing a loop that goes through the major array. You can assume the majors are being added in order in the array. You can also assume majorCount is accurate (look at createMajor(String major) to see how that works).

If you are getting a null pointer exception, it probably means you are not using the correct end condition for the for-loop. While the code is easier with a for loop, it is also possible to use a for:each loop and short-circuiting if statements to accomplish the same task. College.java uses that approach. You should check to make sure the majors array has values in it before attempting to index it. You should also check to see if a given index has a value in it.

Testing findMajor

You will find we did not create tests for findMajor, but this is a critical method. You may want to create your own tests in UnitTests.java to test this method before moving on. You may want to use createMajor to create the major, and then findMajor to see if you can find it.

Step 6 - Department : addStudent(String major,String gender)

You will now implement the addStudent method. First, you will want to make use of findOrCreateMajor(String major) to get the major, based on the name of the major passed in. Once you have the major, call addStudent(gender) and also call addCount(gender) on the genderCounter, so you can track the total number of students in a department. For example, take a look at addStudent in College.java.

Testing addStudent

This is a difficult method to test. However, you can write a test where you add a student, and then directly look at your majors array to see if the major is added and genderCounter to see if the counts increased.

Step 7 - Department: toJson()

It is now time to write the toJson() for the Department. The final JSON String will look like the following.

{"name":"Frivolities","data":{"Female":2,"Male":1},"majors":[{"name":"Astronomy","data":{"Female":1,"Male":0}},{"name":"Basket Weaving","data":{"Female":1,"Male":1}}]}

You will know that in JSON an array (or list) of values (such as the majors array), has the square brackets around it - similar to how you declare an array, and commonly print an array. Then each value is placed in the brackets separated by commas. If you look at the provided code in .toJson(), we also are planning on using a StringBuilder due to memory savings. For the department, you will first want to get the key:value pairs of name:getName(), data:genderCounter.toJson(), you will then
want to loop through the majors array getting the toJson for each major!

Carefully look at the provided String above. Additionally, you can look at College.toJson() for an example of how to do it, if you are stuck. (try first!)

Testing toJson()

Testing toJson() means you end up testing all the majors in Department. One test has been written for you, and you may want to add more. As with Major.java, you will want to use JSONLint to confirm the JSON String is valid!

Step 8 - CSVReader.java

By this point, you have successfully set up your data structure to store the information about majors in a structured form (technically semi-structured). where do you get the data from? The goal is to read in from a CSV file, so your next step is to write the code for a simple CSVReader. It should be noted that we are assuming the data is well-formatted, and you don’t have to worry about quoted or escaped data.

Writing public CSVReader(String file) and Scanner

CSVReader will use a Scanner object to scan a file. However, the components are broken up into separate method calls, so the Scanner needs to be declared with the class instance variables at the top of the file. In the constructor, you will then initialize the scanner with a new File based on the filename which the String file.

Don’t worry about the try-catch statement just yet. Just make sure you initialize your Scanner inside the try.

Pro Tip
Ask yourself can private Scanner fileScanner; (or whatever you called it) be final? Why or why not?

Writing public String[] getNext()

This Scanner method grabs the entire next line (what method in scanner gives you the next line?)

You will then take that String, and call .split on it. However, .split takes in a parameter of the String type you want to split on. Ask yourself, if it is a comma separated value file, what separates the values? Return the String array that .split returns. If the line is empty (hint .isEmpty()), return null.

Writing .hasNext()

As long as the scanner is initialized (not null) and the file has more lines, this method should return true. You may want to use the scanner .hasNext() or .hasNextLine() method after checking for the null scanner object.

Testing CSVReader

As all three methods are heavily linked, they must be tested together. A test is built for you in UnitTests - but be careful on the file name. Depending on where you stored test.csv, you may need to update UnitTests (or move the file). This is also true for the loadData test. Double check the location of the data file being tested. If everything is working correctly, you should see the following on your console.

{"201790":[{"code":"NS","name":"Natural Sciences","data":{"Female":2,"Male":0},"departments":[{"name":"Psychology","data":{"Female":2,"Male":0},"majors":[{"name":"Psychology","data":{"Female":2,"Male":0}}]}]},{"code":"EG","name":"Walter Scott College of Engr","data":{"Female":0,"Male":2},"departments":[{"name":"Mechanical Engineering","data":{"Female":0,"Male":1},"majors":[{"name":"Mechanical Engineering","data":{"Female":0,"Male":1}}]},{"name":"Walter Scott College of Engr","data":{"Female":0,"Male":1},"majors":[{"name":"Engineering Science","data":{"Female":0,"Male":1}}]}]}],"201810":[{"code":"NS","name":"Natural Sciences","data":{"Female":1,"Male":0},"departments":[{"name":"Psychology","data":{"Female":1,"Male":0},"majors":[{"name":"Psychology","data":{"Female":1,"Male":0}}]}]}],"201890":[{"code":"EG","name":"Walter Scott College of Engr","data":{"Female":1,"Male":2},"departments":[{"name":"Electrical and Computer Engr","data":{"Female":0,"Male":1},"majors":[{"name":"Electrical Engineering","data":{"Female":0,"Male":1}}]},{"name":"Mechanical Engineering","data":{"Female":0,"Male":1},"majors":[{"name":"Mechanical Engineering","data":{"Female":0,"Male":1}}]},{"name":"Walter Scott College of Engr","data":{"Female":1,"Male":0},"majors":[{"name":"Biomedical Engineering with EE","data":{"Female":1,"Male":0}}]}]}],"201910":[{"code":"NS","name":"Natural Sciences","data":{"Female":1,"Male":0},"departments":[{"name":"Biochemistry & Molecular Biol","data":{"Female":1,"Male":0},"majors":[{"name":"Biochemistry","data":{"Female":1,"Male":0}}]}]}]}
Saved data to file: testOut.json

Step 9 - Running your program

You have completed all the parts for this program. You should now run it by modifying the program arguments. Assuming you stored your .csv data in a data directory, your program arguments could look like the following

-v data/CollegeData100sample.csv data/fullData100.json
-v data/CollegeData1000sample.csv data/fullData1000.json
data/CollegeData.csv data/fullData.json

You can open your .json files in eclipse to see the output (though the -v option will show you the output). If you look at CollegeStats.java you will notice no calculations are performed in stats. Instead, it simply crawls or traverses the data as you have structured it using loops.

Step 10 - Submitting and Reflection

You should click the link through Canvas to submit your code to Zybooks now. You will notice that the only files we are grading are CSVReader.java, Department.java, GenderCounter.java and Major.java. Many of our tests will be weird bounds cases, so make sure you test before submitting.

You should also work on your reflection now. What did you learn? How could the code be simplified? What if we had a student-level with more student information (GPA, etc)?

Going Further
This type of program is very common when working with Artificial Intelligence / Machine Learning or Data Science. While you aren’t actually doing any of the cool Machine Learning algorithms that they both use (take CS 345, 445 or 440 for that!), you are working on an essential step of getting the data into a usable format. Data often has to be cleaned and organized before it can be analyzed. If you are interested in Artificial Intelligence, you should look at the concentration in computer science that focuses in it, as it is one of the few concentrations that modify your math requirements (Neural Network Backpropagation is an example of calculus used in CS), while Data Science majors spend even more time on the math side of the algorithms. Furthermore, those who minor in computer science have access to the ML and AI classes, especially CS 345 is a great course for all minors, and a great supplement to any major.

Computer Science Department

279 Computer Science Building
1100 Centre Avenue
Fort Collins, CO 80523
Phone: (970) 491-5792
Fax: (970) 491-2466

CS 163/4: Java Programming (CS 1)

Computer Programming in Java: Topics include variables, assignment, expressions, operators, booleans, conditionals, characters and strings, control loops, arrays, objects and classes, file input/output, interfaces, recursion, inheritance, and sorting.