Introduction
When you talk about machine learning and data science, data wrangling becomes essential. What is data wrangling? It is taking data, and putting into a format that is easy to work with structuring your data in a way, so the problem becomes easier to manage. Machine Learning is about using data to create predictions, and react to those predictions, so you can see the critical nature of wrangling that data into manageable formats. For this practical assignment, you are working for Institutional Research. A group that analyzes data across the university to help departments make funding decisions and strategic initiatives. In particular, you have been provided with enrollment counts and the gender (as listed in the university database at the time of retrieval) of every student officially majoring in Engineering or Natural Sciences major at the university.
Your task will be to read data from a Comma Separated Value (CSV) file, store the data into a custom structure using arrays and objects and then write that data back out to a JSON file format. CSV file formats are called “flat-data” formats, while JSON data is represented as semi-structured data. CSV is very common, and you have probably used it if you have used Microsoft Excel. JSON formats are extremely common for web applications, and the standard for passing information around between web applications.
You will learn about:
- More String.format
- Loops - for and for:each
- Objects (make heavy use of objects)
- Arrays
Provided Code
Similar to Practical 2, you will first download your files for use of eclipse. If you would like to review how to set up your Eclipse Environment and download the files - visit practical 2 for a detailed explanation. As with Practical 2, there are a large number of files, and you will notice that you will only work on four files in a much larger application. As with the previous practicals, once you get the code on your machine, you should go through and take the coding quiz in canvas, now.
A note on adding the files to your project:
- src
Your source folder should contain your .java files - data
We recommend creating a data folder under the project root, and storing the following files in that folder:- CollegeData.csv
- CollegeData10sample.csv (random sampling of only 10 students)
- CollegeData100sample.csv (random sampling of only 100 students)
- CollegeData1000sample.csv (random sampling of only 1000 students)
- project root
You will want to put test.csv in the project root folder.
Running the Code
After you get the code setup, you will notice running the application is slightly different from the rest. The program doesn’t use a menu system, the entire design is simply to run the application passing in an input file, and saving out to the file system. This is extremely common when working with files and data wrangling. You also have a number of command-line argument options you can do to turn on different features, which is also common for command-line programs. They are as follows:
- Filename (any name) Loads the CSV file. For your CollegeData files, you will often use data/CollegeData.csv or data/CollegeData10sample.csv (etc). That is because they are in the data folder from your project root folder.
- Output file (any name) A second command-line argument will cause the program to save out the data
- Special args:
- -v --verbose
The verbose option enables printing when detailed output is desired (very common for command-line programs) - -d --debug The debug option enables printing of all debug statements
- --tests
Run the unit tests even when file options are given (tests first, and then the program)
- -v --verbose
If you run the program without any command-line arguments, you will get the following display to your console:
It is highly suggested that you run with –tests in your program arguments when working through the various steps.
Step 1 - GenderCounter: toJson()
GenderCounter is an object used throughout the code to help track gender counts across colleges, departments, and majors. Using the divide conquer glue paradigm, the goal is to keep this class as simple as possible. Looking through the code, you will notice that while the data only currently has two gender definitions (admittedly, poorly behind times), the class is designed to expand in the future. In fact, if you read the comments of the code in the class, you can see the efforts members of the CS department have worked on to help the university move forward. Beyond that you should take note of the following arrays:
1
2
3
public final static String[] OPTIONS = {"F", "M"};
public final static String[] FRIENDLY_NAMES = {"Female", "Male"};
private final int[] counts = new int[OPTIONS.length];
The OPTIONS
and FRIENDLY_NAMES
arrays are fixed, and the locations match up exactly. You will
also notice by looking at .addCount(String gender)
that the index of the counts matches
whatever the index of OPTIONS location is for that type. Essentially, index 0 is the same
index for F, Female, and the count of females in GenderCounter. It also means that
if I wanted to print out all my friendly names, I could do the following:
1
2
3
for(int i = 0; i < OPTIONS.length; i++) {
System.out.println(FRIENDLY_NAMES[i]);
}
Pro Tip
You could also make the options array dynamic as the data loads, which would be an even better way. However, that would have overcomplicated things for this assignment. It is also why a lot of data files also have data definitions that are shared with them listing the possibilities in each category.
Writing .toJson()
With the above information in mind, you will want to go down to .toJson(), and you will
notice it needs you to add some lines of code. What you will want to do is write a loop
that pairs FRIENDLY_NAMES[i] with counts[i], where i
is the shared index (see code example above). The
method expects you to return a JSON friendly string.
The format for a JSON key:value pair is as follows:
- Strings (such as Female or Male) need to have double quotes around them.
- Numbers can just be the number without quotes
- As you are pairing name to count (key to value), you want a colon
:
between them - To split different pairs, you want a comma between them.
For example:
You will notice the provided code already does the curly brackets and removes the last comma,
for example, if leaving the loop you have a string of "Female":21,"Male":22,
it will fix it to be
"Female":21,"Male":22
by removing the last character. You can remove that line of
code if you don’t use it, but know, you should iterator through the options array - as the
options may change! Don’t hard code just grabbing the first two elements.
As a reminder, to add a quote (“) to a String, use the escape character \”. Also, here is a reminder about StringBuilder.
Testing .toJson()
It is very important you test this method, as an error can cascade throughout
your code if you don’t. You will want to test it in two ways. First, run the program
with the --tests
program argument, and that will run the unit tests. You will notice
we have only tested one option for GenderCounter, and that option matches the example above. You
should test other options, to make sure it works.
The second test you will want to do is confirm your JSON format is correct.
In order to do that, you are going to use a
Lint, called JSONLint.
When you visit the JSONLint website, you are presented with an option to paste in your
JSON string. Go ahead and copy it, past it into the text box, and click Validate JSON
.
If the JSON format is correct, the results line will say “Valid JSON”. If it is not correct, it will attempt to detect the line with the error. In both cases, it will format your JSON to be more human-readable. Don’t worry about that change. JSON is meant to be stored without the spaces or returns as that makes the file size smaller. In fact, there should be no spaces in the JSON you generate for this method.
You will continue to come back to JSONLint, and you need to make sure your String is a properly formatted JSON String in each step, or the next steps become more difficult.
Step 2 - Major: getName()
and addStudent(String gender)
In this step, you will implement two methods, both that only require one line of code.
getName()
Simply returns the string stored in name. You will notice name is a final variable that is set in the constructor, which is why there is no mutator method, only the accessor getName().addStudent(String gender)
Adding a student is a void method, that doesn’t return anything. Instead, the goal is to add to the gender counter by calling the addCount(String) method. You will notice at the top of Major, there is a genderCounter object. You will want to call addCount on that object.
Testing
You will notice that there are no tests written for these two methods. You should correct that and add some to the unit tests!
Step 3 - Major.toJson()
The toJson method is meant to return a String in the following JSON format.
Notice, the name value matches with the return value of getName()
and the data
value matches with the value of genderCounter.toJson()
. None of this
method is written for you, but we found String.format to be extremely
useful. The only space that can show up in the return value is if it part of the name
of the major (e.g. Computer Science). You will not need a loop for this method, as
you only have the two key:value pairs and they are fixed.
Pro Tip
name:value is called a key:value pair. The idea is if you have the key, and the key is unique (which major names are), then the value is associated with it. This is covered more in data structures and Map style collections. Javascript makes heavy use of maps, as do most languages, including Java.
Testing .toJson()
Similar to the last .toJson method you wrote, you will want to test it both
via the unit tests and by using JSONLint. You want to
make sure it is formatted correctly, or the next toJson method will be
incorrect. So you know, if the linter is saying something is wrong in the
String returned from genderCounter.toJson()
, it is probably not that spot. Instead
similar to a semicolon error in java, it is probably a line next to that return
(missing quote, comma, etc).
Step 4 - Department: getName()
and getMajors()
You will want to implement both the accessor methods, getName()
and getMajors()
.
They both return the associated variables (name and majors), and are only one line.
Testing
You will notice we did not add tests for these methods. You should do that now in unit tests!
Step 5 - Department: findMajor(String major)
This method is essential, and you can go to College.java for similar methods implemented
on finding the department. What find major does is loop through the
majors array, looking at the name of each major (hint: majors[i].getName()
), if
String major matches the name of the item in the array, return
the major object (majors[i]
). If you don’t find the major, return null
(already done).
Overall, you are simply writing a loop that goes through the major array. You
can assume the majors are being added in order in the array. You can also
assume majorCount is accurate (look at createMajor(String major)
to see how that works).
If you are getting a null pointer exception, it probably means you are not using the correct end condition for the for-loop. While the code is easier with a for loop, it is also possible to use a for:each loop and short-circuiting if statements to accomplish the same task. College.java uses that approach. You should check to make sure the majors array has values in it before attempting to index it. You should also check to see if a given index has a value in it.
Testing findMajor
You will find we did not create tests for findMajor, but this is a critical method. You may want to create your own tests in UnitTests.java to test this method before moving on. You may want to use createMajor to create the major, and then findMajor to see if you can find it.
Step 6 - Department : addStudent(String major,String gender)
You will now implement the addStudent method. First, you will want to make
use of findOrCreateMajor(String major)
to get the major, based on the
name of the major passed in. Once you have the major, call addStudent(gender)
and also call addCount(gender) on the genderCounter, so you can track
the total number of students in a department. For example, take a look at
addStudent in College.java.
Testing addStudent
This is a difficult method to test. However, you can write a test where you add a student, and then directly look at your majors array to see if the major is added and genderCounter to see if the counts increased.
Step 7 - Department: toJson()
It is now time to write the toJson()
for the Department. The final JSON String
will look like the following.
You will know that in JSON an array (or list) of values (such as the majors array),
has the square brackets around it - similar to how you declare an array, and
commonly print an array.
Then each value is placed in the brackets separated by commas. If you look at
the provided code in .toJson(), we also are planning on using a StringBuilder
due to memory savings. For the department, you will first want to get the
key:value pairs of name:getName(), data:genderCounter.toJson(), you will then
want to loop through the majors array getting the toJson for each major!
Carefully look at the provided String above. Additionally, you can look at College.toJson() for an example of how to do it, if you are stuck. (try first!)
Testing toJson()
Testing toJson() means you end up testing all the majors in Department. One test has been written for you, and you may want to add more. As with Major.java, you will want to use JSONLint to confirm the JSON String is valid!
Step 8 - CSVReader.java
By this point, you have successfully set up your data structure to store the information about majors in a structured form (technically semi-structured). where do you get the data from? The goal is to read in from a CSV file, so your next step is to write the code for a simple CSVReader. It should be noted that we are assuming the data is well-formatted, and you don’t have to worry about quoted or escaped data.
Writing public CSVReader(String file)
and Scanner
CSVReader will use a Scanner object to scan a file.
However, the components are broken up into separate method calls, so the
Scanner needs to be declared with the class instance variables at the top of the file.
In the constructor, you will then initialize the scanner with a new File based
on the filename which the String file
.
Don’t worry about the try-catch statement just yet. Just make sure you initialize your Scanner inside the try.
Pro Tip
Ask yourself canprivate Scanner fileScanner;
(or whatever you called it) be final? Why or why not?
Writing public String[] getNext()
This Scanner method grabs the entire next line (what method in scanner gives you the next line?)
You will then take that String, and call .split
on it. However, .split takes
in a parameter of the String type you want to split on. Ask yourself, if it is a
comma separated value file, what separates the values? Return the
String array that .split returns. If the line is empty (hint .isEmpty()
), return null.
Writing .hasNext()
As long as the scanner is initialized (not null) and the file has more lines, this method should return true. You may want to use the scanner .hasNext() or .hasNextLine() method after checking for the null scanner object.
Testing CSVReader
As all three methods are heavily linked, they must be tested together. A test is built for you in UnitTests - but be careful on the file name. Depending on where you stored test.csv, you may need to update UnitTests (or move the file). This is also true for the loadData test. Double check the location of the data file being tested. If everything is working correctly, you should see the following on your console.
Step 9 - Running your program
You have completed all the parts for this program. You should now run it by modifying the program arguments. Assuming you stored your .csv data in a data directory, your program arguments could look like the following
-v data/CollegeData100sample.csv data/fullData100.json
-v data/CollegeData1000sample.csv data/fullData1000.json
data/CollegeData.csv data/fullData.json
You can open your .json files in eclipse to see the output (though the -v option will show you the output). If you look at CollegeStats.java you will notice no calculations are performed in stats. Instead, it simply crawls or traverses the data as you have structured it using loops.
Step 10 - Submitting and Reflection
You should click the link through Canvas to submit your code to Zybooks now. You will notice that the only files we are grading are CSVReader.java, Department.java, GenderCounter.java and Major.java. Many of our tests will be weird bounds cases, so make sure you test before submitting.
You should also work on your reflection now. What did you learn? How could the code be simplified? What if we had a student-level with more student information (GPA, etc)?
Going Further
This type of program is very common when working with Artificial Intelligence / Machine Learning or Data Science. While you aren’t actually doing any of the cool Machine Learning algorithms that they both use (take CS 345, 445 or 440 for that!), you are working on an essential step of getting the data into a usable format. Data often has to be cleaned and organized before it can be analyzed. If you are interested in Artificial Intelligence, you should look at the concentration in computer science that focuses in it, as it is one of the few concentrations that modify your math requirements (Neural Network Backpropagation is an example of calculus used in CS), while Data Science majors spend even more time on the math side of the algorithms. Furthermore, those who minor in computer science have access to the ML and AI classes, especially CS 345 is a great course for all minors, and a great supplement to any major.