Introduction
When you talk about machine learning and data science, data wrangling becomes essential. What is data wrangling? It is taking data, and putting into a format that is easy to work with structuring your data in a way, so the problem becomes easier to manage. Machine Learning is about using data to create predictions, and react to those predictions, so you can see the critical nature of wrangling that data into manageable formats.
You will learn about:
- More String.format
- Loops - for and for:each
- Objects (makes heavy use of objects)
- Arrays
Scenario
You are working for Institutional Research, a group that analyzes data across the university to help departments make funding decisions and strategic initiatives. In particular, you have been provided with enrollment counts as a list of people, and their self-identified genders of every student officially majoring in Engineering or Natural Sciences major at the university.
The file provided (in zybooks, click ‘download files’), contains the following columns with the data being separated by commas (called a CSV or comma separated value file).
- PRIMARY_COLLEGE - As the college code (NS, EG)
- PRIMARY_COLLEGE_DESC - The name of the college (unused for this assignment)
- PRIMARY_DEPARTMENT_DESC - The name of the department in the college (unused)
- PRIMARY_MAJOR_DESC - The primary major of the student
- TERM - The term the data was collected
- GENDER - the self-identified gender of the student
There are 52,304 lines in the file! It is essential to build a program to help you calculate the percentage of Male vs. Female vs. Non-binary identifying students, and then present that information in an easy-to-read manner.
Requirements
- The program will take in two inputs via System.in
- The name of the file to load. Defaults to STEM_Diversity_Data.csv if no file is given
- The identifier of the term. Defaults to 202110 (Spring 2021)
- For reference, university terms are defined as YearStartMonth0, so 202090 is Fall 2020, 202110 is Spring 21, and 201960 is Summer 2019. You can look through the file to see various terms that are included.
- The prompts for the inputs are
Enter a file to load (STEM_Diversity_Data.csv):
Term (202110):
- The program will output a table for both Natural Sciences and Engineering to System.out.
- There is no need to sort the entries within the College (challenge bonus: sort them!)
- The program can immediately exit after table is printed
- If an invalid file or term is given, the program can just exit with an error message.
Here are some sample input and outputs of the program running:
Enter a file to load (STEM_Diversity_Data.csv):
Term (202110):
Natural Sciences: Major Male Female
Psychology 23.70% 76.30%
Zoology 17.61% 82.39%
Data Science 70.27% 29.73%
Biological Science 27.71% 72.29%
Computer Science 83.61% 16.39%
Statistics 65.08% 34.92%
Chemistry 44.53% 55.47%
Biochemistry 40.07% 59.93%
Mathematics 54.67% 45.33%
Applied Computing Technology 85.07% 14.93%
Physics 80.56% 19.44%
Natural Sciences 41.67% 58.33%
Engineering: Major Male Female
Biomedical Engineering with ME 53.08% 46.92%
Biomedical Engineering with EE 66.67% 33.33%
Mechanical Engineering 87.55% 12.45%
Chemical & Biological Engineer 63.49% 36.51%
Electrical Engineering 88.89% 11.11%
Computer Engineering 91.53% 8.47%
Civil Engineering 75.98% 24.02%
Environmental Engineering 53.26% 46.74%
Biomedical Engineering with CB 42.47% 57.53%
Engrg Sci and Intl Studies 42.86% 57.14%
Engineering Science 71.43% 28.57%
Biomedical Engineering with EL 66.67% 33.33%
Engineering Open Option 100.00% 0.00%
The example above, the person just hit return for each entry, and didn’t put in any input besides return. The one below, the client is typing in a file name and a different term.
Enter a file to load (STEM_Diversity_Data.csv): STEM_Diversity_Data.csv
Term (202110): 201990
Natural Sciences: Major Male Female
Psychology 24.49% 75.51%
Zoology 19.77% 80.23%
Biological Science 29.86% 70.14%
Computer Science 86.75% 13.25%
Statistics 62.07% 37.93%
Biochemistry 42.36% 57.64%
Chemistry 47.88% 52.12%
Mathematics 58.26% 41.74%
Applied Computing Technology 81.51% 18.49%
Physics 81.63% 18.37%
Data Science 75.56% 24.44%
Natural Sciences 50.00% 50.00%
Engineering: Major Male Female
Electrical Engineering 88.72% 11.28%
Biomedical Engineering with EE 61.11% 38.89%
Mechanical Engineering 86.78% 13.22%
Biomedical Engineering with ME 55.16% 44.84%
Civil Engineering 74.89% 25.11%
Chemical & Biological Engineer 64.41% 35.59%
Computer Engineering 92.41% 7.59%
Engineering Open Option 80.30% 19.70%
Biomedical Engineering with CB 43.13% 56.88%
Environmental Engineering 48.60% 51.40%
Engineering Science 78.38% 21.62%
Engrg Sci and Intl Studies 28.57% 71.43%
Biomedical Engineering with EL 100.00% 0.00%
The order of your answers may vary, but the percentages should not. Our tests will randomly take lines out to test them individually.
Why only Male vs. Female?
Wait, didn’t we mention Non-Binary was an option? It is an option to identify as at the university, after many years of hard work from faculty (some from Computer Science) to get various other ways to identify listed on the admission’s paperwork. However, that does not mean it is selected frequently, and many students don’t know they can choose to change it in Ramweb (Pride Resource Center Howto). As such, the demographic data tends to match what we get from high schools, and other information the university gets. To complicate it even more, this option only became available a couple of years ago, and there are legal ramifications on how it can be updated. We left it out in the table, due to the fact that with rounding, it came up as 0% for every major (it actually isn’t). It is also an important reminder that behind all the data, there are people, and people are more than a statistic. Good data analysis includes learning about the narrative behind the data, and validating the data collected.
Coding Specifications
For grading purposes, we are asking you to follow this format. Furthermore, the format emphasis the design of the program, focusing on storing the data in objects, and then printing by grabbing the data from the objects. This is very standard to do. Also, you will notice that while in Practical 2, you created 3 files from scratch, in this practical, you will be creating 5 files from scratch.
Full breakdown for the required methods can be found in the javadoc. Make sure to map out how the files are interacting, and work with a TA early.
Five classes will be made and submitted:
- Main.java - the main driver of the program, but does a bit more work than past mains.
- CSVReader - Helps you read a Comma Separated Value file by using a Scanner to grab each line, and String.split to break up the lines into String arrays.
- Data - Uses CSVReader to read the data, and keeps track of the CollegeDemographics as an array of two elements. Prints out table when requested.
- CollegeDemographics - Keeps track of the various MajorDemographics (majors) in a single College. Builds a String value of the table to print out unique to each college.
- MajorDemographics - Keeps the total gender counts for a major, so percents can be easily calculated when asked.
Like practical 2, you wll also be asked to write tests in separate main methods for your classes. It is always important to test thoroughly before submitting.
Where to Start?
Here are some hints on how to start
- Write down on paper what you want to do!
- do you understand the specifications? If not, ask on MS Teams!
- Draw out the program flow. Can you picture how the program will work?
The picture doesn’t have to be clear, but seeing a picture can help you see how classes interact. - Sometimes it helps to develop empty classes, with comments on what to do
in your own words in each class (you can even turn them in, to get your “does it compile” point)
- Look at the problems, divide and conquer.
- What do you know how to do?
- What is ‘self-contained’, meaning you can write it without dependence on the other
classes?
- MajorDemographics - does not rely on the other classes to work, so could be a starting place. You can write, and test to make sure it works before moving on.
- CSVReader - does not rely on any of the other classes to work. You could start with reading a CSV file (write your own, make it only 3 lines!) and printing out the contents of that file to the screen. You now know how to read a file, which is a major component of this assignment.
- With those two classes done, you can then build from there!
- A quick reminder, when working with most IDEs (IntelliJ and Eclipse), the file path reads from the root project directory (not src), so you should place your CSV file in the project root.
The biggest problem when starting with code is getting lost in a “where to start mental loop”. Take your time to think it through, but don’t take too long. You should start writing code, even if the code is simply to help you figure out how to do something (read a file for example, or get client input). That type of “warm up” will then get your brain working on how to work on the entire problem. Basically, don’t stare at the problem and do nothing. It doesn’t hurt to try, fail, try again.
Going Further
This type of program is very common when working with Artificial Intelligence / Machine Learning or Data Science. While you aren’t actually doing any of the cool Machine Learning algorithms that they both use (take CS 345, CS 445, or CS 440 for that!), you are working on an essential step of getting the data into a usable format. Data often has to be cleaned and organized before it can be analyzed. If you are interested in Artificial Intelligence, you should look at the concentration in computer science that focuses in it, as it is one of the few concentrations that modify your math requirements (Neural Network Backpropagation is an example of calculus used in CS), while Data Science majors spend even more time on the math side of the algorithms. Furthermore, those who minor in computer science have access to the ML and AI classes, especially CS 345 is a great course for all minors, and a great supplement to any major.