The objective of this assignment is to refresh your knowledge of how to solve programming problems involving loops, arrays, methods, and files. You will also learn/review the use of extremely useful methods for string and token processing.
Analysis of social media data is a major topic of research these days both in industry and academia. In this assignment you will perform a simple analysis of tweets. Your task is to read a file containing tweets and find the mostly commonly occurring word in those tweets. Since words like "the", "in", etc, occur often and are not very informative, it is common to ignore these very common words, which are called stop-words.
Create a program called P1.java.
Add the following methods to the program.
Here are some steps to get you started on file processing, tweet processing, and testing.
To read and process a file use the following pattern of using Scanner:
lineScanner = new Scanner(file); // file is an instance of class File while (lineScanner.hasNextLine()) { String line = lineScanner.nextLine(); // process the line }
In a stop-words file, each line contains a stop-word. In a tweets file, each line contains three parts. You will need to extract the third part (i.e., the actual tweet) by using the fact that tabs are used to separate the three parts. You can use a StringTokenizer or the split method defined in the class String.
Using a StringTokenizer: Suppose lineBeingProcessed is a String.
import java.util.StringTokenizer; // must be at the beginning of your class //... //... StringTokenizer st = new StringTokenizer(lineBeingProcessed,"\t"); while(st.hasMoreTokens()){ String element = st.nextToken(); // do something with that token // call nextToken two more times. The third call returns the tweet. }
Using the split method in String: Suppose lineBeingProcessed is a String.
String[] parts = new String[3]; parts = lineBeingProcessed.split("\t"); // parts[2] contains the tweet
Your methods need to ignore punctuation symbols and spaces. Assume that the possible punctuation symbols are contained in this set: {".", "-", ",", "!", "?", "*"}. Tweets will often contain the hashtag symbol ("#") and the "@" symbol. Do not ignore them. Leave those as part of the word to which they are attached, since they are not used as regular words. To parse a tweet contained in a variable t ignoring punctuation, use the Scanner in the following way:
Scanner s = new Scanner(t).useDelimiter("[ *-,!?.]+"); while(s.hasNext()) { String word = s.next(); // process the word }
The argument of useDelimiter requires some explanation. It is what's called in Computer Science a regular expression: We are using as a delimiter one or more occurrences of the characters within the square brackets (that's what the + is doing).
Your solution should be case-insensitive, "tail" and "Tail" both count as occurrences of the same word. This is easy to achieve using a String's toLowerCase() method.
To help you develop your program, here is a tweets file, and a stop-words file. Preliminary testing using the auto-grading system will be performed using these files. You get results of preliminary testing almost instantaneously using the oneline submission system). For final grading of your program we will use different files (both filename and content), so do not hardcode any parameters into your program. We encourage you to construct your own tweet files that test various scenarios (e.g. that you successfully ignore case).
Keep in mind that we will not test your main method. The methods you implement will be tested directly and separately. However, you should write your main method to test the methods that you write. A barebones main can include something like:
public static void main(String[] args) { P1 p1 = new P1(); String [] tweets = p1.readTweets("tweets.txt"); String [] stopWords = p1.readStopWords("stopWords.txt"); System.out.println(p1.mostCommonWord(tweets)); System.out.println(p1.mostCommonWordExcludingStopWords(tweets, stopWords)); }
You should also need to manually determine the most common word in order to check the output of your program.
During preliminary testing, your score equals the number of test cases passed. H owever, during final testing, certain test cases may be weighted more than other s. More difficult methods will be worth more points.
Submit the file P1.java via the online checkin system. This system performs preliminary testing of your program on the same data files to the ones provided above. Final grading will be performed on a different set of files (both filenames and contents may differ).
The twitter dataset we provided to you is a subset of a much larger dataset which is available here.