CS542 (Fall 2025) Final Project
Proposal due October 28, 2025
Lighting Presentations December 10, 2025
Final Submission due December 17, 2025
Your project might involve some of the following:
Apply an NLP technique we have discussed in class to a task or dataset of interest to you.
e.g., Anything we have discussed in class would be appropriate as long as it is scoped to be commensurate with a six week project. Some sources of data will be listed below.
Download code for an NLP technique we have not discussed in depth in class and apply it to a task or dataset of interest.
e.g., We did not discuss CRFs in much depth, and only devoted half a lecture to them, but they have demonstrated success on a number of tasks.
Attempt to replicate some previous finding in the field.
e.g., Find a paper with an available codebase, and attempt to replicate their process. Were you able to reproduce the results? Why or why not? Do you agree with the findings they made in the paper?
Summarize 5 research papers about some NLP task or subdiscipline and write a report with a detailed discussion on what you have learned (so you can do a project that doesn’t involve coding).
e.g., Choose an NLP subarea or task like speech recognition, mul- timodality, translation, parsing, low-resourced languages, etc.,
and find 5 key papers on that topic. Summarize each paper, dis- cuss commonalities and differences in the respective approaches, and synthesize all the papers into a distinct conclusion about that subdiscipline or task and approaches to it.
This is not an exhaustive list, but just a set of representative examples. You should choose something that motivates you enough to keep you inter- ested. The project is a chance to maximize your success as long as you stay within scope of the class material. If what you propose to do falls clearly into one of the four above categories, it’s likely to be approved much faster without questions.
In your project proposal, you should make reference to the following questions where appropriate:
What scientific or empirical questions are you seeking to answer?
What hypotheses can you make about the data or method you will be exploring? (i.e., for a research report, this may be less applicable)
What are common approaches in the field to your problem? What appear to be their advantages or drawbacks?
You may adapt an existing project, either for another class or an inde- pendent study, for this class, if you wish. The only stipulation is that you must make it relevant to NLP. What that means is that your project must involve language data, or something that shares enough similarity with lan- guage data that you can adapt an NLP approach or technique to it. As lan- guage data is almost uniformly sequence data, generally anything involved sequences is probably fair game, such as audio data or DNA data. Straight computer vision or bioinformatics projects are out of scope (we have other classes for that), though if you were to do something cool involving vision and language, I would probably sing your praises.
It is not required that you produce conference paper-level work for this project. You will have about a month and a half for this, and it should be scoped appropriately. For coding projects, you will not be graded on the success or failure of your approach, but your ability to do the work and perform analysis. If you want a general idea of the scope, you can use the amount of coding work on PA2 as a starting point: implementing a distinct
algorithm pretty much from start to finish, and applying it to a dataset. This level of work plus an accompanying write-up analyzing the methods and results would be a sufficient minimum level of effort for a coding project. For a purely writing project, you should consider around 5 typed pages to be the approximate minimum, with about a page to summarize each paper and the rest to present your synthesis and conclusions. You can of course write more.
To find datasets and code, some good resources are:
https://www.paperswithcode.com. The premier site for finding pa- pers with associated codebases, as well as current leaderboards on
common tasks. Search for “natural language,” “natural language pro- cessing,” or related terms, or tasks like “sentiment analysis,” “named entity extraction,” or “speech recognition.”
https://huggingface.co. Really good repository of datasets for lan- guage tasks. Search for models or datasets using terms similar to the task search terms above.
http://archive.ics.uci.edu/ml/index.php. UCI Machine Learn- ing Repository. Contains general machine learning datasets but a search for ”language” or related terms will turn up appropriate datasets.
To find papers, other good resources are:
https://scholar.google.com. Of course.
https://aclanthology.org. NLP-specific. The ACL Anthology hosts over 70,000 papers on the study of computational linguistics and natu- ral language processing from all major venues for the past few decades. Searching works similarly to Google Scholar. This is where I find most of the papers for the paper discussions.
You are not restricted to these sources.
Grading will be as follows:
You will not be graded on the success or failure of what you propose. If you try something and it doesn’t work out, you would not be penalized. What matters is your effort and the analysis you do. A negative result is still a result, but delve into the reasons behind failure. Depending on if you do a coding project or a report project, grading would be slightly different:
For research report proposals, if you do not include citations in your proposal, I will send it back to you to include those. You will receive credit once these are included.
Start with the outcomes: list what you learned, for example about the relationship between data, task, and approach(es) taken.
Enumerate the mediating processes: what artifacts did you pro- duce in the course of doing the project? For example, for a coding project this might be data processing pipelines or experimental procedures. For a written report this might be notes on the differ- ent papers connecting them to each other and your conclusions. How did you go about coming to understand the raw materials (the papers or data)?
Connect your mediating processes to raw artifacts: in this case this is probably limited to the papers or raw datasets/starter code, but some things to consider discussing is how those artifacts are presented. Are they done so in ways that are easily or difficult to understand? Was there anything in particular about these artifacts that made you use the mediating processes you did?
We will discuss the conjecture mapping process in class on October 1, 2025. Watch the recording if you need more review or examples. I’d caution you not to overthink this. The goal is to reconstruct your learning process so I can understand how you thought about the project and your own learning. The conjecture map just provides a specific process on how to go about that.
If you add the points up you’ll find the total is 240 points, but in the syllabus the project is worth 200 points. Therefore it is possible to get more than 100% on the final project. Use this to your advantage.
There will be separate submission boxes on Canvas for the project proposal, final report, and presentation slides/videos where applicable. Proposals should be in PDF format only. Final reports should be in PDF format, or a single Jupyter notebook. Coding projects will need to include essential code that you wrote, or links to code that you used if you replicated some- thing out of the box. As mentioned, we will most likely not attempt to run all your code in normal circumstances, but we need to be able to verify the level of effort you assert it took.
For group projects, only one person needs to submit anything. Just make sure all names are clearly on it so we can assign credit properly.