CS542 (Fall 2025) Final Project

Proposal due October 28, 2025

Lighting Presentations December 10, 2025

Final Submission due December 17, 2025

‌Description
Your project might involve some of the following:
- Apply an NLP technique we have discussed in class to a task or dataset of interest to you.
  - e.g., Anything we have discussed in class would be appropriate as long as it is scoped to be commensurate with a six week project. Some sources of data will be listed below.
- Download code for an NLP technique we have not discussed in depth in class and apply it to a task or dataset of interest.
  - e.g., We did not discuss CRFs in much depth, and only devoted half a lecture to them, but they have demonstrated success on a number of tasks.
- Attempt to replicate some previous finding in the field.
  - e.g., Find a paper with an available codebase, and attempt to replicate their process. Were you able to reproduce the results? Why or why not? Do you agree with the findings they made in the paper?
- Summarize 5 research papers about some NLP task or subdiscipline and write a report with a detailed discussion on what you have learned (so you can do a project that doesn’t involve coding).
  - e.g., Choose an NLP subarea or task like speech recognition, mul- timodality, translation, parsing, low-resourced languages, etc.,
    
    and find 5 key papers on that topic. Summarize each paper, dis- cuss commonalities and differences in the respective approaches, and synthesize all the papers into a distinct conclusion about that subdiscipline or task and approaches to it.
    This is not an exhaustive list, but just a set of representative examples. You should choose something that motivates you enough to keep you inter- ested. The project is a chance to maximize your success as long as you stay within scope of the class material. If what you propose to do falls clearly into one of the four above categories, it’s likely to be approved much faster without questions.
    In your project proposal, you should make reference to the following questions where appropriate:
- What scientific or empirical questions are you seeking to answer?
- What hypotheses can you make about the data or method you will be exploring? (i.e., for a research report, this may be less applicable)
- What are common approaches in the field to your problem? What appear to be their advantages or drawbacks?
  You may adapt an existing project, either for another class or an inde- pendent study, for this class, if you wish. The only stipulation is that you must make it relevant to NLP. What that means is that your project must involve language data, or something that shares enough similarity with lan- guage data that you can adapt an NLP approach or technique to it. As lan- guage data is almost uniformly sequence data, generally anything involved sequences is probably fair game, such as audio data or DNA data. Straight computer vision or bioinformatics projects are out of scope (we have other classes for that), though if you were to do something cool involving vision and language, I would probably sing your praises.
‌Scope
It is not required that you produce conference paper-level work for this project. You will have about a month and a half for this, and it should be scoped appropriately. For coding projects, you will not be graded on the success or failure of your approach, but your ability to do the work and perform analysis. If you want a general idea of the scope, you can use the amount of coding work on PA2 as a starting point: implementing a distinct

algorithm pretty much from start to finish, and applying it to a dataset. This level of work plus an accompanying write-up analyzing the methods and results would be a sufficient minimum level of effort for a coding project. For a purely writing project, you should consider around 5 typed pages to be the approximate minimum, with about a page to summarize each paper and the rest to present your synthesis and conclusions. You can of course write more.
You may work in small groups, of up to 3, if you wish. Working in groups is not required. Group projects should expect to describe the approximate apportionment of work in their final submission.
‌Schedule
- October 28, 2024: please submit a short project proposal, about the length of a paper abstract (only needs to be about 100 words but longer is acceptable), describing what you would like to do. I will approve it or ask you for some revisions over the following days.
- December 4, 2024: each person or team will give a short presen- tation of about 3-5 minutes summarizing the project and progress to date with a small number of slides (also about 3-5). We will do these in class, therefore attendance is obligatory unless you have an unavoid- able conflict such as being in a time zone where it’s 4 in the morning during the normal class period. In that case, you will be asked to pre- record a video of you presenting your slides and project. I recommend doing this on Zoom. Start a private meeting, record yourself and share screen, present, and save and submit your video. Your face does not have to appear in the video; we only have to hear your voice.
- December 11, 2024: please submit your final project report and associated materials.
‌Sources
To find datasets and code, some good resources are:
- https://www.paperswithcode.com. The premier site for finding pa- pers with associated codebases, as well as current leaderboards on
  
  common tasks. Search for “natural language,” “natural language pro- cessing,” or related terms, or tasks like “sentiment analysis,” “named entity extraction,” or “speech recognition.”
- https://huggingface.co. Really good repository of datasets for lan- guage tasks. Search for models or datasets using terms similar to the task search terms above.
- http://archive.ics.uci.edu/ml/index.php. UCI Machine Learn- ing Repository. Contains general machine learning datasets but a search for ”language” or related terms will turn up appropriate datasets.
  To find papers, other good resources are:
- https://scholar.google.com. Of course.
- https://aclanthology.org. NLP-specific. The ACL Anthology hosts over 70,000 papers on the study of computational linguistics and natu- ral language processing from all major venues for the past few decades. Searching works similarly to Google Scholar. This is where I find most of the papers for the paper discussions.
  You are not restricted to these sources.
‌Grading
Grading will be as follows:
You will not be graded on the success or failure of what you propose. If you try something and it doesn’t work out, you would not be penalized. What matters is your effort and the analysis you do. A negative result is still a result, but delve into the reasons behind failure. Depending on if you do a coding project or a report project, grading would be slightly different:
- 20 points for turning in a proposal that is approved. This should be about the length of a paper abstract, so only about 100 words or so, laying out what you plan to do. Whether or not I approve it without changes, you will get the full points for turning one in once approval is granted.
- 10 points for identifying (in your proposal), some sources you can use (coding projects), or the papers you plan to review (report projects).
  
  For research report proposals, if you do not include citations in your proposal, I will send it back to you to include those. You will receive credit once these are included.
- 60 points for submitting code you wrote or adapted (coding projects) or summaries of each of the papers you read (report projects). For coding projects, you don’t need to submit an entire codebase; just submit the essentials, and links to other things that might be required. In principle, someone should be able to read your submission and reproduce what you did. We will most likely not run your code in depth, merely look through it to verify that you put in an appropriate amount of effort.
- 60 points for presentation of results (coding projects) or synthesis of content from papers (report projects) and discussion. This is where the bulk of the writing will be for coding projects. Describe your results and what you learned.
- 30 points for doing the in-class or recorded presentation on December 4.
- 30 points for depth. These 30 points are for providing detailed discus- sion and analysis. Include specific examples of places your approach succeeded or failed (coding project) or specific examples of where dif- ferent papers converge or diverse (written reports). Impress me. The scope of a project relative to the number of people on the project would also be considered in this part. That is, a 3-person project better look like it took 3 people to accomplish. Group projects should describe the apportionment of work here.
- 30 points for the “reverse conjecture map”. Conjecture mapping is a technique from the learning sciences proposed by William Sandoval in 2014. It begins with a conjecture about how to support learning out- comes, and then outlines the possible ways to “embody” that conjec- ture (e.g., tools to use, practices to undertake, etc.), connects those em- bodiments to the “mediating processes” (observable interactions, arti- facts produced), and learning outcomes (new knowledge, motivation, etc.). An entry-level primer can be found here: https://circls.org/ reading-list/reading-list-conjecture-mapping. I’m asking you to reverse this process as a way of summarizing your learning process throughout doing the project.
  - Start with the outcomes: list what you learned, for example about the relationship between data, task, and approach(es) taken.
  - Enumerate the mediating processes: what artifacts did you pro- duce in the course of doing the project? For example, for a coding project this might be data processing pipelines or experimental procedures. For a written report this might be notes on the differ- ent papers connecting them to each other and your conclusions. How did you go about coming to understand the raw materials (the papers or data)?
  - Connect your mediating processes to raw artifacts: in this case this is probably limited to the papers or raw datasets/starter code, but some things to consider discussing is how those artifacts are presented. Are they done so in ways that are easily or difficult to understand? Was there anything in particular about these artifacts that made you use the mediating processes you did?

We will discuss the conjecture mapping process in class on October 1, 2025. Watch the recording if you need more review or examples. I’d caution you not to overthink this. The goal is to reconstruct your learning process so I can understand how you thought about the project and your own learning. The conjecture map just provides a specific process on how to go about that.

If you add the points up you’ll find the total is 240 points, but in the syllabus the project is worth 200 points. Therefore it is possible to get more than 100% on the final project. Use this to your advantage.

Submission Instructions

There will be separate submission boxes on Canvas for the project proposal, final report, and presentation slides/videos where applicable. Proposals should be in PDF format only. Final reports should be in PDF format, or a single Jupyter notebook. Coding projects will need to include essential code that you wrote, or links to code that you used if you replicated some- thing out of the box. As mentioned, we will most likely not attempt to run all your code in normal circumstances, but we need to be able to verify the level of effort you assert it took.

For group projects, only one person needs to submit anything. Just make sure all names are clearly on it so we can assign credit properly.

‌Description

‌Scope

You may work in small groups, of up to 3, if you wish. Working in groups is not required. Group projects should expect to describe the approximate apportionment of work in their final submission.

‌Schedule

October 28, 2024: please submit a short project proposal, about the length of a paper abstract (only needs to be about 100 words but longer is acceptable), describing what you would like to do. I will approve it or ask you for some revisions over the following days.

December 11, 2024: please submit your final project report and associated materials.

‌Sources

‌Grading

20 points for turning in a proposal that is approved. This should be about the length of a paper abstract, so only about 100 words or so, laying out what you plan to do. Whether or not I approve it without changes, you will get the full points for turning one in once approval is granted.

10 points for identifying (in your proposal), some sources you can use (coding projects), or the papers you plan to review (report projects).

60 points for presentation of results (coding projects) or synthesis of content from papers (report projects) and discussion. This is where the bulk of the writing will be for coding projects. Describe your results and what you learned.

30 points for doing the in-class or recorded presentation on December 4.

Submission Instructions