Instructor: Matthew Wilkens
TAs: Kiara Liu, Xinhe Tian
Credits: 3
Mode: In person (only)
Lecture: TR 1:25-2:40pm, Snee 1120
Office hours: See Canvas
All students who have met the prerequisites are welcome in the course. INFO majors can register directly. All others should join the waitlist; you will be provided with a registration PIN in a day or two. Please do attend course meetings while you wait for a PIN. It is otherwise almost impossible to catch up if you're admitted at the end of the add period.
Note that you must have successfully completed INFO 3350 (Text-Mining for History and Literature) or equivalent to enroll in the class. If you're unsure about your preparation, talk to Prof. Wilkens directly.
An advanced, research-centered undergraduate course on applications of AI and natural language processing to humanities problems.
This is not an ordinary NLP course. Yes, we will cover a handful of advanced topics in NLP, mostly concerning large language models and their adaptation to specific tasks. But much of what you need to know on that front you will already have learned in INFO 3350 or other upper-division classes in IS and CS.
The real objective of the course is to learn to carry out independent research. To that end, the majority of class meetings will be devoted to one of two areas:
- Learning and mastering core research activities such as reading papers, selecting problems, building data sets, designing experiments, analyzing results, presenting findings, and crafting arguments according to the conventions of scientific literature.
- Putting those skills to work on a research project of your own design.
You will be responsible for selecting, designing, and carrying out a novel project in computationally assisted humanities research. You will work in small groups, mentored by a teaching assistant. There will be ample opportunities for direction and feedback on every aspect of your work. The final deliverable will be a complete paper suitable for submission to a peer-reviewed conference or journal.
- Analyze and evaluate published research in NLP and the humanities
- Explain, evaluate, and implement advanced techniques for the use of text as data
- Create reliable, compelling, data-driven humanities research reports that apply appropriate advanced computational methods to existing humanities questions
- Identify and analyze historical, ethical, and epistemic limitations of textual corpora and analytic methods
We will use:
- GitHub (right here) to distribute lecture materials, code, and datasets. The current versions of the syllabus (this page) and the schedule are always on GitHub, too. You might want to watch or star this repo to be notified of changes.
- Canvas to distribute restricted readings and other non-public materials.
- Gradescope (linked from Canvas) for assignment distribution and submission.
- Ed (linked from Canvas) for Q&A.
Links and detailed info about each of these are available via the course Canvas site.
Note that you must generally be logged in through your Cornell account to access non-public resources (everything but GitHub).
Grades will be based on interim project deliverables (25% in sum), in-class presentations (15% in sum), a final project in the form of a complete scientific paper (40%), and participation and professionalism (20%). You must achieve a passing grade in each of these components to pass the course. Detailed weights and expectations for individual assignments will be distributed with each assignment.
Assignments in this class are marked using additive scoring, as is standard in professional and scientific work. There are almost never uniquely correct answers to the problems posed. Graders will be looking not for mistakes (to deduct points), but for excellent work (to award points). It is possible (even common) to submit work that is entirely correct, but of low quality. Such work will receive correspondingly low scores.
Grades will be assigned on the following scale:
97 - 100% A+
93 - 97% A
90 - 93% A-
87 - 90% B+
83 - 87% B
80 - 83% B-
77 - 80% C+
73 - 77% C
70 - 73% C-
67 - 70% D+
63 - 67% D
63 - 0% F
Participation scores are determined as follows:
- 100% of participation points: almost always contributes, raising thoughtful points. Is a strongly positive influence on the level and tenor of the class.
- 80% of participation points: frequently contributes, raising thoughtful points
- 60% of participation points: occasionally makes a valuable contribution
- 40% of participation points: rarely makes a valuable contribution
- 0% of participation points: attends class meetings but never contributes, or actively interferes with learning
There is no required textbook for the course. Assigned readings will be available online, either through the open web or via Canvas. See the schedule for details.
There are two textbooks that may be useful for students who wish to consult them. They are not required and most students will not need them.
- Jurafsky and Martin. Speech and Language Processing (3rd ed.). A detailed textbook focusing on many of the core topics in natural language processing. The linked version is the openly available draft of the third edition. The published second edition is also available for sale.
- Karsdorp, Kestemont, and Riddell. Humanities Data Analysis. A set of case studies in data analytics for literary studies. Probably not sufficiently technical for this class, but provides fully worked examples of computational humanities work.
In principle, Tuesday lectures cover material that you'll use to prepare for Thursday group sessions. But we'll often deviate from that pattern as our needs dictate.
For the detailed (and updating) list of topics and readings, see the course schedule.
A complete write-up of your research will be due at a time during finals week assigned by the Office of the University Registrar. Final project due dates are expected to be released by the middle of March.
All students are entitled to respect from course staff and from their fellow students. All staff are entitled to respect from students and from fellow staff members. Violations of this principle, whether large or small, will not be tolerated.
Respect means that your ideas are taken seriously, that you feel welcome in class settings (including in study groups and online fora), and that you are treated as a full, co-equal member of the class. Harassment describes any action, intentional or otherwise, that abridges the respect owed to every member of the class.
If you experience harassment in any form, or if you would like to discuss your experience in the class, please see me in office hours or contact me by email. The university also has reporting and counseling resources available, including those for sexual harassment and for other bias incidents.
This is a small class that will make frequent use of class time to discuss readings and to debate approaches to academic inquiry. It will also involve significant work in small research groups. For these reasons, attendance is strictly required.
If you need to miss a meeting, let me know in advance (ideally in person during an earlier class session). When you return, consult with a classmate to review the material you missed. Lectures will not be recorded. In every case, assigned work remains due at the appointed time.
Late work is accepted subject to a limit of five total slip days per group for the semester. You may submit any individual assignment up to three days (72 hours) late. The slip day policy does not apply to in-class presentations, nor to the final deliverable.
If you expect to miss a deadline or to be absent for an extended period due to truly exceptional circumstances, contact Professor Wilkens as far in advance as possible so that we can discuss potential accommodations.
If you feel that the graders have made a clear, objective, and significant mistake in assessing your work, you may request a regrade via Gradescope not later than one week after feedback is released. Regrade requests are typically processed within a week or two of submission. You will be notified of the outcome as soon as it is ready.
Remember that this process exists to correct mistakes. This process does not exist to lobby for points. We want to give grades that accurately represent our assessment of your understanding of course material. Hence, if you are given a lower score than you should have been, due to an obvious grading error, you should absolutely bring it to our attention. However, we must explicitly mention an additional consequence of the importance of grade accuracy: if we notice that you have been assigned more points than you should have been, we are duty-bound to correct such scores downward to the correct value.
Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work unless specifically and explicitly permitted otherwise.
Using other people's code is an important part of programming, but all code should be the work of each individual student or (for group projects) group members (except for standard libraries). Any code submitted as part of an assignment that was not written by the submitting student/group should be placed in separate files and clearly labeled with their source URLs. If you have benefitted from online resources such as StackOverflow, list the URLs in comments in your own code, even if you did not directly copy anything.
Work that relates to your other classes or research is encouraged, but you may not recycle assignments. There must be no doubt that the work you turn in for this class was done for this class. When in doubt, consult with me during office hours.
The use of generative AI systems is permitted, subject to conditions. If you use AI for any aspect of your work, you must:
- Clearly indicate the content of your submission that was produced by an AI system or with the assistance of such a system;
- Identify the system(s) you used;
- Write a short statement summarizing your assessment of the system's performance on the task.
Use of generative AI without following these steps will constitute an academic integrity violation.
TAs will not help you debug AI-generated code, because it's a pain in the rear.
I reserve the right to ban the use of generative AI systems outright at a later point in the semester if they become a net negative for our learning goals, but I hope and expect that that won't be necessary.
Every student's access is important to us. If you have, or think you may have, a disability, please contact Student Disability Services for a confidential discussion: sds_cu@cornell.edu, 607-254-4545, or sds.cornell.edu.
- Please request any accommodation letter early in the semester, or as soon as you become registered with Student Disability Services (SDS), so that we have adequate time to arrange your approved academic accommodations.
- Once SDS approves your accommodation letter, it will be emailed to you and to me. Please follow up with me to discuss the necessary logistics of your accommodations.
- If you experience any access barriers in this course, such as with printed content, graphics, online materials, or any communication barriers, reach out to me and/or your SDS counselor right away.
- If you need an immediate accommodation, please speak with me after class or send an email message to me and to SDS.
Your health and wellbeing are important to us. There are services and resources at Cornell designed specifically to bolster student mental health and well-being. If you or a friend are struggling emotionally or feeling stressed, fatigued, or burned out, there is a continuum of campus resources available to you. Help is available any time day or night through Cornell’s 24/7 phone consultation (607-255-5155). You can also reach out to me, your college student services office, your resident advisor, or Cornell Health for support.