Semester project 3 for Linguistics
- analyze some text
- learn techniques for analyzing text
- learn some software engineering skills
- learn git
Take a text, sense-annotate open-class words with a wordnet.
In previous work, we have done this manually:
Teaching Through Tagging — Interactive Lexical Semantics (Bond et al., GWC 2021)
- Do this automatically
- evaluate how well this works EVAL
- WHO:
- all the WSD tasks need this
- look at common errors
- fix some (e.g. se/si)
- test different contexts WSD-C
- test different wordnet information WSD-W
- test different LLM models WSD-M
- look at settings to make it more efficient
- e.g., prompt caching
- find translations and use as context ALIGN
- find epub, extract text, split into para, align
- then add to WSD-C
- textual criticism --- compare versions TEXT
- look at sentiment over the story SENTI
- use general tool
- use senses
- compare
- visualize
- improve Czech wordnet EXPAND
- add new senses to existing concepts using aligned data
- verify with LLM?
- create candidate definitions/examples?
- add new Czech concepts NEW
- add new suggestions for concepts to the hierarchy difficult
- evaluate how well this works EVAL
- get something useful for each task
- combine to make a best-of-breed
- write, submit and publish a paper
- release at least one automatically tagged, aligned corpus
- work on tasks in pairs
- use github to coordinate
-
5-10 minutes progress
-
longer discussion of issues as necessary
-
small meetings pair+me or WSD+me, .... as necessary
- ALL make github account
- send me accountname
- I will add to github
- then add your name to a task
- WSD --- try to run ollama on a prompt
- ALIGN --- try to run align on chapter 1 of VsM
- TEXT --- look for existing information on Karel Capek and versions, maybe ask Bohemian studies
- FCB
- prepare databases and data
- meet to set up eval, ...