This repo exists to train a word2vec model on different poetry datasets, yielding verse-influenced word vectors.
- English PCD (Poem Comprehensive Dataset)
- Gutenberg Poem Dataset
- Poem Emotion Recognition Corpus (PERC)
- Poetry Foundation Poems
- PoKi: A Large Dataset of Poems by Children
- poetry: splits the data as is
- sentences: splits the data into component sentences
- splits: leaves English PCD, Gutenberg, and PoKi data as is, splits the others by stanza
[source]\_clean\_text\_[subset].csv: cleaned versions of the datasets used[subset]_tokens.csv: tokenized versions of the cleaned data[subset]\_tokens\_word2vec.model: models trained on tokenized cleaned data
- Split poems into sentences for specific sources [DONE]
- Add list of sentences as new entries in dataframes [DONE]
- Remove duplicate poems [DONE]
- Check why
\t\t\t\t\t\t\tisn't being removed inclean_text.py??? [DONE] - Find other instances of malformed data [DONE]
- Re-tokenize as lines, stanzas, entire poems [DONE]
- Test word vectors [IN PROGRESS]