verse2vec

This repo exists to train a word2vec model on different poetry datasets, yielding verse-influenced word vectors.

Datasets used

Subsets

poetry: splits the data as is
sentences: splits the data into component sentences
splits: leaves English PCD, Gutenberg, and PoKi data as is, splits the others by stanza

Files

[source]\_clean\_text\_[subset].csv: cleaned versions of the datasets used
[subset]_tokens.csv: tokenized versions of the cleaned data
[subset]\_tokens\_word2vec.model: models trained on tokenized cleaned data

To do

Split poems into sentences for specific sources [DONE]
Add list of sentences as new entries in dataframes [DONE]
Remove duplicate poems [DONE]
Check why \t\t\t\t\t\t\t isn't being removed in clean_text.py ??? [DONE]
Find other instances of malformed data [DONE]
Re-tokenize as lines, stanzas, entire poems [DONE]
Test word vectors [IN PROGRESS]

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
models		models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

verse2vec

Datasets used

Subsets

Files

To do

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

verse2vec

Datasets used

Subsets

Files

To do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages