Code Revision for Hwan-branch

Hi Hwan. Great job. Was very impressed with your code. I do have some concerns about the outcome, and would like you to debug your work using the following process. As always, tell me if this helps.

- [x] Read through this Towards Data Science [medium article](https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-3-tf-idf-vectors-8c2d4df98621). I think it gives you a good background into the method of TF-IDF. 
- [x] Verify that the lemma-ization does what you expect. (why is infect and infection lemma-ized to different words?) 
I would suggest you apply the tokenize & lemma functions return expected results from some common variants you see in the data.
- [x] Apply [spellcheck](https://spacy.io/universe/project/contextualSpellCheck) with spacy prior to lemma-ization 
- [x] Re-implement TF-IDF iteratively testing on smaller corpus (i.e try to do your process with just 1 sentence, then 10 sentences, then 40 sentences, etc. The TF-IDF formula can be calculated by hand by just counting the words. Can you replicate the small number outcomes)

After this, we are going to find a way to create a 'word cloud'. Please review the [wordcloud](https://github.com/amueller/word_cloud) python package by next week so that you can be ready to produce a word cloud of your tf-idf outcomes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Revision for Hwan-branch #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code Revision for Hwan-branch #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions