Skip to content

Code Revision for Hwan-branch #1

@edwardhuh

Description

@edwardhuh

Hi Hwan. Great job. Was very impressed with your code. I do have some concerns about the outcome, and would like you to debug your work using the following process. As always, tell me if this helps.

  • Read through this Towards Data Science medium article. I think it gives you a good background into the method of TF-IDF.
  • Verify that the lemma-ization does what you expect. (why is infect and infection lemma-ized to different words?)
    I would suggest you apply the tokenize & lemma functions return expected results from some common variants you see in the data.
  • Apply spellcheck with spacy prior to lemma-ization
  • Re-implement TF-IDF iteratively testing on smaller corpus (i.e try to do your process with just 1 sentence, then 10 sentences, then 40 sentences, etc. The TF-IDF formula can be calculated by hand by just counting the words. Can you replicate the small number outcomes)

After this, we are going to find a way to create a 'word cloud'. Please review the wordcloud python package by next week so that you can be ready to produce a word cloud of your tf-idf outcomes!

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions