Issue:
We currently depend on vocabularies, like glove embeddings, that are:
- Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
- Depend on being consistent with the tokenizer we use.
- Don't necessarily have the same words as our actual text.
Proposed solution project:
Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.
Issue:
We currently depend on vocabularies, like glove embeddings, that are:
Proposed solution project:
Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.