Skip to content

Latest commit

 

History

History
24 lines (20 loc) · 1.07 KB

File metadata and controls

24 lines (20 loc) · 1.07 KB

term frequency–inverse document frequency (tf-idf) in c#

This library can import pretrained keras tokenizer, which were exported in json format. With the imported tokenizer it is possible to transform text with tf-idf. To use tokenization per word use WordStrategy, for per character tokenization use CharacterStrategy. This is the same behavior as setting the char_level property on the tokenizer.

how-to

train and export tokenizer

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=tokenizer_word_count, char_level=False, filters='', split=' ')
all_words = "abc def"
tokenizer.fit_on_texts(all_words)

tokenizer_json = tokenizer.to_json()
with open(output_directory + "tokenizer.json", "w", encoding="utf-8") as text_file:
    print(tokenizer_json, file=text_file)

import tokenizer in c#

var loadedJson = File.ReadAllText(@"tokenizer.json", Encoding.UTF8);
tokenizer = Tokenizer.FromJson(loadedJson);

For examples have a look at the unit tests.