-
Notifications
You must be signed in to change notification settings - Fork 201
Open
Description
Hi!
The code is great!
I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:
for (int i = 0; i < n; i++)
model.train(sentences);
The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code word2vec.h,I have found the code in word2vec.hmay have a problem:
#pragma omp parallel for
for (size_t i=0; i < n_sentences; ++i) {
auto sentence = sentences[i].get();
if (sentence->tokens_.empty())
continue;
size_t len = sentence->tokens_.size();
for (size_t i=0; i<len; ++i) {
auto it = vocab_.find(sentence->tokens_[i]);
if (it == vocab_.end()) continue;
Word *word = it->second.get();
// subsampling
if (sample_ > 0) {
float rnd = (sqrt(word->count_ / (sample_ * total_words) ) + 1) * (sample_ * total_words) / word->count_;
if (rnd < rng(eng)) continue;
}
sentence->words_.emplace_back(it->second.get());
}
The vector sentence will be larger and larger if we use the train function for the second time . We can clear the vector first.
#pragma omp parallel for
for (size_t i=0; i < n_sentences; ++i) {
auto sentence = sentences[i].get();
//By Largelymfs
sentence.clear();
if (sentence->tokens_.empty())
continue;
size_t len = sentence->tokens_.size();
for (size_t i=0; i<len; ++i) {
auto it = vocab_.find(sentence->tokens_[i]);
if (it == vocab_.end()) continue;
Word *word = it->second.get();
// subsampling
if (sample_ > 0) {
float rnd = (sqrt(word->count_ / (sample_ * total_words) ) + 1) * (sample_ * total_words) / word->count_;
if (rnd < rng(eng)) continue;
}
sentence->words_.emplace_back(it->second.get());
}
And we can put the train function into a loop.
Thanks a lot.
Metadata
Metadata
Assignees
Labels
No labels