A little problem in train function

Hi!
The code is great!

I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:

```
for (int i = 0; i < n; i++)
        model.train(sentences);
```

The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code `word2vec.h`,I have found the code in `word2vec.h`may have a problem:

```
 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();
            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }
```

The vector sentence will be  larger and larger if we use the `train function` for the second time . We can clear the vector first.

```
 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();

//By Largelymfs
sentence.clear();

            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }
```

And we can put the `train function` into a loop.
Thanks a lot.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A little problem in train function #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A little problem in train function #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions