training using documents seperated by sentence

Hello,

Tks for this fantastic implementation.

I'm wondering if it's possible to use sentences as training units because normally the window is put on the sentence right? If we use documents the last word of a sentence will has a right window of 5 words which shouldn't have been included.

One can argue that it suffices to give the list of lists of sentences as input, however

```python
from svd2vec import svd2vec
documents = ["this is a test right left".split(
), "this is the second test left right".split()]
svd = svd2vec(documents, window=2, min_count=1, size=2)
```
gives

```
test_svd.py 3 <module>
svd = svd2vec(documents, window=2, min_count=1,size=2)

core.py 146 __init__
self.weighted_count_matrix_file = self.skipgram_weighted_count_matrix()

core.py 234 skipgram_weighted_count_matrix
(self.vocabulary_len, self.vocabulary_len), np.dtype('float16'))

temporary_array.py 17 __init__
matrix = self.load(erase=True)

temporary_array.py 23 load
return np.memmap(self.file_name, shape=self.shape, dtype=self.dtype, mode='w+')

memmap.py 267 __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)

ValueError:
cannot mmap an empty file
```

As one can expect, the error would disappear if one gives a larger list:

```py
from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*100, "this is the second test left right".split()*100]
svd = svd2vec(documents, window=2, min_count=1, size=2)
```


Tks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

training using documents seperated by sentence #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

training using documents seperated by sentence #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions