Skip to content

training using documents seperated by sentence #3

@xiaoouwang

Description

@xiaoouwang

Hello,

Tks for this fantastic implementation.

I'm wondering if it's possible to use sentences as training units because normally the window is put on the sentence right? If we use documents the last word of a sentence will has a right window of 5 words which shouldn't have been included.

One can argue that it suffices to give the list of lists of sentences as input, however

from svd2vec import svd2vec
documents = ["this is a test right left".split(
), "this is the second test left right".split()]
svd = svd2vec(documents, window=2, min_count=1, size=2)

gives

test_svd.py 3 <module>
svd = svd2vec(documents, window=2, min_count=1,size=2)

core.py 146 __init__
self.weighted_count_matrix_file = self.skipgram_weighted_count_matrix()

core.py 234 skipgram_weighted_count_matrix
(self.vocabulary_len, self.vocabulary_len), np.dtype('float16'))

temporary_array.py 17 __init__
matrix = self.load(erase=True)

temporary_array.py 23 load
return np.memmap(self.file_name, shape=self.shape, dtype=self.dtype, mode='w+')

memmap.py 267 __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)

ValueError:
cannot mmap an empty file

As one can expect, the error would disappear if one gives a larger list:

from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*100, "this is the second test left right".split()*100]
svd = svd2vec(documents, window=2, min_count=1, size=2)

Tks again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions