Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

inconsistent sentences length causes encoding failure #155

@yilil

Description

@yilil

Problem Description:
I tried to run the demo but encountered the following error:

sentences = np.array(sentences)[idx_sort]
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (240,) + inhomogeneous part.

This error is thrown at the second last line of the prepare_sample() function, which is a helper function invoked when encoding sentences. The (240, ) indicates that I have 240 sentences that I'd like to encode.

The problem lies in the second last line, sentences = np.array(sentences)[:idx_sort]

The input sentences can have various length, after the above operations (tokenisation, filtering etc.), sentences may look like:

[
  ['<s>',  'token1', 'token2', '</s>'],
  ['<s>',  'token1', 'token2', 'token3', 'token4',  '</s>'],
  ['<s>',  'token1', 'token2', 'token3', '</s>']
]

Converting this list of list into a numpy array can fail, as each inner list that represents a sentence may have various length as shown above

I'm using numpy 1.25.2. My hypothesis is that you might have used a different older version of numpy, which may implictly handle or ignore this.

Solutions:
The fix is simple:

  1. Pad the tokenised sentence lists (after sorting, but before the numpy array conversion) to make them have equal length
  2. Change the second last line to sentences = np.array(sentences, dtype=object)[idx_sort]

Though the later approach seems like a much simpler fix (it forces numpy to treat each inner list as an object and therefore variable length is allowed), it can cause computation inefficiency, especically if we plan to do mathematical operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions