Skip to content

empty elements were found in the dataset #1

@yiwang454

Description

@yiwang454

Hi authors,
Thanks for providing this abx tool. I met an error of "fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])". I would like to ask what could be the usual reason for encountering this issue, and any preprocessing approach is recommended to avoid this error.

I'm now using it with my data with the following codes:

    item, frequency = "./largeclass_tenth.item", 50
    features = f"./data/hubert_feature/large_l{layer}_mic1"
    dataset = Dataset.from_item(item, features, frequency)
    subsampler = Subsampler(max_size_group=10, max_x_across=2)
    task = Task(dataset, on="#phone", by=["next-phone", "prev-phone"], across=["speaker"], subsampler=subsampler,)

The codes worked for part of my data and obtained abx error rate at expected scale. But when using some other parts of the data, I got the following error (hubert_abx.py is the file containing the above scripts):

Building dataset: 100%|
Traceback (most recent call last):
  File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 38, in <module>
    main()
  File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 17, in main
    dataset = Dataset.from_item(item, features, frequency)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 274, in from_item
    return Dataset(labels=labels, accessor=InMemoryAccessor(indices, data))
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 65, in __init__
    verify_empty_datapoints(self.indices)
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/verify.py", line 61, in verify_empty_datapoints
    raise EmptyDataPointsError(empty)
fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])

Meanwhile, I have already modified the load_data_from_item function in the dataset module, by padding the feature to the slicing end index:

            if end > features.size(0):
                missing_frame = end - features.size(0)
                features = torch.cat([features] + [features[-1:] for _ in range(missing_frame)], dim=0)

So it shouldn't be caused by that features are short in length.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions