empty elements were found in the dataset

Hi authors, 
Thanks for providing this abx tool. I met an error of "fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])". I would like to ask what could be the usual reason for encountering this issue, and any preprocessing approach is recommended to avoid this error.

I'm now using it with my data with the following codes:
```
    item, frequency = "./largeclass_tenth.item", 50
    features = f"./data/hubert_feature/large_l{layer}_mic1"
    dataset = Dataset.from_item(item, features, frequency)
    subsampler = Subsampler(max_size_group=10, max_x_across=2)
    task = Task(dataset, on="#phone", by=["next-phone", "prev-phone"], across=["speaker"], subsampler=subsampler,)
```
The codes worked for part of my data and obtained abx error rate at expected scale. But when using some other parts of the data, I got the following error (hubert_abx.py is the file containing the above scripts):
```
Building dataset: 100%|
Traceback (most recent call last):
  File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 38, in <module>
    main()
  File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 17, in main
    dataset = Dataset.from_item(item, features, frequency)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 274, in from_item
    return Dataset(labels=labels, accessor=InMemoryAccessor(indices, data))
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 65, in __init__
    verify_empty_datapoints(self.indices)
  File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/verify.py", line 61, in verify_empty_datapoints
    raise EmptyDataPointsError(empty)
fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])
```

Meanwhile, I have already modified the load_data_from_item function in the dataset module, by padding the feature to the slicing end index:
```
            if end > features.size(0):
                missing_frame = end - features.size(0)
                features = torch.cat([features] + [features[-1:] for _ in range(missing_frame)], dim=0)
```
So it shouldn't be caused by that features are short in length. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

empty elements were found in the dataset #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

empty elements were found in the dataset #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions