-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi authors,
Thanks for providing this abx tool. I met an error of "fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])". I would like to ask what could be the usual reason for encountering this issue, and any preprocessing approach is recommended to avoid this error.
I'm now using it with my data with the following codes:
item, frequency = "./largeclass_tenth.item", 50
features = f"./data/hubert_feature/large_l{layer}_mic1"
dataset = Dataset.from_item(item, features, frequency)
subsampler = Subsampler(max_size_group=10, max_x_across=2)
task = Task(dataset, on="#phone", by=["next-phone", "prev-phone"], across=["speaker"], subsampler=subsampler,)
The codes worked for part of my data and obtained abx error rate at expected scale. But when using some other parts of the data, I got the following error (hubert_abx.py is the file containing the above scripts):
Building dataset: 100%|
Traceback (most recent call last):
File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 38, in <module>
main()
File "/mnt/ceph_rbd/workspace/fastabx/scripts/hubert_abx.py", line 17, in main
dataset = Dataset.from_item(item, features, frequency)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 274, in from_item
return Dataset(labels=labels, accessor=InMemoryAccessor(indices, data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/dataset.py", line 65, in __init__
verify_empty_datapoints(self.indices)
File "/mnt/ceph_rbd/workspace/fastabx/src/fastabx/verify.py", line 61, in verify_empty_datapoints
raise EmptyDataPointsError(empty)
fastabx.verify.EmptyDataPointsError: 4 empty elements were found in the dataset (with indices [170498, 264684, 397790, 505072])
Meanwhile, I have already modified the load_data_from_item function in the dataset module, by padding the feature to the slicing end index:
if end > features.size(0):
missing_frame = end - features.size(0)
features = torch.cat([features] + [features[-1:] for _ in range(missing_frame)], dim=0)
So it shouldn't be caused by that features are short in length.