Skip to content

Numeric categorical values can not be written to disk #126

@PGijsbers

Description

@PGijsbers

If a categorical attribute has only numeric values (which is valid, if not somewhat unspecified, in the original arff definition), the package raises an error when writing the data to disk:

import arff
data = dict(
  relation='dataset name',
  description='dataset description',
  attributes=[("categorical_with_numeric_values", [1, 2, 3])],
  data=[[1], [2], [3]]
)
with open("test.arff", "w") as fh:
  arff.dump(data, fh)

Expected behavior: Produce a valid arff file with attribute:

@ATTRIBUTE categorical_with_numeric_values {1, 2, 3}.

Actual behavior: Treats the categories as strings, leading to an error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 1091, in dump
    for row in generator:
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 1028, in iter_encode
    yield self._encode_attribute(attr[0], attr[1])
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 964, in _encode_attribute
    type_tmp = [u'%s' % encode_string(type_k) for type_k in type_]
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 964, in <listcomp>
    type_tmp = [u'%s' % encode_string(type_k) for type_k in type_]
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 420, in encode_string
    if _RE_QUOTE_CHARS.search(s):
TypeError: expected string or bytes-like object

Possible workaround by stringifying the categories (they are unquoted in the resulting arff header):

-  attributes=[("categorical_with_numeric_values", [1, 2, 3])],
+  attributes=[("categorical_with_numeric_values", ['1', '2,' '3'])],

Python 3.11.3, liac-arff 2.5.0

I understand there is currently no work being done on the package, but I figured I would document the bug and workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions