forked from jstray/deepform
-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
I'm running the following (locally, not docker) with the latest from mainline: python -m deepform.data.add_features data/3_year_manifest.csv
And I'm getting this traceback:
Traceback (most recent call last):
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/evan/deepform/deepform/data/add_features.py", line 263, in <module>
extend_and_write_docs(
File "/Users/evan/deepform/deepform/data/add_features.py", line 98, in extend_and_write_docs
doc_index.to_parquet(pq_index)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
return func(*args, **kwargs)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2365, in to_parquet
to_parquet(
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 270, in to_parquet
return impl.write(
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 101, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
arrays[i] = maybe_fut.result()
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 565, in convert_column
raise e
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 0.0 with type str: tried to convert to double', 'Conversion failed for column gross_amount with type object')
I inspected the DataFrame, and the issue appears to be that the document with slug 499480-cancel-68803-13518579030793-_-pdf has 0.0 for gross_amount, which prevents conversion to double.
One solution might be to do:
doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels