Skip to content

Add support for decimal types#70

Open
brendanyounger wants to merge 1 commit intoadjust:masterfrom
brendanyounger:decimal-types
Open

Add support for decimal types#70
brendanyounger wants to merge 1 commit intoadjust:masterfrom
brendanyounger:decimal-types

Conversation

@brendanyounger
Copy link

Can now read parquet files with decimal types.

@za-arthur za-arthur added the enhancement New feature or request label Apr 28, 2023
Copy link
Contributor

@za-arthur za-arthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brendanyounger for the new feature. Could you also add tests and fix README?

return TIMESTAMPOID;
case arrow::Type::DATE32:
return DATEOID;
case arrow::Type::DECIMAL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be DECIMAL128 and DECIMAL256 used here similar to bytes_to_postgres_type?

@Magmatrix
Copy link

@brendanyounger Hi, I tested this in PG 15 but couldn't get it to work fully.

I have tried creating the foreign table with price column of type NUMERIC, and later with type FLOAT. Regardless of the type used, price always show up as 0. Example:

parquet=# select price,price_m,price_e from "Order" limit 10;
 price | price_m | price_e
-------+---------+---------
     0 |    6000 |      -3
     0 |    2320 |      -3
     0 |   10000 |      -3
     0 |    1530 |      -3
     0 |   66200 |      -3
     0 |    3000 |      -3
     0 |    4010 |      -3
     0 |     232 |      -3
     0 |    7420 |      -3
     0 |  291940 |      -3
(10 rows)

However, if I cast it from NUMERIC to FLOAT (or vice versa), all price columns EXCEPT the first row show up correctly:

parquet=# select price::NUMERIC,price_m,price_e from "Order" limit 10;
 price  | price_m | price_e
--------+---------+---------
      0 |    6000 |      -3
   2.32 |    2320 |      -3
     10 |   10000 |      -3
   1.53 |    1530 |      -3
   66.2 |   66200 |      -3
      3 |    3000 |      -3
   4.01 |    4010 |      -3
  0.232 |     232 |      -3
   7.42 |    7420 |      -3
 291.94 |  291940 |      -3
(10 rows)

The cast obviously triggers something (after the first row), but I have no idea where to start looking for this bug. Any ideas?

@Magmatrix
Copy link

NB: If I count the number of orders with price==0 this happens:

parquet=# select count(*) from "Order" where price = 0;
ERROR:  parquet_fdw: failed to extract row groups from Parquet file: row group filter match failed: cache lookup failed for function 0 ('/srv/parquet-test/Order/2025/02/10/10/Order.parquet')

But with a cast, it finds 105 cases where the price wrongly shows up as 0:

parquet=# select count(*) from "Order" where price::numeric = 0;
 count
-------
   105

This table is constructed from exactly 105 files, so it's the first price in each file that gets misinterpreted when using a cast (or every row if not using a cast). I guess this might help in finding the bug.

@Magmatrix
Copy link

Magmatrix commented Mar 5, 2025

And a bit more testing shows that if you add another cast, then the results becomes random (different results every time):

parquet=# select count(*) from "Order" where price::numeric::float = 0;
 count
--------
 477094
(1 row)

parquet=# select count(*) from "Order" where price::numeric::float = 0;
 count
--------
 459599
(1 row)

parquet=# select count(*) from "Order" where price::numeric::float <> 0;
 count
--------
 405207
(1 row)

parquet=# select count(*) from "Order" where price::numeric::float <> 0;
 count
--------
 427107
(1 row)

(UNIX_EPOCH_JDATE - POSTGRES_EPOCH_JDATE));
case arrow::Type::DECIMAL128: {
auto dectype = (arrow::Decimal128Type *)arrow_type;
std::string val = arrow::Decimal128(bytes).ToString(dectype->scale());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the type to a string and then processing it with numeric_in is relatively inefficient, right? At least in our tests, it is significantly slower than Spark’s parsing. Are there better solutions to handle this type of conversion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants