Skip to content

feature request: dtype option for transform #1111

@skyqrose

Description

@skyqrose

Series.transform and DF.transform allow creating new columns, but don't have a way to specify the dtype. If Explorer doesn't have enough data to infer the new type, it could come out as the wrong type.

I want to be able to guarantee that my dataframes and series have the right columns and dtypes, regardless of the values of their input.

Series.transform()

This comes out as a string series as expected:

["abc"]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)

#Explorer.Series<
  Polars[1]
  string ["a"]
>

But if the input is a different value, it ends up as a null series:

[""]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)

#Explorer.Series<
  Polars[1]
  null [nil]
>

[]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)

#Explorer.Series<
  Polars[0]
  null []
>

I want to do Series.transform(series, dtype: :string) to specify the dtype of the output.

DF.transform

This one has caused crashes for me when I end up with empty input data.

Make a new column with DF.transform and access it:

df =
  DF.new(%{"a" => [1]})
  |> DF.transform(fn %{"a" => a} -> %{"b" => a+1} end)

df["b"]

#Explorer.Series<
  Polars[1]
  s64 [2]
>

But if the input is empty, the elixir transformation never runs, and explorer has no way to know what columns should be created. So it crashes because the column doesn't exist. (The desired result here is an empty series.)

df =
  DF.new(%{"a" => []})
  |> DF.transform(fn %{"a" => a} -> %{"b" => a+1} end)

df["b"]

** (ArgumentError) could not find column name "b". The available columns are: ["a"].

Because of this I have to add some special cases in my code to handle empty inputs separately. I wish I could guarantee which columns come out of the transform with something like DF.transform(df, f, dtypes: [b: :integer]) to tell Explorer to make the column even on empty input, so that I don't need any special case.

DF.transform can also fail to infer the type of the column if every row in the transform happens to return nil, which the dtypes option could also solve. For example, this should conceptually result in a new string column, but all Explorer sees with this data is null:

DF.new(%{"s" => [""]})
|> DF.transform(fn %{"s" => s} -> %{"t" => String.first(s)} end)

#Explorer.DataFrame<
  Polars[1 x 2]
  s string [""]
  t null [nil]
>

DF.mutate / Series.map

These functions also allow creating new columns and don't allow specifying the new dtypes. However, Explorer can figure out the dtype based on the query macro. I haven't personally run into a situation where I wanted to specify the dtype, but while you're thinking about transform might as well check if these functions need it, too.

I've looked for examples, and all I found is this hypothetical example: Series.from_list([100], dtype: :u8) |> Series.map(_ * _) results in a u8 series with overflowed data, and you might want to do Series.map(_ * _, dtype: :u16) to fit the output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions