-
Notifications
You must be signed in to change notification settings - Fork 145
Description
Series.transform and DF.transform allow creating new columns, but don't have a way to specify the dtype. If Explorer doesn't have enough data to infer the new type, it could come out as the wrong type.
I want to be able to guarantee that my dataframes and series have the right columns and dtypes, regardless of the values of their input.
Series.transform()
This comes out as a string series as expected:
["abc"]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)
#Explorer.Series<
Polars[1]
string ["a"]
>
But if the input is a different value, it ends up as a null series:
[""]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)
#Explorer.Series<
Polars[1]
null [nil]
>
[]
|> Series.from_list(dtype: :string)
|> S.transform(&String.first/1)
#Explorer.Series<
Polars[0]
null []
>
I want to do Series.transform(series, dtype: :string) to specify the dtype of the output.
DF.transform
This one has caused crashes for me when I end up with empty input data.
Make a new column with DF.transform and access it:
df =
DF.new(%{"a" => [1]})
|> DF.transform(fn %{"a" => a} -> %{"b" => a+1} end)
df["b"]
#Explorer.Series<
Polars[1]
s64 [2]
>
But if the input is empty, the elixir transformation never runs, and explorer has no way to know what columns should be created. So it crashes because the column doesn't exist. (The desired result here is an empty series.)
df =
DF.new(%{"a" => []})
|> DF.transform(fn %{"a" => a} -> %{"b" => a+1} end)
df["b"]
** (ArgumentError) could not find column name "b". The available columns are: ["a"].
Because of this I have to add some special cases in my code to handle empty inputs separately. I wish I could guarantee which columns come out of the transform with something like DF.transform(df, f, dtypes: [b: :integer]) to tell Explorer to make the column even on empty input, so that I don't need any special case.
DF.transform can also fail to infer the type of the column if every row in the transform happens to return nil, which the dtypes option could also solve. For example, this should conceptually result in a new string column, but all Explorer sees with this data is null:
DF.new(%{"s" => [""]})
|> DF.transform(fn %{"s" => s} -> %{"t" => String.first(s)} end)
#Explorer.DataFrame<
Polars[1 x 2]
s string [""]
t null [nil]
>
DF.mutate / Series.map
These functions also allow creating new columns and don't allow specifying the new dtypes. However, Explorer can figure out the dtype based on the query macro. I haven't personally run into a situation where I wanted to specify the dtype, but while you're thinking about transform might as well check if these functions need it, too.
I've looked for examples, and all I found is this hypothetical example: Series.from_list([100], dtype: :u8) |> Series.map(_ * _) results in a u8 series with overflowed data, and you might want to do Series.map(_ * _, dtype: :u16) to fit the output.