import preproc "github.com/rom1mouret/ml-essentials/preprocessing"- FloatImputer
- Scaler
- HashEncoder
- OneHotEncoder
- AutoPreprocessor, a processor that combines the 4 components above.
Preprocessors follow these design principles:
For example, FloatImputer will use every float column.
If you want to train FloatImputer on a subset of float columns, use ColumnView as follows:
imputer := preproc.NewFloatImputer(preproc.FloatImputerOptions{Policy: Mean})
imputer.Fit(df.ColumnView("height", "age"))Once the imputer is trained on a subset of columns like "height" and "age", it does not matter what other columns come along when performing the transformation:
imputer.Fit(df.ColumnView("height", "age"))
imputer.TransformInplace(df.ColumnView("height", "age", "weight"))In the example above, "weight" will be ignored.
See preprocessing/interfaces.go
// serialization
serialized, err = json.Marshal(preproc)
// deserialization
preproc = &preprocessing.AutoPreprocessor{}
json.Unmarshal([]byte(serialized), &preproc)To one-hot strings, first run a HashEncoder to transform strings into integers. Then call OneHotEncoder to transform integer categories into boolean columns.
Later, we may implement an OrdinalEncoder as an alternative to HashEncoder, but the chance of hashing collision is extremely low on 64-bit systems, so I would recommend that you stick to HashEncoder on such systems.
To avoid any confusion, let me clarify that HashEncoder does not vectorize categories via feature hashing. Vectorizing is the job of OneHotEncoder and HashEncoder does not project categories onto a lower-dimension space.