ml-essentials/preprocessing at main · rom1mouret/ml-essentials

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
auto_preprocessor.go	auto_preprocessor.go
float_imputer.go	float_imputer.go
float_imputer_test.go	float_imputer_test.go
hash_encoder.go	hash_encoder.go
interfaces.go	interfaces.go
one_hot.go	one_hot.go
one_hot_test.go	one_hot_test.go
scaler.go	scaler.go
scaler_test.go	scaler_test.go

Name

Last commit message

Last commit date

README.md

auto_preprocessor.go

float_imputer.go

float_imputer_test.go

GoDoc

v0.1.0

Imports

import preproc "github.com/rom1mouret/ml-essentials/preprocessing"

Available preprocessing components

FloatImputer
Scaler
HashEncoder
OneHotEncoder
AutoPreprocessor, a processor that combines the 4 components above.

Preprocessors follow these design principles:

1. They decide on their own which columns are used for training

For example, FloatImputer will use every float column. If you want to train FloatImputer on a subset of float columns, use ColumnView as follows:

imputer := preproc.NewFloatImputer(preproc.FloatImputerOptions{Policy: Mean})
imputer.Fit(df.ColumnView("height", "age"))

2. They can be run on dataframes with extra columns

Once the imputer is trained on a subset of columns like "height" and "age", it does not matter what other columns come along when performing the transformation:

imputer.Fit(df.ColumnView("height", "age"))
imputer.TransformInplace(df.ColumnView("height", "age", "weight"))

In the example above, "weight" will be ignored.

3. They implement the `Transform` interface

See preprocessing/interfaces.go

4. They are readily serializable in JSON

// serialization
serialized, err = json.Marshal(preproc)
// deserialization
preproc = &preprocessing.AutoPreprocessor{}
json.Unmarshal([]byte(serialized), &preproc)

Vectorization of categorical features

To one-hot strings, first run a HashEncoder to transform strings into integers. Then call OneHotEncoder to transform integer categories into boolean columns. Later, we may implement an OrdinalEncoder as an alternative to HashEncoder, but the chance of hashing collision is extremely low on 64-bit systems, so I would recommend that you stick to HashEncoder on such systems.

To avoid any confusion, let me clarify that HashEncoder does not vectorize categories via feature hashing. Vectorizing is the job of OneHotEncoder and HashEncoder does not project categories onto a lower-dimension space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

GoDoc

Imports

Available preprocessing components

1. They decide on their own which columns are used for training

2. They can be run on dataframes with extra columns

3. They implement the `Transform` interface

4. They are readily serializable in JSON

Vectorization of categorical features

FilesExpand file tree

preprocessing

Directory actions

More options

Directory actions

More options

Latest commit

History

preprocessing

Folders and files

parent directory

README.md

GoDoc

Imports

Available preprocessing components

1. They decide on their own which columns are used for training

2. They can be run on dataframes with extra columns

3. They implement the Transform interface

4. They are readily serializable in JSON

Vectorization of categorical features

3. They implement the `Transform` interface