A categorical feature selection approach based on information theoretical considerations.
Implementation of the fast correlation-based filter (FCBF) proposed by Yu and Liu:
@inproceedings{inproceedings,
author = {Yu, Lei and Liu, Huan},
year = {2003},
month = {01},
pages = {856-863},
title = {Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution},
volume = {2},
journal = {Proceedings, Twentieth International Conference on Machine Learning}
}Data for testing is taken from the UCI Machine Learning Repository. See also notes on the contained lung cancer dataset.
from fcbf import fcbf, data
dataset = data.lung_cancer
X = dataset.loc[:, [dataset.columns[0]] + dataset.columns[2:].tolist()]
y = dataset[dataset.columns[1]].astype(int)
print(X)
print(y)
relevant_features, irrelevant_features, correlations = fcbf(X, y, su_threshold=0.1, base=2)
print('relevant_features:', relevant_features, '(count:', len(relevant_features), ')')
print('irrelevant_features:', irrelevant_features, '(count:', len(irrelevant_features), ')')
print('correlations:', correlations)Using pip, execute the following
pip install fcbfTODO
TODO
Code is released under the MIT License. All dependencies are copyright to the respective authors and released under the respective licenses.