Preprocessing of discrete variables

I was digging around in the preprocessing code and found, that there are two heuristics for identifying discrete variables. One checking for if all nonmissing values are integers [L199](https://github.com/trappmartin/BayesianSumProductNetworks/blob/73534eec1005da86ff57800a99083a78759e9656/scripts/run_abda_rg.jl#L199) and the other one counting unique values [L200](https://github.com/trappmartin/BayesianSumProductNetworks/blob/73534eec1005da86ff57800a99083a78759e9656/scripts/run_abda_rg.jl#L200). Which one did you use in the experiment and also maybe why do you prefer one over the other?

One other thing I noticed is, that casting of binary variables into 0,1 values is applied to the whole dataset, not just the particular column [L212]( https://github.com/trappmartin/BayesianSumProductNetworks/blob/73534eec1005da86ff57800a99083a78759e9656/scripts/run_abda_rg.jl#L212) , fortunately the heterogeneous data from ABDA paper should have the binaries removed. However this is not the case with Anneal-U dataset, which preprocesses the 9th feature as categorical instead of binary, as I believe that the authors probably used the value range [metadata](https://github.com/probabilistic-learning/abda/blob/3d832f7f97c9031b23d8665174d51cc02b788d01/bin/abda.py#L285). As a result there is some discrepancy caused by the different "detection" of discrete variables, but I cannot judge if it has some negative effects on the results in that particular dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing of discrete variables #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Preprocessing of discrete variables #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions