POME is a graph-based representation-learning method for heterogeneous datasets that incorporates missingness structures into the computation of low-dimensional sample and variable embeddings. It is applicable to any tabular datasets consisting of both numeric- and categorical-type features, where missing data patterns are supposed to be taken into account.
POME is implemented as a Python package and is easily installable from PyPI by running
pip install pome-py
or locally from this repository by running
pip install .
POME expects input data to be given in the form of a pandas dataframe object, with rows representing variables/features and columns representing samples. Missing data needs to be encoded by a unique numerical value. Furthermore, POME expects one column storing datatypes of the respective variables. An example dataset could have the following structure, with e.g. value -99 encoding missing data:
| Sample1 | Sample2 | Sample3 | Type | |
|---|---|---|---|---|
| VariableA | 0 | 1 | -99 | cat |
| VariableB | 3.14 | -0.1 | 2.5 | numerical |
| VariableC | 0.3 | 1.2 | -99 | numerical |
| VariableD | 1 | 0 | 2 | cat |
POME's core functionality is integrated into its Embedder class, which handles input transformation, training and output generation. A typical such workflow looks as follows:
import pandas as pd
from pome import Embedder
if __name__ == "__main__":
# Load data and set parameters.
example_df = pd.read_csv("example.csv", index_col=0)
NA_ENCODING = -99.0
DIMENSION = 16
DEVICE = "cpu"
# Initialize embedding object with parameters.
embedder = Embedder(epochs=100,
na_encoding=NA_ENCODING,
embedding_dimension=DIMENSION,
device=DEVICE,
enable_imputation=True)
# Fit embedding object to dataset.
embedder.fit(example_df)
# Output stores low-dimensional embeddings for samples and variables.
sample_embeddings, variable_embeddings, _ , _ = embedder.get_embeddings()
print("Computed sample embeddings: \n", sample_embeddings)
imputed_df = embedder.impute_all(na_value=NA_ENCODING)
print("Imputed data: \n", imputed_df)POME's Embedder class allows for the specification of the following parameters:
embedding_dimension : int = 32: Specifies the number of dimensions of the sample & variable embeddings learned by POME.epochs : int = 500: Sets the number of epochs that POME is supposed to be trained.device : str = "cpu": Specifies whether to train on CPU ("cpu") or GPU ("cuda").na_encoding : float = -99.0: The float encoding value of missing data.enable_imputation : bool = False: Set this to true if you want to use POME for imputation after training.
After initializing the Embedder object, the main three functions for using POME are:
fit(X, y=None): Training POME on the given input dataframe, with the input format as specified above.get_embeddings(): Return computed embeddings of samples and variables in dataframe format. Output is a four-tuple with sample embeddings in at position 0, and variable embeddings at position 1.impute_all(na_value : float): Imputes all missing values specified byna_valuein the input dataset, and directly returns the imputed dataframe.