IdentiBench is a Python library designed to streamline and standardize the benchmarking of system identification models. Evaluating and comparing dynamic models often requires repetitive setup for data handling, evaluation protocols, and metrics implementation, making fair comparisons and reproducing results challenging. IdentiBench tackles this by offering a collection of pre-defined benchmark specifications for simulation and prediction tasks, built upon common datasets. It automates data downloading and processing into a consistent format and provides standard evaluation metrics via a simple interface (run_benchmark). This allows you to focus your efforts on developing innovative models, while relying on IdentiBench for robust and reproducible evaluation.
- Access Many Benchmarks from different systems: Instantly utilize pre-configured benchmarks covering diverse domains like electronics (Silverbox), mechanics (Industrial Robot), process control (Cascaded Tanks), aerospace (Quadrotors), and more, available for both simulation and prediction tasks.
- Automate Data Management: Forget manual downloading and processing; the library handles fetching data from various sources (web, Drive, Dataverse), extracting archives (ZIP, RAR, MAT, BAG), converting to a standard HDF5 format, and caching locally.
- Integrate Any Model to evaluate on all benchmarks: Plug in your
custom models, regardless of the Python framework used (NumPy, SciPy,
PyTorch, TensorFlow, JAX, etc.), using a straightforward function
interface (
build_model) that receives all necessary context. - Capture Comprehensive Results: Obtain detailed evaluation reports including standard metrics (RMSE, NRMSE, FIT%, etc.), task-specific scores, execution timings, configuration parameters (hyperparameters, seed), and raw model predictions for thorough analysis.
- Easily Define New Benchmarks: Go beyond the included datasets by
creating your own benchmark specifications
(
BenchmarkSpecSimulation,BenchmarkSpecPrediction) for private data or unique tasks, leveraging the library’s structure and transparent data format.
You can install identibench using pip:
pip install identibenchTo install the latest development version directly from GitHub, use:
pip install git+https://github.com/daniel-om-weber/identibench.gitFor development:
git clone https://github.com/daniel-om-weber/identibench.git
cd identibench
uv sync --extra dev# Basic usage
import identibench as idb
from pathlib import Path
# Example: Download a single dataset
# Note: Always use a Path object, not a string
save_path = Path('./tmp/wh')
idb.datasets.workshop.dl_wiener_hammerstein(save_path)from sysidentpy.model_structure_selection import FROLS
from sysidentpy.parameter_estimation import LeastSquares
def build_frols_model(context):
u_train, y_train, _ = next(context.get_train_sequences())
ylag = context.hyperparameters.get('ylag', 5)
xlag = context.hyperparameters.get('xlag', 5)
n_terms = context.hyperparameters.get('n_terms', 10)
estimator = context.hyperparameters.get('estimator', LeastSquares())
_model = FROLS(xlag=xlag, ylag=ylag, n_terms=n_terms,estimator=estimator)
_model.fit(X=u_train, y=y_train)
def model(u_test, y_init):
nonlocal _model
yhat_full = _model.predict(X=u_test, y=y_init[:_model.max_lag])
y_pred = yhat_full[_model.max_lag:]
return y_pred
return modelhyperparams = {
'ylag': 2,
'xlag': 2,
'n_terms': 10, # Number of terms for FROLS
'estimator': LeastSquares()
}
results = idb.run_benchmark(
spec=idb.BenchmarkWH_Simulation,
build_model=build_frols_model,
hyperparameters=hyperparams
)| Key | Benchmark Name |
|---|---|
WH_Sim |
BenchmarkWH_Simulation |
Silverbox_Sim |
BenchmarkSilverbox_Simulation |
Tanks_Sim |
BenchmarkCascadedTanks_Simulation |
CED_Sim |
BenchmarkCED_Simulation |
EMPS_Sim |
BenchmarkEMPS_Simulation |
NoisyWH_Sim |
BenchmarkNoisyWH_Simulation |
RobotForward_Sim |
BenchmarkRobotForward_Simulation |
RobotInverse_Sim |
BenchmarkRobotInverse_Simulation |
Ship_Sim |
BenchmarkShip_Simulation |
QuadPelican_Sim |
BenchmarkQuadPelican_Simulation |
QuadPi_Sim |
BenchmarkQuadPi_Simulation |
| Key | Benchmark Name |
|---|---|
WH_Pred |
BenchmarkWH_Prediction |
Silverbox_Pred |
BenchmarkSilverbox_Prediction |
Tanks_Pred |
BenchmarkCascadedTanks_Prediction |
CED_Pred |
BenchmarkCED_Prediction |
EMPS_Pred |
BenchmarkEMPS_Prediction |
NoisyWH_Pred |
BenchmarkNoisyWH_Prediction |
RobotForward_Pred |
BenchmarkRobotForward_Prediction |
RobotInverse_Pred |
BenchmarkRobotInverse_Prediction |
Ship_Pred |
BenchmarkShip_Prediction |
QuadPelican_Pred |
BenchmarkQuadPelican_Prediction |
QuadPi_Pred |
BenchmarkQuadPi_Prediction |
This section provides more detail on the core concepts and components of
the identibench workflow.
identibench defines two main types of benchmark tasks, specified using
different classes:
- Simulation
(
BenchmarkSpecSimulation):- Goal: Evaluate a model’s ability to perform a free-run simulation, predicting the system’s output over an extended period given the input sequence.
- Typical Input to Predictor: The full input sequence (
u_test) and potentially an initial segment of the output sequence (y_test[:init_window]) for warm-up or state initialization. - Expected Output from Predictor: The predicted output sequence
(
y_pred) corresponding to the input, usually excluding the warm-up period. - Use Case: Assessing models intended for long-term prediction, control simulation, or understanding overall system dynamics.
- Prediction
(
BenchmarkSpecPrediction):- Goal: Evaluate a model’s ability to predict the system’s output k steps into the future based on recent past data.
- Typical Input to Predictor: Often involves windows of past
inputs and outputs (e.g.,
u[t:t+H],y[t:t+H]). - Expected Output from Predictor: The predicted output at a
specific future time step (e.g.,
y[t+H+k]). Thepred_horizonparameter defines ‘k’, andpred_stepdefines how frequently predictions are made. - Use Case: Evaluating models focused on short-to-medium term forecasting, state estimation, or receding horizon control.
init_window: Both benchmark types often use aninit_window. This specifies an initial number of time steps whose data might be provided to the model for initialization or warm-up. Importantly, data within this window is typically excluded from the final performance metric calculation to ensure a fair evaluation of the model’s predictive capabilities beyond the initial transient.
The core of integrating your custom logic is the build_model function
you provide to run_benchmark.
- Purpose: This function is responsible for defining your model architecture, training it using the provided data, and returning a callable predictor function.
- Input (
context: TrainingContext): Yourbuild_modelfunction receives a single argument,context, which is aTrainingContextobject. This object gives you access to:context.spec: The full specification of the current benchmark being run (including dataset paths, input/output columns,init_window, etc.).context.hyperparameters: A dictionary containing any hyperparameters you passed torun_benchmark. Use this to configure your model or training process.context.seed: A random seed for ensuring reproducibility.- Data Access Methods: Functions like
context.get_train_sequences()andcontext.get_valid_sequences()provide iterators over the raw, full-length training and validation data sequences (as tuples of NumPy arrays(u, y, x)). Note: You need to handle any batching or windowing required for your specific training algorithm within yourbuild_modelfunction.
- Output (Predictor
Callable):build_modelmust return a callable object (e.g., a function, an object’s method) that represents your trained model ready for prediction/simulation. This returned callable will be used internally byrun_benchmarkon the test set. Its expected signature depends on the benchmark type, but typically it accepts NumPy arrays for test inputs (and potentially initial outputs) and returns a NumPy array containing the predictions.
To evaluate a model across several scenarios efficiently, use the
run_multiple_benchmarks function:
# Example: Run on a subset of benchmarks
specs_to_run = {
'WH_Sim': idb.simulation_benchmarks['WH_Sim'],
'Silverbox_Sim': idb.simulation_benchmarks['Silverbox_Sim']
}
# Assume 'my_build_model' is your defined build function
all_results = idb.run_benchmarks(specs_to_run, build_model=build_frols_model,n_times=3)
all_results--- Starting benchmark run for 2 specifications, repeating each 3 times ---
-- Repetition 1/3 --
[1/6] Running: BenchmarkWH_Simulation (Rep 1)
-> Success: BenchmarkWH_Simulation (Rep 1) completed.
[2/6] Running: BenchmarkSilverbox_Simulation (Rep 1)
-> Success: BenchmarkSilverbox_Simulation (Rep 1) completed.
-- Repetition 2/3 --
[3/6] Running: BenchmarkWH_Simulation (Rep 2)
-> Success: BenchmarkWH_Simulation (Rep 2) completed.
[4/6] Running: BenchmarkSilverbox_Simulation (Rep 2)
-> Success: BenchmarkSilverbox_Simulation (Rep 2) completed.
-- Repetition 3/3 --
[5/6] Running: BenchmarkWH_Simulation (Rep 3)
-> Success: BenchmarkWH_Simulation (Rep 3) completed.
[6/6] Running: BenchmarkSilverbox_Simulation (Rep 3)
-> Success: BenchmarkSilverbox_Simulation (Rep 3) completed.
--- Benchmark run finished. 6/6 individual runs completed successfully. ---
| benchmark_name | dataset_id | hyperparameters | seed | training_time_seconds | test_time_seconds | benchmark_type | metric_name | metric_score | cs_multisine_rmse | cs_arrow_full_rmse | cs_arrow_no_extrapolation_rmse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BenchmarkWH_Simulation | wh | {} | 2406651230 | 4.944649 | 1.012850 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
| 1 | BenchmarkSilverbox_Simulation | silverbox | {} | 3813113752 | 2.839149 | 1.246224 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
| 2 | BenchmarkWH_Simulation | wh | {} | 1950649438 | 4.801520 | 1.034119 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
| 3 | BenchmarkSilverbox_Simulation | silverbox | {} | 1560698088 | 2.880391 | 1.217932 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
| 4 | BenchmarkWH_Simulation | wh | {} | 3258007268 | 4.916941 | 1.021927 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
| 5 | BenchmarkSilverbox_Simulation | silverbox | {} | 4194043971 | 2.937101 | 1.231710 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
This function iterates through the provided list or dictionary of
benchmark specifications, calling run_benchmark for each one using the same build_model function and hyperparameters.
#calculate mean and std of the results
idb.aggregate_benchmark_results(all_results,agg_funcs=['mean','std'])| training_time_seconds | test_time_seconds | metric_score | cs_multisine_rmse | cs_arrow_full_rmse | cs_arrow_no_extrapolation_rmse | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | |
| benchmark_name | ||||||||||||
| BenchmarkSilverbox_Simulation | 2.885547 | 0.049179 | 1.231955 | 0.014147 | 10.732386 | 0.0 | 8.501941 | 0.0 | 16.154317 | 0.0 | 7.5409 | 0.0 |
| BenchmarkWH_Simulation | 4.887703 | 0.075912 | 1.022966 | 0.010673 | 42.161572 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
Understanding how identibench organizes and stores data is helpful for
direct interaction or adding new datasets.
- Directory Structure: Datasets are stored under a root directory
(default:
~/.identibench_data, configurable via theIDENTIBENCH_DATA_ROOTenvironment variable). The structure follows:DATA_ROOT / [dataset_id] / [subset] / [experiment_file.hdf5]. - Subsets: Standard subset names are
train,valid, andtest. An optionaltrain_validdirectory might contain combined data. - Download & Cache: Data is downloaded automatically when a
benchmark requires it and cached locally to avoid re-downloads. The
identibench.datasets.download_all_datasetsfunction can fetch all datasets at once. - File Format: Processed time-series data is stored in the HDF5
(
.hdf5) format. - HDF5 Structure:
- Each
.hdf5file typically represents one experimental run. - Signals (inputs, outputs, states) are stored as separate
1-dimensional datasets within the file, named conventionally as
u0,u1, …,y0,y1, …,x0, … - Data is usually stored as
float32NumPy arrays. - Metadata like sampling frequency (
fs) and suggested initialization window size (init_sz) are stored as attributes on the root group of the HDF5 file. - Example Structure:
my_dataset/ └── train/ └── train_run_1.hdf5 ├── u0 (Dataset: shape=(N,), dtype=float32) ├── y0 (Dataset: shape=(N,), dtype=float32) └── Attributes: └── fs (Attribute: float)
- Each
- Extensibility: Adhering to this HDF5 format ensures compatibility
when adding new dataset loaders. Helper functions like
identibench.utils.write_arrayfacilitate creating files in the correct format.
The run_benchmark function returns a dictionary containing detailed results of the
experiment. Key entries include:
benchmark_name(str): The unique name of the benchmark specification used.dataset_id(str): Identifier for the dataset source.hyperparameters(dict): The hyperparameters dictionary passed to the run.seed(int): The random seed used for the run.training_time_seconds(float): Wall-clock time spent inside yourbuild_modelfunction.test_time_seconds(float): Wall-clock time spent evaluating the returned predictor on the test set.benchmark_type(str): The type of benchmark run (e.g.,'BenchmarkSpecSimulation').metric_name(str): The name of the primary metric function defined in the spec.metric_score(float): The calculated score for the primary metric on the test set (aggregated if multiple test files).custom_scores(dict): Any additional scores calculated by custom evaluation logic specific to the benchmark.model_predictions(list): A list containing the raw outputs. For simulation, it’s typically[(y_pred_test1, y_true_test1), (y_pred_test2, y_true_test2), ...]. For prediction, the structure might be nested reflecting windowed predictions.