A configurable machine learning framework for cross-platform and cross-session user identification using keystroke biometrics.
- Configuration-driven experiments - Define all parameters in JSON config files
- Multiple ML models - RandomForest, XGBoost, CatBoost, SVM, MLP, Naive Bayes, ExtraTrees, LightGBM, LogisticRegression, GradientBoosting, KNN
- Cross-platform experiments - Train on one platform, test on another
- Cross-session experiments - Train on session 1, test on session 2
- GPU acceleration - Automatic CUDA support for XGBoost
- Early stopping - Optimized training for gradient boosting models
- Comprehensive metrics - Top-K accuracy, F1 scores, confusion matrices
- Visual reports - HTML reports with interactive plots
Add your dataset_path to the configuration, or override the dataset_path at the command line.
# Add the dataset_path to the configuration. Use default config_full.json.
python ml_runner.py
# Or, override dataset_path using -d option. Use default config.
python ml_runner.py -d path/to/dataset.csvUse your own config. I suggest making a copy of the config_full.json and modifying that rather than changing config_full.json (which could server as a template).
# Run with custom configuration (add your dataset to the config)
python ml_runner.py -c my_config.json
# Or, override dataset_path using -d option.
python ml_runner.py -c my_config.json -d path/to/dataset.csvUse debug mode to run on a minimal set of model and hyperparameters. Uses config_debug.json by default. Use debug mode to test new models or code changes.
# Run debug mode to test your set up
python ml_runner.py --debug
# Or, override dataset_path using -d option.
python ml_runner.py --debug -d path/to/dataset.csv
# Or, use your own debug config.
python ml_runner.py my_debug_config.json run_ml.sh runs ml_runner.py on six datasets generated by eda/ml_typenet_features_polars.py. The current version of ml_run.sh expect the dataset directory to be in:
../eda/ml-experients-with-outliers2025-05-31_142307../eda/ml-experients-without-outliers2025-05-31_143027
These datasets are currently stored in the eda repository main branch. They may be removed until we're ready to release the datasets. If you do not see them, you can also find them in the shared Google Drive at https://drive.google.com/drive/folders/1d7VEy-tj9SRFstBrOXYus95j2H9qraeO?usp=drive_link.
If these are not in your ../eda directory, you can change the ROOT_DATA_DIR path in run_ml.sh.
# Run multiple datasets with default settings
./run_ml.sh
# Run in debug mode
./run_ml.sh --debugFull configuration with comprehensive hyperparameter grids and all experiments.
Minimal configuration for quick testing with reduced hyperparameters and fewer experiments.
{
"dataset_path": "path/to/data.csv",
"early_stopping": false,
"seeds": [42, 123, 456],
"output_affix": "my_experiment",
"show_class_distributions": false,
"draw_feature_importance": true,
"debug": false,
"use_gpu": true,
"models_to_train": [
"RandomForest",
"XGBoost",
"CatBoost",
"SVM",
"MLP",
"NaiveBayes"
],
"experiments": [
{"name": "FI_vs_T", "platform": true, "train": [1, 2], "test": 3},
{"name": "P1_S1_vs_S2", "session": true, "platform": 1, "train": [1], "test": 2}
],
"param_grids": {
"randomforest": {
"n_estimators": [100, 300, 500],
"max_depth": [10, 20, null]
}
}
}| Option | Description | Example |
|---|---|---|
-c, --config |
Configuration file path | -c my_config.json |
-d, --dataset |
Dataset path (overrides config) | -d data.csv |
-e, --early-stop |
Enable early stopping | -e |
-s, --seeds |
Random seeds (overrides config) | -s 42 123 456 |
-o, --output-affix |
Output directory suffix | -o experiment1 |
--show-class-dist |
Show class distribution plots | --show-class-dist |
--no-feature-importance |
Skip feature importance plots | --no-feature-importance |
--max-workers |
Max CPU workers | --max-workers 8 |
--no-gpu |
Disable GPU acceleration | --no-gpu |
--debug |
Use debug configuration | --debug |
- Add the model name to
models_to_trainin the config file - Add hyperparameter grid to
param_gridsin the config - If needed, add a new training function in
ml_core.py - Register the function in
model_train_funcsdictionary inml_runner.py
Add experiment definitions to the experiments array in the configuration:
{
"experiments": [
// Platform-based experiment
{"name": "Platform1_vs_2", "platform": true, "train": [1], "test": 2},
// Session-based experiment for specific platform
{"name": "S1_vs_S2_P1", "session": true, "platform": 1, "train": [1], "test": 2},
// Session-based experiment for all platforms
{"name": "S1_vs_S2_All", "session": true, "platform": "all", "train": [1], "test": 2}
]
}The dataset CSV must contain:
user_id- User identifier (required)platform_id- Platform identifier (for platform experiments)session_id- Session identifier (for session experiments)- Feature columns - All other columns are treated as features
experiment_results_<timestamp>/
├── experiment_results_<timestamp>.csv # Summary results
├── detailed_topk_results_<timestamp>.csv # Detailed top-K metrics
├── user_identification_report.html # Main HTML report
├── performance_plots.html # Interactive performance plots
├── class_distribution_*.png # Class distribution plots
├── *_confusion_matrix.png # Confusion matrices
├── *_feature_importance_*.png # Feature importance plots
└── *.pkl # Trained models with metadata
- GPU Acceleration: Automatically enabled for XGBoost when CUDA is available
- Parallel Processing: Uses all CPU cores by default (configurable with
--max-workers) - Early Stopping: Reduces training time for XGBoost and CatBoost
- Debug Mode: Quickly test pipeline with minimal hyperparameters
If GPU is not detected:
# Disable GPU in configuration
python ml_runner.py --no-gpu -d data.csvFor large datasets:
# Reduce number of workers
python ml_runner.py --max-workers 4 -d data.csvFor quick testing:
# Use debug configuration
python ml_runner.py --debug -d data.csv- Python 3.8+
- See
requirements.txtfor package dependencies - CUDA (optional, for GPU acceleration)
- ✅ Linux (verfied)
- ✅ macOS
- ✅ Windows (use WSL for bash scripts)
If you use this framework in your research, please cite:
[TBD. We plan to publish. Reach out to use if you wish to cite our work before we publish.]
[All rights reserved. The data and scripts used in this repository are currenlty proprietary and may not be used for any purpose without express written permission from the owners. We do plan to publish, after which the license will change to something more permissive. ]
Alvin (alvineasokuruvilla@gmail.com) or Lori (medlori@gmail.com).