ML Keystroke Biometrics Experiments

A configurable machine learning framework for cross-platform and cross-session user identification using keystroke biometrics.

Features

Configuration-driven experiments - Define all parameters in JSON config files
Multiple ML models - RandomForest, XGBoost, CatBoost, SVM, MLP, Naive Bayes, ExtraTrees, LightGBM, LogisticRegression, GradientBoosting, KNN
Cross-platform experiments - Train on one platform, test on another
Cross-session experiments - Train on session 1, test on session 2
GPU acceleration - Automatic CUDA support for XGBoost
Early stopping - Optimized training for gradient boosting models
Comprehensive metrics - Top-K accuracy, F1 scores, confusion matrices
Visual reports - HTML reports with interactive plots

Quick Start

Basic Usage

Add your dataset_path to the configuration, or override the dataset_path at the command line.

# Add the dataset_path to the configuration. Use default config_full.json.
python ml_runner.py 

# Or, override dataset_path using -d option.  Use default config.
python ml_runner.py -d path/to/dataset.csv

Use your own config. I suggest making a copy of the config_full.json and modifying that rather than changing config_full.json (which could server as a template).

# Run with custom configuration (add your dataset to the config)
python ml_runner.py -c my_config.json

# Or, override dataset_path using -d option.
python ml_runner.py -c my_config.json -d path/to/dataset.csv

Use debug mode to run on a minimal set of model and hyperparameters. Uses config_debug.json by default. Use debug mode to test new models or code changes.

# Run debug mode to test your set up
python ml_runner.py --debug

# Or, override dataset_path using -d option.
python ml_runner.py --debug -d path/to/dataset.csv

# Or, use your own debug config.
python ml_runner.py my_debug_config.json

Batch Processing

run_ml.sh runs ml_runner.py on six datasets generated by eda/ml_typenet_features_polars.py. The current version of ml_run.sh expect the dataset directory to be in:

../eda/ml-experients-with-outliers2025-05-31_142307
../eda/ml-experients-without-outliers2025-05-31_143027

These datasets are currently stored in the eda repository main branch. They may be removed until we're ready to release the datasets. If you do not see them, you can also find them in the shared Google Drive at https://drive.google.com/drive/folders/1d7VEy-tj9SRFstBrOXYus95j2H9qraeO?usp=drive_link.

If these are not in your ../eda directory, you can change the ROOT_DATA_DIR path in run_ml.sh.

# Run multiple datasets with default settings
./run_ml.sh

# Run in debug mode
./run_ml.sh --debug

Configuration Files

config_full.json

Full configuration with comprehensive hyperparameter grids and all experiments.

config_debug.json

Minimal configuration for quick testing with reduced hyperparameters and fewer experiments.

Configuration Structure

{
  "dataset_path": "path/to/data.csv",
  "early_stopping": false,
  "seeds": [42, 123, 456],
  "output_affix": "my_experiment",
  "show_class_distributions": false,
  "draw_feature_importance": true,
  "debug": false,
  "use_gpu": true,
  
  "models_to_train": [
    "RandomForest",
    "XGBoost",
    "CatBoost",
    "SVM",
    "MLP",
    "NaiveBayes"
  ],
  
  "experiments": [
    {"name": "FI_vs_T", "platform": true, "train": [1, 2], "test": 3},
    {"name": "P1_S1_vs_S2", "session": true, "platform": 1, "train": [1], "test": 2}
  ],
  
  "param_grids": {
    "randomforest": {
      "n_estimators": [100, 300, 500],
      "max_depth": [10, 20, null]
    }
  }
}

Command Line Options

Option	Description	Example
`-c, --config`	Configuration file path	`-c my_config.json`
`-d, --dataset`	Dataset path (overrides config)	`-d data.csv`
`-e, --early-stop`	Enable early stopping	`-e`
`-s, --seeds`	Random seeds (overrides config)	`-s 42 123 456`
`-o, --output-affix`	Output directory suffix	`-o experiment1`
`--show-class-dist`	Show class distribution plots	`--show-class-dist`
`--no-feature-importance`	Skip feature importance plots	`--no-feature-importance`
`--max-workers`	Max CPU workers	`--max-workers 8`
`--no-gpu`	Disable GPU acceleration	`--no-gpu`
`--debug`	Use debug configuration	`--debug`

Adding New Models

Add the model name to models_to_train in the config file
Add hyperparameter grid to param_grids in the config
If needed, add a new training function in ml_core.py
Register the function in model_train_funcs dictionary in ml_runner.py

Adding New Experiments

Add experiment definitions to the experiments array in the configuration:

{
  "experiments": [
    // Platform-based experiment
    {"name": "Platform1_vs_2", "platform": true, "train": [1], "test": 2},
    
    // Session-based experiment for specific platform
    {"name": "S1_vs_S2_P1", "session": true, "platform": 1, "train": [1], "test": 2},
    
    // Session-based experiment for all platforms
    {"name": "S1_vs_S2_All", "session": true, "platform": "all", "train": [1], "test": 2}
  ]
}

Dataset Requirements

The dataset CSV must contain:

user_id - User identifier (required)
platform_id - Platform identifier (for platform experiments)
session_id - Session identifier (for session experiments)
Feature columns - All other columns are treated as features

Output Structure

experiment_results_<timestamp>/
├── experiment_results_<timestamp>.csv          # Summary results
├── detailed_topk_results_<timestamp>.csv       # Detailed top-K metrics
├── user_identification_report.html             # Main HTML report
├── performance_plots.html                      # Interactive performance plots
├── class_distribution_*.png                    # Class distribution plots
├── *_confusion_matrix.png                      # Confusion matrices
├── *_feature_importance_*.png                  # Feature importance plots
└── *.pkl                                       # Trained models with metadata

Performance Optimization

GPU Acceleration: Automatically enabled for XGBoost when CUDA is available
Parallel Processing: Uses all CPU cores by default (configurable with --max-workers)
Early Stopping: Reduces training time for XGBoost and CatBoost
Debug Mode: Quickly test pipeline with minimal hyperparameters

Troubleshooting

GPU Issues

If GPU is not detected:

# Disable GPU in configuration
python ml_runner.py --no-gpu -d data.csv

Memory Issues

For large datasets:

# Reduce number of workers
python ml_runner.py --max-workers 4 -d data.csv

Debug Mode

For quick testing:

# Use debug configuration
python ml_runner.py --debug -d data.csv

Requirements

Python 3.8+
See requirements.txt for package dependencies
CUDA (optional, for GPU acceleration)

Platform Compatibility

✅ Linux (verfied)
✅ macOS
✅ Windows (use WSL for bash scripts)

Citation

If you use this framework in your research, please cite:

[TBD.  We plan to publish.  Reach out to use if you wish to cite our work before we publish.]

License

[All rights reserved. The data and scripts used in this repository are currenlty proprietary and may not be used for any purpose without express written permission from the owners. We do plan to publish, after which the license will change to something more permissive. ]

Contacts

Alvin (alvineasokuruvilla@gmail.com) or Lori (medlori@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.idea		.idea
archive		archive
catboost_info		catboost_info
dataset		dataset
experiment_results_statistical_IL_top_10_video_2025-08-14_220526_early_stop		experiment_results_statistical_IL_top_10_video_2025-08-14_220526_early_stop
experiment_results_statistical_IL_top_10_video_2025-08-14_220624_early_stop		experiment_results_statistical_IL_top_10_video_2025-08-14_220624_early_stop
experiment_results_statistical_IL_top_10_video_2025-08-14_220740_early_stop		experiment_results_statistical_IL_top_10_video_2025-08-14_220740_early_stop
experiment_results_statistical_IL_top_10_video_2025-08-14_220927_early_stop		experiment_results_statistical_IL_top_10_video_2025-08-14_220927_early_stop
experiment_results_statistical_IL_top_10_video_2025-09-07_190742_early_stop		experiment_results_statistical_IL_top_10_video_2025-09-07_190742_early_stop
experiment_results_statistical_IL_top_10_video_2025-09-07_190927_early_stop		experiment_results_statistical_IL_top_10_video_2025-09-07_190927_early_stop
experiment_results_video_id-imputation_global-outliers-estop_2025-06-01_120124_early_stop		experiment_results_video_id-imputation_global-outliers-estop_2025-06-01_120124_early_stop
keystroke_scripts.egg-info		keystroke_scripts.egg-info
method_explainations		method_explainations
scenario_results_rf_only_2026-01-25_223058_early_stop		scenario_results_rf_only_2026-01-25_223058_early_stop
scenario_results_scenarios_2025-11-29_143051_early_stop		scenario_results_scenarios_2025-11-29_143051_early_stop
scenario_results_scenarios_2025-11-29_144120_early_stop_debug		scenario_results_scenarios_2025-11-29_144120_early_stop_debug
scenario_results_scenarios_2025-11-29_191915_early_stop		scenario_results_scenarios_2025-11-29_191915_early_stop
scenario_results_scenarios_2025-11-30_104836_early_stop		scenario_results_scenarios_2025-11-30_104836_early_stop
scenario_results_scenarios_2025-12-07_214504_early_stop_debug		scenario_results_scenarios_2025-12-07_214504_early_stop_debug
scenario_results_scenarios_2025-12-07_214537_early_stop_debug		scenario_results_scenarios_2025-12-07_214537_early_stop_debug
scenario_results_scenarios_2025-12-21_162632_early_stop		scenario_results_scenarios_2025-12-21_162632_early_stop
scenario_results_scenarios_2025-12-21_165745_early_stop		scenario_results_scenarios_2025-12-21_165745_early_stop
scenario_results_scenarios_2025-12-22_170450_early_stop		scenario_results_scenarios_2025-12-22_170450_early_stop
scenario_results_scenarios_2025-12-22_170546_early_stop		scenario_results_scenarios_2025-12-22_170546_early_stop
scenario_results_scenarios_2025-12-23_120315_early_stop		scenario_results_scenarios_2025-12-23_120315_early_stop
scenario_results_scenarios_2025-12-23_134927_early_stop		scenario_results_scenarios_2025-12-23_134927_early_stop
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
SCENARIO_REFERENCE.md		SCENARIO_REFERENCE.md
alpha_bigrams_gen.py		alpha_bigrams_gen.py
before_cleaning.csv		before_cleaning.csv
classifier_config.json		classifier_config.json
cleaned_features_data.csv		cleaned_features_data.csv
config_debug.json		config_debug.json
config_full.json		config_full.json
config_full_no_sessions.json		config_full_no_sessions.json
config_full_single_sessions_dataset2.json		config_full_single_sessions_dataset2.json
config_rf_only.json		config_rf_only.json
config_scenarios.json		config_scenarios.json
config_sessions.json		config_sessions.json
config_test_computer_engine.json		config_test_computer_engine.json
current_dataset_results.txt		current_dataset_results.txt
dataset_2_full_IL_filtred.csv		dataset_2_full_IL_filtred.csv
extended_minmax.py		extended_minmax.py
feature_dataset_preparer.py		feature_dataset_preparer.py
feature_table.py		feature_table.py
fp_features_data.csv		fp_features_data.csv
hbos.py		hbos.py
kht_nan_zero.py		kht_nan_zero.py
kit_heatmap.py		kit_heatmap.py
lori_keystroke_features.py		lori_keystroke_features.py
missing_keys.log		missing_keys.log
ml_core.py		ml_core.py
ml_models.py		ml_models.py
ml_platforms_core.py		ml_platforms_core.py
ml_platforms_runner.py		ml_platforms_runner.py
ml_platforms_runner.py.backup		ml_platforms_runner.py.backup
ml_platforms_visualizer.py		ml_platforms_visualizer.py
ml_runner.py		ml_runner.py
ml_runner_temp.py		ml_runner_temp.py
ml_scenario_runner.py		ml_scenario_runner.py
ml_utils.py		ml_utils.py
ml_visualizer.py		ml_visualizer.py
new_dataset_results.txt		new_dataset_results.txt
platform_leakage_analysis.png		platform_leakage_analysis.png
post_fill.csv		post_fill.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results_template.xlsx		results_template.xlsx
run_ml.sh		run_ml.sh
run_ml_platforms_experiments.sh		run_ml_platforms_experiments.sh
scenario-template-14Dec2025.tsv		scenario-template-14Dec2025.tsv
scenarios.py		scenarios.py
score_level_fusion.py		score_level_fusion.py
test_debug.py		test_debug.py
test_fixed_ml_platforms.py		test_fixed_ml_platforms.py
words_alpha.txt		words_alpha.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Keystroke Biometrics Experiments

Features

Quick Start

Basic Usage

Batch Processing

Configuration Files

config_full.json

config_debug.json

Configuration Structure

Command Line Options

Adding New Models

Adding New Experiments

Dataset Requirements

Output Structure

Performance Optimization

Troubleshooting

GPU Issues

Memory Issues

Debug Mode

Requirements

Platform Compatibility

Citation

License

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

FakeProfileDetection/keystroke-scripts

Folders and files

Latest commit

History

Repository files navigation

ML Keystroke Biometrics Experiments

Features

Quick Start

Basic Usage

Batch Processing

Configuration Files

config_full.json

config_debug.json

Configuration Structure

Command Line Options

Adding New Models

Adding New Experiments

Dataset Requirements

Output Structure

Performance Optimization

Troubleshooting

GPU Issues

Memory Issues

Debug Mode

Requirements

Platform Compatibility

Citation

License

Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages