Skip to content

FakeProfileDetection/keystroke-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Keystroke Biometrics Experiments

A configurable machine learning framework for cross-platform and cross-session user identification using keystroke biometrics.

Features

  • Configuration-driven experiments - Define all parameters in JSON config files
  • Multiple ML models - RandomForest, XGBoost, CatBoost, SVM, MLP, Naive Bayes, ExtraTrees, LightGBM, LogisticRegression, GradientBoosting, KNN
  • Cross-platform experiments - Train on one platform, test on another
  • Cross-session experiments - Train on session 1, test on session 2
  • GPU acceleration - Automatic CUDA support for XGBoost
  • Early stopping - Optimized training for gradient boosting models
  • Comprehensive metrics - Top-K accuracy, F1 scores, confusion matrices
  • Visual reports - HTML reports with interactive plots

Quick Start

Basic Usage

Add your dataset_path to the configuration, or override the dataset_path at the command line.

# Add the dataset_path to the configuration. Use default config_full.json.
python ml_runner.py 

# Or, override dataset_path using -d option.  Use default config.
python ml_runner.py -d path/to/dataset.csv

Use your own config. I suggest making a copy of the config_full.json and modifying that rather than changing config_full.json (which could server as a template).

# Run with custom configuration (add your dataset to the config)
python ml_runner.py -c my_config.json

# Or, override dataset_path using -d option.
python ml_runner.py -c my_config.json -d path/to/dataset.csv

Use debug mode to run on a minimal set of model and hyperparameters. Uses config_debug.json by default. Use debug mode to test new models or code changes.

# Run debug mode to test your set up
python ml_runner.py --debug

# Or, override dataset_path using -d option.
python ml_runner.py --debug -d path/to/dataset.csv

# Or, use your own debug config.
python ml_runner.py my_debug_config.json 

Batch Processing

run_ml.sh runs ml_runner.py on six datasets generated by eda/ml_typenet_features_polars.py. The current version of ml_run.sh expect the dataset directory to be in:

  • ../eda/ml-experients-with-outliers2025-05-31_142307
  • ../eda/ml-experients-without-outliers2025-05-31_143027

These datasets are currently stored in the eda repository main branch. They may be removed until we're ready to release the datasets. If you do not see them, you can also find them in the shared Google Drive at https://drive.google.com/drive/folders/1d7VEy-tj9SRFstBrOXYus95j2H9qraeO?usp=drive_link.

If these are not in your ../eda directory, you can change the ROOT_DATA_DIR path in run_ml.sh.

# Run multiple datasets with default settings
./run_ml.sh

# Run in debug mode
./run_ml.sh --debug

Configuration Files

config_full.json

Full configuration with comprehensive hyperparameter grids and all experiments.

config_debug.json

Minimal configuration for quick testing with reduced hyperparameters and fewer experiments.

Configuration Structure

{
  "dataset_path": "path/to/data.csv",
  "early_stopping": false,
  "seeds": [42, 123, 456],
  "output_affix": "my_experiment",
  "show_class_distributions": false,
  "draw_feature_importance": true,
  "debug": false,
  "use_gpu": true,
  
  "models_to_train": [
    "RandomForest",
    "XGBoost",
    "CatBoost",
    "SVM",
    "MLP",
    "NaiveBayes"
  ],
  
  "experiments": [
    {"name": "FI_vs_T", "platform": true, "train": [1, 2], "test": 3},
    {"name": "P1_S1_vs_S2", "session": true, "platform": 1, "train": [1], "test": 2}
  ],
  
  "param_grids": {
    "randomforest": {
      "n_estimators": [100, 300, 500],
      "max_depth": [10, 20, null]
    }
  }
}

Command Line Options

Option Description Example
-c, --config Configuration file path -c my_config.json
-d, --dataset Dataset path (overrides config) -d data.csv
-e, --early-stop Enable early stopping -e
-s, --seeds Random seeds (overrides config) -s 42 123 456
-o, --output-affix Output directory suffix -o experiment1
--show-class-dist Show class distribution plots --show-class-dist
--no-feature-importance Skip feature importance plots --no-feature-importance
--max-workers Max CPU workers --max-workers 8
--no-gpu Disable GPU acceleration --no-gpu
--debug Use debug configuration --debug

Adding New Models

  1. Add the model name to models_to_train in the config file
  2. Add hyperparameter grid to param_grids in the config
  3. If needed, add a new training function in ml_core.py
  4. Register the function in model_train_funcs dictionary in ml_runner.py

Adding New Experiments

Add experiment definitions to the experiments array in the configuration:

{
  "experiments": [
    // Platform-based experiment
    {"name": "Platform1_vs_2", "platform": true, "train": [1], "test": 2},
    
    // Session-based experiment for specific platform
    {"name": "S1_vs_S2_P1", "session": true, "platform": 1, "train": [1], "test": 2},
    
    // Session-based experiment for all platforms
    {"name": "S1_vs_S2_All", "session": true, "platform": "all", "train": [1], "test": 2}
  ]
}

Dataset Requirements

The dataset CSV must contain:

  • user_id - User identifier (required)
  • platform_id - Platform identifier (for platform experiments)
  • session_id - Session identifier (for session experiments)
  • Feature columns - All other columns are treated as features

Output Structure

experiment_results_<timestamp>/
├── experiment_results_<timestamp>.csv          # Summary results
├── detailed_topk_results_<timestamp>.csv       # Detailed top-K metrics
├── user_identification_report.html             # Main HTML report
├── performance_plots.html                      # Interactive performance plots
├── class_distribution_*.png                    # Class distribution plots
├── *_confusion_matrix.png                      # Confusion matrices
├── *_feature_importance_*.png                  # Feature importance plots
└── *.pkl                                       # Trained models with metadata

Performance Optimization

  • GPU Acceleration: Automatically enabled for XGBoost when CUDA is available
  • Parallel Processing: Uses all CPU cores by default (configurable with --max-workers)
  • Early Stopping: Reduces training time for XGBoost and CatBoost
  • Debug Mode: Quickly test pipeline with minimal hyperparameters

Troubleshooting

GPU Issues

If GPU is not detected:

# Disable GPU in configuration
python ml_runner.py --no-gpu -d data.csv

Memory Issues

For large datasets:

# Reduce number of workers
python ml_runner.py --max-workers 4 -d data.csv

Debug Mode

For quick testing:

# Use debug configuration
python ml_runner.py --debug -d data.csv

Requirements

  • Python 3.8+
  • See requirements.txt for package dependencies
  • CUDA (optional, for GPU acceleration)

Platform Compatibility

  • ✅ Linux (verfied)
  • ✅ macOS
  • ✅ Windows (use WSL for bash scripts)

Citation

If you use this framework in your research, please cite:

[TBD.  We plan to publish.  Reach out to use if you wish to cite our work before we publish.]

License

[All rights reserved. The data and scripts used in this repository are currenlty proprietary and may not be used for any purpose without express written permission from the owners. We do plan to publish, after which the license will change to something more permissive. ]

Contacts

Alvin (alvineasokuruvilla@gmail.com) or Lori (medlori@gmail.com).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages