Skip to content

Clasificación de estadios de sueño usando EEG, EOG y EMG del dataset Sleep-EDF Expanded // Sleep stage classification using EEG, EOG, and EMG from the Sleep-EDF Expanded dataset

License

Notifications You must be signed in to change notification settings

ignaciolinari/sleep-ST

Repository files navigation

Sleep Stage Classification

Idioma / Language: Español · English

Español

Clasificación automática de estadios de sueño (W, N1, N2, N3, REM) a partir de señales EEG, EOG y EMG usando Machine Learning y Deep Learning.

Resultados

Modelo Cohen's Kappa F1 Macro Accuracy Uso Recomendado
CNN-LSTM Seq 0.721 73.19% 80.04% Mejor rendimiento (multi-epoch)
CNN1D 0.680 70.83% 76.86% Deep Learning single-epoch
XGBoost (LOSO-CV) 0.675 ± 0.097 67.96% ± 9.85% 77.15% ± 6.97% ML interpretable y confiable para generalizar a nuevos sujetos
LSTM Bi + Attention 0.651 68.07% 74.64% Deep Learning con atención
Random Forest 0.635 69.50% 72.82% Baseline ML
LSTM Unidireccional 0.530 58.59% 66.17% Inferencia real-time

Nota sobre XGBoost: la tabla muestra LOSO-CV (78 folds, media ± std) para Kappa/F1/Accuracy. La optimización (tuning) se realizó con un holdout test por sujeto; ver detalles en XGBoost LOSO.

Características

  • Pipeline completo: descarga → preprocesamiento → extracción de features → entrenamiento
  • 133 features espectrales, temporales, spindles, ondas lentas y cross-channel
  • Modelos entrenados: Random Forest, XGBoost (LOSO), CNN1D, LSTM (uni/bi/attention), CNN-LSTM Seq
  • Optimización bayesiana con Optuna
  • Validación LOSO (Leave-One-Subject-Out) para generalización a nuevos sujetos

Quickstart (ES/EN)

# 1. Crear entorno / Create environment
conda env create -f environment.yml && conda activate sleep-st

# 2. Descargar y procesar datos / Download and process data
python src/download.py --method wget --subset sleep-cassette --out data/raw --clean
python src/manifest.py --version 1.0.0 --subset sleep-cassette --raw-root data/raw --out data/processed/manifest.csv
python src/preprocessing.py --manifest data/processed/manifest.csv --out-root data/processed/sleep_trimmed_resamp200 --out-manifest data/processed/manifest_trimmed_resamp200.csv --pre-padding 900 --post-padding 900 --resample-sfreq 200 --filter-lowcut 0.3 --filter-highcut 45 --notch-freqs 50 --avg-ref

# 3. Extraer features y entrenar / Extract features and train
python -m src.extract_features --manifest data/processed/manifest_trimmed_resamp200.csv --output data/processed/features_resamp200.parquet
python -m src.models --features-file data/processed/features_resamp200.parquet --model-type xgboost

Nota: Usá --help en cualquier script para ver todas las opciones disponibles.

Documentación

Documento Descripción
Getting Started Instalación, descarga y validación de datos
Data Pipeline Preprocesamiento, recorte y estrategias de episodios
Features Descripción detallada de las 133 features extraídas
Models Entrenamiento, splits, cross-validation y optimización
DL Architectures Detalle de arquitecturas Deep Learning (CNN1D, LSTM, CNN-LSTM Seq)
Kaggle Notebooks Entrenar modelos DL en Kaggle con GPU
Troubleshooting FAQ y solución de problemas comunes
Future Work Direcciones futuras: Transformers, LSTM largo, híbridos

Reportes de Modelos

Reportes principales

Reporte Descripción
Análisis Comparativo Comparación final de todos los modelos
XGBoost LOSO Mejor ML — análisis detallado del modelo interpretable (LOSO)
CNN-LSTM Seq Mejor DL — multi-epoch (κ=0.72)

Otros reportes

Reporte Descripción
CNN1D Mejor modelo single-epoch
LSTM Unidireccional Baseline para inferencia real-time
LSTM Bidireccional Análisis de bidireccionalidad
LSTM Bi+Attention Mejor modelo LSTM

Notebooks de Análisis

Estructura del proyecto

src/                      Código fuente
├── download.py           Descarga de Sleep-EDF desde PhysioNet
├── manifest.py           Generación de inventario de sesiones
├── preprocessing.py      Recorte alrededor del período de sueño
├── extract_features.py   CLI para extracción batch
├── features.py           Extracción de características (133 features)
├── crossval.py           Cross-validation (LOSO)
├── check_data.py         Validaciones / sanity checks del dataset
├── view_subject.py       Visualización rápida por sujeto
├── visualize_training.py Utilidades para visualizar entrenamiento
└── models/               Modelos ML y DL (pipelines, training, tuning)

data/                     Datos (no versionados)
├── raw/                  Espejo de PhysioNet
└── processed/            Manifests, PSG procesados y features

models/                   Modelos entrenados (ML) y resultados de CV
artifacts/                Modelos entrenados (DL), métricas y visualizaciones
reports/                  Reportes generados (SHAP, clustering, etc.)
docs/                     Documentación técnica
├── reports/              Análisis detallados por modelo
notebooks/                Exploración, análisis y entrenamiento DL
scripts/                  Scripts auxiliares (ej. Kaggle)
tests/                    Tests automatizados (pytest)
tmp/                      Salidas temporales / experimentos
vintage/                  Material histórico / notas y backups
out/                      Salidas locales

Dataset

Este proyecto utiliza el subset Sleep-Cassette del Sleep-EDF Database Expanded de PhysioNet.

Citación (dataset original y PhysioNet)

Si usás este repositorio o el dataset para investigación/uso académico, por favor citá:

  • B Kemp, AH Zwinderman, B Tuk, HAC Kamphuisen, JJL Oberyé. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Trans Biomed Eng. 47(9):1185-1194 (2000).
  • Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101(23), e215–e220. RRID:SCR_007345.

Note

El dataset completo Sleep-EDF contiene 197 grabaciones divididas en dos estudios:

  • Sleep-Cassette (SC): 78 sujetos, 153 noches - estudio de efectos de la edad en el sueño
  • Sleep-Telemetry (ST): 22 sujetos, 44 noches - estudio de efectos del temazepam

Este proyecto usa únicamente Sleep-Cassette.

Sleep-Cassette (SC)

Métrica Valor
Sujetos totales 78
Sujetos con 2 noches 75
Sujetos con 1 noche 3 (SC4131, SC4362, SC4522)*
Sesiones totales 153
Epochs (30s) 186,499
Canales 2 EEG (Fpz-Cz, Pz-Oz), 1 EOG, 1 EMG

* Las noches faltantes se perdieron por fallos en el cassette durante el estudio original.

Validación

El pipeline usa Leave-One-Subject-Out (LOSO) por defecto, garantizando:

  • Ningún epoch del sujeto de test aparece en entrenamiento
  • Evaluación de generalización a sujetos completamente nuevos
  • Sin data leakage entre noches del mismo sujeto

Licencia

MIT


English

Automatic sleep stage classification (W, N1, N2, N3, REM) from EEG, EOG and EMG using Machine Learning and Deep Learning.

Results

Model Cohen's Kappa Macro F1 Accuracy Recommended Use
CNN-LSTM Seq 0.721 73.19% 80.04% Best performance (multi-epoch)
CNN1D 0.680 70.83% 76.86% Deep Learning single-epoch
XGBoost (LOSO-CV) 0.675 ± 0.097 67.96% ± 9.85% 77.15% ± 6.97% Interpretable, strong baseline for unseen subjects
LSTM Bi + Attention 0.651 68.07% 74.64% Deep Learning with attention
Random Forest 0.635 69.50% 72.82% ML baseline
LSTM Unidirectional 0.530 58.59% 66.17% Real-time oriented inference

XGBoost note: the table reports LOSO-CV (78 folds, mean ± std) for Kappa/F1/Accuracy. Hyperparameter tuning was performed with a per-subject holdout test; see the Spanish report in XGBoost LOSO.

Features

  • End-to-end pipeline: download → preprocessing → feature extraction → training
  • 133 features: spectral, time-domain, spindles, slow waves and cross-channel
  • Trained models: Random Forest, XGBoost (LOSO), CNN1D, LSTM (uni/bi/attention), CNN-LSTM Seq
  • Bayesian optimization with Optuna
  • LOSO validation (Leave-One-Subject-Out) to generalize to new subjects

Quickstart

See the shared Quickstart.

Tip: use --help on any script to see all options.

Documentation

English technical docs live under docs/en/:

Document Description
Getting Started Installation, download and validation
Data Pipeline Preprocessing and dataset preparation
Features Overview of extracted features
Models Model catalog and training notes
DL Architectures Deep Learning architectures overview
Kaggle Notebooks How to run DL training on Kaggle
Troubleshooting FAQ and common issues
Future Work Ideas and next directions

Spanish reports (canonical, currently not translated):

Note

Most of the content under docs/reports/ and notebooks/ is currently Spanish-first (including the Kaggle notebooks and the main analysis notebooks 01_raw_exploration.ipynb through 05_subject_clustering.ipynb). It should still be easy to follow: it’s mostly plots/tables and code, with consistent filenames and standard metric names (Kappa/F1/Accuracy).

Model Reports

Main reports

Report Description
Comparative Analysis Final comparison across all models
XGBoost LOSO Best ML — detailed analysis of the interpretable LOSO model
CNN-LSTM Seq Best DL — multi-epoch (κ=0.72)

Other reports

Report Description
CNN1D Best single-epoch model
Unidirectional LSTM Baseline for real-time oriented inference
Bidirectional LSTM Bidirectionality analysis
BiLSTM + Attention Best LSTM model

Analysis Notebooks

Project structure

See the shared Project structure.

Dataset

This project uses the Sleep-Cassette subset of the Sleep-EDF Database Expanded dataset from PhysioNet.

Citation (original dataset and PhysioNet)

If you use this repository or the dataset for research/academic purposes, please cite:

  • B Kemp, AH Zwinderman, B Tuk, HAC Kamphuisen, JJL Oberyé. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Trans Biomed Eng. 47(9):1185-1194 (2000).
  • Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101(23), e215–e220. RRID:SCR_007345.

Note

The full Sleep-EDF dataset contains 197 recordings split into two studies:

  • Sleep-Cassette (SC): 78 subjects, 153 nights — study of age effects in sleep
  • Sleep-Telemetry (ST): 22 subjects, 44 nights — study of temazepam effects

This project uses only Sleep-Cassette.

Sleep-Cassette (SC)

Metric Value
Total subjects 78
Subjects with 2 nights 75
Subjects with 1 night 3 (SC4131, SC4362, SC4522)*
Total sessions 153
Epochs (30s) 186,499
Channels 2 EEG (Fpz-Cz, Pz-Oz), 1 EOG, 1 EMG

* The missing nights were lost due to cassette failures during the original study.

Validation

By default, the pipeline uses Leave-One-Subject-Out (LOSO), ensuring:

  • No test subject epochs appear in training
  • Evaluation of generalization to completely unseen subjects
  • No leakage between nights from the same subject

License

MIT

About

Clasificación de estadios de sueño usando EEG, EOG y EMG del dataset Sleep-EDF Expanded // Sleep stage classification using EEG, EOG, and EMG from the Sleep-EDF Expanded dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published