Sleep Stage Classification

Idioma / Language: Español · English

Español

Clasificación automática de estadios de sueño (W, N1, N2, N3, REM) a partir de señales EEG, EOG y EMG usando Machine Learning y Deep Learning.

Resultados

Modelo	Cohen's Kappa	F1 Macro	Accuracy	Uso Recomendado
CNN-LSTM Seq	0.721	73.19%	80.04%	Mejor rendimiento (multi-epoch)
CNN1D	0.680	70.83%	76.86%	Deep Learning single-epoch
XGBoost (LOSO-CV)	0.675 ± 0.097	67.96% ± 9.85%	77.15% ± 6.97%	ML interpretable y confiable para generalizar a nuevos sujetos
LSTM Bi + Attention	0.651	68.07%	74.64%	Deep Learning con atención
Random Forest	0.635	69.50%	72.82%	Baseline ML
LSTM Unidireccional	0.530	58.59%	66.17%	Inferencia real-time

Nota sobre XGBoost: la tabla muestra LOSO-CV (78 folds, media ± std) para Kappa/F1/Accuracy. La optimización (tuning) se realizó con un holdout test por sujeto; ver detalles en XGBoost LOSO.

Características

Pipeline completo: descarga → preprocesamiento → extracción de features → entrenamiento
133 features espectrales, temporales, spindles, ondas lentas y cross-channel
Modelos entrenados: Random Forest, XGBoost (LOSO), CNN1D, LSTM (uni/bi/attention), CNN-LSTM Seq
Optimización bayesiana con Optuna
Validación LOSO (Leave-One-Subject-Out) para generalización a nuevos sujetos

Quickstart (ES/EN)

# 1. Crear entorno / Create environment
conda env create -f environment.yml && conda activate sleep-st

# 2. Descargar y procesar datos / Download and process data
python src/download.py --method wget --subset sleep-cassette --out data/raw --clean
python src/manifest.py --version 1.0.0 --subset sleep-cassette --raw-root data/raw --out data/processed/manifest.csv
python src/preprocessing.py --manifest data/processed/manifest.csv --out-root data/processed/sleep_trimmed_resamp200 --out-manifest data/processed/manifest_trimmed_resamp200.csv --pre-padding 900 --post-padding 900 --resample-sfreq 200 --filter-lowcut 0.3 --filter-highcut 45 --notch-freqs 50 --avg-ref

# 3. Extraer features y entrenar / Extract features and train
python -m src.extract_features --manifest data/processed/manifest_trimmed_resamp200.csv --output data/processed/features_resamp200.parquet
python -m src.models --features-file data/processed/features_resamp200.parquet --model-type xgboost

Nota: Usá --help en cualquier script para ver todas las opciones disponibles.

Documentación

Documento	Descripción
Getting Started	Instalación, descarga y validación de datos
Data Pipeline	Preprocesamiento, recorte y estrategias de episodios
Features	Descripción detallada de las 133 features extraídas
Models	Entrenamiento, splits, cross-validation y optimización
DL Architectures	Detalle de arquitecturas Deep Learning (CNN1D, LSTM, CNN-LSTM Seq)
Kaggle Notebooks	Entrenar modelos DL en Kaggle con GPU
Troubleshooting	FAQ y solución de problemas comunes
Future Work	Direcciones futuras: Transformers, LSTM largo, híbridos

Reportes de Modelos

Reportes principales

Reporte	Descripción
Análisis Comparativo	Comparación final de todos los modelos
XGBoost LOSO	Mejor ML — análisis detallado del modelo interpretable (LOSO)
CNN-LSTM Seq	Mejor DL — multi-epoch (κ=0.72)

Otros reportes

Reporte	Descripción
CNN1D	Mejor modelo single-epoch
LSTM Unidireccional	Baseline para inferencia real-time
LSTM Bidireccional	Análisis de bidireccionalidad
LSTM Bi+Attention	Mejor modelo LSTM

Notebooks de Análisis

03_model_results.ipynb: Comparación de métricas y visualizaciones
04_feature_analysis.ipynb: SHAP, feature importance e interpretabilidad
05_subject_clustering.ipynb: Análisis de clustering de sujetos (resultado: espectro continuo)

Estructura del proyecto

src/                      Código fuente
├── download.py           Descarga de Sleep-EDF desde PhysioNet
├── manifest.py           Generación de inventario de sesiones
├── preprocessing.py      Recorte alrededor del período de sueño
├── extract_features.py   CLI para extracción batch
├── features.py           Extracción de características (133 features)
├── crossval.py           Cross-validation (LOSO)
├── check_data.py         Validaciones / sanity checks del dataset
├── view_subject.py       Visualización rápida por sujeto
├── visualize_training.py Utilidades para visualizar entrenamiento
└── models/               Modelos ML y DL (pipelines, training, tuning)

data/                     Datos (no versionados)
├── raw/                  Espejo de PhysioNet
└── processed/            Manifests, PSG procesados y features

models/                   Modelos entrenados (ML) y resultados de CV
artifacts/                Modelos entrenados (DL), métricas y visualizaciones
reports/                  Reportes generados (SHAP, clustering, etc.)
docs/                     Documentación técnica
├── reports/              Análisis detallados por modelo
notebooks/                Exploración, análisis y entrenamiento DL
scripts/                  Scripts auxiliares (ej. Kaggle)
tests/                    Tests automatizados (pytest)
tmp/                      Salidas temporales / experimentos
vintage/                  Material histórico / notas y backups
out/                      Salidas locales

Dataset

Este proyecto utiliza el subset Sleep-Cassette del Sleep-EDF Database Expanded de PhysioNet.

Citación (dataset original y PhysioNet)

Si usás este repositorio o el dataset para investigación/uso académico, por favor citá:

B Kemp, AH Zwinderman, B Tuk, HAC Kamphuisen, JJL Oberyé. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Trans Biomed Eng. 47(9):1185-1194 (2000).
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101(23), e215–e220. RRID:SCR_007345.

Note

El dataset completo Sleep-EDF contiene 197 grabaciones divididas en dos estudios:

Sleep-Cassette (SC): 78 sujetos, 153 noches - estudio de efectos de la edad en el sueño
Sleep-Telemetry (ST): 22 sujetos, 44 noches - estudio de efectos del temazepam

Este proyecto usa únicamente Sleep-Cassette.

Sleep-Cassette (SC)

Métrica	Valor
Sujetos totales	78
Sujetos con 2 noches	75
Sujetos con 1 noche	3 (SC4131, SC4362, SC4522)*
Sesiones totales	153
Epochs (30s)	186,499
Canales	2 EEG (Fpz-Cz, Pz-Oz), 1 EOG, 1 EMG

* Las noches faltantes se perdieron por fallos en el cassette durante el estudio original.

Validación

El pipeline usa Leave-One-Subject-Out (LOSO) por defecto, garantizando:

Ningún epoch del sujeto de test aparece en entrenamiento
Evaluación de generalización a sujetos completamente nuevos
Sin data leakage entre noches del mismo sujeto

Licencia

MIT

English

Automatic sleep stage classification (W, N1, N2, N3, REM) from EEG, EOG and EMG using Machine Learning and Deep Learning.

Results

Model	Cohen's Kappa	Macro F1	Accuracy	Recommended Use
CNN-LSTM Seq	0.721	73.19%	80.04%	Best performance (multi-epoch)
CNN1D	0.680	70.83%	76.86%	Deep Learning single-epoch
XGBoost (LOSO-CV)	0.675 ± 0.097	67.96% ± 9.85%	77.15% ± 6.97%	Interpretable, strong baseline for unseen subjects
LSTM Bi + Attention	0.651	68.07%	74.64%	Deep Learning with attention
Random Forest	0.635	69.50%	72.82%	ML baseline
LSTM Unidirectional	0.530	58.59%	66.17%	Real-time oriented inference

XGBoost note: the table reports LOSO-CV (78 folds, mean ± std) for Kappa/F1/Accuracy. Hyperparameter tuning was performed with a per-subject holdout test; see the Spanish report in XGBoost LOSO.

Features

End-to-end pipeline: download → preprocessing → feature extraction → training
133 features: spectral, time-domain, spindles, slow waves and cross-channel
Trained models: Random Forest, XGBoost (LOSO), CNN1D, LSTM (uni/bi/attention), CNN-LSTM Seq
Bayesian optimization with Optuna
LOSO validation (Leave-One-Subject-Out) to generalize to new subjects

Quickstart

See the shared Quickstart.

Tip: use --help on any script to see all options.

Documentation

English technical docs live under docs/en/:

Document	Description
Getting Started	Installation, download and validation
Data Pipeline	Preprocessing and dataset preparation
Features	Overview of extracted features
Models	Model catalog and training notes
DL Architectures	Deep Learning architectures overview
Kaggle Notebooks	How to run DL training on Kaggle
Troubleshooting	FAQ and common issues
Future Work	Ideas and next directions

Spanish reports (canonical, currently not translated):

Model reports: docs/reports/
Analysis notebooks: notebooks/

Note

Most of the content under docs/reports/ and notebooks/ is currently Spanish-first (including the Kaggle notebooks and the main analysis notebooks 01_raw_exploration.ipynb through 05_subject_clustering.ipynb). It should still be easy to follow: it’s mostly plots/tables and code, with consistent filenames and standard metric names (Kappa/F1/Accuracy).

Model Reports

Main reports

Report	Description
Comparative Analysis	Final comparison across all models
XGBoost LOSO	Best ML — detailed analysis of the interpretable LOSO model
CNN-LSTM Seq	Best DL — multi-epoch (κ=0.72)

Other reports

Report	Description
CNN1D	Best single-epoch model
Unidirectional LSTM	Baseline for real-time oriented inference
Bidirectional LSTM	Bidirectionality analysis
BiLSTM + Attention	Best LSTM model

Analysis Notebooks

03_model_results.ipynb: metric comparisons and visualizations
04_feature_analysis.ipynb: SHAP, feature importance, interpretability
05_subject_clustering.ipynb: subject clustering analysis (result: continuous spectrum)

Project structure

See the shared Project structure.

Dataset

This project uses the Sleep-Cassette subset of the Sleep-EDF Database Expanded dataset from PhysioNet.

Citation (original dataset and PhysioNet)

If you use this repository or the dataset for research/academic purposes, please cite:

B Kemp, AH Zwinderman, B Tuk, HAC Kamphuisen, JJL Oberyé. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Trans Biomed Eng. 47(9):1185-1194 (2000).
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101(23), e215–e220. RRID:SCR_007345.

Note

The full Sleep-EDF dataset contains 197 recordings split into two studies:

Sleep-Cassette (SC): 78 subjects, 153 nights — study of age effects in sleep
Sleep-Telemetry (ST): 22 subjects, 44 nights — study of temazepam effects

This project uses only Sleep-Cassette.

Sleep-Cassette (SC)

Metric	Value
Total subjects	78
Subjects with 2 nights	75
Subjects with 1 night	3 (SC4131, SC4362, SC4522)*
Total sessions	153
Epochs (30s)	186,499
Channels	2 EEG (Fpz-Cz, Pz-Oz), 1 EOG, 1 EMG

* The missing nights were lost due to cassette failures during the original study.

Validation

By default, the pipeline uses Leave-One-Subject-Out (LOSO), ensuring:

No test subject epochs appear in training
Evaluation of generalization to completely unseen subjects
No leakage between nights from the same subject

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sleep Stage Classification

Español

Resultados

Características

Quickstart (ES/EN)

Documentación

Reportes de Modelos

Reportes principales

Otros reportes

Notebooks de Análisis

Estructura del proyecto

Dataset

Citación (dataset original y PhysioNet)

Sleep-Cassette (SC)

Validación

Licencia

English

Results

Features

Quickstart

Documentation

Model Reports

Main reports

Other reports

Analysis Notebooks

Project structure

Dataset

Citation (original dataset and PhysioNet)

Sleep-Cassette (SC)

Validation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
artifacts		artifacts
data		data
docs		docs
models		models
notebooks		notebooks
out		out
reports		reports
scripts		scripts
src		src
tests		tests
tmp		tmp
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pytest.ini		pytest.ini
ruff.toml		ruff.toml

License

ignaciolinari/sleep-ST

Folders and files

Latest commit

History

Repository files navigation

Sleep Stage Classification

Español

Resultados

Características

Quickstart (ES/EN)

Documentación

Reportes de Modelos

Reportes principales

Otros reportes

Notebooks de Análisis

Estructura del proyecto

Dataset

Citación (dataset original y PhysioNet)

Sleep-Cassette (SC)

Validación

Licencia

English

Results

Features

Quickstart

Documentation

Model Reports

Main reports

Other reports

Analysis Notebooks

Project structure

Dataset

Citation (original dataset and PhysioNet)

Sleep-Cassette (SC)

Validation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages