A framework for training and evaluating machine learning models that predict protein properties based on embeddings from various Protein Language Models (PLMs).
This project enables systematic comparison of different model architectures (feed-forward networks, linear regression, distance-based models) across multiple PLM embeddings and target protein properties. It automates the experimental workflow from training to evaluation and visualization.
Key capabilities:
- Automated training/evaluation across model types and embeddings
- Multiple baseline comparisons (Euclidean distance, random embeddings)
- Comprehensive performance metrics and visualizations
- Organized output structure for systematic analysis
- Python >= 3.12
uvpackage manager
git clone <repository-url>
cd unknown_unknowns
uv sync-
Prepare your data structure:
data/processed/sprot_train/ # CSV files (train.csv, val.csv, test.csv) data/processed/sprot_embs/ # HDF5 embedding files -
Run experiments:
# Train and evaluate all model combinations uv run python src/training/run_experiments.py \ --csv_dir data/processed/sprot_train \ --evaluate_after_train \ --model_types fnn linear linear_distance euclidean -
Generate summary plots:
# Performance summary across all runs uv run python src/visualization/create_performance_summary_plots.py \ --results_dir models/sprot_train \ --output out/plots
fnn: Feed-forward neural networklinear: Linear regression on concatenated embeddingslinear_distance: Linear regression on embedding differenceseuclidean: Euclidean distance baseline (no training)
Results are organized as:
models/<dataset>/<model_type>/<parameter>/<embedding>/<timestamp>/
├── checkpoints/ # Model weights
├── tensorboard/ # Training logs
└── evaluation_results/ # Plots and metrics
For detailed documentation, see docs/SPECIFICATION.md.