This repository contains a clean pipeline for predicting 32 specific cancer types (BRCA, KIRC, LUAD, etc.) from TCGA pathology reports.
Determine if pathology report text contains discriminative patterns to identify specific cancer types with high accuracy.
- Classical ML Baseline: TF-IDF β Logistic Regression/Random Forest
- LLM Evaluation: Zero-shot, few-shot, fine-tuned biomedical LLMs
- Comparison: Establish performance hierarchy
- Logistic Regression: 94.2% accuracy (32 classes)
- 5-Fold CV: 93.1% Β± 0.6%
- Random Forest: 93.2% (comparison baseline)
- β 94%+ ML baseline (ACHIEVED)
- LLM superiority (96-98% expected)
- Few-shot efficiency demonstration
data/
βββ TCGAReports.csv # 9,523 pathology reports
βββ tcga_patient_to_cancer_type.csv # 32 cancer labels
TCGA_cancer_classification.ipynb # Complete 94.2% pipeline
project_documenation.doc # Research journey