Cancer Type Prediction from Pathology Reports

This repository contains a clean pipeline for predicting 32 specific cancer types (BRCA, KIRC, LUAD, etc.) from TCGA pathology reports.

🎯 Aim

Determine if pathology report text contains discriminative patterns to identify specific cancer types with high accuracy.

🛤️ Plan of Action

Classical ML Baseline: TF-IDF → Logistic Regression/Random Forest
LLM Evaluation: Zero-shot, few-shot, fine-tuned biomedical LLMs
Comparison: Establish performance hierarchy

📊 Current Results

Logistic Regression: 94.2% accuracy (32 classes)
5-Fold CV: 93.1% ± 0.6%
Random Forest: 93.2% (comparison baseline)

📈 Expected Outcomes

✅ 94%+ ML baseline (ACHIEVED)
LLM superiority (96-98% expected)
Few-shot efficiency demonstration

🗂️ Files

data/
├── TCGAReports.csv              # 9,523 pathology reports
└── tcga_patient_to_cancer_type.csv  # 32 cancer labels

TCGA_cancer_classification.ipynb  # Complete 94.2% pipeline
project_documenation.doc          # Research journey

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
README.md		README.md
TCGA_cancer_classification.ipynb		TCGA_cancer_classification.ipynb
project_documentation.docx		project_documentation.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cancer Type Prediction from Pathology Reports

🎯 Aim

🛤️ Plan of Action

📊 Current Results

📈 Expected Outcomes

🗂️ Files

About

Uh oh!

Releases

Packages

Languages

mmaisa1/Medical-Text-Classification-LLMs

Folders and files

Latest commit

History

Repository files navigation

Cancer Type Prediction from Pathology Reports

🎯 Aim

🛤️ Plan of Action

📊 Current Results

📈 Expected Outcomes

🗂️ Files

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages