Skip to content

Zarif-S/UCL-Cancer-Collaboratorium-Hack

 
 

Repository files navigation

🧬 Designing Better Cancer Vaccines — UCL-CCC Hackathon 2025 🏆

This repository contains our project developed for the UCL Cancer Collaborative Centre Hackathon 2025, where our team designed an end-to-end computational pipeline to improve personalised cancer vaccine development. Our aim is to identify high-value neoantigen targets that a patient's immune system has not already failed to recognise.

🏆 Result: Winner — UCL Cancer Collaborative Centre Hackathon 2025


🚀 Project Overview

Cancer vaccines provide a powerful therapeutic avenue, but high non-response rates remain a key challenge. The main bottleneck is the identification and prioritisation of effective neoantigen targets from an enormous search space (≈10¹⁷ peptides). Our solution integrates tumour genomics, patient HLA typing, and TCR repertoire data to generate a ranked list of optimal vaccine candidates.

Key Innovation

We exclude peptides already recognised by exhausted or ineffective T-cell responses, increasing the likelihood of inducing a strong and durable vaccine response.


🧠 High-Level Pipeline

  1. Neoantigen Identification
    Extract mutated peptides (9-mers) present in cancer cells but not in normal tissue.

  2. MHC Binding Prediction
    Predict HLA-specific binding affinities to determine surface-presentable peptides.

  3. TCR Binding Prediction
    Identify and remove peptides already targeted unsuccessfully by the patient's TCRs.

  4. Ranking & Output
    Score the remaining peptides using immunogenicity, presentation likelihood, clonality, conservation, and safety metrics.


🧬 Data Requirements

Training Data

  • Cancer genomes with validated neoantigens
  • HLA–peptide binding datasets (IEDB)
  • TCR–peptide binding datasets

Patient-Specific Inputs

  • Tumour genome sequencing
  • Patient HLA genotype
  • TCR repertoire sequencing

🧩 Model Architecture

Our architecture incorporates:

  • Neoantigen identification module
  • Transformer-based MHC binding prediction
  • Structural TCR–peptide interaction modelling (AlphaFold-based)

📊 Ranking Metrics

  • MHC binding affinity
  • TCR engagement strength
  • Surface presentation probability
  • Peptide abundance
  • Conservation across cancer subclones
  • Phylogenetic clonality
  • Cross-reactivity and safety assessment

📁 Repository Structure

.
├── Human_AF3_inputs/          # AlphaFold 3 JSON input files (6 TCR-pMHC jobs)
├── hack-1.ipynb               # Main pipeline notebook
├── human_tcr_dataset.xlsx     # Raw IEDB export
├── updated_reduced_data.xlsx  # Filtered IEDB data
├── new_reduced_data_TCRonly.csv            # Cleaned TCR-only dataset
├── Processed_Human_TCR_MHC_Dataset.csv    # Enriched dataset with MHC sequences
├── requirements.txt
└── README.md

🗄️ Datasets

Data Sources

Dataset Source Format Public?
TCR–peptide–MHC binding data IEDB Excel export ✅ Yes
MHC heavy chain sequences (HLA allotypes) UniProt REST API FASTA ✅ Yes
Beta-2-microglobulin sequence (P61769) UniProt REST API FASTA ✅ Yes
AlphaFold 3 input files Generated by pipeline JSON

Dataset Details

human_tcr_dataset.xlsx / updated_reduced_data.xlsx

  • Source: IEDB query export
  • Columns: peptide, hla, tcr_alpha, tcr_beta
  • ~70 raw rows filtered to 6 complete entries (both TCR chains required)

new_reduced_data_TCRonly.csv

  • Cleaned/filtered version of the IEDB export
  • 6 rows with complete TCR alpha + beta sequences

Processed_Human_TCR_MHC_Dataset.csv

  • Enriched dataset combining IEDB data with UniProt-fetched sequences
  • Columns: peptide, hla, tcr_alpha, tcr_beta, mhc_heavy_chain, beta_2_microglobulin
  • 6 complete TCR-pMHC entries ready for structure prediction

Human_AF3_inputs/

  • 6 AlphaFold Server JSON files (af3_job_0.jsonaf3_job_5.json)
  • Each file encodes one TCR-pMHC complex with 5 protein chains: TCR-α, TCR-β, MHC heavy chain, β2-microglobulin, peptide (9-mer)

Peptides & HLA Allotypes in Current Dataset

Peptide HLA Allotype
IMDQVPFSV HLA-A*02:01
TRLALIAPK HLA-B*27:05
LRVMMLAPF HLA-B*27:05

HLA → UniProt ID Mapping (used for MHC sequence retrieval)

HLA Allotype UniProt ID
HLA-A*02:01 P01892
HLA-A*02:05 P30512
HLA-B*27:05 P03989
HLA-B*27:09 P30480
HLA-B*08:01 P01889
HLA-E*01:03 P30511

🔮 Scaling This Further

To extend the pipeline beyond the current 6-entry proof of concept:

  • More TCR-pMHC data: Relax IEDB query filters (e.g. allow single-chain TCRs, broader HLA coverage)
  • Neoantigen prediction: Tools like pVACseq, NetMHCpan, or MHCflurry applied to somatic mutation data
  • Tumour mutation data: TCGA (GDC portal) or ICGC — somatic mutation calls in MAF/VCF format
  • HLA population frequencies: Allele Frequency Net Database to prioritise broadly immunogenic alleles
  • Structural validation: PDB templates of TCR-pMHC complexes to benchmark AF3 outputs
  • Immunogenicity scoring: NetTCR, ERGO, or ImRex for TCR-antigen binding prediction

🧑‍🔬 Team TC-AWARE

  • Matthew Cowley
  • Zhen Wei Yap
  • Mohammad Alawwami
  • Zarif Shafiei
  • Linh Hoang
  • Julia Sala-Bayo
  • Graham Bonomo-Jackson
  • Gleb Gmyzov
  • Nick Keatley

About

Part of the UCL Cancer Collaborative Centre Hackathon 2025, our team designed an end-to-end computational pipeline to improve personalised cancer vaccine development. The hackathon challenge focused on identifying effective neoantigen targets for therapeutic vaccines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%