This repository contains our project developed for the UCL Cancer Collaborative Centre Hackathon 2025, where our team designed an end-to-end computational pipeline to improve personalised cancer vaccine development. Our aim is to identify high-value neoantigen targets that a patient's immune system has not already failed to recognise.
🏆 Result: Winner — UCL Cancer Collaborative Centre Hackathon 2025
Cancer vaccines provide a powerful therapeutic avenue, but high non-response rates remain a key challenge. The main bottleneck is the identification and prioritisation of effective neoantigen targets from an enormous search space (≈10¹⁷ peptides). Our solution integrates tumour genomics, patient HLA typing, and TCR repertoire data to generate a ranked list of optimal vaccine candidates.
We exclude peptides already recognised by exhausted or ineffective T-cell responses, increasing the likelihood of inducing a strong and durable vaccine response.
-
Neoantigen Identification
Extract mutated peptides (9-mers) present in cancer cells but not in normal tissue. -
MHC Binding Prediction
Predict HLA-specific binding affinities to determine surface-presentable peptides. -
TCR Binding Prediction
Identify and remove peptides already targeted unsuccessfully by the patient's TCRs. -
Ranking & Output
Score the remaining peptides using immunogenicity, presentation likelihood, clonality, conservation, and safety metrics.
- Cancer genomes with validated neoantigens
- HLA–peptide binding datasets (IEDB)
- TCR–peptide binding datasets
- Tumour genome sequencing
- Patient HLA genotype
- TCR repertoire sequencing
Our architecture incorporates:
- Neoantigen identification module
- Transformer-based MHC binding prediction
- Structural TCR–peptide interaction modelling (AlphaFold-based)
- MHC binding affinity
- TCR engagement strength
- Surface presentation probability
- Peptide abundance
- Conservation across cancer subclones
- Phylogenetic clonality
- Cross-reactivity and safety assessment
.
├── Human_AF3_inputs/ # AlphaFold 3 JSON input files (6 TCR-pMHC jobs)
├── hack-1.ipynb # Main pipeline notebook
├── human_tcr_dataset.xlsx # Raw IEDB export
├── updated_reduced_data.xlsx # Filtered IEDB data
├── new_reduced_data_TCRonly.csv # Cleaned TCR-only dataset
├── Processed_Human_TCR_MHC_Dataset.csv # Enriched dataset with MHC sequences
├── requirements.txt
└── README.md
| Dataset | Source | Format | Public? |
|---|---|---|---|
| TCR–peptide–MHC binding data | IEDB | Excel export | ✅ Yes |
| MHC heavy chain sequences (HLA allotypes) | UniProt REST API | FASTA | ✅ Yes |
| Beta-2-microglobulin sequence (P61769) | UniProt REST API | FASTA | ✅ Yes |
| AlphaFold 3 input files | Generated by pipeline | JSON | — |
human_tcr_dataset.xlsx / updated_reduced_data.xlsx
- Source: IEDB query export
- Columns:
peptide,hla,tcr_alpha,tcr_beta - ~70 raw rows filtered to 6 complete entries (both TCR chains required)
new_reduced_data_TCRonly.csv
- Cleaned/filtered version of the IEDB export
- 6 rows with complete TCR alpha + beta sequences
Processed_Human_TCR_MHC_Dataset.csv
- Enriched dataset combining IEDB data with UniProt-fetched sequences
- Columns:
peptide,hla,tcr_alpha,tcr_beta,mhc_heavy_chain,beta_2_microglobulin - 6 complete TCR-pMHC entries ready for structure prediction
Human_AF3_inputs/
- 6 AlphaFold Server JSON files (
af3_job_0.json–af3_job_5.json) - Each file encodes one TCR-pMHC complex with 5 protein chains: TCR-α, TCR-β, MHC heavy chain, β2-microglobulin, peptide (9-mer)
| Peptide | HLA Allotype |
|---|---|
| IMDQVPFSV | HLA-A*02:01 |
| TRLALIAPK | HLA-B*27:05 |
| LRVMMLAPF | HLA-B*27:05 |
| HLA Allotype | UniProt ID |
|---|---|
| HLA-A*02:01 | P01892 |
| HLA-A*02:05 | P30512 |
| HLA-B*27:05 | P03989 |
| HLA-B*27:09 | P30480 |
| HLA-B*08:01 | P01889 |
| HLA-E*01:03 | P30511 |
To extend the pipeline beyond the current 6-entry proof of concept:
- More TCR-pMHC data: Relax IEDB query filters (e.g. allow single-chain TCRs, broader HLA coverage)
- Neoantigen prediction: Tools like pVACseq, NetMHCpan, or MHCflurry applied to somatic mutation data
- Tumour mutation data: TCGA (GDC portal) or ICGC — somatic mutation calls in MAF/VCF format
- HLA population frequencies: Allele Frequency Net Database to prioritise broadly immunogenic alleles
- Structural validation: PDB templates of TCR-pMHC complexes to benchmark AF3 outputs
- Immunogenicity scoring: NetTCR, ERGO, or ImRex for TCR-antigen binding prediction
- Matthew Cowley
- Zhen Wei Yap
- Mohammad Alawwami
- Zarif Shafiei
- Linh Hoang
- Julia Sala-Bayo
- Graham Bonomo-Jackson
- Gleb Gmyzov
- Nick Keatley