Machine-learning assisted search for high value Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python. Here’s a proposal you can copy-paste straight into GitHub 👇
Machine-learning assisted search for Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python.
# ProthPrime-ML-XGBoost
Machine-learning assisted search for **Proth primes** using an XGBoost classifier as a pre-screening filter, combined with modular arithmetic heuristics and high-performance Proth primality tests in C and Python.
The goal of this project is not to *replace* deterministic primality tests, but to **reduce the search space**: the model ranks candidate Proth numbers so that deterministic tests are applied first to the most promising candidates.
> **Author:** David Vesterlund, Independent Researcher in Applied Mathematics and Machine Learning with focus on computational and algorithmic Number Theory
> **GitHub:** [VesterlundCoder](https://github.com/VesterlundCoder)
KeyWords: Proth Primes, Machine Learning, Computational Number Theory, XGBoost, Prime Prediction, Supervised Learning, Primality Testing, Gradient Boosting, Large Primes, Data-Driven Mathematics
---
## 🔍 Background
A Proth number has the form
\[
N = k \cdot 2^n + 1, \quad k \text{ odd}, \quad 2^n > k.
\]
When \(N\) is prime, it is called a **Proth prime**. Classical searches generate huge ranges of \((k, n)\) pairs and apply a Proth test (and related checks) to each candidate – an expensive process at scale.
This project trains a supervised model (gradient-boosted decision trees, XGBoost) on labeled Proth numbers (known primes + carefully generated composites) and uses the model to rank new candidates before running heavy primality tests.
In experiments, the ML-guided candidate lists contain **~4–6× more primes** than a random baseline in the same numerical range, making it a useful *front-end filter* for large prime searches.
---
## ✨ Key Features
- **End-to-end pipeline** from data acquisition → feature engineering → model training → candidate prediction → deterministic verification.
- **XGBoost classifier** trained on:
- known Proth primes from ProthSearch.com, and
- “hard” negative examples (non-primes that look prime-ish).
- **Feature engineering** tailored to Proth numbers:
- raw: `k`, `n`
- logarithmic: `log_k`, `log_n`, `log_proth`
- modular: `k_mod_4`, `n_mod_4`, etc.
- bit-length features for `k` and `n`.
- **Modular k-value optimisation**:
- modular sieving using primes 3,5,7,…,53
- collision score to focus on promising `k` values.
- **High-performance verification**:
- C implementation using GMP + OpenMP (`proth_runner.c`)
- Python/gmpy2 implementation (`prothcalc.py`)
- real-time monitoring (`monitor_primes.py`).
- **Reproducible research**:
- CSV datasets for training, features, predictions and logs
- research report + figures (pipeline, ROC, calibration, enrichment plots).
---
## 📁 Main Components
> The exact directory layout is up to you; below assumes scripts live at the repo root or in a `src/` directory.
### 1. Data acquisition & dataset generation
- **`get_prothsearch.py`**
Web scraper for known Proth primes from ProthSearch.com.
**Output:** `prothsearch_primtal.csv` (columns: `k`, `n`).
- **`generate_proth_ml_dataset.py`**
Builds a balanced ML dataset with positive and negative examples.
- Uses `prothsearch_primtal.csv` as ground truth positives.
- Generates synthetic composite Proth numbers (negatives) in similar ranges.
- Approximate ratio: 1 prime : 2 negatives.
**Output:** `proth_ml_dataset.csv` (`k`, `n`, `label`).
- **`make_features.py`**
Adds feature columns used by the ML model.
- `log_k`, `log_n`, `log_proth`
- `k_mod_4`, `n_mod_4` (and related residues)
- `bit_length_k`, `bit_length_n`
**Output:** `proth_ml_features.csv`.
### 2. Machine-learning model
- **`train_proth_xgb.py`**
Trains the XGBoost classifier on `proth_ml_features.csv`.
- 80/20 stratified train–test split
- metrics: accuracy, precision, recall, F1, ROC-AUC
- can export the model with `joblib` or XGBoost’s native format.
- **`predict_proth_primes.py`**
Uses the trained model to rank new candidates.
Typical process:
- randomly sample \((k,n)\) with e.g. \(n \in [50\,001, 99\,999]\)
- compute \(N = k \cdot 2^n + 1\)
- compute features via a shared `add_features()` function
- predict `P(prime | k, n)` and write ranked outputs
**Outputs:**
- `proth_ml_predictions.csv` — all predictions
- `predicted_primes_parallel.csv` — **Gold** tier (highest P)
- `predicted_primes_over_0.9.csv` — **Silver** tier (P ≥ 0.9)
- `top_predicted_primes.csv` — **Bronze** tier (top-K candidates).
### 3. K-value optimisation
- **`generate_k_values.py`**
Modular sieving over small primes (3–53) to score `k` values by a collision metric (how often corresponding Proth numbers fall into “bad” residue classes).
**Outputs:**
- `top_1000_k_by_modular_score.csv`
- `new_100k_values.csv` (top 100 `k` up to 100,000).
- **`compare_k_lists.py`**
Compares different `k` lists (sieved vs heuristic vs other experiments).
### 4. Proth testing & verification
- **`proth_runner.c` + `Makefile`**
High-performance C implementation of the Proth test:
- GMP for big integers
- OpenMP for parallelism
- trial division + Proth’s theorem (check \(a^{(N-1)/2} \equiv -1 \pmod{N}\)).
**Outputs:** `poth_k_log.csv` and related log files.
- **`proth_runner.py`**
Pure-Python Proth tester using `sympy`. Good for experimentation and small tests.
- **`prothcalc.py`**
Production-grade Python pipeline using `gmpy2` + multiprocessing.
- runs trial division, Miller–Rabin and Proth test
- writes all tests to `proth_log.csv`
- writes confirmed primes to `proth_primes.csv`.
- **`monitor_primes.py`**
Watches the log files and shows real-time stats (e.g. total primes found, most recent primes).
### 5. Docs & figures
Suggested (you can adapt to your structure):
- `docs/ProthPrime-ML-report.pdf` or `.docx` — research report.
- `figures/figure1_pipeline_flowchart.png`
`figures/figure2_data_distribution.png`
`…` up to `figure10_k_score_vs_yield.png`.
These cover the pipeline diagram, data distribution, precision vs baseline, enrichment factor, feature importance, calibration plot, ROC curve, score histogram, discovery rate over time and k-score vs prime yield.
---
## ⚙️ Installation
### Python environment
```bash
git clone https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost.git
cd ProthPrime-ML-XGBoost
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtIf you don’t use a requirements.txt yet, typical dependencies are:
pip install numpy pandas xgboost gmpy2 sympy scikit-learn matplotlib tqdmTo build the fast Proth tester:
- GCC or Clang with OpenMP support.
- GMP development headers (
libgmp-devor equivalent).
make # builds proth_runner / proth_runner_c
./proth_runner # see README / --help in that folder# 1. scrape known Proth primes
python get_prothsearch.py
# 2. generate balanced prime / composite dataset
python generate_proth_ml_dataset.py
# 3. add feature columns
python make_features.pyYou should now have:
prothsearch_primtal.csvproth_ml_dataset.csvproth_ml_features.csv
python train_proth_xgb.pyThis script trains the model, prints metrics on the test set and (optionally) saves the model to disk (e.g. models/proth_xgb_model.json or .joblib).
python predict_proth_primes.pyBy default this will:
-
sample 100,000 random ((k, n)) pairs in the target range
-
generate
N = k·2^n + 1 -
compute features and model probabilities
-
write ranked predictions to:
proth_ml_predictions.csvpredicted_primes_parallel.csv(gold)predicted_primes_over_0.9.csv(silver)top_predicted_primes.csv(bronze)
Use either the Python or C pipelines, e.g.:
# Python/gmpy2 verification
python prothcalc.py --input predicted_primes_parallel.csv
# or run the high-performance C code on your favourite k-list
./proth_runner top_1000_k_by_modular_score.csv > poth_k_log.csvConfirmed primes will be written to proth_primes.csv and/or the C logs.
proth_ml_dataset.csv
| column | description |
|---|---|
k |
odd multiplier in Proth form |
n |
exponent in (k \cdot 2^n + 1) |
label |
1 = prime, 0 = composite |
proth_ml_features.csv
Extends the above with:
log_k,log_n,log_prothk_mod_4,n_mod_4bit_length_k,bit_length_n- (plus any extra modular / arithmetic features you add).
proth_ml_predictions.csv
| column | description | |
|---|---|---|
k |
candidate k | |
n |
candidate n | |
proba |
model probability `P(prime | k, n)` |
proth_primes.csv
| column | description |
|---|---|
| timestamp | discovery time |
k |
Proth multiplier |
n |
exponent |
N / prime |
the Proth prime found |
The repo can also include small plotting scripts (e.g. plots/plot_precision_baseline.py) that read the CSV files and generate:
- Figure 1 – Pipeline Flowchart (diagram, not code-generated)
- Figure 2 – Data Distribution in (k, n)
- Figure 3 – Precision vs Random Baseline (bar chart)
- Figure 4 – Enrichment Factor vs Threshold (line chart)
- Figure 5 – Feature Importance (XGBoost)
- Figure 6 – Calibration Plot (Reliability Diagram)
- Figure 7 – ROC Curve on test set
- Figure 8 – Score Histogram: primes vs composites
- Figure 9 – Cumulative prime discoveries over time
- Figure 10 – k-score vs empirical prime yield.
These figures are described in detail in the accompanying report. Proth Prime ML Prediction Project - File Documentation Python Scripts (Core ML Pipeline) 1. Data Acquisition & Preparation****
get_prothsearch.py
- Purpose: Web scraper for collecting known Proth primes from prothsearch.com
- Contribution: Downloads verified Proth prime data (k, n pairs) from multiple HTML sources to create the initial training dataset
- Output: prothsearch_primtal.csv
generate_proth_ml_dataset.py
- Purpose: Creates balanced ML training dataset with positive and negative examples
- Contribution: Generates synthetic negative examples (non-prime k,n pairs) at 2:1 ratio to positive examples; implements intelligent sampling to avoid known primes
- Output: proth_ml_dataset.csv
make_features.py
- Purpose: Feature engineering for ML model
- Contribution: Creates mathematical features including log transformations (log_k, log_n, log_proth), modular arithmetic (k_mod_4, n_mod_4), and bit-length calculations; enriches raw dataset with predictive features
- Output: proth_ml_features.csv
- Machine Learning Model
train_proth_xgb.py
- Purpose: XGBoost classifier training and evaluation
- Contribution: Core ML component that trains the predictive model on n < 100,000; includes 80/20 train-test split with stratification; reports accuracy, precision, recall, F1-score, and ROC-AUC metrics
- Key Features Used: k, n, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth
predict_proth_primes.py
- Purpose: Generates predictions for new Proth prime candidates
- Contribution: Applies trained XGBoost model to 100,000 randomly sampled (k,n) pairs in range n ∈ [50001, 99999]; outputs top 1,000 candidates ranked by probability; validates predictions using gmpy2 primality testing
- Output: proth_ml_predictions.csv
- K-Value Optimization
generate_k_values.py
- Purpose: Mathematical algorithm for selecting optimal k values
- Contribution: Implements modular arithmetic sieving using small primes (3,5,7,11,13,17,19,23,29,31,37,41,43,47,53); calculates "collision scores" based on multiplicative order of 2 modulo p; identifies k values that minimize false positives in primality testing
- Algorithm: For each k, computes target residue (-k⁻¹ mod p) and checks if it falls in the set of power-of-2 residues
- Output: new_100k_values.csv (top 100 k values from search up to k=100,000)
compare_k_lists.py
- Purpose: Validation tool for k-value selection methods
- Contribution: Compares different k-value generation approaches; identifies overlaps between modular scoring method and alternative selection strategies
- Computational Testing (C & Python)
proth_runner.c
- Purpose: High-performance C implementation of Proth primality testing
- Contribution: Uses GMP library for arbitrary-precision arithmetic and OpenMP for parallel processing; implements trial division filtering followed by Proth's theorem verification (testing if a^((N-1)/2) ≡ -1 (mod N) for witnesses a)
- Performance: Multi-threaded execution for large-scale k,n space exploration
- Output: poth_k_log.csv with hexadecimal representation of large numbers
proth_runner.py
- Purpose: Pure Python Proth tester with interactive CLI
- Contribution: Educational implementation using sympy for primality checking; demonstrates Proth test algorithm with trial division pre-filtering; includes timing analysis per n value
- User Interface: Terminal prompts for k, n_start, n_end parameters
prothcalc.py
- Purpose: Parallel Proth calculator using gmpy2
- Contribution: Production-grade implementation with multiprocessing; combines trial division, Miller-Rabin (gmpy2.is_prime), and Proth test; logs detailed results including timestamps and test outcomes
- Output: proth_log.csv (all tests), proth_primes.csv (confirmed primes only)
monitor_primes.py
- Purpose: Real-time monitoring dashboard for prime discovery
- Contribution: Watches log files and displays live statistics; shows total primes found and most recent 10 discoveries; refreshes every 60 seconds
- Use Case: Long-running computational sessions
- Build & Configuration
Makefile
- Purpose: Automated compilation for C implementation
- Contribution: Simplifies building proth_runner_c executable with proper GMP and OpenMP linking
- Command: make produces optimized binary
README.md
- Purpose: Project documentation
- Contribution: Describes C implementation dependencies, compilation instructions, input file format, and usage examples
CSV Data Files Training Data
prothsearch_primtal.csv
- Content: Known Proth primes scraped from prothsearch.com
- Columns: k, n
- Size: Source data for positive training examples
- Role: Ground truth dataset for ML training
proth_ml_dataset.csv
- Content: Balanced dataset with positive (prime) and negative (composite) examples
- Columns: k, n, label (1=prime, 0=composite)
- Ratio: 1 positive : 2 negative examples
- Role: Raw training data before feature engineering
proth_ml_features.csv
- Content: Engineered features for ML model
- Columns: k, n, label, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth
- Role: Final training dataset with all features K-Value Selection
top_1000_k_by_modular_score.csv
- Content: Top 1,000 k values ranked by modular collision score
- Selection Criteria: Lowest collision scores using sieving primes
- Role: Input for C implementation (proth_runner_c)
new_100k_values.csv
- Content: Top 100 k values from extended search (k ≤ 100,000)
- Role: Alternative k-value list for comparison and validation Predictions & Results
proth_ml_predictions.csv
- Content: ML-predicted Proth prime candidates
- Columns: k, n, proba (probability from XGBoost model)
- Size: Top 1,000 candidates
- Validation: Each candidate verified with gmpy2.is_prime() Log Files
proth_log.csv
- Content: Complete test results from prothcalc.py
- Columns: timestamp, k, n, N, passes_trial, passes_mr, passes_proth
- Role: Detailed audit trail of all primality tests
proth_primes.csv
- Content: Confirmed Proth primes only
- Columns: timestamp, k, n, Proth_Prime_N
- Role: Verified discoveries
poth_k_log.csv (from C implementation)
- Content: Hexadecimal logs from proth_runner_c
- Columns: timestamp, k, n, N_hex, trial_division_passed, proth_test_passed, is_proth_prime
- Format: N values in hexadecimal for large numbers
prime_logfile.csv
- Content: Real-time prime discovery log
- Monitored by: monitor_primes.py
- Role: Live tracking during long computations
prun_logfile.csv,
prun_new100k_logfile.csv
- Content: Alternative log outputs from various test runs
- Role: Historical records of different k-value experiments
Other Files
proth_runner (compiled binary)
- Type: Executable (compiled from proth_runner.c)
- Role: Production binary for high-performance testing
proth_runner_c (compiled binary)
- Type: Executable (alternative build output)
- Role: Same as above, different naming convention .numbers files (Numbers spreadsheets)
- poth_k_log.numbers , poth_k_log_blank.numbers
- proth_log.numbers
- top_1000_k_by_modular_score.numbers
- EIJu6Mun3Xc_analysis.numbers
- Role: macOS Numbers templates for data visualization and analysis
- This project is research/experimental code.
- It is not intended as a drop-in cryptographic library.
- Always verify primes independently when correctness matters.
If you use this code or the ideas in your own work, please consider citing:
@misc{vesterlund_prothprime_ml_xgboost,
author = {David Vesterlund, Independent Research in Applied Mathematics and Machine Learning},
title = {ProthPrime-ML-XGBoost: Machine Learning Assisted Search for Proth Primes},
year = {2025},
howpublished = {GitHub repository},
url = {https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost}
}MIT License
Copyright (c) 2025 David Vesterlund ...
---