Skip to content

VesterlundCoder/ProthPrime-ML-XGBoost

Repository files navigation

ProthPrime-ML-XGBoost

Machine-learning assisted search for high value Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python. Here’s a proposal you can copy-paste straight into GitHub 👇


Repo description (GitHub “Description” field)

Machine-learning assisted search for Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python.


README.md

# ProthPrime-ML-XGBoost

Machine-learning assisted search for **Proth primes** using an XGBoost classifier as a pre-screening filter, combined with modular arithmetic heuristics and high-performance Proth primality tests in C and Python.

The goal of this project is not to *replace* deterministic primality tests, but to **reduce the search space**: the model ranks candidate Proth numbers so that deterministic tests are applied first to the most promising candidates.

> **Author:** David Vesterlund, Independent Researcher in Applied Mathematics and Machine Learning with focus on computational and algorithmic Number Theory 
> **GitHub:** [VesterlundCoder](https://github.com/VesterlundCoder)
KeyWords: Proth Primes, Machine Learning, Computational Number Theory, XGBoost, Prime Prediction, Supervised Learning, Primality Testing, Gradient Boosting, Large Primes, Data-Driven Mathematics


---

## 🔍 Background

A Proth number has the form

\[
N = k \cdot 2^n + 1, \quad k \text{ odd}, \quad 2^n > k.
\]

When \(N\) is prime, it is called a **Proth prime**. Classical searches generate huge ranges of \((k, n)\) pairs and apply a Proth test (and related checks) to each candidate – an expensive process at scale.

This project trains a supervised model (gradient-boosted decision trees, XGBoost) on labeled Proth numbers (known primes + carefully generated composites) and uses the model to rank new candidates before running heavy primality tests.

In experiments, the ML-guided candidate lists contain **~4–6× more primes** than a random baseline in the same numerical range, making it a useful *front-end filter* for large prime searches.

---

## ✨ Key Features

- **End-to-end pipeline** from data acquisition → feature engineering → model training → candidate prediction → deterministic verification.
- **XGBoost classifier** trained on:
  - known Proth primes from ProthSearch.com, and  
  - “hard” negative examples (non-primes that look prime-ish).
- **Feature engineering** tailored to Proth numbers:
  - raw: `k`, `n`  
  - logarithmic: `log_k`, `log_n`, `log_proth`  
  - modular: `k_mod_4`, `n_mod_4`, etc.  
  - bit-length features for `k` and `n`.
- **Modular k-value optimisation**:
  - modular sieving using primes 3,5,7,…,53  
  - collision score to focus on promising `k` values.
- **High-performance verification**:
  - C implementation using GMP + OpenMP (`proth_runner.c`)  
  - Python/gmpy2 implementation (`prothcalc.py`)  
  - real-time monitoring (`monitor_primes.py`).
- **Reproducible research**:
  - CSV datasets for training, features, predictions and logs  
  - research report + figures (pipeline, ROC, calibration, enrichment plots).

---

## 📁 Main Components

> The exact directory layout is up to you; below assumes scripts live at the repo root or in a `src/` directory.

### 1. Data acquisition & dataset generation

- **`get_prothsearch.py`**  
  Web scraper for known Proth primes from ProthSearch.com.  
  **Output:** `prothsearch_primtal.csv` (columns: `k`, `n`).

- **`generate_proth_ml_dataset.py`**  
  Builds a balanced ML dataset with positive and negative examples.  
  - Uses `prothsearch_primtal.csv` as ground truth positives.  
  - Generates synthetic composite Proth numbers (negatives) in similar ranges.  
  - Approximate ratio: 1 prime : 2 negatives.  
  **Output:** `proth_ml_dataset.csv` (`k`, `n`, `label`).

- **`make_features.py`**  
  Adds feature columns used by the ML model.  
  - `log_k`, `log_n`, `log_proth`  
  - `k_mod_4`, `n_mod_4` (and related residues)  
  - `bit_length_k`, `bit_length_n`  
  **Output:** `proth_ml_features.csv`.

### 2. Machine-learning model

- **`train_proth_xgb.py`**  
  Trains the XGBoost classifier on `proth_ml_features.csv`.  
  - 80/20 stratified train–test split  
  - metrics: accuracy, precision, recall, F1, ROC-AUC  
  - can export the model with `joblib` or XGBoost’s native format.

- **`predict_proth_primes.py`**  
  Uses the trained model to rank new candidates.  
  Typical process:
  - randomly sample \((k,n)\) with e.g. \(n \in [50\,001, 99\,999]\)  
  - compute \(N = k \cdot 2^n + 1\)  
  - compute features via a shared `add_features()` function  
  - predict `P(prime | k, n)` and write ranked outputs  
  **Outputs:**
  - `proth_ml_predictions.csv` — all predictions  
  - `predicted_primes_parallel.csv`**Gold** tier (highest P)  
  - `predicted_primes_over_0.9.csv`**Silver** tier (P ≥ 0.9)  
  - `top_predicted_primes.csv`**Bronze** tier (top-K candidates).

### 3. K-value optimisation

- **`generate_k_values.py`**  
  Modular sieving over small primes (3–53) to score `k` values by a collision metric (how often corresponding Proth numbers fall into “bad” residue classes).  
  **Outputs:**
  - `top_1000_k_by_modular_score.csv`  
  - `new_100k_values.csv` (top 100 `k` up to 100,000).

- **`compare_k_lists.py`**  
  Compares different `k` lists (sieved vs heuristic vs other experiments).

### 4. Proth testing & verification

- **`proth_runner.c` + `Makefile`**  
  High-performance C implementation of the Proth test:
  - GMP for big integers  
  - OpenMP for parallelism  
  - trial division + Proth’s theorem (check \(a^{(N-1)/2} \equiv -1 \pmod{N}\)).  
  **Outputs:** `poth_k_log.csv` and related log files.

- **`proth_runner.py`**  
  Pure-Python Proth tester using `sympy`. Good for experimentation and small tests.

- **`prothcalc.py`**  
  Production-grade Python pipeline using `gmpy2` + multiprocessing.  
  - runs trial division, Miller–Rabin and Proth test  
  - writes all tests to `proth_log.csv`  
  - writes confirmed primes to `proth_primes.csv`.

- **`monitor_primes.py`**  
  Watches the log files and shows real-time stats (e.g. total primes found, most recent primes).

### 5. Docs & figures

Suggested (you can adapt to your structure):

- `docs/ProthPrime-ML-report.pdf` or `.docx` — research report.
- `figures/figure1_pipeline_flowchart.png`  
  `figures/figure2_data_distribution.png`  
  `` up to `figure10_k_score_vs_yield.png`.

These cover the pipeline diagram, data distribution, precision vs baseline, enrichment factor, feature importance, calibration plot, ROC curve, score histogram, discovery rate over time and k-score vs prime yield.

---

## ⚙️ Installation

### Python environment

```bash
git clone https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost.git
cd ProthPrime-ML-XGBoost

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

If you don’t use a requirements.txt yet, typical dependencies are:

pip install numpy pandas xgboost gmpy2 sympy scikit-learn matplotlib tqdm

C toolchain (optional but recommended)

To build the fast Proth tester:

  • GCC or Clang with OpenMP support.
  • GMP development headers (libgmp-dev or equivalent).
make           # builds proth_runner / proth_runner_c
./proth_runner # see README / --help in that folder

🚀 Quickstart

1. Build the training dataset

# 1. scrape known Proth primes
python get_prothsearch.py

# 2. generate balanced prime / composite dataset
python generate_proth_ml_dataset.py

# 3. add feature columns
python make_features.py

You should now have:

  • prothsearch_primtal.csv
  • proth_ml_dataset.csv
  • proth_ml_features.csv

2. Train the XGBoost model

python train_proth_xgb.py

This script trains the model, prints metrics on the test set and (optionally) saves the model to disk (e.g. models/proth_xgb_model.json or .joblib).

3. Generate candidate Proth primes

python predict_proth_primes.py

By default this will:

  • sample 100,000 random ((k, n)) pairs in the target range

  • generate N = k·2^n + 1

  • compute features and model probabilities

  • write ranked predictions to:

    • proth_ml_predictions.csv
    • predicted_primes_parallel.csv (gold)
    • predicted_primes_over_0.9.csv (silver)
    • top_predicted_primes.csv (bronze)

4. Verify candidates with deterministic tests

Use either the Python or C pipelines, e.g.:

# Python/gmpy2 verification
python prothcalc.py --input predicted_primes_parallel.csv
# or run the high-performance C code on your favourite k-list
./proth_runner top_1000_k_by_modular_score.csv > poth_k_log.csv

Confirmed primes will be written to proth_primes.csv and/or the C logs.


📊 Data Formats (selected)

proth_ml_dataset.csv

column description
k odd multiplier in Proth form
n exponent in (k \cdot 2^n + 1)
label 1 = prime, 0 = composite

proth_ml_features.csv

Extends the above with:

  • log_k, log_n, log_proth
  • k_mod_4, n_mod_4
  • bit_length_k, bit_length_n
  • (plus any extra modular / arithmetic features you add).

proth_ml_predictions.csv

column description
k candidate k
n candidate n
proba model probability `P(prime k, n)`

proth_primes.csv

column description
timestamp discovery time
k Proth multiplier
n exponent
N / prime the Proth prime found

📈 Reproducing the figures

The repo can also include small plotting scripts (e.g. plots/plot_precision_baseline.py) that read the CSV files and generate:

  • Figure 1 – Pipeline Flowchart (diagram, not code-generated)
  • Figure 2 – Data Distribution in (k, n)
  • Figure 3 – Precision vs Random Baseline (bar chart)
  • Figure 4 – Enrichment Factor vs Threshold (line chart)
  • Figure 5 – Feature Importance (XGBoost)
  • Figure 6 – Calibration Plot (Reliability Diagram)
  • Figure 7 – ROC Curve on test set
  • Figure 8 – Score Histogram: primes vs composites
  • Figure 9 – Cumulative prime discoveries over time
  • Figure 10 – k-score vs empirical prime yield.

These figures are described in detail in the accompanying report. Proth Prime ML Prediction Project - File Documentation Python Scripts (Core ML Pipeline) 1. Data Acquisition & Preparation****

get_prothsearch.py

  • Purpose: Web scraper for collecting known Proth primes from prothsearch.com
  • Contribution: Downloads verified Proth prime data (k, n pairs) from multiple HTML sources to create the initial training dataset
  • Output: 
prothsearch_primtal.csv

generate_proth_ml_dataset.py

  • Purpose: Creates balanced ML training dataset with positive and negative examples
  • Contribution: Generates synthetic negative examples (non-prime k,n pairs) at 2:1 ratio to positive examples; implements intelligent sampling to avoid known primes
  • Output: 
proth_ml_dataset.csv

make_features.py

  • Purpose: Feature engineering for ML model
  • Contribution: Creates mathematical features including log transformations (log_k, log_n, log_proth), modular arithmetic (k_mod_4, n_mod_4), and bit-length calculations; enriches raw dataset with predictive features
  • Output: 
proth_ml_features.csv
  1. Machine Learning Model

train_proth_xgb.py

  • Purpose: XGBoost classifier training and evaluation
  • Contribution: Core ML component that trains the predictive model on n < 100,000; includes 80/20 train-test split with stratification; reports accuracy, precision, recall, F1-score, and ROC-AUC metrics
  • Key Features Used: k, n, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth

predict_proth_primes.py

  • Purpose: Generates predictions for new Proth prime candidates
  • Contribution: Applies trained XGBoost model to 100,000 randomly sampled (k,n) pairs in range n ∈ [50001, 99999]; outputs top 1,000 candidates ranked by probability; validates predictions using gmpy2 primality testing
  • Output: 
proth_ml_predictions.csv
  1. K-Value Optimization

generate_k_values.py

  • Purpose: Mathematical algorithm for selecting optimal k values
  • Contribution: Implements modular arithmetic sieving using small primes (3,5,7,11,13,17,19,23,29,31,37,41,43,47,53); calculates "collision scores" based on multiplicative order of 2 modulo p; identifies k values that minimize false positives in primality testing
  • Algorithm: For each k, computes target residue (-k⁻¹ mod p) and checks if it falls in the set of power-of-2 residues
  • Output: 
new_100k_values.csv
 (top 100 k values from search up to k=100,000)

compare_k_lists.py

  • Purpose: Validation tool for k-value selection methods
  • Contribution: Compares different k-value generation approaches; identifies overlaps between modular scoring method and alternative selection strategies
  1. Computational Testing (C & Python)

proth_runner.c

  • Purpose: High-performance C implementation of Proth primality testing
  • Contribution: Uses GMP library for arbitrary-precision arithmetic and OpenMP for parallel processing; implements trial division filtering followed by Proth's theorem verification (testing if a^((N-1)/2) ≡ -1 (mod N) for witnesses a)
  • Performance: Multi-threaded execution for large-scale k,n space exploration
  • Output: 
poth_k_log.csv
 with hexadecimal representation of large numbers

proth_runner.py

  • Purpose: Pure Python Proth tester with interactive CLI
  • Contribution: Educational implementation using sympy for primality checking; demonstrates Proth test algorithm with trial division pre-filtering; includes timing analysis per n value
  • User Interface: Terminal prompts for k, n_start, n_end parameters

prothcalc.py

  • Purpose: Parallel Proth calculator using gmpy2
  • Contribution: Production-grade implementation with multiprocessing; combines trial division, Miller-Rabin (gmpy2.is_prime), and Proth test; logs detailed results including timestamps and test outcomes
  • Output: 
proth_log.csv
 (all tests), 
proth_primes.csv
 (confirmed primes only)

monitor_primes.py

  • Purpose: Real-time monitoring dashboard for prime discovery
  • Contribution: Watches log files and displays live statistics; shows total primes found and most recent 10 discoveries; refreshes every 60 seconds
  • Use Case: Long-running computational sessions
  1. Build & Configuration

Makefile

  • Purpose: Automated compilation for C implementation
  • Contribution: Simplifies building 
proth_runner_c
 executable with proper GMP and OpenMP linking
  • Command: make produces optimized binary

README.md

  • Purpose: Project documentation
  • Contribution: Describes C implementation dependencies, compilation instructions, input file format, and usage examples

CSV Data Files Training Data

prothsearch_primtal.csv

  • Content: Known Proth primes scraped from prothsearch.com
  • Columns: k, n
  • Size: Source data for positive training examples
  • Role: Ground truth dataset for ML training

proth_ml_dataset.csv

  • Content: Balanced dataset with positive (prime) and negative (composite) examples
  • Columns: k, n, label (1=prime, 0=composite)
  • Ratio: 1 positive : 2 negative examples
  • Role: Raw training data before feature engineering

proth_ml_features.csv

  • Content: Engineered features for ML model
  • Columns: k, n, label, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth
  • Role: Final training dataset with all features K-Value Selection

top_1000_k_by_modular_score.csv

  • Content: Top 1,000 k values ranked by modular collision score
  • Selection Criteria: Lowest collision scores using sieving primes
  • Role: Input for C implementation (proth_runner_c)

new_100k_values.csv

  • Content: Top 100 k values from extended search (k ≤ 100,000)
  • Role: Alternative k-value list for comparison and validation Predictions & Results

proth_ml_predictions.csv

  • Content: ML-predicted Proth prime candidates
  • Columns: k, n, proba (probability from XGBoost model)
  • Size: Top 1,000 candidates
  • Validation: Each candidate verified with gmpy2.is_prime() Log Files

proth_log.csv

  • Content: Complete test results from prothcalc.py
  • Columns: timestamp, k, n, N, passes_trial, passes_mr, passes_proth
  • Role: Detailed audit trail of all primality tests

proth_primes.csv

  • Content: Confirmed Proth primes only
  • Columns: timestamp, k, n, Proth_Prime_N
  • Role: Verified discoveries

poth_k_log.csv (from C implementation)

  • Content: Hexadecimal logs from proth_runner_c
  • Columns: timestamp, k, n, N_hex, trial_division_passed, proth_test_passed, is_proth_prime
  • Format: N values in hexadecimal for large numbers

prime_logfile.csv

  • Content: Real-time prime discovery log
  • Monitored by: monitor_primes.py
  • Role: Live tracking during long computations

prun_logfile.csv, 

prun_new100k_logfile.csv

  • Content: Alternative log outputs from various test runs
  • Role: Historical records of different k-value experiments

Other Files

proth_runner (compiled binary)

  • Type: Executable (compiled from proth_runner.c)
  • Role: Production binary for high-performance testing

proth_runner_c (compiled binary)

  • Type: Executable (alternative build output)
  • Role: Same as above, different naming convention .numbers files (Numbers spreadsheets)
  • 
poth_k_log.numbers
, 
poth_k_log_blank.numbers
  • 
proth_log.numbers
  • 
top_1000_k_by_modular_score.numbers
  • 
EIJu6Mun3Xc_analysis.numbers
  • Role: macOS Numbers templates for data visualization and analysis

⚠️ Disclaimer

  • This project is research/experimental code.
  • It is not intended as a drop-in cryptographic library.
  • Always verify primes independently when correctness matters.

📚 Citation

If you use this code or the ideas in your own work, please consider citing:

@misc{vesterlund_prothprime_ml_xgboost,
  author       = {David Vesterlund, Independent Research in Applied Mathematics and Machine Learning},
  title        = {ProthPrime-ML-XGBoost: Machine Learning Assisted Search for Proth Primes},
  year         = {2025},
  howpublished = {GitHub repository},
  url          = {https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost}
}

📝 License

MIT License

Copyright (c) 2025 David Vesterlund ...


---



About

Machine-learning assisted search for high value Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors