ProthPrime-ML-XGBoost

Machine-learning assisted search for high value Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python. Here’s a proposal you can copy-paste straight into GitHub 👇

Repo description (GitHub “Description” field)

Machine-learning assisted search for Proth primes using XGBoost, modular k-sieving and high-performance Proth tests in C/Python.

`README.md`

# ProthPrime-ML-XGBoost

Machine-learning assisted search for **Proth primes** using an XGBoost classifier as a pre-screening filter, combined with modular arithmetic heuristics and high-performance Proth primality tests in C and Python.

The goal of this project is not to *replace* deterministic primality tests, but to **reduce the search space**: the model ranks candidate Proth numbers so that deterministic tests are applied first to the most promising candidates.

> **Author:** David Vesterlund, Independent Researcher in Applied Mathematics and Machine Learning with focus on computational and algorithmic Number Theory 
> **GitHub:** [VesterlundCoder](https://github.com/VesterlundCoder)
KeyWords: Proth Primes, Machine Learning, Computational Number Theory, XGBoost, Prime Prediction, Supervised Learning, Primality Testing, Gradient Boosting, Large Primes, Data-Driven Mathematics


---

## 🔍 Background

A Proth number has the form

\[
N = k \cdot 2^n + 1, \quad k \text{ odd}, \quad 2^n > k.
\]

When \(N\) is prime, it is called a **Proth prime**. Classical searches generate huge ranges of \((k, n)\) pairs and apply a Proth test (and related checks) to each candidate – an expensive process at scale.

This project trains a supervised model (gradient-boosted decision trees, XGBoost) on labeled Proth numbers (known primes + carefully generated composites) and uses the model to rank new candidates before running heavy primality tests.

In experiments, the ML-guided candidate lists contain **~4–6× more primes** than a random baseline in the same numerical range, making it a useful *front-end filter* for large prime searches.

---

## ✨ Key Features

- **End-to-end pipeline** from data acquisition → feature engineering → model training → candidate prediction → deterministic verification.
- **XGBoost classifier** trained on:
  - known Proth primes from ProthSearch.com, and  
  - “hard” negative examples (non-primes that look prime-ish).
- **Feature engineering** tailored to Proth numbers:
  - raw: `k`, `n`  
  - logarithmic: `log_k`, `log_n`, `log_proth`  
  - modular: `k_mod_4`, `n_mod_4`, etc.  
  - bit-length features for `k` and `n`.
- **Modular k-value optimisation**:
  - modular sieving using primes 3,5,7,…,53  
  - collision score to focus on promising `k` values.
- **High-performance verification**:
  - C implementation using GMP + OpenMP (`proth_runner.c`)  
  - Python/gmpy2 implementation (`prothcalc.py`)  
  - real-time monitoring (`monitor_primes.py`).
- **Reproducible research**:
  - CSV datasets for training, features, predictions and logs  
  - research report + figures (pipeline, ROC, calibration, enrichment plots).

---

## 📁 Main Components

> The exact directory layout is up to you; below assumes scripts live at the repo root or in a `src/` directory.

### 1. Data acquisition & dataset generation

- **`get_prothsearch.py`**  
  Web scraper for known Proth primes from ProthSearch.com.  
  **Output:** `prothsearch_primtal.csv` (columns: `k`, `n`).

- **`generate_proth_ml_dataset.py`**  
  Builds a balanced ML dataset with positive and negative examples.  
  - Uses `prothsearch_primtal.csv` as ground truth positives.  
  - Generates synthetic composite Proth numbers (negatives) in similar ranges.  
  - Approximate ratio: 1 prime : 2 negatives.  
  **Output:** `proth_ml_dataset.csv` (`k`, `n`, `label`).

- **`make_features.py`**  
  Adds feature columns used by the ML model.  
  - `log_k`, `log_n`, `log_proth`  
  - `k_mod_4`, `n_mod_4` (and related residues)  
  - `bit_length_k`, `bit_length_n`  
  **Output:** `proth_ml_features.csv`.

### 2. Machine-learning model

- **`train_proth_xgb.py`**  
  Trains the XGBoost classifier on `proth_ml_features.csv`.  
  - 80/20 stratified train–test split  
  - metrics: accuracy, precision, recall, F1, ROC-AUC  
  - can export the model with `joblib` or XGBoost’s native format.

- **`predict_proth_primes.py`**  
  Uses the trained model to rank new candidates.  
  Typical process:
  - randomly sample \((k,n)\) with e.g. \(n \in [50\,001, 99\,999]\)  
  - compute \(N = k \cdot 2^n + 1\)  
  - compute features via a shared `add_features()` function  
  - predict `P(prime | k, n)` and write ranked outputs  
  **Outputs:**
  - `proth_ml_predictions.csv` — all predictions  
  - `predicted_primes_parallel.csv` — **Gold** tier (highest P)  
  - `predicted_primes_over_0.9.csv` — **Silver** tier (P ≥ 0.9)  
  - `top_predicted_primes.csv` — **Bronze** tier (top-K candidates).

### 3. K-value optimisation

- **`generate_k_values.py`**  
  Modular sieving over small primes (3–53) to score `k` values by a collision metric (how often corresponding Proth numbers fall into “bad” residue classes).  
  **Outputs:**
  - `top_1000_k_by_modular_score.csv`  
  - `new_100k_values.csv` (top 100 `k` up to 100,000).

- **`compare_k_lists.py`**  
  Compares different `k` lists (sieved vs heuristic vs other experiments).

### 4. Proth testing & verification

- **`proth_runner.c` + `Makefile`**  
  High-performance C implementation of the Proth test:
  - GMP for big integers  
  - OpenMP for parallelism  
  - trial division + Proth’s theorem (check \(a^{(N-1)/2} \equiv -1 \pmod{N}\)).  
  **Outputs:** `poth_k_log.csv` and related log files.

- **`proth_runner.py`**  
  Pure-Python Proth tester using `sympy`. Good for experimentation and small tests.

- **`prothcalc.py`**  
  Production-grade Python pipeline using `gmpy2` + multiprocessing.  
  - runs trial division, Miller–Rabin and Proth test  
  - writes all tests to `proth_log.csv`  
  - writes confirmed primes to `proth_primes.csv`.

- **`monitor_primes.py`**  
  Watches the log files and shows real-time stats (e.g. total primes found, most recent primes).

### 5. Docs & figures

Suggested (you can adapt to your structure):

- `docs/ProthPrime-ML-report.pdf` or `.docx` — research report.
- `figures/figure1_pipeline_flowchart.png`  
  `figures/figure2_data_distribution.png`  
  `…` up to `figure10_k_score_vs_yield.png`.

These cover the pipeline diagram, data distribution, precision vs baseline, enrichment factor, feature importance, calibration plot, ROC curve, score histogram, discovery rate over time and k-score vs prime yield.

---

## ⚙️ Installation

### Python environment

```bash
git clone https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost.git
cd ProthPrime-ML-XGBoost

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

If you don’t use a requirements.txt yet, typical dependencies are:

pip install numpy pandas xgboost gmpy2 sympy scikit-learn matplotlib tqdm

C toolchain (optional but recommended)

To build the fast Proth tester:

GCC or Clang with OpenMP support.
GMP development headers (libgmp-dev or equivalent).

make           # builds proth_runner / proth_runner_c
./proth_runner # see README / --help in that folder

🚀 Quickstart

1. Build the training dataset

# 1. scrape known Proth primes
python get_prothsearch.py

# 2. generate balanced prime / composite dataset
python generate_proth_ml_dataset.py

# 3. add feature columns
python make_features.py

You should now have:

prothsearch_primtal.csv
proth_ml_dataset.csv
proth_ml_features.csv

2. Train the XGBoost model

python train_proth_xgb.py

This script trains the model, prints metrics on the test set and (optionally) saves the model to disk (e.g. models/proth_xgb_model.json or .joblib).

3. Generate candidate Proth primes

python predict_proth_primes.py

By default this will:

sample 100,000 random ((k, n)) pairs in the target range
generate N = k·2^n + 1
compute features and model probabilities
write ranked predictions to:
- proth_ml_predictions.csv
- predicted_primes_parallel.csv (gold)
- predicted_primes_over_0.9.csv (silver)
- top_predicted_primes.csv (bronze)

4. Verify candidates with deterministic tests

Use either the Python or C pipelines, e.g.:

# Python/gmpy2 verification
python prothcalc.py --input predicted_primes_parallel.csv
# or run the high-performance C code on your favourite k-list
./proth_runner top_1000_k_by_modular_score.csv > poth_k_log.csv

Confirmed primes will be written to proth_primes.csv and/or the C logs.

📊 Data Formats (selected)

proth_ml_dataset.csv

column	description
`k`	odd multiplier in Proth form
`n`	exponent in (k \cdot 2^n + 1)
`label`	1 = prime, 0 = composite

proth_ml_features.csv

Extends the above with:

log_k, log_n, log_proth
k_mod_4, n_mod_4
bit_length_k, bit_length_n
(plus any extra modular / arithmetic features you add).

proth_ml_predictions.csv

column	description
`k`	candidate k
`n`	candidate n
`proba`	model probability `P(prime	k, n)`

proth_primes.csv

column	description
timestamp	discovery time
`k`	Proth multiplier
`n`	exponent
`N` / prime	the Proth prime found

📈 Reproducing the figures

The repo can also include small plotting scripts (e.g. plots/plot_precision_baseline.py) that read the CSV files and generate:

Figure 1 – Pipeline Flowchart (diagram, not code-generated)
Figure 2 – Data Distribution in (k, n)
Figure 3 – Precision vs Random Baseline (bar chart)
Figure 4 – Enrichment Factor vs Threshold (line chart)
Figure 5 – Feature Importance (XGBoost)
Figure 6 – Calibration Plot (Reliability Diagram)
Figure 7 – ROC Curve on test set
Figure 8 – Score Histogram: primes vs composites
Figure 9 – Cumulative prime discoveries over time
Figure 10 – k-score vs empirical prime yield.

These figures are described in detail in the accompanying report. Proth Prime ML Prediction Project - File Documentation Python Scripts (Core ML Pipeline) 1. Data Acquisition & Preparation****

get_prothsearch.py

Purpose: Web scraper for collecting known Proth primes from prothsearch.com
Contribution: Downloads verified Proth prime data (k, n pairs) from multiple HTML sources to create the initial training dataset
Output:  prothsearch_primtal.csv

generate_proth_ml_dataset.py

Purpose: Creates balanced ML training dataset with positive and negative examples
Contribution: Generates synthetic negative examples (non-prime k,n pairs) at 2:1 ratio to positive examples; implements intelligent sampling to avoid known primes
Output:  proth_ml_dataset.csv

make_features.py

Purpose: Feature engineering for ML model
Contribution: Creates mathematical features including log transformations (log_k, log_n, log_proth), modular arithmetic (k_mod_4, n_mod_4), and bit-length calculations; enriches raw dataset with predictive features
Output:  proth_ml_features.csv

Machine Learning Model

train_proth_xgb.py

Purpose: XGBoost classifier training and evaluation
Contribution: Core ML component that trains the predictive model on n < 100,000; includes 80/20 train-test split with stratification; reports accuracy, precision, recall, F1-score, and ROC-AUC metrics
Key Features Used: k, n, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth

predict_proth_primes.py

Purpose: Generates predictions for new Proth prime candidates
Contribution: Applies trained XGBoost model to 100,000 randomly sampled (k,n) pairs in range n ∈ [50001, 99999]; outputs top 1,000 candidates ranked by probability; validates predictions using gmpy2 primality testing
Output:  proth_ml_predictions.csv

K-Value Optimization

generate_k_values.py

Purpose: Mathematical algorithm for selecting optimal k values
Contribution: Implements modular arithmetic sieving using small primes (3,5,7,11,13,17,19,23,29,31,37,41,43,47,53); calculates "collision scores" based on multiplicative order of 2 modulo p; identifies k values that minimize false positives in primality testing
Algorithm: For each k, computes target residue (-k⁻¹ mod p) and checks if it falls in the set of power-of-2 residues
Output:  new_100k_values.csv  (top 100 k values from search up to k=100,000)

compare_k_lists.py

Purpose: Validation tool for k-value selection methods
Contribution: Compares different k-value generation approaches; identifies overlaps between modular scoring method and alternative selection strategies

Computational Testing (C & Python)

proth_runner.c

Purpose: High-performance C implementation of Proth primality testing
Contribution: Uses GMP library for arbitrary-precision arithmetic and OpenMP for parallel processing; implements trial division filtering followed by Proth's theorem verification (testing if a^((N-1)/2) ≡ -1 (mod N) for witnesses a)
Performance: Multi-threaded execution for large-scale k,n space exploration
Output:  poth_k_log.csv  with hexadecimal representation of large numbers

proth_runner.py

Purpose: Pure Python Proth tester with interactive CLI
Contribution: Educational implementation using sympy for primality checking; demonstrates Proth test algorithm with trial division pre-filtering; includes timing analysis per n value
User Interface: Terminal prompts for k, n_start, n_end parameters

prothcalc.py

Purpose: Parallel Proth calculator using gmpy2
Contribution: Production-grade implementation with multiprocessing; combines trial division, Miller-Rabin (gmpy2.is_prime), and Proth test; logs detailed results including timestamps and test outcomes
Output:  proth_log.csv  (all tests),  proth_primes.csv  (confirmed primes only)

monitor_primes.py

Purpose: Real-time monitoring dashboard for prime discovery
Contribution: Watches log files and displays live statistics; shows total primes found and most recent 10 discoveries; refreshes every 60 seconds
Use Case: Long-running computational sessions

Build & Configuration

Makefile

Purpose: Automated compilation for C implementation
Contribution: Simplifies building  proth_runner_c  executable with proper GMP and OpenMP linking
Command: make produces optimized binary

README.md

Purpose: Project documentation
Contribution: Describes C implementation dependencies, compilation instructions, input file format, and usage examples

CSV Data Files Training Data

prothsearch_primtal.csv

Content: Known Proth primes scraped from prothsearch.com
Columns: k, n
Size: Source data for positive training examples
Role: Ground truth dataset for ML training

proth_ml_dataset.csv

Content: Balanced dataset with positive (prime) and negative (composite) examples
Columns: k, n, label (1=prime, 0=composite)
Ratio: 1 positive : 2 negative examples
Role: Raw training data before feature engineering

proth_ml_features.csv

Content: Engineered features for ML model
Columns: k, n, label, log_k, log_n, k_mod_4, n_mod_4, bit_length_k, bit_length_n, log_proth
Role: Final training dataset with all features K-Value Selection

top_1000_k_by_modular_score.csv

Content: Top 1,000 k values ranked by modular collision score
Selection Criteria: Lowest collision scores using sieving primes
Role: Input for C implementation (proth_runner_c)

new_100k_values.csv

Content: Top 100 k values from extended search (k ≤ 100,000)
Role: Alternative k-value list for comparison and validation Predictions & Results

proth_ml_predictions.csv

Content: ML-predicted Proth prime candidates
Columns: k, n, proba (probability from XGBoost model)
Size: Top 1,000 candidates
Validation: Each candidate verified with gmpy2.is_prime() Log Files

proth_log.csv

Content: Complete test results from prothcalc.py
Columns: timestamp, k, n, N, passes_trial, passes_mr, passes_proth
Role: Detailed audit trail of all primality tests

proth_primes.csv

Content: Confirmed Proth primes only
Columns: timestamp, k, n, Proth_Prime_N
Role: Verified discoveries

poth_k_log.csv (from C implementation)

Content: Hexadecimal logs from proth_runner_c
Columns: timestamp, k, n, N_hex, trial_division_passed, proth_test_passed, is_proth_prime
Format: N values in hexadecimal for large numbers

prime_logfile.csv

Content: Real-time prime discovery log
Monitored by: monitor_primes.py
Role: Live tracking during long computations

prun_logfile.csv,

prun_new100k_logfile.csv

Content: Alternative log outputs from various test runs
Role: Historical records of different k-value experiments

Other Files

proth_runner (compiled binary)

Type: Executable (compiled from proth_runner.c)
Role: Production binary for high-performance testing

proth_runner_c (compiled binary)

Type: Executable (alternative build output)
Role: Same as above, different naming convention .numbers files (Numbers spreadsheets)
 poth_k_log.numbers ,  poth_k_log_blank.numbers
 proth_log.numbers
 top_1000_k_by_modular_score.numbers
 EIJu6Mun3Xc_analysis.numbers
Role: macOS Numbers templates for data visualization and analysis

⚠️ Disclaimer

This project is research/experimental code.
It is not intended as a drop-in cryptographic library.
Always verify primes independently when correctness matters.

📚 Citation

If you use this code or the ideas in your own work, please consider citing:

@misc{vesterlund_prothprime_ml_xgboost,
  author       = {David Vesterlund, Independent Research in Applied Mathematics and Machine Learning},
  title        = {ProthPrime-ML-XGBoost: Machine Learning Assisted Search for Proth Primes},
  year         = {2025},
  howpublished = {GitHub repository},
  url          = {https://github.com/VesterlundCoder/ProthPrime-ML-XGBoost}
}

📝 License

MIT License

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProthPrime-ML-XGBoost

Repo description (GitHub “Description” field)

`README.md`

C toolchain (optional but recommended)

🚀 Quickstart

1. Build the training dataset

2. Train the XGBoost model

3. Generate candidate Proth primes

4. Verify candidates with deterministic tests

📊 Data Formats (selected)

📈 Reproducing the figures

⚠️ Disclaimer

📚 Citation

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
Makefile		Makefile
PAPER_FIGURES_SUMMARY.md		PAPER_FIGURES_SUMMARY.md
README.md		README.md
calculate_enrichment_stats.py		calculate_enrichment_stats.py
compare_k_lists.py		compare_k_lists.py
generate_all_figures.py		generate_all_figures.py
generate_k_values.py		generate_k_values.py
generate_proth_ml_dataset.py		generate_proth_ml_dataset.py
get_prothsearch.py		get_prothsearch.py
make_features.py		make_features.py
monitor_primes.py		monitor_primes.py
new_100k_values.csv		new_100k_values.csv
proth_ml_dataset.csv		proth_ml_dataset.csv
proth_ml_features.csv		proth_ml_features.csv
proth_ml_predictions.csv		proth_ml_predictions.csv

Folders and files

Latest commit

History

Repository files navigation

ProthPrime-ML-XGBoost

Repo description (GitHub “Description” field)

README.md

C toolchain (optional but recommended)

🚀 Quickstart

1. Build the training dataset

2. Train the XGBoost model

3. Generate candidate Proth primes

4. Verify candidates with deterministic tests

📊 Data Formats (selected)

📈 Reproducing the figures

⚠️ Disclaimer

📚 Citation

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`README.md`

Packages