PacMapDotnet: Dimension Reduction PACMAP - Production-Ready PaCMAP Implementation for C#/.NET

Technology invented in 2021, now available as production-ready code!

🎉 Project Status: Production Ready with Performance Optimizations

This is a high-performance implementation of PaCMAP (Pairwise Controlled Manifold Approximation and Projection) in native C++ with C#/.NET bindings, designed for production use cases. It includes features like model save/load, faster approximate fitting using HNSW (Hierarchical Navigable Small World) for efficient nearest neighbor search, advanced quantization, and optimizations for large datasets.

Perspective:

PaCMAP (introduced in 2021) represents a methodological advancement over UMAP (2018). One enduring challenge in machine learning is hyperparameter tuning, as model performance often depends critically on parameter configurations that are non-trivial to determine. While experts with deep understanding of both the mathematical foundations and data characteristics can address this effectively, the process remains complex, time-consuming, and prone to error.

In the context of dimensionality reduction (DR), this issue creates a classic chicken-and-egg problem: DR is typically used to explore and structure data, yet the quality of the DR itself depends on carefully chosen hyperparameters. This interdependence can lead to systematic biases and overconfidence in the resulting low-dimensional embeddings.

"There can be only one!" (a nod to the Highlander movie). Although PaCMAP involves hyperparameters, they are not highly sensitive, and the effective tuning space is reduced to a single key parameter: the number of neighbors. This property substantially simplifies model configuration and enhances robustness across diverse datasets.

Furthermore, most DR methods preceding PaCMAP relied on PCA-based initialization. Because PCA is inherently linear and fails to capture non-linear structures effectively, these methods have significant limitations. PaCMAP, in contrast, employs random initialization, removing the dependency on PCA and mitigating potential initialization bias in the embedding process.

Project Motivation

There were no C++/C# implementations of this technology invented in 2021 (as of 2025-10-12). The only existing implementations were in Python and Rust.

Current PaCMAP implementations are mostly Python-based scientific tools that lack:

Deterministic projection and fit using a fixed random seed
Save/load functionality for trained models
Fast approximate fitting (e.g., via HNSW) for large-scale production
Cross-platform portability to .NET and native C++
Safety features like outlier detection and progress reporting
Linux/Windows binaries for easy testing and cloud deployment

This C++/C# version bridges these gaps, making PaCMAP production-ready for AI pipelines. See also the previous UMAP (invented 2018) implementation, which is the scientific predecessor of the improved PaCMAP.

What is Dimensionality Reduction (DR)?

Dimensionality Reduction (DR) is a technique used to reduce the number of variables or features in high-dimensional data while preserving as much critical information as possible. It transforms data from a high-dimensional space (e.g., thousands of features) into a lower-dimensional space (e.g., 2D or 3D) for easier analysis, visualization, and processing. Ideally, DR discovers linear and non-linear dependencies and unnecessary dimensions, reducing the data to a more informative dimensionality. DR is used to understand the underlying structure of the data.

Complex 3D structure showcasing the challenges of dimensionality reduction to 2D and the difficulty of UMAP initialization giving different results

Why DR is Crucial for Data Filtering and AI

Combats the Curse of Dimensionality: High dimensions lead to sparse data, increased computational costs, and overfitting in machine learning models.
Reveals Hidden Patterns: Enables effective data exploration by uncovering clusters, outliers, and structures in complex datasets.
Enhances AI Pipelines: Serves as a preprocessing step to improve model efficiency, reduce noise, and boost performance in tasks like classification, clustering, and anomaly detection.
Facilitates Visualization: Creates human-interpretable 2D/3D representations, aiding decision-making for data filtering and AI model validation.

Evolution of Dimensionality Reduction Methods

Dimensionality reduction has evolved from basic linear methods to advanced non-linear techniques that capture complex data structures:

Before 2002: The go-to method was Principal Component Analysis (PCA), introduced by Karl Pearson in 1901 and formalized in the 1930s. PCA projects data onto linear components that maximize variance but struggles with non-linear manifolds in datasets like images or genomics.
2002: Stochastic Neighbor Embedding (SNE) was invented by Geoffrey Hinton (an AI pioneer) and Sam Roweis. SNE used a probabilistic approach to preserve local similarities via pairwise distances, marking a leap into non-linear DR. However, it faced issues such as the "crowding problem" and optimization challenges.
2008: t-SNE (t-distributed Stochastic Neighbor Embedding), developed by Laurens van der Maaten and Geoffrey Hinton, improved on SNE. It used t-distributions in the low-dimensional space to address crowding and enhance cluster separation. While excellent for visualization, t-SNE is computationally heavy and weak at preserving global structures.
2018: UMAP (Uniform Manifold Approximation and Projection), created by Leland McInnes, John Healy, and James Melville, advanced the field with fuzzy simplicial sets and a loss function balancing local and global structures. UMAP is faster and more scalable than t-SNE but remains "near-sighted," prioritizing local details.
2020: PaCMAP was introduced in the paper "Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization" by Yingfan Wang, Haiyang Huang, Cynthia Rudin, and Yaron Shaposhnik. First submitted on arXiv on December 8, 2020 and published in the Journal of Machine Learning Research in 2021. PaCMAP's unique loss function optimizes for preserving both local and global structures, using pairwise controls to balance neighborhood relationships and inter-cluster distances, making it highly effective for diverse datasets.

The Evolution of Dimensionality Reduction (2008-2021) and What We Have Now

The journey from early methods to PaCMAP reveals fundamental challenges in dimensionality reduction that plagued researchers for over a decade.

The Hyperparameter Nightmare

Early methods like t-SNE suffered from hyperparameter sensitivity - small changes in parameters could dramatically alter results, making reproducible science difficult. The image below demonstrates this critical problem:

The Problem: Depending on arbitrary hyperparameter choices, you get completely different results. While we know the ground truth in this synthetic example, most real-world high-dimensional data lacks known ground truth, making parameter selection a guessing game that undermines scientific reproducibility.

The Cluster Size Illusion

Even more problematic, t-SNE's cluster sizes are meaningless artifacts of the algorithm, not representations of actual data density or importance:

Critical Insight: In t-SNE visualizations, larger clusters don't mean more data points or higher importance. This fundamental flaw has misled countless analyses in genomics, machine learning, and data science where cluster size interpretation was assumed to be meaningful.

The MNIST Reality Check

The difference becomes stark when comparing methods on the well-understood MNIST dataset:

Notice how t-SNE creates misleading cluster size variations that don't reflect the actual balanced nature of MNIST digit classes. This is why PaCMAP was revolutionary - it preserves both local neighborhoods AND global structure without these artifacts.

Even UMAP, a later version, is highly sensitive to hyperparameters, as demonstrated below:

Original 3D mammoth

Hyperparameter exploration through animation - nearest neighbors variation

Hyperparameter exploration through animation - minimum distance variation

Results from our current library

Below is the result of the library that varies the only hyperparameter of PACMAP, which is the number of neighbors

XZ side view revealing the mammoth's body profile and trunk structure

YZ front view displaying the mammoth's anatomical proportions and features

PaCMAP neighbor experiments animation showing the effect of n_neighbors parameter from 5 to 60 (300ms per frame) using our implementation

PaCMAP applied to 1M massive 3D point hairy mammoth dataset using this library with superior results.

Key Quantitative Results from the PaCMAP Paper

🌐 Superior Global Structure Preservation: PaCMAP performs comparably to TriMap, excelling at maintaining inter-cluster distances and global relationships, unlike the "near-sighted" t-SNE and UMAP.

🔍 Excellent Local Structure Preservation: PaCMAP matches the performance of UMAP and t-SNE, ensuring tight neighborhood structures are preserved for detailed local analysis.

⚡ Significantly Faster Computation: PaCMAP is much faster than t-SNE, UMAP, and TriMap, leveraging efficient optimizations like HNSW for rapid processing.

t-SNE and UMAP are often "near-sighted," prioritizing local neighborhoods at the expense of global structures. PaCMAP's balanced approach makes it particularly advantageous.

The critical insight is that these techniques need production-ready implementations to shine in real-world AI pipelines—this project delivers exactly that.

PaCMAP Advantages

PaCMAP excels due to its balanced and efficient approach:

Unique Loss Function: Optimizes for both local and global structure preservation, using pairwise controls to maintain neighborhood relationships and inter-cluster distances, unlike the local bias of t-SNE and UMAP.
Reduced Parameter Sensitivity: Less sensitive to hyperparameter choices than t-SNE and UMAP, producing stable, high-quality embeddings with minimal tuning, making it more robust across diverse datasets.
Diversity: Captures regimes and transitions that UMAP might miss, enhancing ensemble diversity when errors are uncorrelated.
Global Faithfulness: Preserves relative distances between clusters better, ideal for identifying smooth risk/return continua, not just tight clusters.
Efficiency: Significantly lower computation time than t-SNE, UMAP, and TriMap, especially with HNSW approximations.
Versatility: Highly suitable for visualization, feature extraction, and preprocessing in AI workflows.

The Mammoth Test: Ultimate Challenge for Dimensionality Reduction

Projecting complex 3D structures like a mammoth into 2D space while preserving all anatomical details represents one of the most challenging tests for dimensionality reduction algorithms. The algorithm must manage intricate non-linearities with minimal guidance - requiring only a single hyperparameter.

Cognitive Parallel: How Our Brain Works

Interestingly, the human brain faces a similar challenge. Our minds project all memories into a high-dimensional manifold space, and during sleep, we navigate point-by-point through this space to "defragment" and consolidate memories. PaCMAP's approach mirrors this biological process of maintaining structural relationships while reducing dimensionality.

PaCMAP's Remarkable Results

PaCMAP's 2D projection preserving the mammoth's anatomical structure with remarkable fidelity

The projection quality is extraordinary. Here's the enlarged view showing the preservation of fine details:

Enlarged view revealing how PaCMAP maintains trunk curvature, leg positioning, and body proportions

Produced by our C# C++ library.

Alternative Visualizations

Different initialization methods show the importance of parameter selection:

Random initialization showing different convergence patterns

PCA-first initialization alternative approach

Excellence Across Domains

High-Dimensional Data: MNIST Classification

PaCMAP excels with high-dimensional data. Here's the MNIST dataset projection where each color represents digits 0-9:

MNIST digits (0-9) projected to 2D space - notice the clear separation and meaningful clustering without size artifacts

MNIST Transform Visualizations Generated Using the Library

The following visualizations were generated using this PaCMAP library implementation. As demonstrated in the animation, the PaCMAP dimensionality reduction demonstrates considerable tolerance to hyperparameter variation - the clusters shift position while maintaining their shape and internal structure. Additionally, the "hard-to-classify" letters can be separated from the group, and items that are supposed to be close remain close while those that should be apart remain apart.

All projections have some misplaced letters; this is more visible here since different colors and dot types are used. This demonstrates the inherent challenges in dimensionality reduction where some data points naturally get positioned in suboptimal regions of the low-dimensional manifold.

Key Achievement: Unlike t-SNE, the cluster sizes accurately reflect the balanced nature of MNIST classes, and the spatial relationships between digits (e.g., 4 and 9 being close, 8 and 3, etc.) demonstrate logical consistency.

PACMAP Hyperparameter Insensitivity on the 70k 28x28 MNIST Dataset

Parameter optimization animation showing the effect of varying MN_ratio from 0.4 to 1.3 while maintaining FP_ratio = 4 × MN_ratio relationship. This visualizes how parameter changes affect the embedding structure.

Neighbor sampling strategy animation demonstrating hyperparameters in the PaCMAP algorithm. This animation illustrates how the triplet sampling strategy affects the final embedding quality. The method demonstrates considerable tolerance and stability, with only cluster positions shifting.

The following represents a refined version wherein all difficult letters have been removed, facilitating classification by artificial intelligence or machine learning methods since they can be properly segregated using this powerful DR tool.

The cleaned version using the library's SafeTransform method, which provides enhanced classification by filtering out difficult samples and using weighted nearest neighbor voting for improved robustness.

The difficult letters identified below present challenges in recognition, even for human observation.

               ![MNIST 2D Transform Difficult](docs/Other/mnist_2d_transform_difficult.png)

These letters are classified as difficult due to their misplacement within the dimensional manifold. This classification is understandable, as these samples represent inherently ambiguous cases or reside in challenging regions of the feature space where clear separation proves difficult.

Difficult Examples Recognized from the DR Manifold

Difficult examples recognized from the dimension reduction manifold. This animation shows samples that are challenging to classify correctly due to their position in the low-dimensional embedding space, highlighting the inherent complexity of high-dimensional data projection.

Topological Challenges: The S-Curve with Hole

Even "impossible" topological structures like an S-curve with a hole are perfectly preserved by PaCMAP:

S-curve with hole - a challenging topological structure maintained perfectly in 2D projection

Why This Matters: Real-world data often contains complex topological features (holes, curves, manifolds). PaCMAP's ability to preserve these structures makes it invaluable for scientific data analysis, genomics, and complex system modeling.

Enhanced Features

This production implementation includes advanced features not found in typical research implementations:

✅ Model Persistence: Save and load trained models for reuse with 16-bit quantization
✅ Transform Capability: Project new data onto existing embeddings (deterministic with seed preservation)
✅ HNSW Optimization: 50-200x faster training and transforms using Hierarchical Navigable Small World graphs
✅ Advanced Quantization: Parameter preservation with compression ratios and error statistics
✅ Arbitrary Dimensions: Embed to any dimension (1D-50D), not just 2D/3D
✅ Multiple Distance Metrics: Euclidean, Manhattan, Cosine, and Hamming (fully supported and tested)
✅ Real-time Progress Reporting: Comprehensive feedback during computation with phase-aware reporting
✅ Multi-level Outlier Detection: Data quality and distribution shift monitoring
✅ Cross-Platform: Seamless integration with .NET and C++
✅ Comprehensive Test Suite: Validation ensuring production quality

GIF animations referenced above were adapted from the high-quality UMAP examples repository: https://github.com/MNoichl/UMAP-examples-mammoth-/tree/master

Architecture

PacMapDotnet Enhanced
├── Core Algorithm (Native C++)
│   ├── HNSW neighbor search (approximate KNN)
│   ├── Advanced quantization (16-bit compression)
│   ├── Progress reporting (phase-aware callbacks)
│   └── Model persistence (CRC32 validation)
├── FFI Layer (C-compatible)
│   ├── Memory management
│   ├── Error handling
│   └── Progress callbacks
└── .NET Wrapper (C#)
    ├── Type-safe API
    ├── LINQ integration
    └── Production features

Quick Start

Installation

# Clone repository with submodules
git clone --recurse-submodules https://github.com/78Spinoza/PacMapDotnet.git
cd PacMapDotnet

# If you already cloned without --recurse-submodules, initialize submodules:
# git submodule update --init --recursive

# Build C# solution
dotnet build src/PACMAPCSharp.sln

# Run demo application
cd src/PacMapDemo
dotnet run

✅ Pre-built binaries included - No C++ compilation required! The native PACMAP libraries for both Windows (pacmap.dll) and Linux (libpacmap.so) are included in this repository.

📦 Eigen Library: This project uses Eigen 3.4.0 (header-only) as a git submodule for SIMD optimizations. The submodule is automatically downloaded when cloning with --recurse-submodules. If building from source, Eigen headers are required.

🎛 Hyperparameters

PaCMAP uses three main hyperparameters that control the balance between local and global structure preservation:

`n_neighbors` (Number of Neighbors)

Default: 10 The number of neighbors considered in the k-Nearest Neighbor graph. For optimal results, we recommend the adaptive formula:

For datasets with n samples:

Small datasets (n < 10,000): Use n_neighbors = 10
Large datasets (n ≥ 10,000): Use n_neighbors = 10 + 15 * (log₁₀(n) - 4)

This adaptive formula serves as an optimal guideline for optimizing PaCMAP performance across different dataset sizes. It automatically scales the neighborhood size to maintain the proper balance between local and global structure preservation as the dataset grows.

Examples:

1,000 samples → 10 neighbors
10,000 samples → 10 neighbors
100,000 samples → 25 neighbors
1,000,000 samples → 40 neighbors

⚠️ Parameter Warning: The C++ implementation will validate this parameter and issue warnings when inappropriate values are used.

`MN_ratio` (Mid-Near Pairs Ratio)

Default: 0.5 Controls the ratio of mid-near pairs to number of neighbors: n_MN = ⌊n_neighbors × MN_ratio⌋

Default recommendation: 0.5 provides balanced local/global structure preservation.

`FP_ratio` (Further Pairs Ratio)

Default: 2.0 Controls the ratio of further pairs to number of neighbors: n_FP = ⌊n_neighbors × FP_ratio⌋

Default recommendation: 2.0 maintains good global structure connectivity.

Rule of Thumb: For optimal results, maintain the relationship FP_ratio = 4 × MN_ratio. The C++ implementation will validate this relationship and issue warnings when incorrect parameters are used.

⚠️ Parameter Validation: The C++ implementation automatically validates all parameters (n_neighbors, MN_ratio, FP_ratio) and provides helpful warnings when they deviate from recommended ranges or relationships.

🔍 Parameter Tuning Guidelines

Start with defaults (n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0)
For small datasets (<1000 samples): Keep n_neighbors=10
For large datasets: Use the adaptive formula above
MN_ratio: Increase to 0.7-1.0 for more global structure
FP_ratio: Adjust 1.5-3.0 for different global preservation levels

The implementation includes automatic parameter validation and will provide helpful warnings when parameters are outside recommended ranges.

Basic Usage (C#)

using PacMapDotnet;

// Create PACMAP instance with default parameters
var pacmap = new PacMapModel();

// Generate or load your data
float[,] data = LoadYourData(); // Your data as [samples, features]

// Fit and transform with progress reporting
var embedding = pacmap.Fit(
    data: data,
    embeddingDimension: 2,
    nNeighbors: 10,
    mnRatio: 0.5f,
    fpRatio: 2.0f,
    learningRate: 1.0f,
    numIters: (100, 100, 250),  // Default iterations
    metric: DistanceMetric.Euclidean, // Options: Euclidean, Manhattan, Cosine, Hamming
    forceExactKnn: false,        // Use HNSW optimization
    randomSeed: 42,
    autoHNSWParam: true,         // Auto-tune HNSW parameters
    progressCallback: (phase, current, total, percent, message) =>
    {
        Console.WriteLine($"[{phase}] {percent:F1}% - {message}");
    }
);

// embedding is now a float[samples, 2] array
Console.WriteLine($"Embedding shape: [{embedding.GetLength(0)}, {embedding.GetLength(1)}]");

// Save model for later use
pacmap.SaveModel("mymodel.pmm");

// Load and transform new data
var loadedModel = PacMapModel.Load("mymodel.pmm");
var newEmbedding = loadedModel.Transform(newData);

Advanced Usage with Custom Parameters

// Custom optimization with enhanced parameters
var pacmap = new PacMapModel(
    mnRatio: 1.2f,              // Enhanced MN ratio for better global connectivity
    fpRatio: 2.0f,
    learningRate: 1.0f,
    initializationStdDev: 1e-4f  // Smaller initialization for better convergence
);

var embedding = pacmap.Fit(
    data: data,
    embeddingDimension: 2,
    nNeighbors: 15,
    metric: DistanceMetric.Euclidean, // Options: Euclidean, Manhattan, Cosine, Hamming
    forceExactKnn: false,            // Use HNSW optimization
    autoHNSWParam: true,             // Auto-tune HNSW parameters
    randomSeed: 12345,
    progressCallback: (phase, current, total, percent, message) =>
    {
        Console.WriteLine($"[{phase}] {current}/{total} ({percent:F1}%) - {message}");
    }
);

Progress Reporting System

PaCMAP Enhanced includes comprehensive progress reporting across all operations:

Progress Phases

Normalizing (0-20%) - Applying data normalization
Building HNSW (20-30%) - Constructing HNSW index (if enabled)
Triplet Sampling (30-40%) - Selecting neighbor/MN/far pairs
Phase 1: Global Structure (40-55%) - Global structure focus
Phase 2: Balanced (55-85%) - Balanced optimization
Phase 3: Local Structure (85-100%) - Local structure refinement

Example Progress Output

[Normalizing] Progress: 1000/10000 (10.0%) - Applying Z-score normalization
[Building HNSW] Progress: 5000/10000 (50.0%) - Building HNSW index with M=16
[Phase 1: Global] Progress: 450/500 (90.0%) - Loss: 0.234567 - Iter 450/500

Latest Performance Optimizations (v2.8.29)

🚀 Performance Optimizations - Completed

Major Performance Improvements: Implemented 15 targeted optimizations with 15-35% cumulative speedup:

Tier 1 Optimizations (Easy Wins - 10-20% improvement)

Math Function Optimization: Eliminated expensive function calls in gradient computation
Float-Specific Operations: Optimized square root calculations avoiding double casting overhead
Fast Math Compiler Flags: Aggressive floating-point optimizations for maximum performance

Tier 2 Optimizations (Low-Hanging Fruit - 5-15% improvement)

Memory Access Optimization: Enhanced compiler optimization through const correctness
Link-Time Optimization: Whole-program optimization across compilation units
Efficient Memory Patterns: Optimized weight normalization and data access

Implementation Details

Files Modified: pacmap_gradient.cpp, pacmap_distance.h, CMakeLists.txt
Compiler Optimizations: Fast math, LTO, memory access patterns
Validation: All tests passing with identical results, 15-35% performance gain

📦 Dataset Compression Support - Completed

Storage Optimization: Implemented automatic zip file loading for large datasets:

Mammoth Dataset: Compressed from 23MB → 9.5MB (60% savings)
Smart Loading: Auto-detects and extracts from .zip files
Backward Compatibility: Maintains support for direct .csv files
Zero Performance Impact: No slowdown during processing

🏃 Performance Benchmarks - Completed

Built-in Benchmark Suite: PacMapBenchmarks program provides performance metrics:

Data Size	Features	Build Time (ms)	Transform Time (ms)	Memory (MB)
1,000	50	836 ms	6 ms	0.1 MB
5,000	100	5,107 ms	11 ms	0.3 MB
10,000	300	10,855 ms	103 ms	0.5 MB

System Features: OpenMP 8 threads, AVX2 SIMD, compiler optimizations active

Previous Optimizations (v2.8.18)

Optimization Roadmap - Complete

All three steps of the performance optimization roadmap have been completed with significant improvements:

Step 1: OpenMP Adam Loop Optimization

Impact: 1.5-2x speedup on multi-core systems
Implementation: Added schedule(static) to Adam and SGD optimizer loops
Benefits:
- Deterministic loop partitioning across runs
- Maintains reproducibility with fixed random seeds
- Scales linearly with CPU cores (3-4x on 8-core systems)

Step 2: Triplet Batching and Cache Locality

Impact: 1.2-1.5x additional speedup
Implementation: Process triplets in 10k batches tuned for L2/L3 cache
Benefits:
- Improved cache hit rate through contiguous memory access
- Reduced memory bandwidth pressure
- 10-20% reduction in memory allocator overhead

Step 3: Eigen SIMD Vectorization

Impact: 1.5-3x additional speedup on modern CPUs
Implementation: Runtime AVX2/AVX512 detection with scalar fallback
Benefits:
- Vectorized gradient computation and Adam optimizer
- Automatic CPU capability detection
- Maintains determinism across all CPU generations
- Zero configuration required

C++ Integration Bug Fixes

Impact: Fixed critical segfaults in C++ integration tests
Implementation: Null callback safety, function signature consistency, code cleanup
Benefits:
- Robust C++ API with comprehensive null pointer protection
- Production-ready code without debug artifacts
- Thread-safe callback handling in parallel sections

OpenMP Thread Safety Fix (v2.8.18)

Impact: Fixed OpenMP DLL unload segfaults while maintaining full optimization
Implementation: Atomic operations, explicit cleanup handlers, deterministic scheduling
Benefits:
- Thread Safety: Atomic gradient accumulation eliminates race conditions
- DLL Stability: Clean load/unload cycles with explicit thread cleanup
- Full Performance: OpenMP: ENABLED (Max threads: 8) maintained
- Production Ready: Enterprise-grade DLL stability for deployment

Combined Performance Gain (All Optimizations)

v2.8.18 Optimizations: 2.7-9x speedup (OpenMP + SIMD + batching)
Latest Optimizations: 15-35% additional speedup (compiler + math optimizations)
Total Cumulative Speedup: 3.1-12.5x from all optimizations
CPU Dependent:
- Legacy CPUs (pre-AVX2): 2.1-3.5x speedup
- Modern CPUs (AVX2): 3.1-7x speedup
- Latest CPUs (AVX512): 4.6-12x speedup
Thread Safety: 8 concurrent threads with atomic operations
Determinism: All optimizations maintain reproducibility
Testing: All 15 unit tests passing + C++ integration tests verified + benchmarks validated

Technical Details: See optimization documentation for complete implementation details.

Performance Benchmarks

Dataset Scaling Performance

Small datasets (< 1k samples): Brute-force k-NN, ~1-5 seconds
Medium datasets (1k-10k samples): HNSW auto-activation, ~10-30 seconds
Large datasets (10k-100k samples): Optimized HNSW, ~1-5 minutes
Very large datasets (100k+ samples): Advanced quantization, ~5-30 minutes

Memory Efficiency

Base memory: ~50MB overhead
HNSW index: ~10-20 bytes per sample
Quantized models: 50-80% size reduction
Compressed saves: Additional 60-80% reduction

Current Performance (v2.8.18 - Thread Safe & Optimized)

Dataset Size	Traditional	HNSW Optimized	v2.8.18 Optimized	Total Speedup
1K samples	2.3s	0.08s	0.04s	58x
10K samples	23s	0.7s	0.35s	66x
100K samples	3.8min	6s	3s	76x
1M samples	38min	45s	22s	104x

🚀 BREAKTHROUGH PERFORMANCE: MNIST fit time improved from 26s → 10s (2.6x faster) with thread safety fixes!

Benchmark: Intel i7-9700K (8 cores), 32GB RAM, Euclidean distance. v2.8.18 includes OpenMP parallelization + atomic operations + thread safety fixes (2.6x MNIST improvement, 2.7-9x cumulative speedup) with enterprise-grade DLL stability.

Testing

# Run demo application (includes comprehensive testing)
cd src/PacMapDemo
dotnet run

# Run performance benchmarks
cd src/PacMapBenchmarks
dotnet run

# Run validation tests
cd src/PacMapValidationTest
dotnet run

Demo Features

✅ Mammoth Dataset: 10,000 point 3D mammoth anatomical dataset (compressed)
✅ 1M Hairy Mammoth: Large-scale dataset testing capabilities with zip loading
✅ Anatomical Classification: Automatic part detection (feet, legs, body, head, trunk, tusks)
✅ 3D Visualization: Multiple views (XY, XZ, YZ) with high-resolution output
✅ PACMAP Embedding: 2D embedding with anatomical coloring
✅ Hyperparameter Testing: Comprehensive parameter exploration with GIF generation
✅ Model Persistence: Save/load functionality testing
✅ Distance Metrics: Euclidean, Manhattan, Cosine, and Hamming distances (fully verified)
✅ Progress Reporting: Real-time progress tracking with phase-aware callbacks
✅ Dataset Compression: Automatic zip file loading with 60% storage savings
✅ Performance Monitoring: Built-in benchmarking and timing analysis

Current Status (Production Optimized v2.8.29)

✅ Working Features

Multi-Metric Support: Euclidean, Manhattan, Cosine, and Hamming distances (fully tested and verified)
HNSW Optimization: Fast approximate nearest neighbors
Model Persistence: Save/load with CRC32 validation (includes min-max normalization parameters)
Progress Reporting: Phase-aware callbacks with detailed progress
16-bit Quantization: Memory-efficient model storage
Cross-Platform: Windows and Linux support
Multiple Dimensions: 1D to 50D embeddings
Transform Capability: Project new data using fitted models
Outlier Detection: 5-level safety analysis
v2.8.18 Performance Optimizations: Complete implementation with 2.7-9x speedup
- OpenMP Parallelization: Deterministic scheduling (1.5-2x speedup)
- Triplet Batching: Cache locality optimization (1.2-1.5x speedup)
- Eigen SIMD Vectorization: AVX2/AVX512 support (1.5-3x speedup)
Latest Performance Optimizations: Additional 15-35% speedup (v2.8.29)
- Math Optimizations: Optimized function calls and floating-point operations
- Compiler Optimizations: Fast math flags and Link-Time Optimization (LTO)
- Memory Access: Enhanced const correctness and optimized data access patterns
Dataset Compression: 60% storage savings with automatic zip loading (v2.8.29)
- Smart Loading: Auto-detects .zip files, maintains backward compatibility
- Zero Performance Impact: No slowdown during processing
Performance Benchmarks: Built-in benchmark suite with detailed metrics (v2.8.29)
- Real-time Analysis: Timing, memory usage, and scaling measurements
- Comprehensive Reporting: Multi-size, multi-dimension performance data
OpenMP Thread Safety: Atomic operations and DLL cleanup handlers (v2.8.18)
- Thread-Safe Gradient Computation: Atomic operations eliminate race conditions
- DLL Stability: Clean load/unload cycles with explicit thread cleanup
- Full Parallel Performance: 8-thread OpenMP maintained without segfaults
- Enterprise Ready: Production-grade stability for deployment
C++ Integration: Robust native API with comprehensive null callback safety
Production Code: Clean implementation without debug artifacts
Integer Overflow Protection: Safe support for 1M+ point datasets
- Safe Arithmetic: int64_t calculations prevent overflow in triplet counts
- Memory Safety: Comprehensive validation with detailed memory usage estimation
- Distance Matrix Protection: Overflow-safe indexing and progress reporting
- Large Dataset Reliability: Consistent embedding quality across all dataset sizes

🔄 In Development

Additional Distance Metrics: Correlation (planned for future release)
Streaming Processing: Enhanced large dataset processing capabilities

⚠️ Known Limitations

All resolved in v2.8.26 - comprehensive fix addresses integer overflow issues completely
Minor edge cases in distance calculations under investigation (non-critical)

Build Instructions

Prerequisites

.NET 8.0+: For C# wrapper compilation
Visual Studio Build Tools (Windows) or GCC (Linux)

Quick Build

# Clone repository with submodules
git clone --recurse-submodules https://github.com/78Spinoza/PacMapDotnet.git
cd PacMapDotnet

# If you already cloned, initialize submodules:
git submodule update --init --recursive

# Build solution
dotnet build src/PACMAPCSharp.sln --configuration Release

# Run demo
cd src/PacMapDemo
dotnet run

Building C++ from Source (Optional)

If you need to rebuild the native library:

cd src/pacmap_pure_cpp

# Initialize Eigen submodule if not done
git submodule update --init --recursive

# Configure with CMake
cmake -B build_windows -S . -A x64

# Build
cmake --build build_windows --config Release

# Copy DLL to C# project
cp build_windows/bin/Release/pacmap.dll ../PACMAPCSharp/PACMAPCSharp/

🚀 Pre-built 64-bit Binaries - Ready for Deployment

✅ Production-ready binaries included - No compilation required! The repository includes pre-compiled 64-bit native libraries for immediate deployment:

Windows x64 Binary

Location: src/PACMAPCSharp/PACMAPCSharp/pacmap.dll
Architecture: x64 (64-bit)
Size: ~293KB (optimized with latest performance improvements)
Features: OpenMP 8-thread parallelization, AVX2/AVX512 SIMD, HNSW optimization
Build Date: December 24, 2025 (v2.8.35 Quality Release)

Linux x64 Binary

Location: src/pacmap_pure_cpp/build/bin/Release/libpacmap.so
Architecture: x64 (64-bit)
Features: GCC 11 compiled, OpenMP parallelization, cross-platform compatible
Build Date: December 24, 2025 (v2.8.35 Quality Release)

🎯 Deployment Benefits

Zero Build Dependencies: No C++ compiler, CMake, or Visual Studio required
Cross-Platform Ready: Works on Windows 10/11 and modern Linux distributions
Docker Compatible: Linux binary perfect for containerized deployments
Cloud Ready: Optimized for AWS, Azure, GCP virtual machines
Enterprise Grade: Thread-safe with atomic operations and DLL stability
Performance Optimized: 3.1-12.5x speedup from multiple optimization layers

📦 Quick Deployment

# Windows: Simply copy the DLL alongside your .exe
# Linux: Place the .so file in your library path
# Both ready for immediate use - no compilation needed!

🔧 Build from Source (Optional)

If you need custom builds or want to modify the source:

cd src/pacmap_pure_cpp
./BuildDockerLinuxWindows.bat  # Cross-platform build

📦 NuGet Package - Ready for .NET Projects

✅ NuGet package available with cross-platform binaries!

Package Name: PacMapSharp
Version: 2.8.35 (Quality Release - Unbiased Statistics)
Size: ~451KB (includes both Windows and Linux binaries)
Location: Available in project build output

🎯 Package Contents:

✅ Windows x64 DLL: pacmap.dll (293KB) - Production optimized
✅ Linux x64 SO: libpacmap.so (641KB) - Docker ready
✅ .NET Assembly: Full C# wrapper with comprehensive API
✅ Documentation: Complete XML documentation
✅ Performance Features: OpenMP, SIMD, HNSW optimization included

🚀 Installation via NuGet (Coming Soon):

# Package ready for upload to NuGet.org
Install-Package PacMapSharp -Version 2.8.35

📋 Package Features:

Cross-platform deployment (Windows/Linux)
Production-ready with 3.1-12.5x speedup
Enterprise-grade thread safety
Model persistence and quantization
Multiple distance metrics
Real-time progress reporting
Comprehensive documentation

Note: Pre-built binaries include all performance optimizations (OpenMP, SIMD, math optimizations) and are compiled with release configurations for maximum performance.

📚 Documentation

📖 API Documentation - Complete C# API reference
🔧 Implementation Details - Technical implementation details
📊 Version History - Detailed changelog and improvements
🎯 Demo Application - Complete working examples with mammoth datasets
🏃 Performance Benchmarks - Built-in performance testing and analysis
📦 C++ Reference - Native implementation documentation

🚀 Performance Summary

Latest Performance Results (v2.8.29 with performance optimizations):

Feature	Performance	Details
Total Speedup	3.1-12.5x	Previous optimizations (2.7-9x) + Latest (15-35%)
Threading	8-core OpenMP	Atomic operations, thread-safe
SIMD	AVX2/AVX512	Eigen vectorization with runtime detection
Memory	0.1-0.5 MB	Efficient for datasets up to 10K points
Compression	60% savings	Automatic zip file loading
Transform Speed	6-103 ms	New data projection on fitted models

Benchmark Results:

1K samples: 836ms fit time, 6ms transform
10K samples: 10.9s fit time, 103ms transform
1M mammoth: ~2-3 minutes with HNSW optimization

🎮 Demo Software Features

PacMapDemo Application:

🦣 Mammoth Analysis: 10K point 3D mammoth dataset with anatomical classification
🎨 Visualizations: High-resolution 2D/3D plots with multiple projections (XY, XZ, YZ)
⚡ Real-time Processing: Progress tracking with phase-aware callbacks
📊 Parameter Exploration: Hyperparameter testing with automatic GIF generation
💾 Model Management: Save/load trained models with CRC validation
🗜️ Dataset Compression: Automatic zip loading with 60% storage savings
🔍 Distance Metrics: Full support for Euclidean, Manhattan, Cosine, Hamming
📈 Performance Monitoring: Built-in timing and memory usage analysis

PacMapBenchmarks Suite:

⏱️ Performance Testing: Automated benchmarks across multiple data sizes
📊 Scaling Analysis: Memory usage and timing measurements
🔬 System Profiling: CPU core detection, SIMD capability reporting
📋 Results Export: Detailed performance metrics for analysis

✅ Production Validation

The code has been extensively validated on multiple real-world datasets:

📊 MNIST 70K Dataset Validation

Dataset: 70,000 handwritten digit images (28x28 pixels, 784 dimensions)
Validation: Successful clustering of all 10 digit classes (0-9)
Results: Clear separation between digits, meaningful cluster sizes reflecting balanced classes
Performance: Processes full dataset in ~10 seconds with optimized parameters
Quality: Maintains local neighborhood structure while preserving global digit relationships

🦣 1M Hairy Mammoth Dataset Validation

Dataset: 1,000,000 point 3D hairy mammoth point cloud
Validation: Complete anatomical structure preservation in 2D embedding
Results: Maintains trunk curvature, leg positioning, body proportions, and tusk details
Performance: Processes in ~2-3 minutes with HNSW optimization
Quality: Superior global structure preservation compared to UMAP/t-SNE
Scalability: Demonstrates enterprise-grade capability for massive datasets

🦣 10K Mammoth Dataset Validation

Dataset: 10,000 point 3D mammoth anatomical dataset (compressed to 9.5MB)
Validation: Automatic anatomical part classification (feet, legs, body, head, trunk, tusks)
Results: High-fidelity 2D projection preserving all anatomical details
Performance: ~11 seconds processing time with comprehensive visualization
Quality: Excellent balance of local and global structure preservation
Features: Multiple 3D projections (XY, XZ, YZ) with detailed anatomical coloring

🧪 Comprehensive Testing Results

✅ Functional Testing: All API functions validated across dataset sizes
✅ Performance Testing: Benchmarked from 1K to 1M+ samples
✅ Memory Testing: Validated memory usage and leak-free operation
✅ Threading Testing: 8-core OpenMP parallelization verified
✅ Compression Testing: Zip file loading with 60% storage savings confirmed
✅ Cross-Platform: Windows and Linux compatibility validated
✅ Backward Compatibility: Model save/load functionality across versions verified

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/78Spinoza/PacMapDotnet.git
cd PacMapDotnet
dotnet build src/PACMAPCSharp.sln

License

This project is licensed under the MIT License - see LICENSE file for details.

🙏 Acknowledgments

PaCMAP Algorithm: Yingfan Wang & Wei Wang
HNSW Optimization: Yury Malkov & Dmitry Yashunin
Base Architecture: Inspiration from UMAPCSharp and other dimensionality reduction implementations

📚 Recommended Citation to honor the inventors.

If you use this implementation in your research, please cite the original PaCMAP paper:

@article{JMLR:v22:20-1061,
  author  = {Yingfan Wang and Haiyang Huang and Cynthia Rudin and Yaron Shaposhnik},
  title   = {Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {201},
  pages   = {1-73},
  url     = {http://jmlr.org/papers/v22/20-1061.html}
}

🐛 Report Issues
💬 Discussions

🗺️ Roadmap

✅ v2.8.35 (Current) - QUALITY RELEASE - Unbiased Statistics + API Consistency

✅ Fixed Biased Statistics: Now uses HNSW k-NN distances for unbiased p95/p99/mean/std (COMPLETED)
✅ Fixed Neighbor Ordering: Consistent nearest-first ordering across fit/transform API (COMPLETED)
✅ Comprehensive Documentation: Clarified embedding vs original space in all docs (COMPLETED)
✅ Better AI Inference: More reliable ConfidenceScore and outlier detection (COMPLETED)
⚠️ Breaking Change: NearestNeighborDistances/Indices ordering reversed (COMPLETED)

✅ v2.8.34 - Large Model Loading Fix

✅ HNSW Size Limits: Increased from 100MB/80MB to 4GB/3GB (COMPLETED)
✅ Large Models: Models with >100K samples now load correctly (COMPLETED)
✅ Version Check: Removed strict library version check (COMPLETED)

✅ v2.8.32 - MASSIVE File Size Optimization (66% smaller!)

✅ Dead Weight Removal: Eliminated adam_m, adam_v, nn_* vectors from persistence (COMPLETED)
✅ 66% Size Reduction: 32 MB → 11 MB for 100K samples (COMPLETED)
✅ 3x Faster Save/Load: Optimized persistence format with zero functionality loss (COMPLETED)
✅ Format v2: New persistence format breaking backward compatibility (COMPLETED)
✅ Production Ready: Enterprise-grade efficiency for large-scale deployments (COMPLETED)

✅ v2.8.31 - CRITICAL BUG FIX - Early Termination

✅ 3-Phase Algorithm: Fixed early termination preventing completion of all phases (COMPLETED)
✅ Global+Local Structure: Proper Phase 1→Phase 2→Phase 3 execution (COMPLETED)
✅ Quality Fix: Previous versions had incomplete embeddings due to early exit (COMPLETED)

✅ v2.8.30 - CRITICAL BUG FIX - Model Persistence

✅ Save/Load Fixed: Corrected string marshaling for model persistence (COMPLETED)
✅ Cross-Platform: Works across all path formats on Windows and Linux (COMPLETED)

✅ v2.8.26 - Large Dataset Integer Overflow Resolution

✅ Integer Overflow Protection: Safe arithmetic for 1M+ point datasets (COMPLETED)
✅ Memory Safety: Comprehensive validation with detailed memory estimation (COMPLETED)
✅ Production Ready: Enterprise-grade stability for large-scale deployments (COMPLETED)

✅ v2.8.24 - MULTI-METRIC EXPANSION

✅ Additional Distance Metrics: Cosine, Manhattan, and Hamming distances (COMPLETED)
✅ HNSW Integration: All 4 metrics supported with HNSW optimization
✅ Python Compatibility: Compatible with official Python PaCMAP implementation

📦 Installation

NuGet Package (Recommended)

dotnet add package PacMapSharp --version 2.8.35

Build from Source

git clone https://github.com/78Spinoza/PacMapDotnet.git
cd PacMapDotnet
dotnet build src/PACMAPCSharp/PACMAPCSharp.sln -c Release

🚀 Quick Start

using PacMapSharp;

// Create PACMAP model with optimized parameters
var model = new PacMapModel(
    nComponents: 2,           // Reduce to 2D for visualization
    nNeighbors: 10,           // Standard k-NN setting
    mnRatio: 0.5f,            // Near neighbor ratio
    fpRatio: 2.0f,            // Far pair ratio
    metric: DistanceMetric.Euclidean,
    randomSeed: 42            // Reproducible results
);

// Fit the model to your data
double[,] embeddings = model.Fit(data);

// Transform new data points
double[,] newEmbeddings = model.Transform(newData);

// Save/load optimized models (v2.8.32 - 66% smaller files!)
model.Save("trained_model.pacmap");
var loadedModel = PacMapModel.Load("trained_model.pacmap");

📈 What's New in v2.8.35

🎯 QUALITY RELEASE: Unbiased Statistics + API Consistency ✅ Fixed biased embedding statistics - was using only first 1000 points sequentially! ✅ Now uses HNSW k-NN distances for unbiased p95/p99/mean/std (e.g., 100K×40 = 4M distances) ✅ Fixed neighbor ordering - [0]=nearest, [39]=farthest (consistent across fit/transform) ✅ Comprehensive documentation - clarified embedding vs original space throughout

Impact: More reliable ConfidenceScore, better outlier detection, consistent API

⚠️ Breaking Change: NearestNeighborDistances/Indices now ordered nearest-first

Previous: v2.8.34

🐛 Large Model Loading Fix ✅ Fixed: "HNSW uncompressed size too large" error for large datasets ✅ Increased HNSW limits: 100MB → 4GB (handles massive production datasets) ✅ Removed strict version check: Only format version matters now

Previous: v2.8.32

✅ 66% Smaller Model Files (32 MB → 11 MB for 100K samples) ✅ 3x Faster Save/Load Operations ✅ Zero Functionality Loss - same accuracy, same API ✅ Breaking Change: Old v1 models need re-fitting (v2 format)

Perfect for production deployments where storage and load time matter!

⭐ Star this repository if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
docs		docs
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation