This is a parallel implementation of the Hyperloglog algorithm realized for the course of High Performance Computing for Data Science at the University of Trento.
The project includes multiple implementations of the Hyperloglog algorithm, leveraging different parallelization techniques such as MPI and OpenMP. The code is designed to handle large datasets efficiently by estimating the cardinality of sets. Its modular structure allows easy reuse of code and integration of various algorithm versions. To compile a particular implementation, simply include the corresponding file from the src/implementation directory.
The Hyperloglog algorithm is a probabilistic algorithm used for estimating the cardinality of large sets. It is particularly useful in scenarios where the exact count of unique elements is not required, but an approximate count is sufficient. The algorithm uses hashing and bit manipulation to achieve this.
├── dataset/ # Input datasets for testing
├── include/ # Header files and configuration
├── results/ # Output results in csv and plot scripts
├── src/
│ ├── implementation/ # Contains multiple versions of the algorithm
│ ├── hll.c # Core logic of HyperLogLog
│ ├── main.c # Entry point for execution
│ ├── murmur3.c # Murmur3 hash implementation
│ └── utility.c # Utility functions (e.g., file writing)
├── Makefile # Compilation instructions
├── README.md # This file
├── hll.py # Python script (for testing)
├── pythonsub.sh # Job submission script (Python-based)
├── run.sh # Script to run the executable
└── wrapper.sh # Wrapper for job arrays
To install the package, clone the repository and compile the code using the following commands:
cd parallel-hyperloglog
#load mpich-3.2 #if on a cluster
make allThis will create 5 executable with 5 different implementations of the Hyperloglog algorithm in the exe folder.:
seq: The main implementation of the Hyperloglog algorithm, which run sequentially.mpi: An MPI implementation of the Hyperloglog algorithm, where the algorithm works as in the sequential version but leveraging more than 1 process.omp_v1: An OpenMP implementation of the Hyperloglog algorithm, uses a critical section to udpate the hll structure.omp_v2: An OpenMP implementation of the Hyperloglog algorithm, uses a lock to update a register.omp_v3: An OpenMP implementation of the Hyperloglog algorithm, where each thread holds a hll structure, avoiding thus any lock or critical section.
In order to compile only a specific implementation, you just need to perform
make <exe_name>Where <exe_name> is one of the following: seq, mpi, omp_v1, omp_v2, omp_v3.
There are 2 ways to run the code:
mpirun.actual -n <n_process> ./parallel-hyperloglog/exe/<exe_name> <n_bucket> <dataset_name>Where:
<n_process>: The number of processes to use (for the MPI implementation).<exe_name>: The name of the executable to run (e.g.,seq,mpi,omp_v1,omp_v2,omp_v3).<n_bucket>: The number of buckets to use in the Hyperloglog algorithm.<dataset_name>: The name of the dataset to use, must be in dataset folder (e.g.,dataset.txt).
On a cluster that uses a PBS environment, you can use the provided wrapper.sh script to submit jobs. The script takes care of setting up the environment and running the specified executable with the provided arguments.
./wrapper.sh --run <exe_name> <dataset> <number_of_nodes> <number_of_processes>