Skip to content

marco3724/parallel-hyperloglog

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Hyperloglog

This is a parallel implementation of the Hyperloglog algorithm realized for the course of High Performance Computing for Data Science at the University of Trento.

The project includes multiple implementations of the Hyperloglog algorithm, leveraging different parallelization techniques such as MPI and OpenMP. The code is designed to handle large datasets efficiently by estimating the cardinality of sets. Its modular structure allows easy reuse of code and integration of various algorithm versions. To compile a particular implementation, simply include the corresponding file from the src/implementation directory.

The Hyperloglog algorithm is a probabilistic algorithm used for estimating the cardinality of large sets. It is particularly useful in scenarios where the exact count of unique elements is not required, but an approximate count is sufficient. The algorithm uses hashing and bit manipulation to achieve this.

Table of Contents

Directory Structure

├── dataset/             # Input datasets for testing
├── include/             # Header files and configuration
├── results/             # Output results in csv and plot scripts
├── src/
│   ├── implementation/  # Contains multiple versions of the algorithm
│   ├── hll.c            # Core logic of HyperLogLog
│   ├── main.c           # Entry point for execution
│   ├── murmur3.c        # Murmur3 hash implementation
│   └── utility.c        # Utility functions (e.g., file writing)
├── Makefile             # Compilation instructions
├── README.md            # This file
├── hll.py               # Python script (for testing)
├── pythonsub.sh         # Job submission script (Python-based)
├── run.sh               # Script to run the executable
└── wrapper.sh           # Wrapper for job arrays

Installation

To install the package, clone the repository and compile the code using the following commands:

cd parallel-hyperloglog
#load mpich-3.2 #if on a cluster
make all

This will create 5 executable with 5 different implementations of the Hyperloglog algorithm in the exe folder.:

  • seq: The main implementation of the Hyperloglog algorithm, which run sequentially.
  • mpi: An MPI implementation of the Hyperloglog algorithm, where the algorithm works as in the sequential version but leveraging more than 1 process.
  • omp_v1: An OpenMP implementation of the Hyperloglog algorithm, uses a critical section to udpate the hll structure.
  • omp_v2: An OpenMP implementation of the Hyperloglog algorithm, uses a lock to update a register.
  • omp_v3: An OpenMP implementation of the Hyperloglog algorithm, where each thread holds a hll structure, avoiding thus any lock or critical section.

In order to compile only a specific implementation, you just need to perform

make <exe_name>

Where <exe_name> is one of the following: seq, mpi, omp_v1, omp_v2, omp_v3.

Usage

There are 2 ways to run the code:

On your local machine

mpirun.actual -n <n_process> ./parallel-hyperloglog/exe/<exe_name> <n_bucket> <dataset_name>

Where:

  • <n_process>: The number of processes to use (for the MPI implementation).
  • <exe_name>: The name of the executable to run (e.g., seq, mpi, omp_v1, omp_v2, omp_v3).
  • <n_bucket>: The number of buckets to use in the Hyperloglog algorithm.
  • <dataset_name>: The name of the dataset to use, must be in dataset folder (e.g., dataset.txt).

On a cluster

On a cluster that uses a PBS environment, you can use the provided wrapper.sh script to submit jobs. The script takes care of setting up the environment and running the specified executable with the provided arguments.

 ./wrapper.sh --run <exe_name> <dataset> <number_of_nodes> <number_of_processes>

Authors

About

Project for the course High Performance Computing 2024/25 @ University of Trento

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 73.0%
  • Shell 9.4%
  • Python 8.7%
  • C++ 4.7%
  • Makefile 4.2%