Parallel Hyperloglog

This is a parallel implementation of the Hyperloglog algorithm realized for the course of High Performance Computing for Data Science at the University of Trento.

The project includes multiple implementations of the Hyperloglog algorithm, leveraging different parallelization techniques such as MPI and OpenMP. The code is designed to handle large datasets efficiently by estimating the cardinality of sets. Its modular structure allows easy reuse of code and integration of various algorithm versions. To compile a particular implementation, simply include the corresponding file from the src/implementation directory.

The Hyperloglog algorithm is a probabilistic algorithm used for estimating the cardinality of large sets. It is particularly useful in scenarios where the exact count of unique elements is not required, but an approximate count is sufficient. The algorithm uses hashing and bit manipulation to achieve this.

Directory Structure

├── dataset/             # Input datasets for testing
├── include/             # Header files and configuration
├── results/             # Output results in csv and plot scripts
├── src/
│   ├── implementation/  # Contains multiple versions of the algorithm
│   ├── hll.c            # Core logic of HyperLogLog
│   ├── main.c           # Entry point for execution
│   ├── murmur3.c        # Murmur3 hash implementation
│   └── utility.c        # Utility functions (e.g., file writing)
├── Makefile             # Compilation instructions
├── README.md            # This file
├── hll.py               # Python script (for testing)
├── pythonsub.sh         # Job submission script (Python-based)
├── run.sh               # Script to run the executable
└── wrapper.sh           # Wrapper for job arrays

Installation

To install the package, clone the repository and compile the code using the following commands:

cd parallel-hyperloglog
#load mpich-3.2 #if on a cluster
make all

This will create 5 executable with 5 different implementations of the Hyperloglog algorithm in the exe folder.:

seq: The main implementation of the Hyperloglog algorithm, which run sequentially.
mpi: An MPI implementation of the Hyperloglog algorithm, where the algorithm works as in the sequential version but leveraging more than 1 process.
omp_v1: An OpenMP implementation of the Hyperloglog algorithm, uses a critical section to udpate the hll structure.
omp_v2: An OpenMP implementation of the Hyperloglog algorithm, uses a lock to update a register.
omp_v3: An OpenMP implementation of the Hyperloglog algorithm, where each thread holds a hll structure, avoiding thus any lock or critical section.

In order to compile only a specific implementation, you just need to perform

make <exe_name>

Where <exe_name> is one of the following: seq, mpi, omp_v1, omp_v2, omp_v3.

Usage

There are 2 ways to run the code:

On your local machine

mpirun.actual -n <n_process> ./parallel-hyperloglog/exe/<exe_name> <n_bucket> <dataset_name>

Where:

<n_process>: The number of processes to use (for the MPI implementation).
<exe_name>: The name of the executable to run (e.g., seq, mpi, omp_v1, omp_v2, omp_v3).
<n_bucket>: The number of buckets to use in the Hyperloglog algorithm.
<dataset_name>: The name of the dataset to use, must be in dataset folder (e.g., dataset.txt).

On a cluster

On a cluster that uses a PBS environment, you can use the provided wrapper.sh script to submit jobs. The script takes care of setting up the environment and running the specified executable with the provided arguments.

 ./wrapper.sh --run <exe_name> <dataset> <number_of_nodes> <number_of_processes>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel Hyperloglog

Table of Contents

Directory Structure

Installation

Usage

On your local machine

On a cluster

Authors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
dataset		dataset
include		include
results		results
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
Report_HPC.pdf		Report_HPC.pdf
hll.py		hll.py
pythonsub.sh		pythonsub.sh
run.sh		run.sh
wrapper.sh		wrapper.sh

marco3724/parallel-hyperloglog

Folders and files

Latest commit

History

Repository files navigation

Parallel Hyperloglog

Table of Contents

Directory Structure

Installation

Usage

On your local machine

On a cluster

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages