A flexible framework for benchmarking data mining algorithms including clustering, dimensionality reduction, and network analysis.
From TestPyPI (https://test.pypi.org/project/data-mining-framework/)_
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ data-mining-frameworkNote: The --extra-index-url ensures dependencies are installed from the main PyPI repository.
git clone https://github.com/M-Gkiko/data_mining_framework.git
cd data_mining_framework
pip install -e .
# With development dependencies
pip install -e ".[dev]"
pip install -r requirements.txt # Alternative # Run a benchmark from YAML configuration
python examples/run_benchmark_example.py --config examples/clustering_benchmark.yaml
# Run with verbose output
python examples/run_benchmark_example.py --config examples/dr_cl_quality.yaml --verbose
# Network analysis benchmark
python examples/run_benchmark_example.py --config examples/network_benchmark.yamlfrom data_mining_framework import CSVDataset, HierarchicalClustering, ManhattanDistance
dataset = CSVDataset('data/iris.csv')
distance = ManhattanDistance()
clustering = HierarchicalClustering(distance_measure=distance, n_clusters=3, linkage='complete')
clustering.fit(dataset)
labels = clustering.get_labels()from data_mining_framework import (
CSVDataset, PCAProjection, HierarchicalClustering,
CalinskiHarabaszIndex, ManhattanDistance, Pipeline
)
from data_mining_framework.implementations.pipelines import (
DRAdapter, ClusteringAdapter, ClusteringQualityAdapter
)
dataset = CSVDataset('data/iris.csv')
distance = ManhattanDistance()
pipeline = Pipeline("PCA_Hierarchical_Quality")
# Add dimensionality reduction
pca = PCAProjection(n_components=2)
pipeline.add_component(DRAdapter(pca))
# Add clustering
clustering = HierarchicalClustering(distance_measure=distance, n_clusters=3)
pipeline.add_component(ClusteringAdapter(clustering, distance))
# Add quality measure
quality = CalinskiHarabaszIndex()
pipeline.add_component(ClusteringQualityAdapter(quality))
results = pipeline.execute(dataset)from data_mining_framework import (
NetworkXWrapper, LouvainCommunityDetection,
PageRankMeasure, EdgeBetweennessMeasure,
Pipeline, NetworkAdapter, CommunityDetectionAdapter,
NodeMeasureAdapter, EdgeMeasureAdapter
)
# Create network
network = NetworkXWrapper(filepath='data/karate.edgelist', format='edgelist')
# Build pipeline: Community Detection β Edge Measures β Node Measures
pipeline = Pipeline("Network_Analysis")
pipeline.add_component(NetworkAdapter(network))
pipeline.add_component(CommunityDetectionAdapter(LouvainCommunityDetection(resolution=1.0)))
pipeline.add_component(EdgeMeasureAdapter(EdgeBetweennessMeasure()))
pipeline.add_component(NodeMeasureAdapter(PageRankMeasure(alpha=0.85)))
# Execute and get all results
results = pipeline.execute(None)
# Results contain: communities, modularity, edge_scores, node_scoresfrom data_mining_framework.benchmarks import run_benchmark
results = run_benchmark('examples/dr_cl_quality.yaml')
print(f"Completed {results.total_runs} runs")
print(f"Average time: {results.average_time:.3f}s")The examples/ directory contains organized examples demonstrating different usage patterns. All examples can run from the project root directory.
π For detailed documentation of all examples, see examples/README.md
Direct component usage without pipelines - great for learning the API.
python examples/basic_component_usage.pyShows both clustering and network analysis workflows step-by-step.
Complete workflows using the Pipeline framework.
# Clustering: DR β Clustering β Quality
python examples/clustering_pipeline_example.py
# Network: Community β Edge β Node measures
python examples/network_pipeline_example.pyTest multiple algorithm combinations with timing metrics.
# From examples/ directory (recommended)
cd examples
python clustering_benchmark_example.py # Tests 4 combinations
python network_benchmark_example.py # Tests 27 combinations
# Or from project root
python examples/clustering_benchmark_example.py
python examples/network_benchmark_example.pyRun benchmarks from YAML configuration files.
# Programmatic YAML usage tutorial
cd examples
python yaml_benchmark_example.py
# Command-line YAML benchmark runner
python run_benchmark_example.py -c clustering_benchmark.yaml
python run_benchmark_example.py -c network_benchmark.yaml --verbose| Category | Files | Purpose |
|---|---|---|
| Learning | basic_component_usage.py |
Understand the API |
| Pipelines | clustering_pipeline_example.pynetwork_pipeline_example.py |
Single workflow execution |
| Benchmarks | clustering_benchmark_example.pynetwork_benchmark_example.py |
Compare algorithms |
| YAML Configs | yaml_benchmark_example.pyrun_benchmark_example.py |
Configuration-based execution |
Import errors: Make sure you've installed the package:
pip install -e .Path issues when running benchmarks: Some examples work best from the examples/ directory:
cd examples
python clustering_benchmark_example.pyMissing dependencies:
pip install -e ".[dev]"- π Read
examples/README.mdfor detailed descriptions - π See YAML configuration examples in
examples/*.yaml - ποΈ Check
ARCHITECTURE.mdfor framework design
benchmark:
name: "My_Benchmark"
dataset: "path/to/data.csv"
pipeline_template:
- type: "step_type"
algorithms: ["Algorithm1", "Algorithm2"]
params:
Algorithm1:
param1: value1
Algorithm2:
param2: value2
iterations: 3
output:
directory: "results"
format: ["csv"]Algorithms: PCA, MDS, TSNE, Sammon
- type: "dimensionality_reduction"
algorithms: ["PCA", "MDS", "TSNE", "Sammon"]
params:
PCA:
n_components: 2
MDS:
n_components: 2
max_iter: 300
distance_measure: "Manhattan" # Optional: Manhattan, Euclidean, Cosine
TSNE:
n_components: 2
perplexity: 30
max_iter: 1000
distance_measure: "Manhattan"
Sammon:
n_components: 2
max_iter: 500
distance_measure: "Manhattan"
init: "pca" # or "random"Algorithms: Hierarchical, DBSCAN, KMeans
- type: "clustering"
algorithms: ["Hierarchical", "DBSCAN", "KMeans"]
params:
Hierarchical:
n_clusters: 3
linkage: "complete" # complete, average, single, ward
distance_measure: "Manhattan"
DBSCAN:
eps: 0.5
min_samples: 5
distance_measure: "Euclidean"
KMeans:
n_clusters: 3
max_iter: 300
n_init: 10Algorithms: Calinski_Harabasz, Davies_Bouldin, Silhouette
- type: "clustering_quality"
algorithms: ["Calinski_Harabasz", "Davies_Bouldin", "Silhouette"]
params:
Calinski_Harabasz: {}
Davies_Bouldin: {}
Silhouette: {}Algorithms: Trustworthiness, Continuity, Reconstruction_Error
- type: "dr_quality"
algorithms: ["Trustworthiness", "Continuity", "Reconstruction_Error"]
params:
Trustworthiness:
n_neighbors: 12
Continuity:
n_neighbors: 12
Reconstruction_Error: {}Algorithms: Louvain, GirvanNewman, LabelPropagation
- type: "community_detection"
algorithms: ["Louvain", "GirvanNewman", "LabelPropagation"]
params:
Louvain:
resolution: 1.0
random_state: 42
GirvanNewman:
k: 2 # Number of communities
LabelPropagation:
max_iterations: 100
random_seed: 42Algorithms: PageRank, DegreeCentrality, ClosenessCentrality
- type: "node_measures"
algorithms: ["PageRank", "DegreeCentrality", "ClosenessCentrality"]
params:
PageRank:
alpha: 0.85
max_iter: 100
tol: 0.000001
DegreeCentrality:
normalized: true
ClosenessCentrality:
normalized: trueAlgorithms: EdgeBetweenness, EdgeWeight, JaccardCoefficient
- type: "edge_measures"
algorithms: ["EdgeBetweenness", "EdgeWeight", "JaccardCoefficient"]
params:
EdgeBetweenness:
normalized: true
EdgeWeight:
weight_attribute: "weight"
default_weight: 1.0
JaccardCoefficient: {}benchmark:
name: "Benchmark_Name" # Benchmark identifier
dataset: "data/file.csv" # Path to dataset
iterations: 3 # Number of runs per configuration
output:
directory: "results" # Output directory
format: ["csv", "json"] # Output formats
save_communities: true # Save community results (network only)
save_centralities: true # Save centrality scores (network only)
timeout: 300 # Timeout per run in seconds
verbose: true # Enable detailed logging
random_seed: 42 # Random seed for reproducibilityAvailable distance measures: Manhattan, Euclidean, Cosine
Use in algorithm params:
params:
AlgorithmName:
distance_measure: "Manhattan"Hierarchical- Agglomerative clustering (linkage: complete, average, single, ward)DBSCAN- Density-based clusteringKMeans- K-means clustering
PCA- Principal Component AnalysisMDS- Multidimensional ScalingTSNE- t-Distributed Stochastic Neighbor EmbeddingSammon- Sammon Mapping
- Community Detection: Louvain, Girvan-Newman, Label Propagation
- Node Measures: PageRank, Degree Centrality, Closeness Centrality
- Edge Measures: Edge Betweenness, Edge Weight, Jaccard Coefficient
- Clustering: Calinski-Harabasz Index, Davies-Bouldin Index, Silhouette Score
- DR: Trustworthiness, Continuity, Reconstruction Error
The examples/ directory contains both YAML configs and Python examples:
YAML Configs:
clustering_benchmark.yaml- Clustering algorithm comparisonsdr_cl_quality.yaml- Complete DR + Clustering + Quality pipelinenetwork_benchmark.yaml- Network analysis with all combinations
Python Examples:
basic_component_usage.py- Learn the API without pipelinesclustering_pipeline_example.py- Complete clustering workflownetwork_pipeline_example.py- Chained network analysisclustering_benchmark_example.py- Benchmark all clustering combinationsnetwork_benchmark_example.py- Benchmark all network combinationsyaml_benchmark_example.py- YAML usage tutorialrun_benchmark_example.py- Command-line benchmark runner
π See examples/README.md for detailed documentation
data_mining_framework/
βββ core/ # Abstract base classes
βββ implementations/ # Algorithm implementations
β βββ clustering/
β βββ dr/
β βββ networks/
β βββ community_detection/
β βββ node_measures/
β βββ edge_measures/
β βββ pipelines/ # Pipeline adapters
βββ benchmarks/ # Benchmarking system
βββ utils/ # Utilities
βββ examples/ # Usage examples and configs
βββ data/ # Sample datasets