A comprehensive benchmarking suite for evaluating Gemma and other language models on various benchmarks including MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math 8K).
- ✅ Support for Gemma models (2B and 7B)
- 🔍 Support for Mistral models
- 📊 MMLU benchmark implementation
- 🔢 GSM8K benchmark implementation
- 🔌 Configurable model parameters
- 🔒 Secure HuggingFace authentication
- 📈 Detailed results reporting and visualization
- 📊 Interactive plots and summary reports
- 🧙♂️ Interactive setup wizard
Ensure you have the following installed:
- Python 3.10+
- CUDA-capable GPU (recommended)
- HuggingFace account with access to Gemma models
- Clone the repository
git clone https://github.com/yourusername/gemma-benchmarking.git
cd gemma-benchmarking- Run the setup wizard (Recommended)
python scripts/setup_wizard.pyThe wizard will:
- Check prerequisites
- Set up the Python environment (conda or venv)
- Configure models and benchmarks
- Generate a custom configuration file
- Guide you through the next steps
- Manual Installation (Alternative)
Option 1: Using Conda (Recommended)
conda env create -f environment.yml
conda activate gemma-benchmarkOption 2: Using Python venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateOption 3: Using Docker
- Docker installed on your system
- NVIDIA Container Toolkit (for GPU support)
- CPU Version
docker-compose up --build- GPU Version
TARGET=gpu docker-compose up --build- Running Jupyter Notebooks
COMMAND="jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root" docker-compose up --build- Running Specific Scripts
COMMAND="python scripts/run_benchmark.py" docker-compose up --buildThe Docker setup includes:
- Multi-stage builds for CPU and GPU support
- Persistent volume for HuggingFace cache
- Jupyter notebook support
- Security best practices (non-root user)
- Automatic GPU detection and support
- Install dependencies
pip install -r requirements.txtFor models that require authentication (like Gemma), you need to log in to HuggingFace:
huggingface-cli loginThis will prompt you to enter your HuggingFace token. You can get your token from HuggingFace Settings.
All benchmarking settings are controlled via JSON configuration files in the configs/ directory.
- The default configuration is available at:
configs/default.json - You can create custom configs to tailor model selection, datasets, and evaluation settings
- The setup wizard will help you create a custom configuration file
Run with the default config:
python src/main.pyRun with a custom config:
python src/main.py --config path/to/config.jsonSpecify models and benchmarks via CLI:
python src/main.py --models gemma-2b mistral-7bRun specific benchmarks:
python src/main.py --benchmarks mmlu gsm8kEnable verbose output:
python src/main.py --verboseAfter running benchmarks, generate visualization reports:
python scripts/generate_report.pyCustomize report generation:
python scripts/generate_report.py --results_dir custom_results --output_dir custom_reports --output_name my_reportgemma-benchmarking/
├── configs/ # Configuration files (JSON)
├── environment.yml # Conda environment specification
├── logs/ # Log files
├── requirements.txt # Python dependencies
├── results/ # Benchmark output results
├── reports/ # Visualization reports and plots
├── scripts/ # Utility scripts
│ ├── setup_wizard.py # Interactive setup wizard
│ ├── generate_report.py # Report generation script
│ └── prepare_data.py # Dataset preparation scripts
├── src/ # Source code
│ ├── benchmarks/ # Benchmark task implementations
│ │ ├── base_benchmark.py # Base benchmark class
│ │ ├── mmlu.py # MMLU benchmark
│ │ └── gsm8k.py # GSM8K benchmark
│ ├── models/ # Model wrappers and loading logic
│ ├── utils/ # Helper utilities and tools
│ ├── visualization/ # Visualization and reporting tools
│ │ └── plotter.py # Results plotting
│ └── main.py # Entry point for benchmarking
└── README.md # You're here!
- Evaluates models across 57 subjects
- Supports few-shot learning
- Configurable number of examples per subject
- Tests mathematical reasoning capabilities
- Step-by-step problem solving
- Few-shot learning support
- Detailed accuracy metrics
- Add CLI wizard for quick setup
- Add support for additional Gemma model variants
- Expand academic benchmark integration
- Add HumanEval benchmark implementation
- Improve visualization and report automation
- Add leaderboard comparison with open models (e.g., LLaMA, Mistral)
- Docker support and multiplatform compatibility
This project is licensed under the MIT License.
Pull requests, issues, and suggestions are welcome! Please open an issue or start a discussion if you'd like to contribute.
- Google for the Gemma models
- Mistral AI for the Mistral models
- HuggingFace for the transformers library and model hosting
- The MMLU and GSM8K benchmark creators