A comprehensive machine learning framework for measuring trigger efficiency in CMS HHβbbΟΟ analysis
- Overview
- Project Structure
- Quick Start
- Installation
- Running the Gradient Boosting Analysis
- Data Requirements
- Analysis Components
- Output and Results
- Advanced Usage
- Troubleshooting
- Contributing
This project implements a comprehensive machine learning approach to measure trigger efficiency in CMS (Compact Muon Solenoid) experiments, specifically for the HHβbbΟΟ analysis. The framework uses gradient boosting methods to predict trigger efficiency, providing a data-driven approach to correct for trigger inefficiencies in physics measurements.
- Multi-algorithm Support: XGBoost, LightGBM, CatBoost, and scikit-learn GradientBoosting
- Comprehensive Workflow: From data processing to final efficiency measurements
- Probability Calibration: Ensures reliable probability estimates
- Cutflow Analysis: Track event selection efficiency step-by-step
- Scale Factor Generation: For systematic uncertainty estimation
- Model Comparison: Automated hyperparameter optimization and model selection
The analysis focuses on:
- Signal Triggers: Complex OR combination of HLT paths including jet, HT, MET, and tau triggers
- Reference Trigger: HLT_AK8PFJet260 for unbiased efficiency measurement
- Samples: QCD background, ggF HH signal, VBF HH signal, and collision data
- Variables: Fat jet properties, HT, MET, invariant masses, and kinematic variables
CMS-trigger-efficiency-ashe/
βββ π run_gradient_boosting.ipynb # Main analysis notebook
βββ π data_distribution.ipynb # Data distribution analysis
βββ π composing_plot.ipynb # Plotting utilities
βββ
βββ π library/ # Core analysis modules
β βββ trigger_efficiency_ML.py # Main ML functions and plotting
β βββ processing_data.py # Data processing utilities
β
βββ π data/ # Data storage
β βββ HHbbtautau-v1/ # Raw ROOT files (v1)
β βββ HHbbtautau-v2/ # Raw ROOT files (v2)
β βββ processed/ # Processed data files
β βββ azura/ # Tau & new processed data
β βββ ashe/ # No tau & new processed data
β βββ briar/ # Old processed data
β βββ ...
β
βββ π model_selection/ # Model comparison framework
β βββ run_gbdt_comparison.py # Execute model comparison
β βββ gbdt_comparison_framework.py # Framework implementation
β βββ gbdt_training_pipeline.py # Training pipeline
β βββ hyperparameter_optimization.py # HPO utilities
β βββ data_loader.py # Data loading utilities
β βββ pipeline_clean.py # Clean analysis pipeline
β βββ requirements_gbdt.txt # GBDT-specific requirements
β
βββ π calibration/ # Probability calibration
β βββ run_bdt_calibration.py # Execute calibration
β βββ bdt_calibration_workflow.py # Workflow implementation
β βββ probability_calibration.py # Calibration methods
β βββ bdt_calibration_example.py # Usage examples
β βββ test_gradientboosting_integration.py # Integration tests
β βββ README_BDT_Calibration.md # Detailed calibration docs
β
βββ π cutflow/ # Selection analysis
β βββ run_cutflow.py # Execute cutflow analysis
β βββ cutflow_analysis.py # Core cutflow implementation
β βββ cutflow_example.py # Advanced cutflow examples
β βββ README_cutflow.md # Cutflow documentation
β
βββ π scale_factor/ # Systematic uncertainties
β βββ efficiency_ratio_predictor.py # Ratio prediction methods
β βββ xgboost_scale_factor_predictor.py # XGBoost-based scale factors
β βββ integrated_ratio_predictor_fixed.py # Fixed integration method
β βββ example_integration.py # Integration examples
β βββ README_Efficiency_Ratio_Prediction.md # Scale factor docs
β
βββ π result/ # Output directory
βββ DD-MM-YYYY-suffix/ # Date-stamped results
βββ plots/ # Generated plots
βββ models/ # Saved models
βββ reports/ # Analysis reports
Ensure you have access to the CERN CMS environment with all required modules:
# Activate CERN CMS environment (recommended)
source /cvmfs/cms.cern.ch/cmsset_default.sh
# or your local CMS environment setupTo prepare the environment for first time run
# In CERN CMS environment (recommended)
source /cvmfs/cms.cern.ch/cmsset_default.sh
pip install --user -r requirements.txt
# Or locally
pip install -r requirements.txtFor future updates
# Re-run the extraction anytime
python extract_libraries.py-
Clone and navigate to the project:
cd /path/to/CMS-trigger-efficiency-ashe -
Open the main analysis notebook:
jupyter notebook run_gradient_boosting.ipynb
-
Configure your analysis (in the notebook's second cell):
# Define the run name and version run_name = "Run2" version = "v2" # Change to v1, v3, v4, etc. as needed # Define samples to analyze samples = ["QCD", "ggF", "VBF", "DATA"] # Save trained models? save_model_gradu = False # Set to True to save models save_model_data = False
-
Execute all cells to run the complete analysis pipeline
The project is designed to work within the CERN CMS software environment where most dependencies are pre-installed:
# Set up CMS environment
source /cvmfs/cms.cern.ch/cmsset_default.shIf running outside CERN, install the required packages:
# Core ML and data analysis
pip install scikit-learn>=1.7.0
pip install xgboost>=3.0.3
pip install lightgbm>=4.6.0
pip install catboost>=1.2.68
pip install pandas>=1.3.0
pip install numpy>=1.21.0
pip install matplotlib>=3.5.0
# HEP-specific packages
pip install mplhep
pip install hist
# Additional utilities
pip install tqdm
pip install optuna>=3.0.0 # For hyperparameter optimizationROOT is essential for reading CMS data files. Install from:
- CERN users: Usually pre-installed in CMS environment
- Local users: Follow ROOT installation guide
This Jupyter notebook implements the complete gradient boosting analysis for trigger efficiency measurement.
-
Configuration (Cells 1-2)
# Set run parameters run_name = "Run2" version = "v2" # Data version (v1, v2, v3, etc.) samples = ["QCD", "ggF", "VBF", "DATA"] # Configure triggers signal_triggers = [ "HLT_AK8PFHT800_TrimMass50", "HLT_AK8PFJet400_TrimMass30", "HLT_AK8PFJet500", "HLT_PFJet500", "HLT_PFHT1050", # ... additional triggers ] reference_trigger = "HLT_AK8PFJet260"
-
Data Loading (Cell 3)
- Automatically loads processed ROOT files
- Maps data versions to file paths
- Creates RDataFrame objects for analysis
-
Data Distribution Analysis (Cells 4-6)
- Generates comparison plots for all kinematic variables
- Creates efficiency plots for each sample
- Outputs:
Distribution_MC_*.pngfiles
-
Machine Learning Training (Cells 7-10)
- Trains gradient boosting models on QCD data
- Applies trained model to all samples
- Generates efficiency predictions
-
Validation and Comparison (Cells 11-15)
- Compares ML predictions with measured efficiencies
- Creates validation plots
- Outputs: Efficiency comparison plots
The analysis uses these kinematic variables:
analysis_variables = [
"HighestPt", # Leading fat jet pT
"HT", # Scalar sum of jet pT
"MET_pt", # Missing transverse energy
"mHH", # Invariant mass of HH system
"HighestMass", # Leading fat jet mass
"SecondHighestPt", # Sub-leading fat jet pT
"SecondHighestMass", # Sub-leading fat jet mass
"FatHT", # HT from fat jets only
"MET_FatJet", # MET projected on fat jets
"mHHwithMET", # HH+MET invariant mass
"HighestEta", # Leading fat jet Ξ·
"SecondHighestEta", # Sub-leading fat jet Ξ·
"DeltaEta", # |Ξ·β - Ξ·β|
"DeltaPhi" # |Οβ - Οβ|
]The analysis measures efficiency for this complex trigger combination:
Signal Triggers (OR combination):
HLT_AK8PFHT800_TrimMass50HLT_AK8PFJet400_TrimMass30HLT_AK8PFJet500HLT_PFJet500HLT_PFHT1050HLT_PFHT500_PFMET100_PFMHT100_IDTightHLT_PFHT700_PFMET85_PFMHT85_IDTightHLT_PFHT800_PFMET75_PFMHT75_IDTightHLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_eta2p1_RegHLT_MediumChargedIsoPFTau180HighPtRelaxedIso_Trk50_eta2p1
Reference Trigger:
HLT_AK8PFJet260(unbiased reference)
The analysis expects data in this format:
data/processed/{version}/
βββ {version}-NewQCD.root # QCD background sample
βββ {version}-NewggF.root # ggF HH signal sample
βββ {version}-NewVBF.root # VBF HH signal sample
βββ {version}-NewDATA.root # Collision data sample
- v1 (briar): Initial data processing
- v2 (azura): Updated selection criteria
- v3 (ashe): Current recommended version
- v4 (cypress): Alternative processing
- v5 (azura-v2): Refined v2 processing
Each ROOT file must contain:
// Kinematic variables
Float_t HighestPt, SecondHighestPt;
Float_t HT, FatHT;
Float_t MET_pt, MET_FatJet;
Float_t mHH, mHHwithMET;
Float_t HighestMass, SecondHighestMass;
Float_t HighestEta, SecondHighestEta;
Float_t DeltaEta, DeltaPhi;
// Trigger information
Bool_t Combo; // Signal trigger combination
Bool_t HLT_AK8PFJet260; // Reference trigger
// ... individual trigger branchesLocated in model_selection/, this component compares different GBDT algorithms:
# Run comprehensive model comparison
python model_selection/run_gbdt_comparison.py
# With custom settings
python model_selection/run_gbdt_comparison.py --n_trials 100 --cv_folds 5Features:
- Automated hyperparameter optimization with Optuna
- Cross-validation evaluation
- Performance comparison across algorithms
- Best model selection and saving
Located in calibration/, ensures reliable probability estimates:
# Run BDT calibration workflow
python calibration/run_bdt_calibration.py
# List available data versions
python calibration/run_bdt_calibration.py --list_versionsFeatures:
- Isotonic regression and Platt scaling
- Cross-validation to prevent overfitting
- Calibration quality metrics (ECE, MCE, Brier score)
- Calibration curve visualization
Located in cutflow/, tracks selection efficiency:
# Run cutflow analysis
python cutflow/run_cutflow.pyFeatures:
- Step-by-step efficiency tracking
- Event count monitoring
- Selection optimization insights
- Multi-sample comparison
Located in scale_factor/, provides systematic uncertainty estimates:
# Generate efficiency ratio predictions
python scale_factor/efficiency_ratio_predictor.pyFeatures:
- Data/MC scale factor calculation
- Uncertainty propagation
- Integration with trigger efficiency
- Ratio prediction methods
After running the main analysis, you'll find results in result/DD-MM-YYYY-suffix/:
{Sample}_{Variable}_Run2_both.png- Individual sample distributionsDistribution_MC_{Variable}.png- Combined MC sample comparisonsEfficiency_{Sample}_{Variable}.png- Trigger efficiency plotsValidation_{Method}_{Variable}.png- ML validation plots
gb_{suffix}_200et0p2_gradu.sav- Gradient boosting model (MC training)gb_{suffix}_200et0p2_data.sav- Gradient boosting model (data training)
- Analysis summary with key metrics
- Model performance comparisons
- Calibration quality assessments
Expected Efficiency Values:
- QCD Background: 60-80% (depends on kinematic region)
- ggF Signal: 80-95% (higher efficiency due to signal characteristics)
- VBF Signal: 70-90% (moderate efficiency)
- Collision Data: 65-85% (real trigger performance)
Key Performance Metrics:
- AUC Score: >0.85 for well-trained models
- Calibration ECE: <0.05 for well-calibrated probabilities
- Validation Agreement: <5% difference between predicted and measured efficiency
Modify the notebook configuration for specialized analyses:
# Custom trigger selection
signal_triggers = ["HLT_PFHT1050", "HLT_PFJet500"] # Subset of triggers
reference_trigger = "HLT_AK8PFJet260"
# Custom variable selection
analysis_variables = ["HighestPt", "HT", "MET_pt"] # Reduced variable set
# Custom sample selection
samples = ["QCD", "DATA"] # Only background and data# Import trigger efficiency functions
from library.trigger_efficiency_ML import *
# Load your RDataFrame
df = ROOT.RDataFrame("tree", "your_file.root")
# Apply trigger efficiency correction
efficiency_weights = predict_trigger_efficiency(df, model_path)
corrected_df = df.Define("trigger_weight", efficiency_weights)# Process multiple data versions
versions = ["v1", "v2", "v3"]
for version in versions:
run_analysis(version=version, save_results=True)Error: RDataFrame unable to read file
Solution:
- Check file paths in the configuration
- Verify data version exists in
data/processed/ - Ensure ROOT files are accessible
ImportError: No module named 'xgboost'
Solution:
- Install missing packages:
pip install xgboost lightgbm catboost - Use CERN CMS environment where possible
- Check Python environment activation
Memory allocation error during training
Solution:
- Reduce dataset size for testing:
df = df.Range(100000) - Use fewer features: modify
analysis_variableslist - Increase system memory or use batch processing
AttributeError: module 'ROOT' has no attribute 'RDataFrame'
Solution:
- Ensure compatible ROOT version (β₯6.24)
- Check ROOT Python bindings:
import ROOT; print(ROOT.__version__) - Re-source CMS environment
AUC Score < 0.6, Poor efficiency prediction
Solution:
- Check data quality and preprocessing
- Verify trigger branch consistency
- Increase training data size
- Tune hyperparameters in
model_selection/
# Use data sampling for faster development
df_sample = df.Range(50000) # Use 50k events for testing
# Enable ROOT parallelization
ROOT.ROOT.EnableImplicitMT(4) # Use 4 threads# Save intermediate results
df.Snapshot("processed_tree", "intermediate.root")
# Use efficient data formats
df.AsNumpy() # Convert to NumPy for ML libraries- Check existing documentation in component README files
- Review example scripts in each subdirectory
- Validate data files using ROOT TBrowser or similar tools
- Test with reduced datasets to isolate issues
Follow these conventions when contributing:
- Functions: Use descriptive names and comprehensive docstrings
- Variables: Follow physics naming conventions (e.g.,
pt,eta,phi) - Files: Include version/date information in output filenames
- Documentation: Update README files for significant changes
- New ML Models: Add to
model_selection/gbdt_comparison_framework.py - New Variables: Update variable lists and plotting functions
- New Triggers: Modify trigger configuration in main notebook
- New Samples: Add sample handling in data loading functions
# Test with minimal dataset
python test_with_small_data.py
# Validate against reference results
python validate_against_baseline.py
# Check calibration quality
python calibration/test_gradientboosting_integration.py- CMS Collaboration, "Search for Higgs boson pair production..."
- HHβbbΟΟ analysis documentation
- CMS trigger system documentation
- Gradient Boosting: Friedman, J.H. (2001). "Greedy function approximation: A gradient boosting machine"
- Probability Calibration: Platt, J. (1999). "Probabilistic outputs for support vector machines"
- Model Selection: Bergstra, J. & Bengio, Y. (2012). "Random search for hyper-parameter optimization"
- ROOT: https://root.cern/
- XGBoost: https://xgboost.readthedocs.io/
- scikit-learn: https://scikit-learn.org/
- CMS Open Data: http://opendata.cern.ch/
Project Status: Active development for CMS HHβbbΟΟ trigger efficiency analysis
Last Updated: September 2025
Contact: trantuankha643@gmail.com
License: Follow CMS collaboration guidelines for data and code sharing