A comprehensive implementation and exploration of differential privacy techniques using IBM's Differential Privacy Library (Diffprivlib).
This project demonstrates the implementation of differential privacy mechanisms for machine learning and data analytics tasks. Differential privacy provides mathematical guarantees for protecting individual privacy while still allowing meaningful statistical analysis and machine learning on datasets.
Differential privacy adds carefully calibrated noise to data or algorithms to ensure that the output doesn't reveal information about any individual in the dataset, while maintaining the overall utility of the data for analysis.
- Privacy Mechanisms: Implementation of core differential privacy mechanisms including Laplace, Gaussian, and Exponential mechanisms
- Machine Learning Models: Differentially private versions of common ML algorithms (classification, regression, clustering)
- Privacy Budget Management: Tools for tracking and managing privacy budget across multiple operations
- Data Analysis Tools: Differentially private statistical functions and data exploration utilities
- Educational Examples: Step-by-step tutorials and examples for learning differential privacy concepts
- Python 3.7 or higher
- pip or conda package manager
# Install the IBM Differential Privacy Library
pip install diffprivlib
# Install additional dependencies
pip install numpy pandas scikit-learn matplotlib jupyter
# Libraries for private language model fine-tuning
pip install torch transformers datasets opacusgit clone https://github.com/phantom2810/IBM-Differential-Privacy.git
cd IBM-Differential-Privacy
pip install -r requirements.txtHere's a simple example of training a differentially private classifier:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from diffprivlib.models import GaussianNB
# Load dataset
dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.2
)
# Train differentially private model
dp_clf = GaussianNB(epsilon=1.0)
dp_clf.fit(X_train, y_train)
# Evaluate
accuracy = dp_clf.score(X_test, y_test)
print(f"Differentially private accuracy: {accuracy}")You can also fine-tune Hugging Face language models with differential privacy using the
train_dp_llm function:
from dp_llm import train_dp_llm
model_dir = train_dp_llm(
model_name="gpt2",
dataset_path="data/my_corpus.txt",
epsilon=5.0,
delta=1e-5,
max_grad_norm=1.0,
)
print("Model saved to", model_dir)The building blocks of differential privacy:
- Laplace Mechanism: Adds Laplace noise for ε-differential privacy
- Gaussian Mechanism: Adds Gaussian noise for (ε,δ)-differential privacy
- Exponential Mechanism: Selects outputs based on a utility function
- Sparse Vector Technique: Efficiently answers many queries
Differentially private implementations of popular algorithms:
- Classification: Naive Bayes, Logistic Regression, Random Forest
- Regression: Linear Regression with objective perturbation
- Clustering: K-means with privacy guarantees
- Dimensionality Reduction: Principal Component Analysis (PCA)
- BudgetAccountant: Track privacy loss across multiple operations
- Composition Theorems: Calculate total privacy loss using advanced composition
- Privacy Analysis: Tools for analyzing privacy guarantees
IBM-Differential-Privacy/
├── examples/ # Tutorial notebooks and examples
│ ├── basic_mechanisms.ipynb # Introduction to DP mechanisms
│ ├── ml_models.ipynb # Differentially private ML
│ └── privacy_budget.ipynb # Budget management examples
├── src/ # Source code
│ ├── mechanisms/ # Custom mechanism implementations
│ ├── models/ # Extended ML model implementations
│ ├── utils/ # Utility functions
│ └── analysis/ # Privacy analysis tools
├── datasets/ # Sample datasets for examples
├── tests/ # Unit tests
├── docs/ # Documentation
├── requirements.txt # Python dependencies
└── README.md # This file
Explore fundamental differential privacy mechanisms:
from diffprivlib.mechanisms import Laplace
# Add Laplace noise to a query result
mechanism = Laplace(epsilon=1.0, sensitivity=1.0)
private_result = mechanism.randomise(actual_count)Perform statistical analysis with privacy guarantees:
from diffprivlib.tools import histogram, mean
# Compute differentially private histogram
dp_hist = histogram(data, epsilon=1.0, bins=10)
# Calculate differentially private mean
dp_mean = mean(data, epsilon=0.5, bounds=(0, 100))Track privacy expenditure across multiple operations:
from diffprivlib.accountant import BudgetAccountant
accountant = BudgetAccountant(epsilon=1.0, delta=1e-5)
# Spend budget on operations
with accountant:
result1 = private_operation_1(data, epsilon=0.3)
result2 = private_operation_2(data, epsilon=0.7)
print(f"Remaining budget: {accountant.remaining()}")Understanding key differential privacy parameters:
- ε (epsilon): Privacy parameter - smaller values mean stronger privacy
- δ (delta): Failure probability - typically set to 1/n where n is dataset size
- Sensitivity: Maximum change in output when one record is added/removed
- Privacy Budget: Total amount of privacy loss allowed across all operations
This project demonstrates differential privacy applications in:
- Healthcare Data: Analyzing medical records while protecting patient privacy
- Financial Analytics: Processing transaction data with privacy guarantees
- Survey Research: Publishing statistics from sensitive surveys
- Machine Learning: Training models on private data for deployment
- Government Statistics: Creating public datasets from census/administrative data
This project is based on research and implementations from:
- IBM Research: Diffprivlib library and associated research papers
- Academic Literature: Foundational papers on differential privacy
- Industry Applications: Real-world use cases and best practices
Key papers:
- Dwork, C. (2006). Differential Privacy. ICALP
- Holohan, N. et al. (2019). Diffprivlib: The IBM Differential Privacy Library. arXiv:1907.02444
We welcome contributions to this project! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-mechanism) - Commit your changes (
git commit -am 'Add new privacy mechanism') - Push to the branch (
git push origin feature/new-mechanism) - Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- IBM Research for developing and open-sourcing the Differential Privacy Library
- Differential Privacy Research Community for foundational work and ongoing research
- Contributors who have helped improve and extend this project
For questions, suggestions, or collaboration opportunities:
- GitHub Issues: Use the issue tracker for bug reports and feature requests
- Email: [Contact information]
- Academic Collaboration: Open to research partnerships and academic projects
- IBM Differential Privacy Library Documentation
- Differential Privacy Explained
- OpenDP: Open Source Differential Privacy
- Google's Differential Privacy Library
This project is for educational and research purposes. For production use of differential privacy, please consult with privacy experts and consider formal privacy audits.