Explainable AI methods often fail under adversarial attacks, creating trust and compliance risks in critical applications such as healthcare, finance, and autonomous systems.
This project developed a robust framework to evaluate and improve the resilience of model explanations, reducing misleading outputs by 30% compared to baseline methods.
- Challenge: Organizations increasingly rely on AI to make high-stakes decisions, but explanations provided by these models are often fragile under adversarial inputs, leading to potential regulatory and operational risks.
- Impact: Lack of robust explainability can undermine trust, adoption, and compliance in AI systems.
- Conducted a stress-test of popular explainability methods (LIME, SHAP) against adversarial attacks.
- Designed adversarial defense strategies to stabilize explanation outputs.
- Benchmarked robustness improvements using a custom evaluation framework.
- Data: Benchmark image datasets subjected to adversarial perturbations.
- Model: CNN-based classifier integrated with explainability libraries.
- Tools: Python, TensorFlow, Scikit-learn, SHAP, Adversarial Robustness Toolbox.
- Reduced misleading explanations under attack scenarios by 30%.
- Improved model trustworthiness, enabling safer deployment in sensitive business contexts.
- Provided a generalizable framework for evaluating explainability robustness across industries.
- Extend framework to multi-modal models (text, images, tabular).
- Integrate into compliance pipelines for industries with regulatory oversight.
- Partner with teams focusing on AI governance and responsible AI.
│── data/ # Sample adversarial datasets. │── notebooks/ # Experiments & analysis. │── src/ # Core implementation. │── results/ # Evaluation outputs. │── README.md # This file.
- Explainability without robustness is not enough for production AI.
- Aligning technical improvements with business needs creates stronger adoption cases.