An AI-powered Android security solution to detect malicious applications using advanced machine learning techniques
Developed by: Varnit Kumar, MCA Student at GGSIPU, Dwarka
Inspired by: MSc thesis at Lisbon Institute of Engineering (ISEL)
Research Paper: "Malware Detection in Android Applications with Machine Learning Techniques"
This repository presents a comprehensive and effective Malware Detection System for Android applications using state-of-the-art Machine Learning techniques. The system employs static feature analysis, advanced feature selection methods, and multiple classification algorithms to accurately classify Android APKs as malicious or benign.
- Develop a robust malware detection system with high accuracy
- Implement explainable AI for transparent decision-making
- Provide a scalable solution for real-world Android security
- Contribute to cybersecurity research and education
- π Static Feature Extraction - Comprehensive APK analysis without execution
- π€ Multiple ML Models - SVM, Random Forest, XGBoost, and Neural Networks
- βοΈ Feature Selection - Advanced dimensionality reduction techniques
- π Explainable AI (XAI) - SHAP and LIME integration for model interpretability
- π Performance Metrics - Accuracy, Precision, Recall, F1-Score analysis
- π Real-time Processing - Efficient APK classification pipeline
- π Multiple Dataset Support - Tested on various public malware datasets
- π¨ Visualization Tools - Interactive charts and model performance graphs
The problem is formulated as a binary classification problem. The aim is to classify a given Android application as malicious (positive) or benign (negative). Each component integrating the proposed approach or enabling its assessment is briefly described next.
Machine Learning module - Component responsible for building, improving and evaluating the ML model that will classify Android applications as benign or malicious.
Feature extraction module - Extracts static features from an Android applicationβs Android Package Kit (APK) file. It maps them with the features deemed more relevant of the presence of malware in Android applications. This mapping results in the input data provided to the model, which can then classify/predict the Android application as benign or malicious.
Android applications - Allow an assessment of the developed prototype with real-world apps.
The system follows a binary classification approach to categorize Android applications as malicious (positive) or benign (negative).
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β Android APK βββββΆβ Feature Extraction βββββΆβ ML Classifier β
β Applications β β Module β β Module β
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββ βββββββββββββββββββ
β Feature Selection β β Prediction β
β & Preprocessing β β & XAI Output β
ββββββββββββββββββββββββ βββββββββββββββββββ
| Component | Description |
|---|---|
| Machine Learning Module | Builds, trains, and evaluates ML models for binary classification |
| Feature Extraction Module | Extracts static features from APK files using advanced parsing techniques |
| Feature Selection Module | Applies dimensionality reduction and selects most relevant features |
| Explainable AI Module | Provides interpretable explanations for model decisions |
| Evaluation Module | Comprehensive performance assessment and validation |
Our system has been trained and tested on multiple public datasets to ensure robustness:
| Dataset | Description | Size | Features |
|---|---|---|---|
| Drebin | Comprehensive Android malware dataset | 15,036 samples | 545,333 features |
| CICAndMal2017 | Android permission-based dataset | 426,000+ samples | Permission features |
| Android Malware (AM) | General Android malware collection | 25,000+ samples | Mixed features |
| AMSF | Android static features dataset | 10,000+ samples | 6 feature categories |
- Python 3.10+ - Primary programming language
- scikit-learn - Machine learning framework
- NumPy & Pandas - Data manipulation and analysis
- Matplotlib & Seaborn - Data visualization
- SHAP & LIME - Explainable AI
- PyCharm - Primary IDE
- Jupyter Notebook - Interactive development
- Android Studio - Android app development
- Androguard - APK analysis and feature extraction
- XGBoost - Gradient boosting framework
- TensorFlow/Keras - Deep learning models
- Plotly - Interactive visualizations
- Joblib - Model serialization
- Python 3.10 or higher
- pip package manager
- Git
- 4GB+ RAM recommended
- Clone the repository
git clone https://github.com/vannu07/Android-Malware-Detection.git
cd Android-Malware-Detection- Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Download datasets (optional)
# Download sample datasets
python scripts/download_datasets.py# Run the main detection system
python malware_detection.py
# Analyze a specific APK
python detect_single_apk.py --apk_path /path/to/app.apk
# Train a new model
python train_model.py --dataset drebin --model random_forest# Run with custom configuration
python malware_detection.py --config config/custom_config.yaml
# Batch processing
python batch_process.py --input_dir /path/to/apks --output_dir /path/to/results
# Model evaluation
python evaluate_model.py --model_path models/best_model.pkl --test_data data/test_set.csv| Model | Accuracy | Precision | Recall | F1-Score | Training Time |
|---|---|---|---|---|---|
| Random Forest | 97.2% | 96.8% | 97.1% | 96.9% | 2.3s |
| SVM | 95.8% | 95.2% | 96.1% | 95.6% | 5.7s |
| XGBoost | 98.1% | 97.9% | 98.2% | 98.0% | 3.1s |
| Neural Network | 96.5% | 96.1% | 96.8% | 96.4% | 12.4s |
Top 10 most important features for malware detection:
- Suspicious API calls
- Permission requests
- Network activity patterns
- File system operations
- Cryptographic operations
- Intent filters
- Service declarations
- Receiver components
- Content providers
- Application signatures
AndroidMalwareDetection/
βββ π data/
β βββ raw/ # Raw dataset files
β βββ processed/ # Processed feature files
β βββ models/ # Trained model files
βββ π src/
β βββ feature_extraction/ # Feature extraction modules
β βββ models/ # ML model implementations
β βββ evaluation/ # Model evaluation scripts
β βββ utils/ # Utility functions
βββ π notebooks/ # Jupyter notebooks for analysis
βββ π config/ # Configuration files
βββ π scripts/ # Automation scripts
βββ π tests/ # Unit tests
βββ π docs/ # Documentation
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
βββ README.md # This file
Run the test suite to ensure everything works correctly:
# Run all tests
python -m pytest tests/
# Run specific test categories
python -m pytest tests/test_feature_extraction.py
python -m pytest tests/test_models.py
# Run with coverage
python -m pytest --cov=src tests/We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- π Bug fixes and improvements
- π Documentation enhancements
- π¬ New feature extraction methods
- π€ Additional ML models
- π§ͺ More comprehensive testing
- π¨ UI/UX improvements
- Follow PEP 8 guidelines
- Use meaningful variable names
- Add docstrings to functions
- Include type hints where appropriate
This project builds on and contributes to the following research:
- INForum 2023: "On the Use of ML for Malware Detection"
- RECPAD 2023: "Role of Feature Selection in Malware Detection"
- MDPI Information Journal 2024: "Explainable Machine Learning for Android Malware Detection"
@misc{kumar2024android,
title={Android Malware Detection with Machine Learning},
author={Kumar, Varnit},
year={2024},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{https://github.com/vannu07/Android-Malware-Detection}}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Catarina Palma - Original MSc thesis author at ISEL, Lisbon
- Prof. Artur Ferreira - Academic supervisor and research guidance
- GGSIPU Faculty - Educational support and mentorship
- Kaggle Community - Open-source datasets and collaborative environment
- Open Source Contributors - Libraries and tools that made this possible
- Lisbon Institute of Engineering (ISEL) - Original research foundation
- Guru Gobind Singh Indraprastha University - Academic support
- Developer: Varnit Kumar
- Email: varnit.kumar@example.com
- LinkedIn: linkedin.com/in/varnit-kumar-0883bb251
- GitHub: github.com/vannu07
- β Star this repository if you find it useful
- π Report bugs via GitHub Issues
- π‘ Suggest features or improvements
- π Share with others in the cybersecurity community
- π’ Follow updates on social media
β‘ Made with β€οΈ by Varnit Kumar | MCA Student at GGSIPU
"Securing the digital world, one APK at a time"
Run Project (no backend API)
- Create venv & install deps
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Run tests (unit tests use mocks so they are fast)
pytest -q- Run the project: start
main.pyand frontend dev server together
# Start main training (runs in foreground if you run directly)
python main.py --dataset Datasets/Drebin_v1.csv --algorithm KNN
# Or use the helper to run main in background and start frontend dev server
./run_all.shNote: the FastAPI backend was removed per repository preference. The frontend does not communicate directly with main.py in this setup β they run concurrently so you can view the UI while main.py logs appear in main.log when using run_all.sh.
