π§ Work in Progress - TCC Development Phase π§
An AI-powered security scanner for Terraform Infrastructure as Code (IaC) files that combines rule-based detection with machine learning anomaly detection.
Infrastructure as Code (IaC) has revolutionized cloud deployments, but misconfigurations remain the #1 cause of cloud security breaches. According to Gartner, 99% of cloud security failures through 2025 will be the customer's fault, primarily due to misconfigurations.
- $5 million - Average cost of a cloud breach (IBM Security Report 2024)
- 70% of organizations experienced IaC security incidents in the past year
- Manual security reviews are slow and error-prone
Traditional rule-based scanners miss complex patterns and novel attack vectors. Machine learning can:
- Detect anomalous configurations not covered by rules
- Learn from new threat patterns
- Provide confidence scoring for risk assessment
- Adapt to organization-specific security baselines
TerraSafe is an intelligent system that combines:
- Rule-based detection for known vulnerabilities (deterministic)
- Anomaly detection ML for unknown risks (probabilistic)
- Hybrid scoring that weights both approaches
- Primary: Anomaly Detection (unsupervised learning)
- Secondary: Risk Classification (pattern recognition)
- Output: Security risk scoring and vulnerability identification
graph TD
A[Terraform File] --> B[Parser HCL2]
B --> C[Feature Extraction Engine]
C --> D[Rule-based Detection]
C --> E[ML Features Extraction]
D --> F[Pattern Matching]
E --> G[Isolation Forest]
F --> H[Risk Score Aggregator <br> 0.6*rules + 0.4*ML]
G --> H
H --> I[Report]
style C fill:#e1f5ff,stroke:#0288d1,stroke-width:2px,color:#01579b
style H fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
style I fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#1b5e20
- Input: Terraform .tf files (HCL format)
- Processing: Dual-path analysis (deterministic + probabilistic)
- Output: Risk score (0-100), vulnerabilities list, confidence level
Justification:
- Excellent for detecting outliers in security configurations
- Unsupervised - doesn't need labeled attack data
- Fast training and inference
- Works well with small datasets
- Neural Networks: Overkill for structured config data, needs more training data
- Genetic Algorithms: Better for optimization, not detection
- Decision Trees: Too rigid for anomaly detection
| Component | Technology | Justification |
|---|---|---|
| Language | Python 3.8+ | Best ML ecosystem, clean syntax |
| ML Framework | Scikit-learn | Production-ready, Isolation Forest implementation |
| Parser | python-hcl2 | Native HCL2 support for Terraform |
| Pattern Matching | re (regex) | Built-in, efficient for rule-based detection |
| Numerical | NumPy | Efficient array operations for features |
| Model Persistence | Joblib | Optimized for scikit-learn models |
- Set up project structure
- Implement HCL2 parser
- Create test files (vulnerable/secure)
- Implement pattern matching for known vulnerabilities
- Create severity classification
- Build basic scoring system
- Feature extraction pipeline
- Isolation Forest training
- Model persistence layer
- Combine rule-based and ML scores
- Add confidence metrics
- Create unified reporting
- Test with multiple configurations
- Generate screenshots
- Complete documentation
The project includes three test configurations demonstrating different security levels:
Contains critical security issues:
- Open SSH access from internet (0.0.0.0/0)
- Hardcoded database passwords
- Unencrypted storage (RDS, EBS)
- Public S3 bucket access
ML Detection: High anomaly score due to multiple security anti-patterns
Follows security best practices:
- Restricted network access (private subnets only)
- Variables for sensitive data
- Encrypted storage enabled
- S3 public access blocked
ML Detection: Normal pattern, low anomaly score
Partially secure configuration:
- Public HTTP (acceptable for web servers)
- SSH restricted to internal network β
- Database encrypted β
- S3 partially restricted
ML Detection: Slight anomaly due to mixed security posture
# Run all three tests
./run_demo.sh
# Or test individually
python -m terrasafe.main test_files/vulnerable.tf
python -m terrasafe.main test_files/secure.tf
python -m terrasafe.main test_files/mixed.tfFinal Risk Score: 92/100
ββ Rule-based Score: 100/100
ββ ML Anomaly Score: 78.3/100
ββ Confidence: HIGH
Critical Issues: 3
High Issues: 3
Detected Vulnerabilities:
[CRITICAL] Open security group - port 22 exposed to internet
[CRITICAL] Open security group - port 80 exposed to internet
[CRITICAL] Hardcoded password detected
[HIGH] Unencrypted RDS instance
[HIGH] Unencrypted EBS volume
[HIGH] S3 bucket with public access enabled
Final Risk Score: 0/100
ββ Rule-based Score: 0/100
ββ ML Anomaly Score: 0.0/100
ββ Confidence: HIGH
β No security issues detected!
β All resources properly encrypted
β Network access properly restricted
β No hardcoded secrets found
Final Risk Score: 48/100
ββ Rule-based Score: 40/100
ββ ML Anomaly Score: 62.1/100
ββ Confidence: MEDIUM
High Issues: 2
Detected Vulnerabilities:
[HIGH] S3 bucket with public access enabled (partially)
[MEDIUM] HTTP port 80 open to internet (acceptable for web servers)
The scanner uses a weighted hybrid approach:
-
Rule-based Score (60% weight): Deterministic detection of known vulnerabilities
- CRITICAL issues: 30 points each
- HIGH issues: 20 points each
- MEDIUM issues: 10 points each
-
ML Anomaly Score (40% weight): Isolation Forest detects unusual patterns
- Trained on baseline security configurations
- Detects deviations from normal security patterns
- Provides confidence level based on anomaly distance
-
Final Score:
0.6 Γ Rule Score + 0.4 Γ ML Score
- 0-30: Secure configuration β
- 31-60: Some issues, review recommended
β οΈ - 61-100: Critical issues, immediate action required β
The ML model analyzes these features:
- Number of open ports to internet
- Presence of hardcoded secrets
- Public access configurations
- Unencrypted storage instances
- Total resource count
Example feature vectors from tests:
- Vulnerable:
[2, 1, 1, 2, 5]β High anomaly - Secure:
[0, 0, 0, 0, 5]β Normal pattern - Mixed:
[1, 0, 1, 0, 4]β Moderate anomaly
# Clone repository
git clone https://github.com/oguarni/terrasafe.git
cd terrasafe
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run scanner
python -m terrasafe.main test_files/vulnerable.tfEvery commit triggers:
- β SAST scanning (Bandit)
- β Dependency vulnerability checks (Safety)
- β Secret detection (GitLeaks)
- β Unit tests with 70%+ coverage
- β Docker image security scan
# Run all security checks
make security-scan
# Check dependencies only
make security-deps
# SAST only
make security-sast
# Set up pre-commit hooks
make setup-hooks# Build secure Docker image (multi-stage, non-root)
docker build -t terrasafe:latest .
# Run security scan
docker run --rm aquasec/trivy image terrasafe:latest
# Deploy
docker run -d \
--name terrasafe \
--read-only \
--security-opt=no-new-privileges:true \
-v /path/to/terraform:/scan:ro \
terrasafe:latest /scan/main.tf- OWASP: Follows Top 10 secure coding practices
- NIST: Aligns with Cybersecurity Framework
- CIS: Container hardening applied
- GDPR: No PII collection
| Metric | Status |
|---|---|
| Test Coverage | 85%+ |
| SAST Issues | 0 Critical |
| Dependencies | No Known Vulns |
| Docker Scan | Pass |
Course: Sistemas Inteligentes
Institution: UTFPR
Semester: 7th - Software Engineering
Type: Proof of Concept - Intelligent System Application
- Hybrid Approach: Combines deterministic and probabilistic methods
- Self-Learning: Model improves with more configurations analyzed
- Explainable AI: Features and confidence levels provide transparency
- Real-time Analysis: Sub-second scanning performance
- Limited training data (using synthetic baseline)
- No support for Terraform modules
- English-only vulnerability descriptions
- Deep Learning for complex pattern recognition
- Integration with CI/CD pipelines
- Multi-cloud support (Azure, GCP)
- Custom policy definition language
- Gartner (2024). "Cloud Security Failures Report"
- IBM Security (2024). "Cost of a Data Breach Report"
- HashiCorp. "Terraform Security Best Practices"
- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
This project is licensed under the CC BY-NC-SA 4.0. This license covers all current and historical commits in this repository. See the LICENSE file for details.
Developed by Gabriel Felipe Guarnieri - UTFPR Software Engineering


