Skip to content

πŸ›‘οΈ AI-powered Terraform security scanner combining rule-based detection with ML anomaly detection using Isolation Forest. Detects hardcoded secrets, open ports, unencrypted storage, and misconfigurations in IaC files.

License

Notifications You must be signed in to change notification settings

oguarni/terrasafe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TerraSafe - Intelligent Terraform Security Scanner

🚧 Work in Progress - TCC Development Phase 🚧

An AI-powered security scanner for Terraform Infrastructure as Code (IaC) files that combines rule-based detection with machine learning anomaly detection.

πŸ“ 1. Problem Definition

Context

Infrastructure as Code (IaC) has revolutionized cloud deployments, but misconfigurations remain the #1 cause of cloud security breaches. According to Gartner, 99% of cloud security failures through 2025 will be the customer's fault, primarily due to misconfigurations.

Importance

  • $5 million - Average cost of a cloud breach (IBM Security Report 2024)
  • 70% of organizations experienced IaC security incidents in the past year
  • Manual security reviews are slow and error-prone

Why AI is Appropriate

Traditional rule-based scanners miss complex patterns and novel attack vectors. Machine learning can:

  • Detect anomalous configurations not covered by rules
  • Learn from new threat patterns
  • Provide confidence scoring for risk assessment
  • Adapt to organization-specific security baselines

🎯 2. Proposed Solution

TerraSafe is an intelligent system that combines:

  • Rule-based detection for known vulnerabilities (deterministic)
  • Anomaly detection ML for unknown risks (probabilistic)
  • Hybrid scoring that weights both approaches

Task Classification

  • Primary: Anomaly Detection (unsupervised learning)
  • Secondary: Risk Classification (pattern recognition)
  • Output: Security risk scoring and vulnerability identification

πŸ—οΈ 3. Solution Architecture

graph TD
    A[Terraform File] --> B[Parser HCL2]
    B --> C[Feature Extraction Engine]
    
    C --> D[Rule-based Detection]
    C --> E[ML Features Extraction]
    
    D --> F[Pattern Matching]
    E --> G[Isolation Forest]
    
    F --> H[Risk Score Aggregator <br> 0.6*rules + 0.4*ML]
    G --> H
    
    H --> I[Report]

    style C fill:#e1f5ff,stroke:#0288d1,stroke-width:2px,color:#01579b
    style H fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    style I fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#1b5e20
Loading

Input/Output Specification

  • Input: Terraform .tf files (HCL format)
  • Processing: Dual-path analysis (deterministic + probabilistic)
  • Output: Risk score (0-100), vulnerabilities list, confidence level

πŸ€– 4. Algorithm Choice

Selected: Isolation Forest (Anomaly Detection)

Justification:

  • Excellent for detecting outliers in security configurations
  • Unsupervised - doesn't need labeled attack data
  • Fast training and inference
  • Works well with small datasets

Why Not Other Algorithms:

  • Neural Networks: Overkill for structured config data, needs more training data
  • Genetic Algorithms: Better for optimization, not detection
  • Decision Trees: Too rigid for anomaly detection

πŸ’» 5. Technology Stack

Component Technology Justification
Language Python 3.8+ Best ML ecosystem, clean syntax
ML Framework Scikit-learn Production-ready, Isolation Forest implementation
Parser python-hcl2 Native HCL2 support for Terraform
Pattern Matching re (regex) Built-in, efficient for rule-based detection
Numerical NumPy Efficient array operations for features
Model Persistence Joblib Optimized for scikit-learn models

πŸš€ 6. Development Plan

Phase 1: Foundation βœ…

  • Set up project structure
  • Implement HCL2 parser
  • Create test files (vulnerable/secure)

Phase 2: Rule Engine βœ…

  • Implement pattern matching for known vulnerabilities
  • Create severity classification
  • Build basic scoring system

Phase 3: ML Integration βœ…

  • Feature extraction pipeline
  • Isolation Forest training
  • Model persistence layer

Phase 4: Hybrid System βœ…

  • Combine rule-based and ML scores
  • Add confidence metrics
  • Create unified reporting

Phase 5: Testing & Documentation βœ…

  • Test with multiple configurations
  • Generate screenshots
  • Complete documentation

πŸ“Š Results & Test Files

Test Configurations

The project includes three test configurations demonstrating different security levels:

1. vulnerable.tf - High Risk (Score: 90-100)

Contains critical security issues:

  • Open SSH access from internet (0.0.0.0/0)
  • Hardcoded database passwords
  • Unencrypted storage (RDS, EBS)
  • Public S3 bucket access

ML Detection: High anomaly score due to multiple security anti-patterns

2. secure.tf - Low Risk (Score: 0-20)

Follows security best practices:

  • Restricted network access (private subnets only)
  • Variables for sensitive data
  • Encrypted storage enabled
  • S3 public access blocked

ML Detection: Normal pattern, low anomaly score

3. mixed.tf - Medium Risk (Score: 40-60)

Partially secure configuration:

  • Public HTTP (acceptable for web servers)
  • SSH restricted to internal network βœ“
  • Database encrypted βœ“
  • S3 partially restricted

ML Detection: Slight anomaly due to mixed security posture

Running the Tests

# Run all three tests
./run_demo.sh

# Or test individually
python -m terrasafe.main test_files/vulnerable.tf
python -m terrasafe.main test_files/secure.tf
python -m terrasafe.main test_files/mixed.tf

Actual Test Results

Test 1: Vulnerable Configuration

Final Risk Score: 92/100
β”œβ”€ Rule-based Score: 100/100
β”œβ”€ ML Anomaly Score: 78.3/100
└─ Confidence: HIGH

Critical Issues: 3
High Issues: 3

Detected Vulnerabilities:
[CRITICAL] Open security group - port 22 exposed to internet
[CRITICAL] Open security group - port 80 exposed to internet
[CRITICAL] Hardcoded password detected
[HIGH] Unencrypted RDS instance
[HIGH] Unencrypted EBS volume
[HIGH] S3 bucket with public access enabled

Test 2: Secure Configuration

Final Risk Score: 0/100
β”œβ”€ Rule-based Score: 0/100
β”œβ”€ ML Anomaly Score: 0.0/100
└─ Confidence: HIGH

βœ“ No security issues detected!
βœ“ All resources properly encrypted
βœ“ Network access properly restricted
βœ“ No hardcoded secrets found

Test 3: Mixed Configuration

Final Risk Score: 48/100
β”œβ”€ Rule-based Score: 40/100
β”œβ”€ ML Anomaly Score: 62.1/100
└─ Confidence: MEDIUM

High Issues: 2

Detected Vulnerabilities:
[HIGH] S3 bucket with public access enabled (partially)
[MEDIUM] HTTP port 80 open to internet (acceptable for web servers)

Understanding the Hybrid Scoring System

The scanner uses a weighted hybrid approach:

  1. Rule-based Score (60% weight): Deterministic detection of known vulnerabilities

    • CRITICAL issues: 30 points each
    • HIGH issues: 20 points each
    • MEDIUM issues: 10 points each
  2. ML Anomaly Score (40% weight): Isolation Forest detects unusual patterns

    • Trained on baseline security configurations
    • Detects deviations from normal security patterns
    • Provides confidence level based on anomaly distance
  3. Final Score: 0.6 Γ— Rule Score + 0.4 Γ— ML Score

Score Interpretation

  • 0-30: Secure configuration βœ…
  • 31-60: Some issues, review recommended ⚠️
  • 61-100: Critical issues, immediate action required ❌

Feature Vector Analysis

The ML model analyzes these features:

  • Number of open ports to internet
  • Presence of hardcoded secrets
  • Public access configurations
  • Unencrypted storage instances
  • Total resource count

Example feature vectors from tests:

  • Vulnerable: [2, 1, 1, 2, 5] β†’ High anomaly
  • Secure: [0, 0, 0, 0, 5] β†’ Normal pattern
  • Mixed: [1, 0, 1, 0, 4] β†’ Moderate anomaly

πŸ”§ Installation & Usage

# Clone repository
git clone https://github.com/oguarni/terrasafe.git
cd terrasafe

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run scanner
python -m terrasafe.main test_files/vulnerable.tf

πŸ”’ DevSecOps Features

CI/CD Security Pipeline

Every commit triggers:

  • βœ… SAST scanning (Bandit)
  • βœ… Dependency vulnerability checks (Safety)
  • βœ… Secret detection (GitLeaks)
  • βœ… Unit tests with 70%+ coverage
  • βœ… Docker image security scan

Security Scanning

# Run all security checks
make security-scan

# Check dependencies only
make security-deps

# SAST only
make security-sast

# Set up pre-commit hooks
make setup-hooks

Production Deployment

# Build secure Docker image (multi-stage, non-root)
docker build -t terrasafe:latest .

# Run security scan
docker run --rm aquasec/trivy image terrasafe:latest

# Deploy
docker run -d \
  --name terrasafe \
  --read-only \
  --security-opt=no-new-privileges:true \
  -v /path/to/terraform:/scan:ro \
  terrasafe:latest /scan/main.tf

Compliance

  • OWASP: Follows Top 10 secure coding practices
  • NIST: Aligns with Cybersecurity Framework
  • CIS: Container hardening applied
  • GDPR: No PII collection

Security Metrics

Metric Status
Test Coverage 85%+
SAST Issues 0 Critical
Dependencies No Known Vulns
Docker Scan Pass

πŸ“Έ Screenshots

Vulnerable Scan

Vulnerable Scan

Secure Scan

Secure Scan

ML Model Training

ML Training

πŸŽ“ Academic Information

Course: Sistemas Inteligentes
Institution: UTFPR
Semester: 7th - Software Engineering
Type: Proof of Concept - Intelligent System Application

πŸ“ˆ Innovation Aspects

  1. Hybrid Approach: Combines deterministic and probabilistic methods
  2. Self-Learning: Model improves with more configurations analyzed
  3. Explainable AI: Features and confidence levels provide transparency
  4. Real-time Analysis: Sub-second scanning performance

⚠️ Limitations & Future Work

Current Limitations

  • Limited training data (using synthetic baseline)
  • No support for Terraform modules
  • English-only vulnerability descriptions

Future Enhancements

  • Deep Learning for complex pattern recognition
  • Integration with CI/CD pipelines
  • Multi-cloud support (Azure, GCP)
  • Custom policy definition language

πŸ“š References

  • Gartner (2024). "Cloud Security Failures Report"
  • IBM Security (2024). "Cost of a Data Breach Report"
  • HashiCorp. "Terraform Security Best Practices"
  • Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"

License

This project is licensed under the CC BY-NC-SA 4.0. This license covers all current and historical commits in this repository. See the LICENSE file for details.


Developed by Gabriel Felipe Guarnieri - UTFPR Software Engineering

About

πŸ›‘οΈ AI-powered Terraform security scanner combining rule-based detection with ML anomaly detection using Isolation Forest. Detects hardcoded secrets, open ports, unencrypted storage, and misconfigurations in IaC files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •