Skip to content

hongsam14/sigraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

150 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sigraph - Malware Knowledge Graph

    ________   ________   ________
   |        | |        | |        |
   |  ◉──┐  | |  ◉──┐  | |  ◉──┐  |
   |     β”‚  | |     β”‚  | |     β”‚  |
   | sig1β”‚β”€β”€β”Όβ”€β”˜ sig2β”‚β”€β”€β”Όβ”€β”˜ sig3│──►
   |_____|  | |_____|  | |_____|   

      S    I    G    R    A    P    H


A knowledge graph-based system for detecting and analyzing behavioral patterns from System Provenance OpenTelemetry (Otel) Traces. Sigraph leverages AI agents and graph databases to provide intelligent malware analysis and threat detection capabilities.

🎯 Features

  • System Provenance Graph: Build and analyze relationships between system events using Neo4j graph database
  • AI-Powered Analysis: Integrate with multiple AI models (Google Gemini, OpenAI GPT, Ollama) for intelligent threat analysis
  • Behavioral Pattern Detection: Detect malicious behavioral patterns from system call traces
  • Vector Search: Advanced RAG (Retrieval Augmented Generation) capabilities for knowledge graph queries
  • OpenTelemetry Integration: Process system provenance data from Otel traces
  • Multi-Modal Storage: Combine graph database (Neo4j) with document search (OpenSearch)
  • REST API: FastAPI-based backend for easy integration
  • Interactive UI: Streamlit-based chat interface for querying malware reports
  • Docker Support: Fully containerized deployment with Docker Compose

πŸ—οΈ Architecture

The system consists of four main components:

  1. Neo4j Graph Database: Stores system provenance graph with APOC plugin support
  2. OpenSearch: Document store for syslog data with full-text search capabilities
  3. FastAPI Backend: RESTful API server for data ingestion and querying
  4. Streamlit Frontend: Interactive chat interface for malware analysis
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit UI  │────▢│  FastAPI Backend │────▢│  Neo4j Graph DB β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β”‚
                               β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  OpenSearch  β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • Docker and Docker Compose
  • Python 3.13+ (for local development)
  • API keys for AI models (Google Gemini, OpenAI, or Ollama setup)

πŸš€ Installation

Using Docker Compose (Recommended)

  1. Clone the repository

    git clone https://github.com/enki-polvo/malware-knowledge-graph.git
    cd malware-knowledge-graph
  2. Create environment file

    cp env.sample .env
  3. Edit .env file with your configuration

    # Required settings
    NEO4J_USER=neo4j
    NEO4J_PASSWORD=YourSecurePassword123#
    
    # AI Model configuration (choose one)
    AI_MODEL=gemini-1.5-flash
    # AI_MODEL=gpt-4o-mini
    # AI_MODEL=gpt-oss-120b
    
    AI_API_KEY=your_api_key_here
    AI_CHUNK_SIZE=400
    AI_OVERLAP=40
  4. Create required volume directories

    mkdir -p volume/{logs,config,data,plugins,opensearch,app-logs}
  5. Start the services

    docker-compose up -d
  6. Verify services are running

    docker-compose ps

Local Development Setup

  1. Install Python dependencies

    pip install -r requirements.txt
  2. Set up environment variables

    export NEO4J_URI=localhost
    export NEO4J_USER=neo4j
    export NEO4J_PASSWORD=YourPassword
    export OPENSEARCH_URI=localhost
    export BACKEND_URI=localhost
    export BACKEND_PORT=8765
  3. Run the backend server

    cd src
    python backend_app.py
  4. Run the Streamlit UI (in a separate terminal)

    cd src
    streamlit run streamlit_app.py

βš™οΈ Configuration

Environment Variables

Variable Description Default Required
NEO4J_URI Neo4j database URI neo4j Yes
NEO4J_USER Neo4j username neo4j Yes
NEO4J_PASSWORD Neo4j password - Yes
OPENSEARCH_URI OpenSearch URI opensearch Yes
OPENSEARCH_INDEX OpenSearch index name syslog_index Yes
BACKEND_URI Backend server host 0.0.0.0 Yes
BACKEND_PORT Backend server port 8765 Yes
AI_MODEL AI model to use - Optional
AI_REALTIME_MODEL Real-time AI model - Optional
AI_CHUNK_SIZE Text chunk size for AI 400 Optional
AI_OVERLAP Overlap size for chunks 40 Optional
AI_API_KEY API key for AI service - Optional

Ports

  • 7474: Neo4j Browser (HTTP)
  • 7687: Neo4j Bolt protocol
  • 9200: OpenSearch REST API
  • 5601: OpenSearch Dashboards
  • 8765: FastAPI Backend

πŸ“– Usage

Starting the System

docker-compose up -d

Accessing the Interfaces

API Examples

1. Post System Call Event

curl -X POST "http://localhost:8765/api/v1/db/syscall" \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "trace123",
    "span_id": "span456",
    "unit_id": "550e8400-e29b-41d4-a716-446655440000",
    "system_provenance": "PROCESS_LAUNCH",
    "timestamp": "2024-01-01T12:00:00Z",
    "weight": 1,
    "process_name": "malicious.exe"
  }'

2. Post Syslog Data

curl -X POST "http://localhost:8765/api/v1/db/syslog" \
  -H "Content-Type: application/json" \
  -d '[{
    "unit_id": "550e8400-e29b-41d4-a716-446655440000",
    "trace_id": "trace123",
    "timestamp": "2024-01-01T12:00:00Z",
    "message": "Process execution detected"
  }]'

3. Query System Provenance

curl -X GET "http://localhost:8765/api/v1/db/unit/550e8400-e29b-41d4-a716-446655440000/provenance"

4. Analyze Behavior with AI

curl -X POST "http://localhost:8765/api/v1/ai/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What malicious behaviors were detected in trace123?"
  }'

5. Get Syslog Sequence

curl -X GET "http://localhost:8765/api/v1/db/syslog/sequence/550e8400-e29b-41d4-a716-446655440000/trace123"

Using the Streamlit Interface

  1. Navigate to http://localhost:8501
  2. Enter your query in the chat input (e.g., "What suspicious network activities were detected?")
  3. View the AI-generated response with supporting context
  4. Toggle "Show context" to see the retrieved knowledge graph data

πŸ“ Project Structure

malware-knowledge-graph/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ai/                      # AI agent modules
β”‚   β”‚   β”œβ”€β”€ ai_agent.py          # Main AI agent with LangChain integration
β”‚   β”‚   β”œβ”€β”€ ai_court.py          # Multi-agent debate system
β”‚   β”‚   β”œβ”€β”€ output_format.py     # Response formatting
β”‚   β”‚   └── prompt.py            # AI prompts for different stages
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ backend/             # FastAPI backend
β”‚   β”‚   β”‚   β”œβ”€β”€ api.py           # Main API router
β”‚   β”‚   β”‚   └── v1/
β”‚   β”‚   β”‚       └── api.py       # V1 API endpoints (DB & AI)
β”‚   β”‚   β”œβ”€β”€ streamlit/           # Streamlit frontend
β”‚   β”‚   β”‚   └── utils.py         # UI utilities
β”‚   β”‚   └── config.py            # Application configuration
β”‚   β”œβ”€β”€ db/                      # Database layer
β”‚   β”‚   β”œβ”€β”€ db_model.py          # Syslog data models
β”‚   β”‚   β”œβ”€β”€ db_session.py        # OpenSearch session manager
β”‚   β”‚   └── exceptions.py        # Database exceptions
β”‚   β”œβ”€β”€ graph/                   # Graph database layer
β”‚   β”‚   β”œβ”€β”€ graph_model.py       # Graph node models
β”‚   β”‚   β”œβ”€β”€ graph_session.py     # Neo4j session manager
β”‚   β”‚   β”œβ”€β”€ graph_client/        # Graph query client
β”‚   β”‚   β”œβ”€β”€ graph_element/       # Graph element definitions
β”‚   β”‚   └── provenance/          # System provenance types
β”‚   β”œβ”€β”€ backend_app.py           # FastAPI application entry point
β”‚   β”œβ”€β”€ streamlit_app.py         # Streamlit application entry point
β”‚   β”œβ”€β”€ Dockerfile               # Backend container definition
β”‚   └── gunicorn.conf.py         # Gunicorn configuration
β”œβ”€β”€ neo4j/
β”‚   β”œβ”€β”€ Dockerfile               # Neo4j with APOC plugin
β”‚   └── apoc-5.22.0-core.jar     # APOC plugin JAR
β”œβ”€β”€ volume/                      # Docker volumes (gitignored)
β”œβ”€β”€ docker-compose.yml           # Multi-container orchestration
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ env.sample                   # Environment variables template
└── README.md                    # This file

πŸ› οΈ Technologies

Backend

  • FastAPI: Modern, high-performance web framework
  • Uvicorn/Gunicorn: ASGI server for production deployment
  • Pydantic: Data validation and settings management
  • Loguru: Advanced logging library

Database & Storage

  • Neo4j 5.x: Graph database with APOC plugin
  • OpenSearch 2.13: Search and analytics engine
  • py2neo: Neo4j Python driver
  • opensearch-py: OpenSearch Python client

AI & Machine Learning

  • LangChain: Framework for LLM application development
  • LangChain Neo4j: Neo4j integration for LangChain
  • Google Gemini: AI model integration (via langchain-google-genai)
  • OpenAI: GPT model integration (via langchain-openai)
  • Ollama: Local LLM support (via langchain-ollama)
  • Vector Stores: Neo4jVector for embedding-based retrieval

Frontend

  • Streamlit: Interactive web UI framework
  • Altair: Declarative visualization library

DevOps

  • Docker & Docker Compose: Containerization and orchestration
  • Python-dotenv: Environment variable management

πŸ”’ Security Considerations

  1. Change default passwords: Always update the NEO4J_PASSWORD in production
  2. API Keys: Store AI API keys securely and never commit them to version control
  3. Network isolation: Use Docker networks to isolate services
  4. HTTPS: Consider using a reverse proxy (nginx/traefik) for HTTPS in production
  5. OpenSearch security: The current setup disables security plugins for development; enable them in production

πŸ§ͺ Testing

Run the existing tests:

python -m pytest src/graph/provenance/tests/

πŸ“Š System Provenance Types

The system supports various provenance action types:

Process Actions

  • LAUNCH: Process creation
  • REMOTE_THREAD: Remote thread injection
  • ACCESS: Process access
  • TAMPERING: Process tampering

Network Actions

  • CONNECT: Network connection
  • ACCEPT: Accepting network connection

File Actions

  • CREATE: File creation
  • RENAME: File rename
  • DELETE: File deletion
  • MODIFY: File modification
  • RAW_ACCESS_READ: Raw disk access

Registry Actions

  • REG_ADD: Add registry key
  • REG_DELETE: Delete registry key
  • REG_SET: Set registry value
  • REG_RENAME: Rename registry key
  • REG_QUERY: Query registry

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Neo4j APOC library for graph algorithms
  • LangChain for LLM orchestration framework
  • OpenSearch for powerful search capabilities

πŸ“ž Support

For issues, questions, or contributions, please open an issue on the GitHub repository.


Note: This project is designed for malware analysis and threat detection research. Use responsibly and in accordance with applicable laws and regulations.

About

It is a knowledge base that detects behavioral patterns from nanolite-agent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors