________ ________ ________
| | | | | |
| ββββ | | ββββ | | ββββ |
| β | | β | | β |
| sig1ββββΌββ sig2ββββΌββ sig3ββββΊ
|_____| | |_____| | |_____|
S I G R A P H
A knowledge graph-based system for detecting and analyzing behavioral patterns from System Provenance OpenTelemetry (Otel) Traces. Sigraph leverages AI agents and graph databases to provide intelligent malware analysis and threat detection capabilities.
- System Provenance Graph: Build and analyze relationships between system events using Neo4j graph database
- AI-Powered Analysis: Integrate with multiple AI models (Google Gemini, OpenAI GPT, Ollama) for intelligent threat analysis
- Behavioral Pattern Detection: Detect malicious behavioral patterns from system call traces
- Vector Search: Advanced RAG (Retrieval Augmented Generation) capabilities for knowledge graph queries
- OpenTelemetry Integration: Process system provenance data from Otel traces
- Multi-Modal Storage: Combine graph database (Neo4j) with document search (OpenSearch)
- REST API: FastAPI-based backend for easy integration
- Interactive UI: Streamlit-based chat interface for querying malware reports
- Docker Support: Fully containerized deployment with Docker Compose
The system consists of four main components:
- Neo4j Graph Database: Stores system provenance graph with APOC plugin support
- OpenSearch: Document store for syslog data with full-text search capabilities
- FastAPI Backend: RESTful API server for data ingestion and querying
- Streamlit Frontend: Interactive chat interface for malware analysis
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Streamlit UI ββββββΆβ FastAPI Backend ββββββΆβ Neo4j Graph DB β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
β
βΌ
ββββββββββββββββ
β OpenSearch β
ββββββββββββββββ
- Docker and Docker Compose
- Python 3.13+ (for local development)
- API keys for AI models (Google Gemini, OpenAI, or Ollama setup)
-
Clone the repository
git clone https://github.com/enki-polvo/malware-knowledge-graph.git cd malware-knowledge-graph -
Create environment file
cp env.sample .env
-
Edit
.envfile with your configuration# Required settings NEO4J_USER=neo4j NEO4J_PASSWORD=YourSecurePassword123# # AI Model configuration (choose one) AI_MODEL=gemini-1.5-flash # AI_MODEL=gpt-4o-mini # AI_MODEL=gpt-oss-120b AI_API_KEY=your_api_key_here AI_CHUNK_SIZE=400 AI_OVERLAP=40
-
Create required volume directories
mkdir -p volume/{logs,config,data,plugins,opensearch,app-logs} -
Start the services
docker-compose up -d
-
Verify services are running
docker-compose ps
-
Install Python dependencies
pip install -r requirements.txt
-
Set up environment variables
export NEO4J_URI=localhost export NEO4J_USER=neo4j export NEO4J_PASSWORD=YourPassword export OPENSEARCH_URI=localhost export BACKEND_URI=localhost export BACKEND_PORT=8765
-
Run the backend server
cd src python backend_app.py -
Run the Streamlit UI (in a separate terminal)
cd src streamlit run streamlit_app.py
| Variable | Description | Default | Required |
|---|---|---|---|
NEO4J_URI |
Neo4j database URI | neo4j |
Yes |
NEO4J_USER |
Neo4j username | neo4j |
Yes |
NEO4J_PASSWORD |
Neo4j password | - | Yes |
OPENSEARCH_URI |
OpenSearch URI | opensearch |
Yes |
OPENSEARCH_INDEX |
OpenSearch index name | syslog_index |
Yes |
BACKEND_URI |
Backend server host | 0.0.0.0 |
Yes |
BACKEND_PORT |
Backend server port | 8765 |
Yes |
AI_MODEL |
AI model to use | - | Optional |
AI_REALTIME_MODEL |
Real-time AI model | - | Optional |
AI_CHUNK_SIZE |
Text chunk size for AI | 400 |
Optional |
AI_OVERLAP |
Overlap size for chunks | 40 |
Optional |
AI_API_KEY |
API key for AI service | - | Optional |
- 7474: Neo4j Browser (HTTP)
- 7687: Neo4j Bolt protocol
- 9200: OpenSearch REST API
- 5601: OpenSearch Dashboards
- 8765: FastAPI Backend
docker-compose up -d- Streamlit UI: http://localhost:8501
- FastAPI Docs: http://localhost:8765/docs
- Neo4j Browser: http://localhost:7474
- OpenSearch Dashboards: http://localhost:5601
curl -X POST "http://localhost:8765/api/v1/db/syscall" \
-H "Content-Type: application/json" \
-d '{
"trace_id": "trace123",
"span_id": "span456",
"unit_id": "550e8400-e29b-41d4-a716-446655440000",
"system_provenance": "PROCESS_LAUNCH",
"timestamp": "2024-01-01T12:00:00Z",
"weight": 1,
"process_name": "malicious.exe"
}'curl -X POST "http://localhost:8765/api/v1/db/syslog" \
-H "Content-Type: application/json" \
-d '[{
"unit_id": "550e8400-e29b-41d4-a716-446655440000",
"trace_id": "trace123",
"timestamp": "2024-01-01T12:00:00Z",
"message": "Process execution detected"
}]'curl -X GET "http://localhost:8765/api/v1/db/unit/550e8400-e29b-41d4-a716-446655440000/provenance"curl -X POST "http://localhost:8765/api/v1/ai/analyze" \
-H "Content-Type: application/json" \
-d '{
"question": "What malicious behaviors were detected in trace123?"
}'curl -X GET "http://localhost:8765/api/v1/db/syslog/sequence/550e8400-e29b-41d4-a716-446655440000/trace123"- Navigate to http://localhost:8501
- Enter your query in the chat input (e.g., "What suspicious network activities were detected?")
- View the AI-generated response with supporting context
- Toggle "Show context" to see the retrieved knowledge graph data
malware-knowledge-graph/
βββ src/
β βββ ai/ # AI agent modules
β β βββ ai_agent.py # Main AI agent with LangChain integration
β β βββ ai_court.py # Multi-agent debate system
β β βββ output_format.py # Response formatting
β β βββ prompt.py # AI prompts for different stages
β βββ app/
β β βββ backend/ # FastAPI backend
β β β βββ api.py # Main API router
β β β βββ v1/
β β β βββ api.py # V1 API endpoints (DB & AI)
β β βββ streamlit/ # Streamlit frontend
β β β βββ utils.py # UI utilities
β β βββ config.py # Application configuration
β βββ db/ # Database layer
β β βββ db_model.py # Syslog data models
β β βββ db_session.py # OpenSearch session manager
β β βββ exceptions.py # Database exceptions
β βββ graph/ # Graph database layer
β β βββ graph_model.py # Graph node models
β β βββ graph_session.py # Neo4j session manager
β β βββ graph_client/ # Graph query client
β β βββ graph_element/ # Graph element definitions
β β βββ provenance/ # System provenance types
β βββ backend_app.py # FastAPI application entry point
β βββ streamlit_app.py # Streamlit application entry point
β βββ Dockerfile # Backend container definition
β βββ gunicorn.conf.py # Gunicorn configuration
βββ neo4j/
β βββ Dockerfile # Neo4j with APOC plugin
β βββ apoc-5.22.0-core.jar # APOC plugin JAR
βββ volume/ # Docker volumes (gitignored)
βββ docker-compose.yml # Multi-container orchestration
βββ requirements.txt # Python dependencies
βββ env.sample # Environment variables template
βββ README.md # This file
- FastAPI: Modern, high-performance web framework
- Uvicorn/Gunicorn: ASGI server for production deployment
- Pydantic: Data validation and settings management
- Loguru: Advanced logging library
- Neo4j 5.x: Graph database with APOC plugin
- OpenSearch 2.13: Search and analytics engine
- py2neo: Neo4j Python driver
- opensearch-py: OpenSearch Python client
- LangChain: Framework for LLM application development
- LangChain Neo4j: Neo4j integration for LangChain
- Google Gemini: AI model integration (via langchain-google-genai)
- OpenAI: GPT model integration (via langchain-openai)
- Ollama: Local LLM support (via langchain-ollama)
- Vector Stores: Neo4jVector for embedding-based retrieval
- Streamlit: Interactive web UI framework
- Altair: Declarative visualization library
- Docker & Docker Compose: Containerization and orchestration
- Python-dotenv: Environment variable management
- Change default passwords: Always update the
NEO4J_PASSWORDin production - API Keys: Store AI API keys securely and never commit them to version control
- Network isolation: Use Docker networks to isolate services
- HTTPS: Consider using a reverse proxy (nginx/traefik) for HTTPS in production
- OpenSearch security: The current setup disables security plugins for development; enable them in production
Run the existing tests:
python -m pytest src/graph/provenance/tests/The system supports various provenance action types:
LAUNCH: Process creationREMOTE_THREAD: Remote thread injectionACCESS: Process accessTAMPERING: Process tampering
CONNECT: Network connectionACCEPT: Accepting network connection
CREATE: File creationRENAME: File renameDELETE: File deletionMODIFY: File modificationRAW_ACCESS_READ: Raw disk access
REG_ADD: Add registry keyREG_DELETE: Delete registry keyREG_SET: Set registry valueREG_RENAME: Rename registry keyREG_QUERY: Query registry
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.
- Neo4j APOC library for graph algorithms
- LangChain for LLM orchestration framework
- OpenSearch for powerful search capabilities
For issues, questions, or contributions, please open an issue on the GitHub repository.
Note: This project is designed for malware analysis and threat detection research. Use responsibly and in accordance with applicable laws and regulations.