==============================================================================
███╗ ██╗ ██████╗ ██████╗ ███████╗███████╗███████╗███╗ ██╗███████╗███████╗
████╗ ██║██╔═══██╗██╔══██╗██╔════╝██╔════╝██╔════╝████╗ ██║██╔════╝██╔════╝
██╔██╗ ██║██║ ██║██║ ██║█████╗ ███████╗█████╗ ██╔██╗ ██║███████╗█████╗
██║╚██╗██║██║ ██║██║ ██║██╔══╝ ╚════██║██╔══╝ ██║╚██╗██║╚════██║██╔══╝
██║ ╚████║╚██████╔╝██████╔╝███████╗███████║███████╗██║ ╚████║███████║███████╗
╚═╝ ╚═══╝ ╚═════╝ ╚═════╝ ╚══════╝╚══════╝╚══════╝╚═╝ ╚═══╝╚══════╝╚══════╝
=============================================================================
N O D E M O N I T O R I N G P L A T F O R M
NodeSense is a platform designed for the continuous monitoring of container and VM nodes. It is based on the core idea that every node in the platform runs an agent that collects system metrics (CPU consumption, memory usage, active processes, etc.) and sends this information to a central collection service continuously.
This platform is structured as a Docker Swarm stack and provides services for data collection, aggregation, visualization, and alerting.
- Continuous Metric Collection: An agent (Node Agent) runs on each node to collect vital system metrics (CPU, RAM, IO) and sends them to the Metrics Collector via REST.
- High Availability API Gateway: User access is managed through an API Gateway that is replicated so that if one instance fails, the others take over the traffic without interruption.
- Distributed Rate Limiting: A distributed rate-limiting mechanism is implemented using a Redis Cluster.
- Data Persistence and Analysis:
- Collected data is aggregated and saved in a TimescaleDB (a time-series database) for analysis.
- Metrics are exposed toward Prometheus and Grafana simultaneously.
- Visualization and Dashboards: Grafana provides dashboards and visualization.
- Alerting Service: An Alerting Service notifies administrators when critical thresholds are exceeded (e.g., unusually high CPU consumption, unresponsive node).
- Single Sign-On (SSO): User authentication and authorization are handled via a Keycloak service.
The platform is implemented as a Docker Swarm stack consisting of the following containers/services:
| Component | Role |
|---|---|
| Keycloak | Authentication and authorization (SSO) |
| API Gateway (Replicated) | Routing, token validation, rate-limiting |
| Redis Cluster | Storage for distributed rate limiting |
| Node Agent | Collecting metrics from nodes |
| Metrics Collector (Replicated) | Aggregation and processing of metrics |
| TimescaleDB | Storing metrics |
| Prometheus | Collecting metrics exposed by the system |
| Grafana | Dashboards and visualization |
| Alerting Service | Generating notifications based on thresholds |
To deploy the NodeSense platform, follow these steps:
-
Clone the Repository:
git clone https://github.com/mariapana/NodeSense.git cd NodeSense -
Run Setup and Deploy the Stack:
./setup.sh ./deploy.sh
-
Access the Platform:
- Frontend UI: Open http://127.0.0.1:3002 in your browser.
- Login: Use
admin/admin(or the credentials you set). - Dashboard: View active nodes and real-time metrics.
- System Topology: (Admin only) View Docker Swarm services and logs.
- Login: Use
- Grafana: Open http://127.0.0.1:3001
- Login:
admin/admin - View dashboards in
Dashboards/NodeSense Metrics.
- Login:
- Keycloak: Open http://127.0.0.1:8080 for user management.
- Frontend UI: Open http://127.0.0.1:3002 in your browser.
-
Verify Implementation: Run the comprehensive test suite to verify all features (Auth, Security, Alerting, Rate Limiting):
./test_suite.sh admin admin [client_secret]
-
Clean Up:
./cleanup.sh
The platform utilizes a dedicated Keycloak service to handle Single Sign-On (SSO), authentication and authorization. The integration is designed to be fully automated, secure, and reproducible, eliminating the need for manual configuration via the Keycloak UI.
Key Implementation Features:
- Automated Realm Import: Keycloak is configured to automatically import the
NodeSenserealm upon startup using the--import-realmargument. - Secure Secret Management:
- Template-Based: The realm configuration is maintained in a version-controlled template (
keycloak/import/NodeSense-realm.template.json) which contains no secrets. - Dynamic Generation: During the deployment process (
deploy.sh), a helper script (keycloak/generate-realm-json.sh) injects the sensitive credentials (admin password, viewer password and API Gateway client secret) provided interactively by the user. - Result: This generates a transient
NodeSense-realm.jsonfile used for the actual import, ensuring that secrets are never committed to the repository.
- Template-Based: The realm configuration is maintained in a version-controlled template (
- Predefined Access Control:
- Users: Automatically provisions
adminandviewerusers. - Roles: Assigns specific roles (
admin,viewer) that are embedded in the JWT tokens for downstream service authorization.
- Users: Automatically provisions
- Streamlined UX: The configuration overrides the default user profile to remove the mandatory "Update Account Information" step, ensuring a seamless login flow.
NodeSense uses TimescaleDB to persist monitoring metrics in an optimized time-series format.
Design Overview:
- A dedicated TimescaleDB service is deployed as part of the Docker Swarm stack.
- The database schema is automatically created at startup using SQL initialization scripts.
- Monitoring data is organized using a normalized model:
nodes- represents monitored nodes (VMs or containers)metrics- a time-series hypertable storing metric values over time
Schema Highlights:
- The
metricstable is defined as a hypertable, enabling efficient queries over time intervals. - Indexes are created for
(node_id, time DESC)to optimize common access patterns.
The API Gateway acts as the central entry point for all incoming traffic, ensuring security, scalability, and control.
Implementation Details:
- Technology Stack: Built with FastAPI (Python) for high performance and async capabilities.
- Security:
- JWT Validation: Enforces authentication by validating JSON Web Tokens (JWT) issued by Keycloak.
- Role-Based Access Control (RBAC): Restricts sensitive endpoints (e.g.,
DELETE, System Topology) to users with theadminrole.
- Rate Limiting:
- Implements a Distributed Rate Limiting algorithm using a Redis Cluster backend.
- Limits traffic to 1000 requests per minute per client IP to prevent abuse.
- System Integration:
- Mounts the Docker Socket (
/var/run/docker.sock) to query Swarm state (services, replicas). - Proxies metric ingestion requests to the Collector service via internal Docker DNS.
- Mounts the Docker Socket (
The Collector is responsible for high-throughput ingestion and persistence of monitoring data.
Implementation Details:
- High Performance: Uses FastAPI and asyncpg (asynchronous PostgreSQL driver) to handle concurrent write operations efficiently.
- Data Pipeline:
- Validates incoming JSON payloads against a strict Pydantic schema.
- Updates real-time Prometheus gauges (
node_metric,node_last_seen) for scraping. - Persists normalized data into TimescaleDB using transactional writes.
- Resiliency: Designed to be stateless and horizontally scalable (replicated).
The Alerting Service continuously monitors the system for critical anomalies and ensures administrators are notified.
Implementation Details:
- Polling Engine: A Python-based persistent service that queries TimescaleDB at fixed intervals (every 60s).
- Detection Rules:
- High CPU: Triggers when CPU usage exceeds 90%.
- Node Down: Triggers when a node has not reported metrics for more than 2 minutes.
- Persistence: Alerts are stored in a dedicated
alertstable for auditing and UI retrieval. - Logging: Outputs structured warning logs for integration with external log aggregators.
The Frontend provides a modern, responsive user interface for visualizing the platform's state.
Implementation Details:
- Tech Stack: Built with React, Vite, and TailwindCSS for rapid development and optimized builds.
- Features:
- Real-Time Dashboard: Displays a grid of active nodes with status indicators (Online/Offline) and last-seen timestamps.
- System Topology: Visualizes Docker Swarm services, including replica counts and image versions (Admin only).
- Log Viewer: Live access to service logs via the Gateway API.
- Live Simulator: Integrated tool to spawn virtual nodes/metrics directly from the browser for testing.
- Alert Notifications: Polling-based Toast notifications for immediate alert visibility.
