Skip to content

NodeSense is a platform designed for the continuous monitoring of container and VM nodes. It is based on the core idea that every node in the platform runs an agent that collects system metrics (CPU consumption, memory usage, active processes, etc.) and sends this information to a central collection service continuously.

Notifications You must be signed in to change notification settings

mariapana/NodeSense

Repository files navigation

==============================================================================
███╗   ██╗ ██████╗ ██████╗ ███████╗███████╗███████╗███╗   ██╗███████╗███████╗
████╗  ██║██╔═══██╗██╔══██╗██╔════╝██╔════╝██╔════╝████╗  ██║██╔════╝██╔════╝
██╔██╗ ██║██║   ██║██║  ██║█████╗  ███████╗█████╗  ██╔██╗ ██║███████╗█████╗
██║╚██╗██║██║   ██║██║  ██║██╔══╝  ╚════██║██╔══╝  ██║╚██╗██║╚════██║██╔══╝
██║ ╚████║╚██████╔╝██████╔╝███████╗███████║███████╗██║ ╚████║███████║███████╗
╚═╝  ╚═══╝ ╚═════╝ ╚═════╝ ╚══════╝╚══════╝╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝
=============================================================================
                           N O D E   M O N I T O R I N G   P L A T F O R M

NodeSense: Container/VM Monitoring Platform

NodeSense is a platform designed for the continuous monitoring of container and VM nodes. It is based on the core idea that every node in the platform runs an agent that collects system metrics (CPU consumption, memory usage, active processes, etc.) and sends this information to a central collection service continuously.

This platform is structured as a Docker Swarm stack and provides services for data collection, aggregation, visualization, and alerting.


Key Features

  • Continuous Metric Collection: An agent (Node Agent) runs on each node to collect vital system metrics (CPU, RAM, IO) and sends them to the Metrics Collector via REST.
  • High Availability API Gateway: User access is managed through an API Gateway that is replicated so that if one instance fails, the others take over the traffic without interruption.
  • Distributed Rate Limiting: A distributed rate-limiting mechanism is implemented using a Redis Cluster.
  • Data Persistence and Analysis:
    • Collected data is aggregated and saved in a TimescaleDB (a time-series database) for analysis.
    • Metrics are exposed toward Prometheus and Grafana simultaneously.
  • Visualization and Dashboards: Grafana provides dashboards and visualization.
  • Alerting Service: An Alerting Service notifies administrators when critical thresholds are exceeded (e.g., unusually high CPU consumption, unresponsive node).
  • Single Sign-On (SSO): User authentication and authorization are handled via a Keycloak service.

Architecture Overview

The platform is implemented as a Docker Swarm stack consisting of the following containers/services:

Component Role
Keycloak Authentication and authorization (SSO)
API Gateway (Replicated) Routing, token validation, rate-limiting
Redis Cluster Storage for distributed rate limiting
Node Agent Collecting metrics from nodes
Metrics Collector (Replicated) Aggregation and processing of metrics
TimescaleDB Storing metrics
Prometheus Collecting metrics exposed by the system
Grafana Dashboards and visualization
Alerting Service Generating notifications based on thresholds

Architecture Diagram

Getting Started

To deploy the NodeSense platform, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/mariapana/NodeSense.git
    cd NodeSense
  2. Run Setup and Deploy the Stack:

    ./setup.sh
    ./deploy.sh
  3. Access the Platform:

    • Frontend UI: Open http://127.0.0.1:3002 in your browser.
      • Login: Use admin / admin (or the credentials you set).
      • Dashboard: View active nodes and real-time metrics.
      • System Topology: (Admin only) View Docker Swarm services and logs.
    • Grafana: Open http://127.0.0.1:3001
      • Login: admin / admin
      • View dashboards in Dashboards/NodeSense Metrics.
    • Keycloak: Open http://127.0.0.1:8080 for user management.
  4. Verify Implementation: Run the comprehensive test suite to verify all features (Auth, Security, Alerting, Rate Limiting):

    ./test_suite.sh admin admin [client_secret]
  5. Clean Up:

    ./cleanup.sh

Keycloak Integration (Authentication & Authorization)

The platform utilizes a dedicated Keycloak service to handle Single Sign-On (SSO), authentication and authorization. The integration is designed to be fully automated, secure, and reproducible, eliminating the need for manual configuration via the Keycloak UI.

Key Implementation Features:

  • Automated Realm Import: Keycloak is configured to automatically import the NodeSense realm upon startup using the --import-realm argument.
  • Secure Secret Management:
    • Template-Based: The realm configuration is maintained in a version-controlled template (keycloak/import/NodeSense-realm.template.json) which contains no secrets.
    • Dynamic Generation: During the deployment process (deploy.sh), a helper script (keycloak/generate-realm-json.sh) injects the sensitive credentials (admin password, viewer password and API Gateway client secret) provided interactively by the user.
    • Result: This generates a transient NodeSense-realm.json file used for the actual import, ensuring that secrets are never committed to the repository.
  • Predefined Access Control:
    • Users: Automatically provisions admin and viewer users.
    • Roles: Assigns specific roles (admin, viewer) that are embedded in the JWT tokens for downstream service authorization.
  • Streamlined UX: The configuration overrides the default user profile to remove the mandatory "Update Account Information" step, ensuring a seamless login flow.

TimescaleDB Integration (Time-Series Metrics Storage)

NodeSense uses TimescaleDB to persist monitoring metrics in an optimized time-series format.

Design Overview:

  • A dedicated TimescaleDB service is deployed as part of the Docker Swarm stack.
  • The database schema is automatically created at startup using SQL initialization scripts.
  • Monitoring data is organized using a normalized model:
    • nodes - represents monitored nodes (VMs or containers)
    • metrics - a time-series hypertable storing metric values over time

Schema Highlights:

  • The metrics table is defined as a hypertable, enabling efficient queries over time intervals.
  • Indexes are created for (node_id, time DESC) to optimize common access patterns.

API Gateway (High Availability & Security)

The API Gateway acts as the central entry point for all incoming traffic, ensuring security, scalability, and control.

Implementation Details:

  • Technology Stack: Built with FastAPI (Python) for high performance and async capabilities.
  • Security:
    • JWT Validation: Enforces authentication by validating JSON Web Tokens (JWT) issued by Keycloak.
    • Role-Based Access Control (RBAC): Restricts sensitive endpoints (e.g., DELETE, System Topology) to users with the admin role.
  • Rate Limiting:
    • Implements a Distributed Rate Limiting algorithm using a Redis Cluster backend.
    • Limits traffic to 1000 requests per minute per client IP to prevent abuse.
  • System Integration:
    • Mounts the Docker Socket (/var/run/docker.sock) to query Swarm state (services, replicas).
    • Proxies metric ingestion requests to the Collector service via internal Docker DNS.

Metrics Collector (Data Aggregation)

The Collector is responsible for high-throughput ingestion and persistence of monitoring data.

Implementation Details:

  • High Performance: Uses FastAPI and asyncpg (asynchronous PostgreSQL driver) to handle concurrent write operations efficiently.
  • Data Pipeline:
    1. Validates incoming JSON payloads against a strict Pydantic schema.
    2. Updates real-time Prometheus gauges (node_metric, node_last_seen) for scraping.
    3. Persists normalized data into TimescaleDB using transactional writes.
  • Resiliency: Designed to be stateless and horizontally scalable (replicated).

Alerting Service (Anomaly Detection)

The Alerting Service continuously monitors the system for critical anomalies and ensures administrators are notified.

Implementation Details:

  • Polling Engine: A Python-based persistent service that queries TimescaleDB at fixed intervals (every 60s).
  • Detection Rules:
    • High CPU: Triggers when CPU usage exceeds 90%.
    • Node Down: Triggers when a node has not reported metrics for more than 2 minutes.
  • Persistence: Alerts are stored in a dedicated alerts table for auditing and UI retrieval.
  • Logging: Outputs structured warning logs for integration with external log aggregators.

Frontend Application (Dashboard & Control)

The Frontend provides a modern, responsive user interface for visualizing the platform's state.

Implementation Details:

  • Tech Stack: Built with React, Vite, and TailwindCSS for rapid development and optimized builds.
  • Features:
    • Real-Time Dashboard: Displays a grid of active nodes with status indicators (Online/Offline) and last-seen timestamps.
    • System Topology: Visualizes Docker Swarm services, including replica counts and image versions (Admin only).
    • Log Viewer: Live access to service logs via the Gateway API.
    • Live Simulator: Integrated tool to spawn virtual nodes/metrics directly from the browser for testing.
    • Alert Notifications: Polling-based Toast notifications for immediate alert visibility.

About

NodeSense is a platform designed for the continuous monitoring of container and VM nodes. It is based on the core idea that every node in the platform runs an agent that collects system metrics (CPU consumption, memory usage, active processes, etc.) and sends this information to a central collection service continuously.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published