-
Notifications
You must be signed in to change notification settings - Fork 0
Monitoring
This part is delegated to PetalsLink easierBSM.
The goal of the technical monitoring is to be able to monitor the PLAY platform components in 'real time' in order to:
- Check system behavior/performance
- Check component behavior/performance
- Be alerted when something abnormal occurs
- ...
- The main idea was to keep the same approach we use in the project i.e. the pub/sub communication paradigm. While this approach has many advantages, the main difficulty is to keep a dynamic/elastic system connected to the monitoring platform to get a consistent view of the platform.
- The monitoring platform needs to be extensible. A generic approach must allow the platform developers to inject the monitoring data of their choice in the monitoring platform.
- The monitoring platform must adapt itself to new monitoring data/format without any software update.
- To be continued
Instead of developing such technical monitoring systemm from scratch, we chose to reuse and extend existing open source components. Here is an initial list of software to have a look to:
- http://graphite.wikidot.com : A Scalable Realtime Graphing.
- https://github.com/etsy/statsd : A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services.
statsd
Statsd is a daemon which open a socket to receive really simple UDP messages with monitoring information (counters, gauges, etc...). It is really simple to push online (less than 200 loc node.js service) and performant due to its implementation choices. It can be connected to several 'backends' to push data. One provided backend implementation is Graphite quickly described below.
graphite
Graphite is composed of two components: the backend and the frontend. The backend is designed to be performant and the frontend displays live data for each monitoring feed. A monitoring feed can be defined as a key/values store: Monitored components send data to a socket in a key/value way. The backend is in charge of aggregating monitoring data based on the monitoring keys.
Graphite also provide a 'time window' feature per monitoring feed: One can define how to keep historical monitoring data by period. For example, we can define that we want to have a high resolution for the last 5 minutes (let's say keep data each second), and low resolution for the last year (keep data each hour for the last year).
Graphite provides a GUI to display monitoring information and a JSON/XML REST API to query monitoring data. This API can be used with external libraries ([https://github.com/blog/1240-new-status-site](github uses it), there are also some JS libs based on d3.js like graphene).
The tools described in the previous section:
- Do not use the pub/sub paradigm i.e. data have to be sent to the monitoring tool
- Are just 'display' tools i.e. we can not create alerts when violation occurs for example
While these two restrictions are not aligned with the features we listed above, there are several ways to bypass them which will be described and implemented in the next iterations. The goals of the first iteration are to:
- Quickly create the monitoring engine. This will be achieved by using the tools listed above.
- Provide a monitoring API which is independant from the underlying tools. Using the key/value approach is generic enough to be used by each Play component. Using this key/value approach, we can easily monitoring everything from the system side to the middleware one: CPU and memory usage, messages, EventCloud instances, EventCloud data, ...
TODO:
- API
- Message format (JSON, Raw, ...)
- Protocol choice (UDP, HTTP, ...)