Skip to content
ubeda edited this page Jun 6, 2013 · 16 revisions

Facts / Publicity:

  • RSS as from DIRAC v7r0 will be the system taking care of the statuses of any element.
  • RSS will take actions on some elements ( provided that there are policies for them ).
  • RSS is generic enough to handle almost anything we could imagine, and incredibly easy to extend.

How RSS monitoring works:

It has a set of caches that are populated by an agent ( CacheFeederAgent ). Then, policies read those caches and take decisions based on that information. This has two implications:

  • Our policies will be as quick at reacting as the CacheFeederAgent is at fetching new information
  • Our knowledge will be bounded by the problems we can foresee and therefore instrument the CacheFeederAgent to get the information we would possibly need to spot them.

This proposal wants to tackle three problems:

  • instrument DIRAC components to notify error symptoms to RSS
  • use it to monitor DIRAC services and agents
  • became active: a-la "HammerCloud" / "SAM" so that we can use probes effectively AND evaluate the usage of pilots

1.- instrument DIRAC components to notify error symptoms to RSS

**THE GOAL: ** detect symptoms of failures on grid elements as soon as possible.

First, we will instrument RSS with a table ( ErrorReportBuffer ) with the following columns:

  • ElementName
  • ElementType
  • Reporter
  • Timestamp
  • ErrorMessage
  • Operation
  • Arguments

the idea is put the table and the client methods in place on DIRAC v6r9. Note that they are completely harmless as they are NOT going to be used yet. Once they are in place, we need to identify which components we want to instrument to report errors to RSS and which operations. An example for this would be the ReplicaManager.putFile operation.

Once a few components / operations are identified, we will have to gather data and decide how to proceed with the data evaluation. One suggestion is to discard single failures and consider only the trends, so that glitches will not be taken into account ( to be seen ). Depending how many reporters and operations are taken into account we will need to put in place few data mining tricks procedures.

Nevertheless, once the ErrorReportBuffer is on place, it can be used by the policies directly or or indirectly making use of the conclusions withdrawn from its data.

2.- use it to monitor DIRAC services and agents

THE GOAL: notify when a DIRAC module ( service / agents - will see executors ) is performing badly, dead, or being hammered.

This idea overlaps with the Framework.ComponentMonitoringDB, so it will be a joint action. For the time being, the database will remain on the Framework module. However, the RSS will decide which components are ok and which ones are not. To do this, we need to add to the ComponentMonitoringDB a column / table to store them and a timestamp. For services it has been proposed to take into account the memory consumption and the running threads. For agents, we will need to instrument the monitoring thread to detect if the time between logs is 2 x pollingTime. The action to be taken is sending alerts ( for the moment ).

3.- became active: a-la "HammerCloud" / "SAM" so that we can use probes effectively AND evaluate the usage of pilots

THE GOAL: make use of the pilots to validate the status set by RSS

This task comprises three sub-tasks: a) Certify Un-Banning decision when status is forced but RSS machinery to Probing b) Getting rid of old SAMSystem and port it to DIRAC c) Instrument pilots to do tests on demand or automatically for us

Firstly, we need to instrument pilots to be able to fetch a Probe instead of a regular Job. Once that is done, we need to make RSS ready to reply to the pilots asking whether to match a Probe or a Job. This decision will be taken based on the time elapsed between the last Probe on that ce / wn / site ( granularity to be defined ) and the decision time. All this information will be stored on RSS, more specifically, a cache table in ResourceManagementDB ( PilotProbesCache ). So, as far as our elements are Active, pilots will be sent. Every now and then, one ( or many, will see ) of those pilots will run a probe ( VO specific ) and report back to RSS. Then, those results will be picked by the usual RSS machinery. However, if the element is banned, there are no pilots sent. The SAM system ( being part of RSS ) will be modified to react when there are elements in Probing state ( first state after Banned ) and submit generic pilots. These pilots, as there have not been Probes running on that element due to the Banned period, will run Probes. The outcome of the probes will determine if the element status is confirmed and moved to Active or put back into Banned.

Clone this wiki locally