Skip to content

RFC 5 RSS Extension

ubeda edited this page Apr 17, 2012 · 24 revisions

Authors: M. Ubeda Garcia

Last Modified: 17.IV.2012

This document does not aim to be the full technical specification of the ResourceStatusSystem (RSS) compatible with RFC 5. Not only the changes needed to get RSS working, it is a module overview.

Before going into details, here is the content index.

  1. Design
  • Database
  • Service
  • Client
  • Agent
  1. Scalability

1. Design

Three major components: Database, Service and Client. The approach taken is to simplify as much as possible the database and service, and keep in the client the logic, if any. This means, service and database are just middle-ware between logic and information. With a heavy information processing needed this model may not be sustainable due to it makes the usage of caches more inefficient. Luckily, the RSS is mainly a big getter / setter system without any complex processing task behind.

1.1. Database

Databases rely on the ( badly named ) MySQLMonkey module. This module will be replaced by Pull 669 , or at least partially. Certainly, the monkey module must be updated to work with the latest MySQL module, but ( my ) the intention is to still use it. Using the naive database + service defined here, it makes the new database / tables and client methods definition a piece of cake.

Four methods on the database:

  • insert
  • update
  • get ( select would make more sense )
  • delete

each one receiving two dictionaries named: params and meta. First one carries information to narrow the query, the second one meta information for the WHERE statement. Direct benefits are:

  • SQL statements generated on the fly
  • Database with only four methods ( and service as well ) instead of X ( and X ). Being X a number between 30 and 100.

1.1.1. Schema

Two databases in RSS, ResourceStatus and ResourceManagement. Leaving the second one aside for the moment, we can focus on the ResourceStatus, which will be in charge of keeping the statuses of the "IT resources defined on /Resources" ( from now, "elements" ).

Four identical tables for each element ( plus one to be discussed ) following the current schema, given that we have three element levels ( Site, Resource and Node ).

Note:

  1. On bold you can find the UNIQUE_TOGETHER relations
  2. StatusType can be anything that makes sense, but preferably defined on the CS Operations/RSSConfiguration ( e.g. ReadAccess for SEs ).
  3. Status can also be defined on the CS, by default RSS provides ( Active, Bad, Probing, Banned ).

Tables:

Without any doubt, the most important table. It keeps the status of the element, plus further information mainly used by RSS internally.

  • ElementStatus
    • Name
    • StatusType
    • Status
    • Reason ( human readable reason )
    • DateEffective ( status dateEffective )
    • LastCheckTime
    • TokenOwner ( owner of the status, can be RSS or a human )
    • TokenExpiration

This table is only interesting for monitoring and debugging purposes, I'd make it a ARCHIVE table indeed, instead of InnoDB. It is mainly a logs table. Every time there is a change in the ElementStatus table, a new entry is registered here ( if the appropriated flag in the CS is Active ).

A variant of this table, ElementHistory can be also put into place. This table would keep entries that have a different status for a given name, statusType pair than the previous entry. The information in this table is used by the web portal mainly.

  • ElementLog
    • UID
    • Name
    • StatusType
    • Status
    • Reason
    • DateEffective
    • DateEnd
    • TokenOwner
    • TokenExpiration

ElementPresent can be a table or a VIEW as it is now. It turns to be very handy to have in the same row the current and the former status of an element. It exists because it is convenient to have such table.

  • ElementPresent
    • Name
    • StatusType
    • Status
    • FormerStatus
    • DateEffective
    • FormerDateEffective

DownTimes are a good use case for the ElementScheduled table. If we know beforehand the status for an element, we enter it on the system, and it will handle it. ( This part is not active currently ).

  • ElementScheduled
    • Name
    • StatusType
    • Status
    • Reason
    • DateEffective
    • DateEnd
    • TokenOwner

The existence of this table is uncertain. We can get the same information from the CS. Is it worth ? The prize to pay for having it is very low.

  • Element
    • Name
    • Parent ?
    • Type ?

REQUIREMENTS

R1: helper to get the information, at least, the list of Sites, the list of Resources and the list of Nodes. If we are keeping their types and parental information, then that is needed too. RSS will take care of being in sync with the CS, the module RSS.Utilities.Synchronizer used in the ResourceStatusHandler does the job.

1.2. Service

The service has no magic inside. Same four methods as on the database:

  • insert
  • update
  • get ( select would make more sense )
  • delete

plus the following Authentication configuration:

  • Default : SiteManager ( to be discussed )
  • get : authenticated

One service per database, two services in total.

1.3. Client

The client offers all methods to extract the information from the database. It offers five methods per table in the database by default:

  • insert + TableName
  • update + TableName
  • get + TableName ( select + TableName )
  • delete + TableName
  • addOrModify + TableName

Starting from the bottom, addOrModify is a insert / update method. If it cannot insert because the entry exists, it updates.

The other four follow this pattern:

methodName( column1, column2, .. , meta = None )

  • methodName gives you a hint which of the four methods in the server is used, and which table is modified.
  • all columns in the table are represented with columnX, using lower camelCase.

For example, the ElementStatus insert method for the Site elements

insertSiteStatus( name, statusType, status, reason, dateEffective, lastCheckTime, tokenOwner, tokenExpiration, meta = None )

Of course, there are special requirements, and additional methods provided to fulfill the needs. In any case, the family of <insert, get( select ), update and delete> primitive methods are the ones that can access the gate ( which could be, indistinctly the server or the database - default is the server ). Any other method will use in one way or another some of the primitive methods.

1.4. Agent

The list of agents will be:

  • SiteInspectorAgent : takes care of policies that apply to sites
  • ResourceInspectorAgent : takes care of policies that apply to resources
  • NodeInspectorAgent : takes care of policies that apply to nodes
  • CacheFeederAgent : feeds few table used by the monitoring part of RSS
  • CacheCleanerAgent : cleans old stuff on the monitoring part of RSS
  • TokenAgent : takes care of the element's token ( mainly alerting humans when their tokens are about to expire )

2. Scalability

In order to deal with high loads, a few more small modules have been added. The ResourceStatusClient gives users access to whatever information stored in the RSS databases but most likely, queries done will be regarding the status of a certain element or group of elements.

RSS provides a helper, RSS.Client.ResourceStatus which now has dual access to CS and RSS, switched with a flag in Operations/RSSConfiguraion/Active. Given that RSS will eventually be the only element status information point, we can forget about CS here.

ResourceStatus helper will have two methods per element:

  • getElementStatus
  • setElementStatus

The helper, behaving as a singleton will provide a cache per element, which is refreshed automatically every 5 minutes. If the item(s) wanted are not in the cache, then the ResourceStatusClient is called. If the operation is a write instead of a read, the cache is locked, the DB updated and finally the cache refreshed before releasing it. The assumption is that we have the whole table in the cache, if there is a miss there are two possible options: wrong element name or changes have not been propagated. To prevent massive refresh if a loop picks by whatever reason a bad name, the cache is not invalidated with a miss. In 5 min you will know if it was a real miss, or a propagation delay.

.. .. TO BE TESTED .. ..

Clone this wiki locally