From fd040d675727bdb0cd0d458d8ba21620a07d031c Mon Sep 17 00:00:00 2001 From: Giuseppe Carboni Date: Tue, 5 May 2020 16:41:03 +0200 Subject: [PATCH 1/2] Fix #334, add some doc. on how to replace the Manager in case of failure --- doc/production.rst | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/doc/production.rst b/doc/production.rst index 7b7df866..40e7e349 100644 --- a/doc/production.rst +++ b/doc/production.rst @@ -4,6 +4,15 @@ Production ********** +Unlike the Development environment, that uses Vagrant pre-configured virtual +machines, when dealing with production machines, you have to perform some +preliminary tasks in order for the provisioning procedure to be completed +successfully. It is required that you configure the to-be-provisioned +machines' network interfaces, as well as their disk partitions. You also have +to install on them the desired Operating System (Centos 6.8 for ACS running +machines, Centos 7.2 for storage). Without these preliminary tasks, the +provisioning procedure will most likely fail. + Machines deployment =================== To deploy the system in production, you have to specify a *cluster* of machines, @@ -74,3 +83,51 @@ tag you want to install on the machines: argument from both the ``discos-deploy`` and ``discos-get`` scripts. If you pass the ``--station`` argument anyway, if the given argument does not match the correct station you will receive an error and the procedure will stop. + +Replace the Manager in case of failure +-------------------------------------- +In case the Manager machine suffers a failure of some sort, it has to be +replaced. In order to do this, the first thing to do is perform again the +provisioning procedure on a newly installed machine (after putting the new +Manager's IP address in the Ansible inventory's hosts file). In order for +the whole system to behave correctly it is also necessary to perform some +manual tweaking on the other DISCOS machines as well (in case the DISCOS +control system is running on a distributed environment. This is the case +for the SRT and Medicina stations). + +The tweaks to be performed in order for the DISCOS control system to work as +expected are the following: + +- Replace the old ACS Manager IP address reference with the new one in + ``/discos-sw/config/misc/bash_profile`` file in the ``discos-console`` + machine. It is stored as an environment variable called ``MNG_IP``. +- Replace the old Manager IP address with the new one in some fiels in the + DISCOS CDB. More specifically, one file has to be corrected in order for the + control system to be able to properly communicate with the ``TotalPower`` + backend, you can find this file in the repository of the currently deployed + released of DISCOS, under the directory + ``SRT/Configuration/CDB/alma/BACKENDS/TotalPower/TotalPower.xml``. + The variable to be corrected is called ``DataIPAddress``. This has to be + performed on the new Manager machine itself before launching the control + system. +- Make sure that all the station systems and machines accept incoming + connections from the newly allocated Manager's IP address. Specifically, the + ``TotalPower`` backend and the ``CalMux`` machines have to be tweaked in + order to allow them to be controlled by the new manager. + +In order for the whole environment to work properly is also necessary to +perform some other tweaks on the other DISCOS machines, but not related to +the control system itself: + +- Replace the old Manager IP address with the new one in ``/etc/hosts`` file in + ``discos-console`` and ``discos-storage`` machines (in case the DISCOS + control software is running on a distributed environment). This will allow + other services such as the Lustre service on the ``discos-storage`` machine + to point again to the correct IP address. +- Perform the ssh key exchange procedure between the ``discos`` user of the + newly installed Manager with the ones present on the ``discos-console`` and + ``discos-storage`` machines. The same procedure has to be performed between + the ``root`` users as well. This will allow some scripts such as the Lustre + service on the ``discos-storage`` machine and the ``discos-addProject`` and + ``discos-removeProject`` on the ``discos-console`` machine to perform some + remote tasks that would be impossible to be performed otherwise. From 193b99bbb0b94c7a386bd9afeba6e15de73c5538 Mon Sep 17 00:00:00 2001 From: Giuseppe Carboni Date: Mon, 14 Sep 2020 09:41:41 +0200 Subject: [PATCH 2/2] Update on duc file --- doc/production.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/production.rst b/doc/production.rst index 40e7e349..8d272a1f 100644 --- a/doc/production.rst +++ b/doc/production.rst @@ -4,7 +4,7 @@ Production ********** -Unlike the Development environment, that uses Vagrant pre-configured virtual +Unlike the development environment, that uses Vagrant pre-configured virtual machines, when dealing with production machines, you have to perform some preliminary tasks in order for the provisioning procedure to be completed successfully. It is required that you configure the to-be-provisioned @@ -87,13 +87,13 @@ tag you want to install on the machines: Replace the Manager in case of failure -------------------------------------- In case the Manager machine suffers a failure of some sort, it has to be -replaced. In order to do this, the first thing to do is perform again the +replaced. In order to do this, the first thing to do is, perform the provisioning procedure on a newly installed machine (after putting the new -Manager's IP address in the Ansible inventory's hosts file). In order for -the whole system to behave correctly it is also necessary to perform some -manual tweaking on the other DISCOS machines as well (in case the DISCOS -control system is running on a distributed environment. This is the case -for the SRT and Medicina stations). +Manager's IP address in the Ansible inventory's hosts file). In order +for the whole system to behave correctly it is also necessary to perform +some manual tweaking on the other DISCOS machines as well (in case the +DISCOS control system is running on a distributed environment. This is the +case for the SRT and Medicina stations). The tweaks to be performed in order for the DISCOS control system to work as expected are the following: