-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Opened a SAPHanaSR-ScaleOut Issue, but not sure if this would be the right location, so opening this new one:
During probes when saphana_monitor_clone() operation is ran, we run crm_resource command to check which node should be master, all other nodes are marked as DEMOTED:
saphana-controller-lib Line 1782
The crm_resource -W -r "$OCF_RESOURCE_INSTANCE" command however, will not report any node currently undergoing a probe operation, and this command will skip outputting any node doing a probe ( by design ):
# No output from rhel8-node2, which is current mastert:
$ cat SAPHana_RH1_HDB01.monitor.2025-08-12.11:04:15
-----------------------------------------8<-----------------------------------------
++ 10:54:04: saphana_monitor_clone:3109: crm_resource -W -r SAPHana_RH1_HDB01
+ 10:54:04: saphana_monitor_clone:3109: super_ocf_log info 'DBG PROBE:
resource SAPHana_RH1_HDB01 is running on: rhel8-node4
resource SAPHana_RH1_HDB01 is running on: rhel8-node5
resource SAPHana_RH1_HDB01 is running on: rhel8-node3
resource SAPHana_RH1_HDB01 is running on: rhel8-node2'
The way this is written guarantees that all probe operations will see all nodes as "DEMOTED". Probes were resources are active as master are rare, since this special monitor type really only runs on cluster start and were everything is already demoted. In rare cases were probes are run on a node with active master though, this can result in a demote against current master in some cases.
This issue is reproducible by creating any probe against the promoted instance of the SAPHanaController resource, but this can be tricky as probes can be difficult to generate.
The "pcs resource refresh" command while cluster is in a managed state should work or with following steps, can be run to generate a probe against just the primary:
-
Create any monitor failure against the "SAPHanaController" primary resource, while replication is down, so the primary is recovered to the same node.
-
Following recovery, run a "pcs resource cleanup" to clear errors. This will also run probes against resources with previous failures to check current status.4.
I have tested with using crm_attribute instead, and checking against the promotion score, and have found that crm_attribute is not susceptible to the same issue. Below change bypasses the issue:
# Updating RA as below:
if ocf_is_probe; then
super_ocf_log info "DEC: PROBE ONLY"
# TODO PRIO1: NG - could check the Master status in a probe - after a refresh this status should be lost?
# TODO PRIO1: NG - score, if CLONE_ATTRIBUTE is known
# TODO PRIO1: NG - is 'Master' still correct for promoted clones?
#
# check during probe, if this instance is *NOT* running as master
# setting clone_state to "DEMOTED" is needed to avoid misleading "PROMOTED"/"PROMOTED"
#
crm_res=$(crm_resource -W -r "$OCF_RESOURCE_INSTANCE")
[[ "$crm_res" =~ "is running on: "(.+)" "(Promoted|Master)"" ]] && master_node_name="${BASH_REMATCH[1]}" || master_node_name="" <----
if [ "$master_node_name" != "$NODENAME" ]; then <----
. -----------------------------------------8<-----------------------------------------
node_score=$( crm_attribute --promotion="$OCF_RESOURCE_INSTANCE" -G | awk -F 'value=' '{ print $2 }' ) #<---- To this
if [ "$node_score" -ne 150 ]; then # <----