| Author | Message | 
		
		  | mqdev | 
			  
				|  Posted: Mon Jul 15, 2019 7:23 am    Post subject: RDQM in "Ended Unexpectedly" status - clueless.. |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| Hello, mine is a 3 node RDQM which hosts 4 RDQM QMs - QM0/1/2/3.
 
 QM1/2/3 are healthy and in Running on one of the 3 nodes. QM0 however, is in "Ended unexpectedly" status on one of the nodes while the other 2 nodes are reporting "Running Elsewhere" for QM0.
 
 Here is my investigation:
 1. QM1/2/3 are healthy - able to failover etc. with no issues whatsoever.
 2. QM0 when is generating an FDC with the following:
 
 Major Errorcode   :- xecF_E_UNEXPECTED_SYSTEM_RC                            |
 | Minor Errorcode   :- OK                                                     |
 | Probe Type        :- MSGAMQ6119                                             |
 | Probe Severity    :- 1                                                      |
 | Probe Description :- AMQ6119S: An internal IBM MQ error has occurred (Rc=5  |
 |   from            write)                                                    |
 | FDCSequenceNumber :- 0                                                      |
 | Comment1          :- Rc=5 from            write                             |
 
 
 MQM Function Stack
 amqzxma0
 zxcExecutionController
 zxcStartupLogic
 kpiStartup
 apiStartup
 almPerformReDoPass
 apiReplayLogRecord
 aqmReplayLog
 aqlDoReplayLog
 aqpWriteData
 adhWrite
 adiWriteFile
 xcsFFST
 
 MQM Trace History
 -------{ xcsFreeMemFn
 Data: 0x00000000 0x0243eb30
 -------} xcsFreeMemFn rc=OK
 ------} xcsFreeQuickCellEnumerator rc=OK
 -----} xcsSearchQuickCellBlock rc=xecU_W_KEY_NOT_FOUND
 -----{ xlsInitMutex
 ------{ xlsSpinLockInit
 
 The above FDC was generated when I rebooted the node where QM0 is "Ended unexpectedly".
 
 
 root@host01 /var/mqm/errors
 # dspmq
 QMNAME(QM0.TEST_RDQM)                              STATUS(Ended unexpectedly)
 QMNAME(QM1.TEST_RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM2.TEST_RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM3.TEST_RDQM)                              STATUS(Running elsewhere)
 
 root@host01 /var/mqm/errors
 #
 
 root@host01 /var/mqm/errors
 #
 
 
 
 Now (post reboot), if I attempt to start the QM (strmqm <QM NAme>), I get the following error:
 
 # strmqm QM0.TEST_RDQM
 IBM MQ queue manager 'QM0.TEST_RDQM' HA sub-system is already active.
 
 root@host01 /var/mqm/errors
 
 
 
 Any ideas as to how to troubleshoot this error? There is nothing suspicious under /var/mqm/vols/qmgrs for QM0 - the QM under duress
 
 Last edited by mqdev on Tue Jul 16, 2019 9:37 am; edited 3 times in total
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | hughson | 
			  
				|  Posted: Mon Jul 15, 2019 3:01 pm    Post subject: |   |  | 
		
		  |  Padawan
 
 
 Joined: 09 May 2013Posts: 1967
 Location: Bay of Plenty, New Zealand
 
 | 
			  
				| What is the output for the following commands:- 
 and:-
 
 and are there any logged issues with Pacemaker
 
 
   
	| Code: |  
	| systemctl status pacemaker |  or with Corosync
 
 
   
	| Code: |  
	| systemctl status corosync |  Cheers,
 Morag
 _________________
 Morag Hughson @MoragHughson
 IBM MQ Technical Education Specialist
 Get your IBM MQ training here!
 MQGem Software
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | mqdev | 
			  
				|  Posted: Tue Jul 16, 2019 8:11 am    Post subject: |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| Results from host03: 
 root@host03 ~
 # rdqmstatus
 Node:                                   host03.xyz.com
 
 Queue manager name:                     QM0.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host01.xyz.com
 
 Queue manager name:                     QM1.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 Queue manager name:                     QM2.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 Queue manager name:                     QM3.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 root@host03 ~
 root@host03 ~
 # dspmq
 QMNAME(QM0.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM1.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM2.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM3.TEST.RDQM)                              STATUS(Running elsewhere)
 
 root@host03 ~
 
 ============================================================================
 
 Results from host02:
 
 root@host02 ~
 # rdqmstatus
 Node:                                   host02.xyz.com
 
 Queue manager name:                     QM0.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host01.xyz.com
 
 Queue manager name:                     QM1.TEST.RDQM
 Queue manager status:                   Running
 HA current location:                    This node
 
 Queue manager name:                     QM2.TEST.RDQM
 Queue manager status:                   Running
 HA current location:                    This node
 
 Queue manager name:                     QM3.TEST.RDQM
 Queue manager status:                   Running
 HA current location:                    This node
 
 root@host02 ~
 root@host02 ~
 # dspmq
 QMNAME(QM0.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM1.TEST.RDQM)                              STATUS(Running)
 QMNAME(QM2.TEST.RDQM)                              STATUS(Running)
 QMNAME(QM3.TEST.RDQM)                              STATUS(Running)
 
 root@host02 ~
 
 ============================================================================
 
 
 Results from host01:
 
 root@host01 /var/mqm/errors
 # rdqmstatus
 Node:                                   host01.xyz.com
 
 Queue manager name:                     QM0.TEST.RDQM
 Queue manager status:                   Ended unexpectedly
 HA current location:                    This node
 
 Queue manager name:                     QM1.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 Queue manager name:                     QM2.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 Queue manager name:                     QM3.TEST.RDQM
 Queue manager status:                   Running elsewhere
 HA current location:                    host02.xyz.com
 
 root@host01 /var/mqm/errors
 # dspmq
 QMNAME(QM0.TEST.RDQM)                              STATUS(Ended unexpectedly)
 QMNAME(QM1.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM2.TEST.RDQM)                              STATUS(Running elsewhere)
 QMNAME(QM3.TEST.RDQM)                              STATUS(Running elsewhere)
 
 -------------------------------------------------------------------------------------------------------------------------
 
 crm status output:
 ------------------
 
 Stack: corosync
 Current DC: host03.xyz.com (version 1.1.15.linbit-2.0+20160622+e174ec8.el7-e174ec8) - partition with quorum
 Last updated: Tue Jul 16 12:02:11 2019          Last change: Tue Jul 16 10:06:33 2019 by root via crm_attribute on host02.xyz.com
 
 3 nodes and 24 resources configured
 
 Online: [ host02.xyz.com host01.xyz.com host03.xyz.com ]
 
 Full list of resources:
 
 Master/Slave Set: ms_drbd_qm0_test_rdqm [p_drbd_qm0_test_rdqm]
 Masters: [ host01.xyz.com ]
 Slaves: [ host02.xyz.com host03.xyz.com ]
 p_fs_qm0_test_rdqm      (ocf::heartbeat:Filesystem):    Started host01.xyz.com
 p_rdqmx_qm0_test_rdqm   (ocf::ibm:rdqmx):       Started host01.xyz.com
 qm0_test_rdqm   (ocf::ibm:rdqm):        Stopped
 Master/Slave Set: ms_drbd_qm1_test_rdqm [p_drbd_qm1_test_rdqm]
 Masters: [ host02.xyz.com ]
 Slaves: [ host01.xyz.com host03.xyz.com ]
 p_fs_qm1_test_rdqm      (ocf::heartbeat:Filesystem):    Started host02.xyz.com
 p_rdqmx_qm1_test_rdqm   (ocf::ibm:rdqmx):       Started host02.xyz.com
 qm1_test_rdqm   (ocf::ibm:rdqm):        Started host02.xyz.com
 Master/Slave Set: ms_drbd_qm2_test_rdqm [p_drbd_qm2_test_rdqm]
 Masters: [ host02.xyz.com ]
 Slaves: [ host01.xyz.com host03.xyz.com ]
 p_fs_qm2_test_rdqm      (ocf::heartbeat:Filesystem):    Started host02.xyz.com
 p_rdqmx_qm2_test_rdqm   (ocf::ibm:rdqmx):       Started host02.xyz.com
 qm2_test_rdqm   (ocf::ibm:rdqm):        Started host02.xyz.com
 Master/Slave Set: ms_drbd_qm3_test_rdqm [p_drbd_qm3_test_rdqm]
 Masters: [ host02.xyz.com ]
 Slaves: [ host01.xyz.com host03.xyz.com ]
 p_fs_qm3_test_rdqm      (ocf::heartbeat:Filesystem):    Started host02.xyz.com
 p_rdqmx_qm3_test_rdqm   (ocf::ibm:rdqmx):       Started host02.xyz.com
 qm3_test_rdqm   (ocf::ibm:rdqm):        Started host02.xyz.com
 
 Failed Actions:
 * qm0_test_rdqm_start_0 on host03.xyz.com 'unknown error' (1): call=142, status=complete, exitreason='none',
 last-rc-change='Mon Jul 15 10:36:39 2019', queued=1ms, exec=915ms
 * qm0_test_rdqm_start_0 on host01.xyz.com 'unknown error' (1): call=91, status=complete, exitreason='none',
 last-rc-change='Tue Jul 16 10:15:35 2019', queued=0ms, exec=932ms
 * qm0_test_rdqm_start_0 on host02.xyz.com 'unknown error' (1): call=132, status=complete, exitreason='none',
 last-rc-change='Mon Jul 15 10:36:34 2019', queued=0ms, exec=1244ms
 
 -------------------------------------------------------------------------------------------------------------------------
 
 Pacemaker status:
 
 # systemctl status pacemaker
 
 ● pacemaker.service - Pacemaker High Availability Cluster Manager
 Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
 Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 49min ago
 Docs: man:pacemakerd
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
 Main PID: 6247 (pacemakerd)
 CGroup: /system.slice/pacemaker.service
 ├─6247 /usr/sbin/pacemakerd -f
 ├─6270 /usr/libexec/pacemaker/cib
 ├─6271 /usr/libexec/pacemaker/stonithd
 ├─6272 /usr/libexec/pacemaker/lrmd
 ├─6273 /usr/libexec/pacemaker/attrd
 ├─6274 /usr/libexec/pacemaker/pengine
 └─6275 /usr/libexec/pacemaker/crmd
 
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
 Jul 16 10:15:36 host01.xyz.com attrd[6273]:   notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
 Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
 Jul 16 10:15:36 host01.xyz.com crmd[6275]:   notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)
 
 root@host01 /var/mqm/errors
 
 -------------------------------------------------------------------------------------------------------------------------
 
 Corosync status:
 
 # systemctl status corosync
 ● corosync.service - Corosync Cluster Engine
 Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
 Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 51min ago
 Process: 5943 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
 Main PID: 6108 (corosync)
 CGroup: /system.slice/corosync.service
 └─6108 corosync
 
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [TOTEM ] adding new UDPU member {172.24.102.130}
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [TOTEM ] adding new UDPU member {172.24.103.121}
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [TOTEM ] adding new UDPU member {172.24.103.151}
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [TOTEM ] A new membership (172.24.103.121:2196) was formed. Members joined: 2
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [TOTEM ] A new membership (172.24.102.130:2200) was formed. Members joined: 1 3
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [QUORUM] This node is within the primary component and will provide service.
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [QUORUM] Members[3]: 1 2 3
 Jul 16 10:15:08 host01.xyz.com corosync[6108]:   [MAIN  ] Completed service synchronization, ready to provide service.
 Jul 16 10:15:08 host01.xyz.com corosync[5943]: Starting Corosync Cluster Engine (corosync): [  OK  ]
 Jul 16 10:15:08 host01.xyz.com systemd[1]: Started Corosync Cluster Engine.
 
 
 
 -------------------------------------------------------------------------------------------------------------------------
 
 Last edited by mqdev on Tue Jul 16, 2019 9:38 am; edited 4 times in total
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | mqdev | 
			  
				|  Posted: Tue Jul 16, 2019 8:13 am    Post subject: |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| host01 seems to be the problematic node - there are errors in Pacemaker as seen above. 
 host02 and host03 are reporting valid status for QM0 (Running elsewhere)
 
 host01 is reporting error for QM0 (Ended unexpectedly).
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | mqdev | 
			  
				|  Posted: Tue Jul 16, 2019 8:18 am    Post subject: |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| Also tried the command ( on host01 ): 
 crm resource cleanup p_drbd_qm0_test_rdqm
 
 Result: did not work...
 
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | hughson | 
			  
				|  Posted: Wed Jul 17, 2019 1:53 pm    Post subject: |   |  | 
		
		  |  Padawan
 
 
 Joined: 09 May 2013Posts: 1967
 Location: Bay of Plenty, New Zealand
 
 | 
			  
				| Also: 
 I'll prod Hursley to take a look at this post.
 
 Cheers,
 Morag
 _________________
 Morag Hughson @MoragHughson
 IBM MQ Technical Education Specialist
 Get your IBM MQ training here!
 MQGem Software
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | hughson | 
			  
				|  Posted: Thu Jul 18, 2019 1:52 am    Post subject: |   |  | 
		
		  |  Padawan
 
 
 Joined: 09 May 2013Posts: 1967
 Location: Bay of Plenty, New Zealand
 
 | 
			  
				| Hursley said you should raise a PMR. 
 Cheers,
 Morag
 _________________
 Morag Hughson @MoragHughson
 IBM MQ Technical Education Specialist
 Get your IBM MQ training here!
 MQGem Software
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | mqdev | 
			  
				|  Posted: Thu Jul 18, 2019 4:44 am    Post subject: |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| Thanks Morag....we already raised a PMR. Not much progress there either, am afraid. Will keep pushing though... 
 Thanks
 -mqdev
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  | mqdev | 
			  
				|  Posted: Thu Jul 18, 2019 4:48 am    Post subject: |   |  | 
		
		  | Centurion
 
 
 Joined: 21 Jan 2003Posts: 136
 
 
 | 
			  
				| Am just curious about the following log entries in Pacemaker log: 
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
 Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
 Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
 Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
 Jul 16 10:15:36 host01.xyz.com crmd[6275]: notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)
 
 Questions:
 
 1. What exactly is "INFINITY" signifying here? If it is some unrecoverable condition, what is that? How can we identify this (and similar conditions), going forward? More then tackling this problem, we are looking for tips to troubleshoot such problems with RDQM...
 
 2. Looks like the QM0 was stopped....why?
 |  | 
		
		  | Back to top |  | 
		
		  |  | 
		
		  |  |