Author |
Message
|
mqdev |
Posted: Mon Jul 15, 2019 7:23 am Post subject: RDQM in "Ended Unexpectedly" status - clueless.. |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Hello,
mine is a 3 node RDQM which hosts 4 RDQM QMs - QM0/1/2/3.
QM1/2/3 are healthy and in Running on one of the 3 nodes. QM0 however, is in "Ended unexpectedly" status on one of the nodes while the other 2 nodes are reporting "Running Elsewhere" for QM0.
Here is my investigation:
1. QM1/2/3 are healthy - able to failover etc. with no issues whatsoever.
2. QM0 when is generating an FDC with the following:
Major Errorcode :- xecF_E_UNEXPECTED_SYSTEM_RC |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ6119 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6119S: An internal IBM MQ error has occurred (Rc=5 |
| from write) |
| FDCSequenceNumber :- 0 |
| Comment1 :- Rc=5 from write |
MQM Function Stack
amqzxma0
zxcExecutionController
zxcStartupLogic
kpiStartup
apiStartup
almPerformReDoPass
apiReplayLogRecord
aqmReplayLog
aqlDoReplayLog
aqpWriteData
adhWrite
adiWriteFile
xcsFFST
MQM Trace History
-------{ xcsFreeMemFn
Data: 0x00000000 0x0243eb30
-------} xcsFreeMemFn rc=OK
------} xcsFreeQuickCellEnumerator rc=OK
-----} xcsSearchQuickCellBlock rc=xecU_W_KEY_NOT_FOUND
-----{ xlsInitMutex
------{ xlsSpinLockInit
The above FDC was generated when I rebooted the node where QM0 is "Ended unexpectedly".
root@host01 /var/mqm/errors
# dspmq
QMNAME(QM0.TEST_RDQM) STATUS(Ended unexpectedly)
QMNAME(QM1.TEST_RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST_RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST_RDQM) STATUS(Running elsewhere)
root@host01 /var/mqm/errors
#
root@host01 /var/mqm/errors
#
Now (post reboot), if I attempt to start the QM (strmqm <QM NAme>), I get the following error:
# strmqm QM0.TEST_RDQM
IBM MQ queue manager 'QM0.TEST_RDQM' HA sub-system is already active.
root@host01 /var/mqm/errors
Any ideas as to how to troubleshoot this error? There is nothing suspicious under /var/mqm/vols/qmgrs for QM0 - the QM under duress
Last edited by mqdev on Tue Jul 16, 2019 9:37 am; edited 3 times in total |
|
Back to top |
|
|
hughson |
Posted: Mon Jul 15, 2019 3:01 pm Post subject: |
|
|
Padawan
Joined: 09 May 2013 Posts: 1948 Location: Bay of Plenty, New Zealand
|
What is the output for the following commands:-
and:-
and are there any logged issues with Pacemaker
Code: |
systemctl status pacemaker |
or with Corosync
Code: |
systemctl status corosync |
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
|
mqdev |
Posted: Tue Jul 16, 2019 8:11 am Post subject: |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Results from host03:
root@host03 ~
# rdqmstatus
Node: host03.xyz.com
Queue manager name: QM0.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host01.xyz.com
Queue manager name: QM1.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
Queue manager name: QM2.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
Queue manager name: QM3.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
root@host03 ~
root@host03 ~
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM1.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST.RDQM) STATUS(Running elsewhere)
root@host03 ~
============================================================================
Results from host02:
root@host02 ~
# rdqmstatus
Node: host02.xyz.com
Queue manager name: QM0.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host01.xyz.com
Queue manager name: QM1.TEST.RDQM
Queue manager status: Running
HA current location: This node
Queue manager name: QM2.TEST.RDQM
Queue manager status: Running
HA current location: This node
Queue manager name: QM3.TEST.RDQM
Queue manager status: Running
HA current location: This node
root@host02 ~
root@host02 ~
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM1.TEST.RDQM) STATUS(Running)
QMNAME(QM2.TEST.RDQM) STATUS(Running)
QMNAME(QM3.TEST.RDQM) STATUS(Running)
root@host02 ~
============================================================================
Results from host01:
root@host01 /var/mqm/errors
# rdqmstatus
Node: host01.xyz.com
Queue manager name: QM0.TEST.RDQM
Queue manager status: Ended unexpectedly
HA current location: This node
Queue manager name: QM1.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
Queue manager name: QM2.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
Queue manager name: QM3.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com
root@host01 /var/mqm/errors
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Ended unexpectedly)
QMNAME(QM1.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST.RDQM) STATUS(Running elsewhere)
-------------------------------------------------------------------------------------------------------------------------
crm status output:
------------------
Stack: corosync
Current DC: host03.xyz.com (version 1.1.15.linbit-2.0+20160622+e174ec8.el7-e174ec8) - partition with quorum
Last updated: Tue Jul 16 12:02:11 2019 Last change: Tue Jul 16 10:06:33 2019 by root via crm_attribute on host02.xyz.com
3 nodes and 24 resources configured
Online: [ host02.xyz.com host01.xyz.com host03.xyz.com ]
Full list of resources:
Master/Slave Set: ms_drbd_qm0_test_rdqm [p_drbd_qm0_test_rdqm]
Masters: [ host01.xyz.com ]
Slaves: [ host02.xyz.com host03.xyz.com ]
p_fs_qm0_test_rdqm (ocf::heartbeat:Filesystem): Started host01.xyz.com
p_rdqmx_qm0_test_rdqm (ocf::ibm:rdqmx): Started host01.xyz.com
qm0_test_rdqm (ocf::ibm:rdqm): Stopped
Master/Slave Set: ms_drbd_qm1_test_rdqm [p_drbd_qm1_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm1_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm1_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm1_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com
Master/Slave Set: ms_drbd_qm2_test_rdqm [p_drbd_qm2_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm2_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm2_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm2_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com
Master/Slave Set: ms_drbd_qm3_test_rdqm [p_drbd_qm3_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm3_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm3_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm3_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com
Failed Actions:
* qm0_test_rdqm_start_0 on host03.xyz.com 'unknown error' (1): call=142, status=complete, exitreason='none',
last-rc-change='Mon Jul 15 10:36:39 2019', queued=1ms, exec=915ms
* qm0_test_rdqm_start_0 on host01.xyz.com 'unknown error' (1): call=91, status=complete, exitreason='none',
last-rc-change='Tue Jul 16 10:15:35 2019', queued=0ms, exec=932ms
* qm0_test_rdqm_start_0 on host02.xyz.com 'unknown error' (1): call=132, status=complete, exitreason='none',
last-rc-change='Mon Jul 15 10:36:34 2019', queued=0ms, exec=1244ms
-------------------------------------------------------------------------------------------------------------------------
Pacemaker status:
# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 49min ago
Docs: man:pacemakerd
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
Main PID: 6247 (pacemakerd)
CGroup: /system.slice/pacemaker.service
├─6247 /usr/sbin/pacemakerd -f
├─6270 /usr/libexec/pacemaker/cib
├─6271 /usr/libexec/pacemaker/stonithd
├─6272 /usr/libexec/pacemaker/lrmd
├─6273 /usr/libexec/pacemaker/attrd
├─6274 /usr/libexec/pacemaker/pengine
└─6275 /usr/libexec/pacemaker/crmd
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
Jul 16 10:15:36 host01.xyz.com crmd[6275]: notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)
root@host01 /var/mqm/errors
-------------------------------------------------------------------------------------------------------------------------
Corosync status:
# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 51min ago
Process: 5943 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 6108 (corosync)
CGroup: /system.slice/corosync.service
└─6108 corosync
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.102.130}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.103.121}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.103.151}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] A new membership (172.24.103.121:2196) was formed. Members joined: 2
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] A new membership (172.24.102.130:2200) was formed. Members joined: 1 3
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [QUORUM] This node is within the primary component and will provide service.
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [QUORUM] Members[3]: 1 2 3
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 16 10:15:08 host01.xyz.com corosync[5943]: Starting Corosync Cluster Engine (corosync): [ OK ]
Jul 16 10:15:08 host01.xyz.com systemd[1]: Started Corosync Cluster Engine.
-------------------------------------------------------------------------------------------------------------------------
Last edited by mqdev on Tue Jul 16, 2019 9:38 am; edited 4 times in total |
|
Back to top |
|
|
mqdev |
Posted: Tue Jul 16, 2019 8:13 am Post subject: |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
host01 seems to be the problematic node - there are errors in Pacemaker as seen above.
host02 and host03 are reporting valid status for QM0 (Running elsewhere)
host01 is reporting error for QM0 (Ended unexpectedly). |
|
Back to top |
|
|
mqdev |
Posted: Tue Jul 16, 2019 8:18 am Post subject: |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Also tried the command ( on host01 ):
crm resource cleanup p_drbd_qm0_test_rdqm
Result: did not work...
|
|
Back to top |
|
|
hughson |
Posted: Wed Jul 17, 2019 1:53 pm Post subject: |
|
|
Padawan
Joined: 09 May 2013 Posts: 1948 Location: Bay of Plenty, New Zealand
|
Also:
I'll prod Hursley to take a look at this post.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
|
hughson |
Posted: Thu Jul 18, 2019 1:52 am Post subject: |
|
|
Padawan
Joined: 09 May 2013 Posts: 1948 Location: Bay of Plenty, New Zealand
|
Hursley said you should raise a PMR.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
|
mqdev |
Posted: Thu Jul 18, 2019 4:44 am Post subject: |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Thanks Morag....we already raised a PMR. Not much progress there either, am afraid. Will keep pushing though...
Thanks
-mqdev |
|
Back to top |
|
|
mqdev |
Posted: Thu Jul 18, 2019 4:48 am Post subject: |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Am just curious about the following log entries in Pacemaker log:
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
Jul 16 10:15:36 host01.xyz.com crmd[6275]: notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)
Questions:
1. What exactly is "INFINITY" signifying here? If it is some unrecoverable condition, what is that? How can we identify this (and similar conditions), going forward? More then tackling this problem, we are looking for tips to troubleshoot such problems with RDQM...
2. Looks like the QM0 was stopped....why? |
|
Back to top |
|
|
|