Author |
Message
|
Challenger |
Posted: Mon Jan 05, 2009 8:00 am Post subject: Challenge Question - 01 / 2009 |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
Dear Forum Members:
Happy New Year to you all
and, here comes your January 2009 Challenge:
Scenario:
A large organisation, operating a mixture of UNIX-based Version 5.3, CSD11 to CSD12 queue managers (although not Linux), ‘separate’ their applications by using clusters. Over time this has led to an organic growth in the number of clusters as more applications were added to existing servers, and further servers were added for scaling purposes, which has increased the number of queue managers; there can be as few as 6 queue managers within a cluster, but any given queue manager may be in as few as one or two clusters, or may be in as many as 15 or more. Some servers have only one queue manager, others have two or three.
Because of the growth, documentation is very fragmented and a unravelling the cluster memberships is both time consuming and difficult – there are very few queue managers that are only Full Repositories, or only Partial Repositories – and the use of Namelists in the cluster channel definitions (rather than separate channels per cluster) makes for major difficulties in quickly identifying problems.
Recently it has been noticed that on some queue managers the queue manager’s Repository Process failed; this issue appeared when some queue managers were added to one of the clusters in which they participate. The affected queue managers were restarted and their Repository Process started too, but after a period of time the same issue reoccurred; the error is reported very badly within queue manager logs and the console. On the servers with multiple queue managers, not all of the queue managers were added into the same cluster as the queue managers which are now presenting the issue.
Within the Production environment it is deemed unacceptable to keep ‘bouncing’ queue managers and an interim ‘fix’ is required to restart the failed Repository Process of a queue manager while keeping the queue manager running. This fix has been found, but occasionally when the Repository Process is restarted it immediately fails again, with the following as an example:
Repository manager started
AMQ9422: Repository manager error, RC=545284148
AMQ9409: Repository manager ended abnormally
Further remedial action identified a way to prevent this occurring.
Challenge:
1. Identify the probable cause of why a previously healthy queue manager’s Repository Process now fails, and how to remove the cause.
2. State the manual actions required to restart the failed Repository Process, including the additional action required when a restarted Repository Process fails immediately.
Good Luck;
Challenger |
|
Back to top |
|
|
dgolding |
Posted: Thu Jan 08, 2009 4:07 am Post subject: |
|
|
Yatiri
Joined: 16 May 2001 Posts: 668 Location: Switzerland
|
Possible corrupt message in SYSTEM.CLUSTER.COMMAND.QUEUE - when a queue manager in the cluster is doing a automatic resend every 30-ish days? The origin of this message needs to be identified - which queue manager - and then fixed.
I always like the brute-force approach to fixing cluster repository problems - burn it all down and build it all up again. Works every time - except when it doesn't
1) On the offending queue manager, kill amqrrmfa process for that queue manager - get inihibit cluster command queue
2) clear qlocal(system.cluster.repository.queue)
3) restart the queue manager
It should start refreshing straight away |
|
Back to top |
|
|
PeterPotkay |
Posted: Thu Jan 08, 2009 6:51 am Post subject: Re: Challenge Question - 01 / 2009 |
|
|
Poobah
Joined: 15 May 2001 Posts: 7719
|
Challenger wrote: |
Within the Production environment it is deemed unacceptable to keep ‘bouncing’ queue managers and an interim ‘fix’ is required to restart the failed Repository Process of a queue manager while keeping the queue manager running. This fix has been found, |
What is this fix? How do you restart the Repository Process on its own, since all documentation points to the fact that this is a QM-internal process with no commands to control it seperatly? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
fjb_saper |
Posted: Thu Jan 08, 2009 3:22 pm Post subject: |
|
|
Grand High Poobah
Joined: 18 Nov 2003 Posts: 20729 Location: LI,NY
|
Was in those dire straits today. The process seemed to be running but the qmgr threw cluster resolution error. Had to bounce the qmgr (6.0.2.5)...
Yes it is a brute force approach but also what gets you back on your feet the fastest _________________ MQ & Broker admin |
|
Back to top |
|
|
Challenger |
Posted: Thu Jan 08, 2009 5:18 pm Post subject: |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
OK...firstly my apologies for the lack of response, but my ISP has only just managed to restore service after three days - outsourcing is a wonderful thing!
dgolding
You've hit the nail on the head but it's the symptom, not the cause. As regards which queue manager may be the cause...the one consistent factor is that the Repository process failure is only occurring on FR queue managers.
PeterPotkay
As far as I am aware the 'fix' isn't documented, and I would be surprised if IBM condoned it. I'm not going to give the details yet, that will have to wait until someone finds the answer.
fjb_saper
Like you and dgolding I too favour the brute force approach. The problem however is that the clusters function reasonably well enough that management only allow a queue manager bounce when the Repository process cannot be recovered by the 'fix'. Even liberal application of trout cannot dissuade them!
So, in summary, we have an identified symptom - a 'poison' message - but the actual cause of that message requires identification. |
|
Back to top |
|
|
PeterPotkay |
Posted: Fri Jan 09, 2009 10:10 am Post subject: |
|
|
Poobah
Joined: 15 May 2001 Posts: 7719
|
Can we assume that we are not dealing with an internal bug that can only be solved by applying a Fix Pack or Hot Fix from IBM? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
Challenger |
Posted: Fri Jan 09, 2009 11:30 am Post subject: |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
Peter,
A correct assumption; this is most definitely a user-induced problem. Furthermore it has been possible to replicate it in V6.0. |
|
Back to top |
|
|
Challenger |
Posted: Thu Jan 22, 2009 2:36 am Post subject: |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
It's all gone very quiet!
Is there anyone who wishes to attempt an explanation of what causes the 'poison' message to be generated - specifically, what administrator configured object caused the failure of the Repository Process to occur? |
|
Back to top |
|
|
fjb_saper |
Posted: Thu Jan 22, 2009 4:27 am Post subject: |
|
|
Grand High Poobah
Joined: 18 Nov 2003 Posts: 20729 Location: LI,NY
|
Could it be that you have FR repositories on a higher version than the PR and some messages are not understood? or the other way round? Would have to review the cluster manual to verify which is not a recommended configuration. I believe IIRC that you are supposed to upgrade the FR first...
So what happens when you have 1 FR @ 6.0 and the other @ 5.3 ? _________________ MQ & Broker admin |
|
Back to top |
|
|
Challenger |
Posted: Thu Jan 22, 2009 8:03 am Post subject: |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
fjb_saper wrote: |
Could it be that you have FR repositories on a higher version than the PR and some messages are not understood? or the other way round? Would have to review the cluster manual to verify which is not a recommended configuration. I believe IIRC that you are supposed to upgrade the FR first... |
Nice try...but no contact admin!
fjb_saper wrote: |
So what happens when you have 1 FR @ 6.0 and the other @ 5.3 ? |
When I have both the time and equipment, this is something I want to find out. Personal view is that for the small amount of time that one FR will be at a lower version, it should not be an issue or I would expect the manual to recommend creation of new FR's at the higher level, migrate the existing PR's to them, and demote the original FR's.
In relation to the challenge: All the FR's are at the same version (5.3) and kept in-synch as regards fix packs. |
|
Back to top |
|
|
mq_developer |
Posted: Mon Jan 26, 2009 3:46 pm Post subject: |
|
|
Voyager
Joined: 18 Feb 2002 Posts: 82
|
from past experience : repository manager is crashing because of buffer overrun problem. Buffer overrun problem is caused by one of the partial / Full repository queue manager sending poison pill message.
Remedy identify the first message in the REPOS queue and go from there. |
|
Back to top |
|
|
mq_developer |
Posted: Mon Jan 26, 2009 3:59 pm Post subject: |
|
|
Voyager
Joined: 18 Feb 2002 Posts: 82
|
If this be it , credit goes to greatest tool of this age : GOOGLE
Problem summary
****************************************************************
USERS AFFECTED:
Customers who have defined multiple cluster receiver channels
with identical names, and who exploit the reset cluster command.
Platforms affected:
All Distributed (iSeries, all Unix and Windows)
****************************************************************
PROBLEM SUMMARY:
When multiple cluster receivers with the same name are defined
then only one of these channels will be 'in-use' concurrently.
If clustering control messages get queued for the instance of
the channel that is not 'in-use' then the error handling in
amqrmppa is incorrect and the process terminates.
Problem conclusion
amqrrmfa has been changed to ignore the messages for the
duplicate channel name. In order to fully resolve this problem
the customer must delete and redefine the duplicate cluster
receiver with a unique name, and issue a RESET CLUSTER command
to propagate the corrected configuration around the cluster.
---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:
v5.3
Platform Fix Pack 13
-------- --------------------
Windows U200246
AIX U804647
HP-UX (PA-RISC) U804874
Solaris (SPARC) U804876
iSeries SI24234
Linux (x86) U804877
Linux (zSeries) U804879
Quote: |
Furthermore it has been possible to replicate it in V6.0. |
v6.0
Platform Fix Pack 6.0.1.1
-------- --------------------
Windows U200247
AIX U804921
HP-UX (PA-RISC) U805233
HP-UX (Itanium) U805767
Solaris (SPARC) U805234
Solaris (x86-64) U805768
iSeries SI21854
Linux (x86) U805235
Linux (x86-64) U805770
Linux (zSeries) U805236
Linux (Power) U805237
Linux (s390x) U805769 |
|
Back to top |
|
|
PeterPotkay |
Posted: Mon Jan 26, 2009 6:32 pm Post subject: |
|
|
Poobah
Joined: 15 May 2001 Posts: 7719
|
PeterPotkay wrote: |
Can we assume that we are not dealing with an internal bug that can only be solved by applying a Fix Pack or Hot Fix from IBM? |
Challenger wrote: |
Peter,
A correct assumption; this is most definitely a user-induced problem. |
_________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
zax |
Posted: Mon Jan 26, 2009 10:18 pm Post subject: |
|
|
Newbie
Joined: 20 Jan 2009 Posts: 2
|
The repos mgr reads SYSTEM.CLUSTER.COMMAND.QUEUE. For it to crash some action by the user must have caused a bad msg to be put to the queue.
Maybe some apps put non-PCF msgs to the queue, or some other msg that the repos mgr cannot recognise or deal with. It may not be the apps' fault; the cmdq might be a resolved name in a queue alias or remote queue.
The presence of such msgs could be determined by browsing the queue. It would not be easy to make much sense of them, but it should be possible to determine whether the msg at the front of the queue is legitimate or not.
The solution is to do a destructive read of the first msg from the queue, or as many msgs as need to be removed, and restart the repos mgr only. It can be started by entering the same command line as it displays in ps output, for example:
Code: |
nohup /usr/mqm/bin/amqrrmfa -m QMGR -t2332800 -s2592000 -p2592000 -g5184000 -c3600 > /dev/null 2>&1
|
This will restart the repos mgr. The EC will no longer have a record of the repos mgr PID, so there will not be a ZX005025 FFST if the repos mgr fails again, but that is bearable until the qmgr can be restarted. |
|
Back to top |
|
|
Challenger |
Posted: Tue Jan 27, 2009 3:32 am Post subject: |
|
|
Centurion
Joined: 31 Mar 2008 Posts: 115
|
zax
Top marks! And your first post too! You have correctly answered the second part of the question - how to manually recover the Repository Manager.
PeterPotkay
zax's answer explains my reason for not giving you the explanation earlier in the challenge.
All
A number of people have identified that a poison message is what causes the Repository Manager to fail, and as I will be closing down the challenge this coming Thursday (29th January) so the winner can be announced Friday, anyone want to have a stab at what the user-induced error is?
Hint: It is not application related in any way - it is a configuration issue. |
|
Back to top |
|
|
|