ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » Challenge Forum » Challenge Question - 01 / 2009

This forum is locked: you cannot post, reply to, or edit topics.  This topic is locked: you cannot edit posts or make replies. Goto page 1, 2, 3, 4  Next
 Challenge Question - 01 / 2009 « View previous topic :: View next topic » 
Author Message
Challenger
PostPosted: Mon Jan 05, 2009 8:00 am    Post subject: Challenge Question - 01 / 2009 Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Dear Forum Members:

Happy New Year to you all

and, here comes your January 2009 Challenge:

Scenario:

A large organisation, operating a mixture of UNIX-based Version 5.3, CSD11 to CSD12 queue managers (although not Linux), ‘separate’ their applications by using clusters. Over time this has led to an organic growth in the number of clusters as more applications were added to existing servers, and further servers were added for scaling purposes, which has increased the number of queue managers; there can be as few as 6 queue managers within a cluster, but any given queue manager may be in as few as one or two clusters, or may be in as many as 15 or more. Some servers have only one queue manager, others have two or three.

Because of the growth, documentation is very fragmented and a unravelling the cluster memberships is both time consuming and difficult – there are very few queue managers that are only Full Repositories, or only Partial Repositories – and the use of Namelists in the cluster channel definitions (rather than separate channels per cluster) makes for major difficulties in quickly identifying problems.

Recently it has been noticed that on some queue managers the queue manager’s Repository Process failed; this issue appeared when some queue managers were added to one of the clusters in which they participate. The affected queue managers were restarted and their Repository Process started too, but after a period of time the same issue reoccurred; the error is reported very badly within queue manager logs and the console. On the servers with multiple queue managers, not all of the queue managers were added into the same cluster as the queue managers which are now presenting the issue.

Within the Production environment it is deemed unacceptable to keep ‘bouncing’ queue managers and an interim ‘fix’ is required to restart the failed Repository Process of a queue manager while keeping the queue manager running. This fix has been found, but occasionally when the Repository Process is restarted it immediately fails again, with the following as an example:

Repository manager started
AMQ9422: Repository manager error, RC=545284148
AMQ9409: Repository manager ended abnormally

Further remedial action identified a way to prevent this occurring.

Challenge:

1. Identify the probable cause of why a previously healthy queue manager’s Repository Process now fails, and how to remove the cause.
2. State the manual actions required to restart the failed Repository Process, including the additional action required when a restarted Repository Process fails immediately.

Good Luck;
Challenger
Back to top
View user's profile Send private message
dgolding
PostPosted: Thu Jan 08, 2009 4:07 am    Post subject: Reply with quote

Yatiri

Joined: 16 May 2001
Posts: 668
Location: Switzerland

Possible corrupt message in SYSTEM.CLUSTER.COMMAND.QUEUE - when a queue manager in the cluster is doing a automatic resend every 30-ish days? The origin of this message needs to be identified - which queue manager - and then fixed.

I always like the brute-force approach to fixing cluster repository problems - burn it all down and build it all up again. Works every time - except when it doesn't

1) On the offending queue manager, kill amqrrmfa process for that queue manager - get inihibit cluster command queue
2) clear qlocal(system.cluster.repository.queue)
3) restart the queue manager

It should start refreshing straight away
Back to top
View user's profile Send private message Visit poster's website
PeterPotkay
PostPosted: Thu Jan 08, 2009 6:51 am    Post subject: Re: Challenge Question - 01 / 2009 Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7717

Challenger wrote:

Within the Production environment it is deemed unacceptable to keep ‘bouncing’ queue managers and an interim ‘fix’ is required to restart the failed Repository Process of a queue manager while keeping the queue manager running. This fix has been found,


What is this fix? How do you restart the Repository Process on its own, since all documentation points to the fact that this is a QM-internal process with no commands to control it seperatly?
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Thu Jan 08, 2009 3:22 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

Was in those dire straits today. The process seemed to be running but the qmgr threw cluster resolution error. Had to bounce the qmgr (6.0.2.5)...

Yes it is a brute force approach but also what gets you back on your feet the fastest
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Challenger
PostPosted: Thu Jan 08, 2009 5:18 pm    Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

OK...firstly my apologies for the lack of response, but my ISP has only just managed to restore service after three days - outsourcing is a wonderful thing!


dgolding
You've hit the nail on the head but it's the symptom, not the cause. As regards which queue manager may be the cause...the one consistent factor is that the Repository process failure is only occurring on FR queue managers.

PeterPotkay
As far as I am aware the 'fix' isn't documented, and I would be surprised if IBM condoned it. I'm not going to give the details yet, that will have to wait until someone finds the answer.

fjb_saper
Like you and dgolding I too favour the brute force approach. The problem however is that the clusters function reasonably well enough that management only allow a queue manager bounce when the Repository process cannot be recovered by the 'fix'. Even liberal application of trout cannot dissuade them!

So, in summary, we have an identified symptom - a 'poison' message - but the actual cause of that message requires identification.
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Jan 09, 2009 10:10 am    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7717

Can we assume that we are not dealing with an internal bug that can only be solved by applying a Fix Pack or Hot Fix from IBM?
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
Challenger
PostPosted: Fri Jan 09, 2009 11:30 am    Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Peter,

A correct assumption; this is most definitely a user-induced problem. Furthermore it has been possible to replicate it in V6.0.
Back to top
View user's profile Send private message
Challenger
PostPosted: Thu Jan 22, 2009 2:36 am    Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

It's all gone very quiet!

Is there anyone who wishes to attempt an explanation of what causes the 'poison' message to be generated - specifically, what administrator configured object caused the failure of the Repository Process to occur?
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Thu Jan 22, 2009 4:27 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

Could it be that you have FR repositories on a higher version than the PR and some messages are not understood? or the other way round? Would have to review the cluster manual to verify which is not a recommended configuration. I believe IIRC that you are supposed to upgrade the FR first...
So what happens when you have 1 FR @ 6.0 and the other @ 5.3 ?
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Challenger
PostPosted: Thu Jan 22, 2009 8:03 am    Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

fjb_saper wrote:
Could it be that you have FR repositories on a higher version than the PR and some messages are not understood? or the other way round? Would have to review the cluster manual to verify which is not a recommended configuration. I believe IIRC that you are supposed to upgrade the FR first...


Nice try...but no contact admin!

fjb_saper wrote:
So what happens when you have 1 FR @ 6.0 and the other @ 5.3 ?


When I have both the time and equipment, this is something I want to find out. Personal view is that for the small amount of time that one FR will be at a lower version, it should not be an issue or I would expect the manual to recommend creation of new FR's at the higher level, migrate the existing PR's to them, and demote the original FR's.

In relation to the challenge: All the FR's are at the same version (5.3) and kept in-synch as regards fix packs.
Back to top
View user's profile Send private message
mq_developer
PostPosted: Mon Jan 26, 2009 3:46 pm    Post subject: Reply with quote

Voyager

Joined: 18 Feb 2002
Posts: 82

from past experience : repository manager is crashing because of buffer overrun problem. Buffer overrun problem is caused by one of the partial / Full repository queue manager sending poison pill message.

Remedy identify the first message in the REPOS queue and go from there.
Back to top
View user's profile Send private message
mq_developer
PostPosted: Mon Jan 26, 2009 3:59 pm    Post subject: Reply with quote

Voyager

Joined: 18 Feb 2002
Posts: 82

If this be it , credit goes to greatest tool of this age : GOOGLE

Problem summary
****************************************************************
USERS AFFECTED:
Customers who have defined multiple cluster receiver channels
with identical names, and who exploit the reset cluster command.

Platforms affected:
All Distributed (iSeries, all Unix and Windows)
****************************************************************
PROBLEM SUMMARY:
When multiple cluster receivers with the same name are defined
then only one of these channels will be 'in-use' concurrently.
If clustering control messages get queued for the instance of
the channel that is not 'in-use' then the error handling in
amqrmppa is incorrect and the process terminates.
Problem conclusion
amqrrmfa has been changed to ignore the messages for the
duplicate channel name. In order to fully resolve this problem
the customer must delete and redefine the duplicate cluster
receiver with a unique name, and issue a RESET CLUSTER command
to propagate the corrected configuration around the cluster.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

v5.3
Platform Fix Pack 13
-------- --------------------
Windows U200246
AIX U804647
HP-UX (PA-RISC) U804874
Solaris (SPARC) U804876
iSeries SI24234
Linux (x86) U804877
Linux (zSeries) U804879

Quote:
Furthermore it has been possible to replicate it in V6.0.

v6.0
Platform Fix Pack 6.0.1.1
-------- --------------------
Windows U200247
AIX U804921
HP-UX (PA-RISC) U805233
HP-UX (Itanium) U805767
Solaris (SPARC) U805234
Solaris (x86-64) U805768
iSeries SI21854
Linux (x86) U805235
Linux (x86-64) U805770
Linux (zSeries) U805236
Linux (Power) U805237
Linux (s390x) U805769
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Mon Jan 26, 2009 6:32 pm    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7717

PeterPotkay wrote:
Can we assume that we are not dealing with an internal bug that can only be solved by applying a Fix Pack or Hot Fix from IBM?



Challenger wrote:
Peter,

A correct assumption; this is most definitely a user-induced problem.

_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
zax
PostPosted: Mon Jan 26, 2009 10:18 pm    Post subject: Reply with quote

Newbie

Joined: 20 Jan 2009
Posts: 2

The repos mgr reads SYSTEM.CLUSTER.COMMAND.QUEUE. For it to crash some action by the user must have caused a bad msg to be put to the queue.
Maybe some apps put non-PCF msgs to the queue, or some other msg that the repos mgr cannot recognise or deal with. It may not be the apps' fault; the cmdq might be a resolved name in a queue alias or remote queue.
The presence of such msgs could be determined by browsing the queue. It would not be easy to make much sense of them, but it should be possible to determine whether the msg at the front of the queue is legitimate or not.

The solution is to do a destructive read of the first msg from the queue, or as many msgs as need to be removed, and restart the repos mgr only. It can be started by entering the same command line as it displays in ps output, for example:

Code:

nohup /usr/mqm/bin/amqrrmfa -m QMGR -t2332800 -s2592000 -p2592000 -g5184000 -c3600 > /dev/null 2>&1


This will restart the repos mgr. The EC will no longer have a record of the repos mgr PID, so there will not be a ZX005025 FFST if the repos mgr fails again, but that is bearable until the qmgr can be restarted.
Back to top
View user's profile Send private message
Challenger
PostPosted: Tue Jan 27, 2009 3:32 am    Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

zax
Top marks! And your first post too! You have correctly answered the second part of the question - how to manually recover the Repository Manager.

PeterPotkay
zax's answer explains my reason for not giving you the explanation earlier in the challenge.

All
A number of people have identified that a poison message is what causes the Repository Manager to fail, and as I will be closing down the challenge this coming Thursday (29th January) so the winner can be announced Friday, anyone want to have a stab at what the user-induced error is?

Hint: It is not application related in any way - it is a configuration issue.
Back to top
View user's profile Send private message
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.  This topic is locked: you cannot edit posts or make replies. Goto page 1, 2, 3, 4  Next Page 1 of 4

MQSeries.net Forum Index » Challenge Forum » Challenge Question - 01 / 2009
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.