ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » xmitq queue depth rise on vendor side.

Post new topic  Reply to topic
 xmitq queue depth rise on vendor side. « View previous topic :: View next topic » 
Author Message
v4vball
PostPosted: Wed Jul 13, 2022 11:05 pm    Post subject: xmitq queue depth rise on vendor side. Reply with quote

Novice

Joined: 17 Aug 2018
Posts: 10

we recently had back to back production incidents involving an external vendor. In both incidents, messages reply from vendor had high latency.

with incident 1, the response time of MQ reply from vendor increased from 200+ms to over 20 second gradually over 6 hours after a weekend MQ maintenance . the reply queue on our end was filled up 15 mins after vendor' received xmitq depth high alert. the queue was filled up with orphaned messages due to consuming application timing out. this incident was resolved by vendor restarting their sender channel.


incident 2 happened a week after. vendor's MQ response time downgraded 6 times over a period of 24 hours. each time lasted a few minutes. the downgrading were at much higher pace, within a minute or so, message response time increase from 200+ms to over 20 seconds. yet the latency was short-lived and our queue was never filled up this time, again vendor received xmitq depth high alert.

I speculate there is an issue with the channel , yet network teams on both end did not find any issue during first incident, no trace of any sort done for the second incident. vendor refuses to believe the issue was on their end.

My questions are

other than a filled up remote queue or network latency what else can cause the xmitq on the other end to rise?

any reason to believe there could be an issue on either end?

Do you think these two incidents have the same root cause? why the first one wasn't able to recover until sender channel recycled?
Back to top
View user's profile Send private message
bruce2359
PostPosted: Thu Jul 14, 2022 12:52 pm    Post subject: Re: xmitq queue depth rise on vendor side. Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 9394
Location: US: west coast, almost. Otherwise, enroute.

v4vball wrote:
... received xmitq depth high alert. the queue was filled up with orphaned messages due to consuming application timing out. this incident was resolved by vendor restarting their sender channel.

Why did the consuming app time out? Why did it time out so quickly? Consumer should stay resident attempting to get the next message for far longer than 20 seconds.
v4vball wrote:
I speculate there is an issue with the channel , yet network teams on both end did not find any issue during first incident, no trace of any sort done for the second incident. vendor refuses to believe the issue was on their end.
Where did the network folks look for errors?

v4vball wrote:
My questions are

other than a filled up remote queue or network latency what else can cause the xmitq on the other end to rise?

any reason to believe there could be an issue on either end?

Do you think these two incidents have the same root cause? why the first one wasn't able to recover until sender channel recycled?

I've been misled by network folks from time to time. I'm going to guess (speculate) that your channels are experiencing transient issues. Could be failing NIC cards, CAT cables, firewall, ...

Look at the MQ error logs (AMQERR01.LOG is the current error log file) for the affected qmgr(s).

What are the RETRY (both SHORT and LONG) settings for the misbehaving channel?

When this slow-down next occurs, do a DISPLAY CHSTATUS(channel-name).
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Thu Jul 14, 2022 4:01 pm    Post subject: Re: xmitq queue depth rise on vendor side. Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

v4vball wrote:
other than a filled up remote queue or network latency what else can cause the xmitq on the other end to rise?

any reason to believe there could be an issue on either end?

Do you think these two incidents have the same root cause? why the first one wasn't able to recover until sender channel recycled?

If the remote app is running normally, usually issues with sender/receiver channels will be network or TCP stack related. However, a gradual increase in reply time points more toward the remote app, not network.
Request the external vendor to examine their MQ qmgr error logs during those time frames, or to provide them to you (good luck with that).
To protect yourself, increase the maxdepth on the reply queue to cover these scenarios. There is no excuse for not increasing maxdepth, as you have no control over behavior of the external vendor.
_________________
Glenn
Back to top
View user's profile Send private message
v4vball
PostPosted: Thu Jul 14, 2022 10:17 pm    Post subject: Reply with quote

Novice

Joined: 17 Aug 2018
Posts: 10

The so-called application on our end is actually a bunch of Restful APIs acting as JMS consumers , all share a similar MQ design pattern . the actual design requires the request and reply to be completed in seconds to serve "impatient" Internet users. Considering the normal performance, that 20 second timeout on MQGET is actually an overkill.

the reply queue depth is 20K, sufficient for the busiest hours. larger queue depth does give more buffer time , yet with the high MQPUT rate, we'll need a substantial increase on the queue depth , just to help avoiding a queue depth full situation. my plan is to implement a trigger for this queue, starting a JCL job when queue depth is high, this job will discard all expired messages by doing browsing. (expired messages will be discarded when browsed or read) in this way, we don't have to rely on queue manager's scavenger clean task, which has 15 mins interval . that 15 mins interval is actually the bottle neck of queue depth recovery . since in 15 mins, there can be large number of reply messages accumulated in xmitq , they will flood the queue again once receiver channel is back to business.

To my knowledge, in incident 1, network folks checked the latency on network and found no issue. I questioned vendor's application performance, yet was told , their other consumers had no issue.

we checked CHIN & MSTR job log, SYSLOG, SMF Type 116, other than queue depth full alert, nothing else really stood out. Being said , my exposure on SMF is very limited , and only type 116 queue accounting data was provided to me by our performance team. the MQPUT latency on receiver channel is actually 0. I am under the impression Type 115 CHIN statistics should tell me more about channel performance , unfortunately We don't have that.


our receiver channel's has bee set up to retry forever every second. vendor's xmitq already started rising before our queue is full, is it possible retrying happened on our end due to some other reasons, if it did happen, shouldn't there be messages in log showing channel pause? the thing is we don't see any.


Network related issues are on top of the suspect list , the symptom in incident 1 also suggests the downgrading trend was in sync with traffic


guess We will have to act quicker and do more when this issue reoccurs. can anyone please share experience on generating IBM MQ GTF trace on CHIN job? or examining SMF Type 115?
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Fri Jul 15, 2022 4:34 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

Other things to look at:
Does the channel server other queues besides said reply queue?

Any queue full problem on your side, will substantially slow down the channel if it has a message for that queue. So I would suggest make sure there is NO queue that is full on your queue manager. See if you still see the slow downs.

Potentially you can also alter the behavior of the retry of your receiver channel to immediately put the message on the DLQ on retry, thus avoiding a slow down on other queues if one of the serviced queues is full.

The impact of a full queue on other queues and channel communications is often underestimated with the response being "Oh but his does not concern our application". Wrong, it does concern your application if it slows down the channel that provides messages to your application.

Example: Say you have 2 queues A & B serviced by the same channel.
Application A is on-line and sends 100 messages for one message sent by application B.
Messages from application A go to queue A, messages from application B go to queue B.

Application B has a problem and queue B fills up.
As a result one Batch out of 50 may have a message to deliver to queue B.
For that batch the retry process of the channel gets engaged and the message gets put on the DLQ with error Queue Full. But this means at the same time that messages in the following batches get delayed by the retry time of the channel and may no longer be relevant (expired/orphaned) when they arrive on Queue A.

This acts like a rolling snow ball until no messages for Queue A are relevant any more... Expired or consumer timed out... No fault of the request servicing application. Symptom XMITQ at the request servicing application build up.


Hope this helps
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
bruce2359
PostPosted: Fri Jul 15, 2022 9:05 am    Post subject: Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 9394
Location: US: west coast, almost. Otherwise, enroute.

v4vball wrote:
we checked CHIN & MSTR job log, SYSLOG, ...

So, your receiver end of the channel is a z/OS qmgr. What o/s is at the at the sender end?

Use the ISPF panels or issue this command /cpf DIS CHINIT
Post the complete results here. I'm specifically interested in number Adapters and Dispatchers.

Or, look at the SYSLOG for CHIN address space. You should see CSQXnnn messages like these:
CSQX141I MQ## CSQXADPI 8 adapter subtasks started, 0 failed
CSQX160E MQ## CSQXGIMP SSL communications unavailable
CSQX151I MQ## CSQXSSLI 0 SSL server subtasks started, 0 failed
CSQX410I MQ## CSQXREPO Repository manager started
CSQT975I MQ## CSQXDPSC Distributed Pub/Sub Controller has started
CSQX015I MQ## CSQXSPRI 5 dispatchers started, 0 failed

I'm puzzled that there are NO messages on SYSLOG, like CSQX500I MQ00 CSQXRCTL Channel MQ00.MQ0B started

If you believe that need a CHIN trace, look at this https://www.ibm.com/support/pages/generating-websphere-mq-chin-trace-ibm-zos

Also, display the SVRCONN channel(s) that the JMS apps use to connect to the qmgr.
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Sun Jul 17, 2022 6:19 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

v4vball wrote:
the reply queue depth is 20K, sufficient for the busiest hours. larger queue depth does give more buffer time , yet with the high MQPUT rate, we'll need a substantial increase on the queue depth , just to help avoiding a queue depth full situation.

20K is very small for production. Make it 100K, or higher. This does not consume any storage, until there are messages on the queue.

Quote:
my plan is to implement a trigger for this queue, starting a JCL job when queue depth is high, this job will discard all expired messages by doing browsing. (expired messages will be discarded when browsed or read) in this way, we don't have to rely on queue manager's scavenger clean task, which has 15 mins interval . that 15 mins interval is actually the bottle neck of queue depth recovery . since in 15 mins, there can be large number of reply messages accumulated in xmitq , they will flood the queue again once receiver channel is back to business.


OK. How does the app get reply messages, by message id or correl id? Have you got the correct indexing set on the queue? Without proper indexing, MQ will be constantly scanning through the entire queue for matches. This can noticeably degrade performance if the depth is a few hundred or more.

Quote:
our receiver channel's has bee set up to retry forever every second. vendor's xmitq already started rising before our queue is full, is it possible retrying happened on our end due to some other reasons, if it did happen, shouldn't there be messages in log showing channel pause? the thing is we don't see any.

If you don't see any messages for channel retry in the logs, it is not doing it. One second is very low.

If there is an issue with the channel starting, retrying every second will generally not help, it just creates more overhead. 30 - 60 seconds would be more reasonable for a production channel, and it should only do short retries for an hour or so.

If there is an issue while the channel is running, the MCAs will have socket timeouts and waits, and should be able to cope with fairly slow network response. This is not counted as channel "retry".

Quote:
Network related issues are on top of the suspect list , the symptom in incident 1 also suggests the downgrading trend was in sync with traffic
guess We will have to act quicker and do more when this issue reoccurs.


What is the average message size? Are any other queues being serviced by this channel? Is the network path long or complex? What is the average ICMP ping time, and does it vary much?
_________________
Glenn
Back to top
View user's profile Send private message
Andyh
PostPosted: Mon Jul 18, 2022 12:37 am    Post subject: Reply with quote

Master

Joined: 29 Jul 2010
Posts: 237

Is the vendor QM a Z/OS or distributed QMgr ?
If distributed, are there any signs of deep queues and message "loading" on the vendor QMgr ?

With distributed MQ, a queue which remains unrefernced for a period of activity will be evicted from memory (a.k.a queue unloading), in particular the queue index will be discarded from memory. When the queue is next referenced the index will be rebuilt which involves reading the MQMD's of every message on that queue from disk. This can lead to random delays when the queue is deep. This behaviour dates back to times when memory was a much more limited resource (MQ is 25+ years old!).

Shortly before I retired from IBM (spring 2021) I changed this algorithm such that the depth of the queue was considered in the decision as to whether to unload an unreferenced deep queue, in an attempt to reduce this sort of impact. I'm afraid that I can't recall exactly which release this would have shipped under. Thus a 'current' queue manager should be much less likely to suffer from these sorts of delays.

The tell tale signs of this sort of issue are messages in AMQERR01.LOG showing large numbers of messages being loaded at around the time of the issues, and (in extreme cases) long lock wait FDC's being raised at around the time of the issues.

The deep queue being unloaded/reloaded might not be directly involved in the app suffering the slowdown, for example if a receiver channel has to reopen some queue then no other messages will flow down that channel until the queue has been completely loaded.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Mon Jul 18, 2022 3:04 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

Andyh wrote:
With distributed MQ, a queue which remains unrefernced for a period of activity will be evicted from memory (a.k.a queue unloading), in particular the queue index will be discarded from memory. When the queue is next referenced the index will be rebuilt which involves reading the MQMD's of every message on that queue from disk. This can lead to random delays when the queue is deep.


Yes, there is a message in the qmgr error log for every 10,000 messages that are reloaded into memory. If a MQ PCF command requires the queue object to be referenced, it will completely stall the MQ command processor until the reload is complete.

eg. Tools like MO71 appear to hang, non-zero depth on SYSTEM.ADMIN.COMMAND.QUEUE.
_________________
Glenn
Back to top
View user's profile Send private message
Andyh
PostPosted: Tue Jul 19, 2022 12:03 am    Post subject: Reply with quote

Master

Joined: 29 Jul 2010
Posts: 237

"If a MQ PCF command requires the queue object to be referenced, it will completely stall the MQ command processor until the reload is complete."

This is a bit off topic for the original post, but I thought might be worthy of a little further remark. The command server doesn't know what messges the requesting application might need to be processed in sequence and so ends up assuming a worst case scenario and processes ALL of the messages in the sequence implied by the command queue. Hence a request which implies the command server loading a deep queue blocks all subsequent requests, even those that don't interact with that queue. The command server is as old as MQ itself and so we've always had this implcit strong serialization. Had grouped messages existed when the command server was written then it might have been possible to do better, however the error handling implications make grouped messages a fairly unattractive option in most cases (for example if the last message in a non-persistent message group is ever lost).

Back on topic, the receiving end of an message channel is in a similar situation, it has no knowledge of how messages are interrelated and so if a delay is implicit in putting one message then all subsequent messages on that channel are delayed regardless of whether they are targetted at the same queue. I'm unaware if there's ever been an RFE requesting an "unordered queue" where the serialization is explicitly relaxed, but without such a concept we're stuck with the existing implied serialization and these sorts of implications. The cheap solution of being more reluctant to unload a deep queue is quite a good fit in the circumstances. It's pretty unusual in this day and age for there to be a shortage of virtual memory !
Back to top
View user's profile Send private message
bruce2359
PostPosted: Tue Jul 19, 2022 5:08 am    Post subject: Re: xmitq queue depth rise on vendor side. Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 9394
Location: US: west coast, almost. Otherwise, enroute.

v4vball wrote:
we recently had back to back production incidents involving an external vendor. In both incidents, messages reply from vendor had high latency.

with incident 1, the response time of MQ reply from vendor increased from 200+ms to over 20 second gradually over 6 hours after a weekend MQ maintenance . the reply queue on our end was filled up 15 mins after vendor' received xmitq depth me root cause? why the first one wasn't able to recover until sender channel recycled?

Please be precise. What MQ maintenance?

Maintenance to IBM MQ software? If so, what maintenance?

Maintenance to your or vendor MQ object definitions?

Maintenance to your or vender application code?

Did restoring the environment to pre-maintenance levels resolve the issue?
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Tue Jul 19, 2022 3:51 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

Andyh wrote:
"If a MQ PCF command requires the queue object to be referenced, it will completely stall the MQ command processor until the reload is complete."

This is a bit off topic for the original post, but I thought might be worthy of a little further remark. The command server doesn't know what messges the requesting application might need to be processed in sequence and so ends up assuming a worst case scenario and processes ALL of the messages in the sequence implied by the command queue. Hence a request which implies the command server loading a deep queue blocks all subsequent requests, even those that don't interact with that queue.


Hi Andy. Thanks for your insights. I have observed that if a "deep" queue has not been opened for a while (10 minutes?), and then something opens the queue to put a message, the "loading" message appears in the qmgr error log. This is typical behavior for an error or exception logging queue. We sometimes have 200K+ messages on the queue (I know, it's not ideal), so it can take a couple of minutes to complete the index memory load. In the mean time, I am trying to use MO71 to list the queues (ie. its doing PCF inquiry on queue attributes only), and it doesn't respond until the load is complete. It's not browsing messages, so why should there be any delay?
_________________
Glenn
Back to top
View user's profile Send private message
Andyh
PostPosted: Wed Jul 20, 2022 6:49 am    Post subject: Reply with quote

Master

Joined: 29 Jul 2010
Posts: 237

Opening a queue for MQOO_INQUIRE shouldn't normally involve loading the queue.
The obvious exception would be a query of queue depth following a crash recovery. Queue depth is a slightly odd "queue attribute" in that it's very dynamic. When a queue is loaded the depth is part of the memory image, but -1 is wriiten into the depth field in the queue attributes on disk. If the queue manager shuts down normally then the proper queue depth will be written out to disk, but if the qmgr ends abruptly it can find -1 on restart, indicating that in order to determine the depth the messages must be loaded. This is all done to avoid needing to update the queue depth on disk every time there is a put or a get. This would only apply to the first time the depth was accessed after a crash restart, so repeated queries should never need to repeatedly reload the queue. I'm not familiar with the source code of MO71 and so I can't comment on whether it's using the right open options. If it were to over specify its open options to imply some intent to access the messages themselves then the queue could be loaded as part of the MQOPEN.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Wed Jul 20, 2022 3:43 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

Andyh wrote:
Opening a queue for MQOO_INQUIRE shouldn't normally involve loading the queue. The obvious exception would be a query of queue depth following a crash recovery.

OK, we have not had any crash recoveries. The queue index load is being forced because some other process has just opened it for output, to put a message, not due to something doing an inquiry.

Quote:
Queue depth is a slightly odd "queue attribute" in that it's very dynamic .... I'm not familiar with the source code of MO71 and so I can't comment on whether it's using the right open options. If it were to over specify its open options to imply some intent to access the messages themselves then the queue could be loaded as part of the MQOPEN.

I'm not familiar either, but I would expect MO71 to only be requesting queue attributes values via PCF, and not opening each queue in the list. I can do MO71 "Queue List" of 15,000 queues (including curdepth) and it only takes a few seconds. This indicates it is not opening any queues in the list.
_________________
Glenn
Back to top
View user's profile Send private message
hughson
PostPosted: Wed Jul 20, 2022 7:11 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

gbaddeley wrote:
I'm not familiar either, but I would expect MO71 to only be requesting queue attributes values via PCF, and not opening each queue in the list. I can do MO71 "Queue List" of 15,000 queues (including curdepth) and it only takes a few seconds. This indicates it is not opening any queues in the list.

I can confirm that if you use MO71's "Queue List" then it is using the PCF Inquire Queues command and the only queues that are MQOPENed are the command server queue to MQPUT the PCF command message, and the Reply Queue (likely via a QMODEL) to MQGET the response messages. The queues in the list are not MQOPENed.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » xmitq queue depth rise on vendor side.
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.