Author |
Message
|
zpat |
Posted: Sun Sep 27, 2020 12:31 am Post subject: How to solve message retry delays impacting others |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
If you have a channel from one QM to another (whether standard or cluster sender). it's possible to have a full queue at the destination cause the channel to enter message retry.
The default (before the DLQ is used) is 10 retries at 1 second intervals. Causing a 10 second delay before the message goes to the DLQ, this is then repeated for every message intended for the full queue.
It seems that during this retry period - nothing else will use the channel so that messages intended for other queues that are not full - are also delayed behind the ones in retry.
For high volume, low latency message based applications this is a very serious issue as having thousands of messages with 10 second retries essentially blocks the channel for long periods, even for the apps whose queues have plenty of space.
I can't see an obvious solution other than turning off message retries entirely but am interested in what other people do to avoid this issue. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
bruce2359 |
Posted: Sun Sep 27, 2020 4:12 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
Yes, turn off retries.
One option: Let the messages destined for the full queue go directly to the DLQ, and let the dead-letter queue handler deal with them out-of-band.
Another option: To handle (avert) queue full conditions: Enable and monitor depth events. When queue depth reaches 80%, increase (alter) maxdepth by 25%.
If the underlying problem is insufficient number of concurrent consumers, here's an opportunity for TRIGTYPE(EVERY). _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
gbaddeley |
Posted: Sun Sep 27, 2020 3:11 pm Post subject: |
|
|
Jedi Knight
Joined: 25 Mar 2003 Posts: 2527 Location: Melbourne, Australia
|
Do everything possible to avoid having a queue full situation. Set the max depth to a very high value. Monitor the queue depth. Monitor the app that is supposed to be consuming the msgs. _________________ Glenn |
|
Back to top |
|
|
PeterPotkay |
Posted: Sun Sep 27, 2020 5:26 pm Post subject: |
|
|
Poobah
Joined: 15 May 2001 Posts: 7719
|
Turn off Message Retry. How many times does it actually accomplish its mission? Almost always the reason for the failed PUTs is not going to resolve itself in a few seconds, so why waste time retrying. Sure, once in a blue moon it will pull it off delivering messages that would have went to a dead letter queue. But its not worth the risk to impact other messages on a shared channel.
One place I absolutely use Message Retry is on my Edge Queue Managers that have channels from other companies coming to us. On those dedicated channels between our company and one other company if there is any funny business occurring causing my RCVR to send to my DLQ, I want Message Retry kicking in to throttle what's coming across. If my RCVR is sending to the DLQ something is seriously wrong, maybe even something malicious. I want that RCVR slowing waaaaay down so we have time to react to the alert for the messages arriving on the DLQ. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 1:30 am Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
The destination queue in this case is a z/OS shared queue using a Coupling Facility structure.
These are memory based and inherently limited in size. We actually hit the CF space full before the max queue depth.
We have QDEPTHHI event alerts at 50%, but it filled very quickly.
So I am going to get the CF made bigger but it's the co-lateral impact on other applications (using the same channel) that caused most grief.
I agree that message retry rarely has any value, it's one of those MQ defaults that probably belongs in a museum now. (There is a nice museum in Hursley!) _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
bruce2359 |
Posted: Mon Sep 28, 2020 3:19 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
So, the consuming app (apps) can't keep up with message arrival rate. Why? What is the bottleneck? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 3:47 am Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
Against my advice, it was not set up with HA or at least some kind of automated restart.
They relied on manual alerting which failed. However this is not really the issue since there could be planned downtime.
My concern is avoiding impact on unrelated (and often more important) applications when some lesser application fills up their queue. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
bruce2359 |
Posted: Mon Sep 28, 2020 5:15 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
Messages of lesser importance should not be sent across the same channel as the important messages.
CF storage is real, not virtual, and therefore limited real estate. Not likely the CF admins will provision much more (way more) structure storage. Does an SMDS data set back up the offending queue? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 6:13 am Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
Non-persistent messages don't go to the SMDS. The queue that fills is not that important, it's the other applications that matter more.
Easy to say "don't use the same channel". Hard to achieve without creating a new cluster.
Even then there will never be one cluster per queue so co-lateral damage is still possible.
Message priority is another option I am considering. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
bruce2359 |
Posted: Mon Sep 28, 2020 9:08 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
zpat wrote: |
Message priority is another option I am considering. |
Where exactly? Sending side XMITQ? Destination queue? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 11:07 am Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
MQMD.Priority is set by the original MQPUT so it would be the sending side application (possibly using an attribute on a queue alias) to try set a higher priority for more critical applications (or vice versa). _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
bruce2359 |
Posted: Mon Sep 28, 2020 11:36 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
And how will this prevent the queue-full condition.
What about additional consumers? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 1:01 pm Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
It won't prevent queue full. But it will mean higher priority messages will be sent in preference over the channel (if the SCTQ is priority sequenced) so after one lower priority message retry - it would send all the higher priority messages before attempting to send another lower priority message.
We can never guarentee the queue won't get full and I can't make them run multiple consumers if they refuse to do so. But I can find a way to stop them impacting more important applications. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
gbaddeley |
Posted: Mon Sep 28, 2020 5:56 pm Post subject: |
|
|
Jedi Knight
Joined: 25 Mar 2003 Posts: 2527 Location: Melbourne, Australia
|
FWIW, our DR planning looks at queue depth and storage usage that could accumulate during the expected time for DR restoration. We set maxdepth and storage allocation to cope with this situation.
Short outages or higher than normal peaks during normal operation do not even approach the settings we have for DR, so we very rarely see queue full conditions.
There is an argument that max depth, max message length and storage allocation should not stand in the way of normal or abnormal app message processing, when there is no good reason to do so. We had a couple of instances where app msgs crept up over 4M length and broke several interfaces, due to arbitrary max msg length set by MQ admins years before on queues and channels. We made an executive decision to set everything to 100MB. _________________ Glenn |
|
Back to top |
|
|
zpat |
Posted: Mon Sep 28, 2020 11:05 pm Post subject: |
|
|
Jedi Council
Joined: 19 May 2001 Posts: 5859 Location: UK
|
Getting off topic but setting channel maxlength to 100 MB can seriously consume CHIN storage on z/OS.
As this queue of ours is QSG shared, the CF (real) storage has to be available and that's quite expensive compared to standard disk (which I would agree is always worth over allocating than having to deal with queue full conditions).
Most of our critical queues are on z/OS and MQ on z/OS is many times more difficult and inconvenient to administer than distributed MQ as you have to worry about CF size, page set size, SMDS size, buffer pool size, CHIN region size and all the other joys of "Ye Olde" MVS. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
|
|