Author |
Message
|
fjb_saper |
Posted: Wed Feb 04, 2009 7:38 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20767 Location: LI,NY
|
Vitor wrote: |
PeterPotkay wrote: |
So you are saying that when the new flows are active, dequeue rates across multiple queues on the QM, even unrelated queues, drops? And when the flows are stopped, the rates return to normal? |
After an apparently random period of normality, yes. It's a subtle effect as we generate far more audit than we do useful messages, but that seems to be the case. Once dequeue rate falls on this queue (easily noticed by the rapidly increasing depth), dequeue rates drop across the queue manager. |
I'd say symptom of a rapidly filling destination queue. There are some parameters that can be set on the channel to minimize the time between and number of retries and make the messages go to the DLQ faster.
Check the DLQ on the destination system for messages with reason 2053. What queue are they for? The audit queue?  _________________ MQ & Broker admin |
|
Back to top |
|
 |
exerk |
Posted: Wed Feb 04, 2009 7:41 am Post subject: |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
fjb_saper wrote: |
...I'd say symptom of a rapidly filling destination queue... |
Would that have a 'global' effect on other queues in the queue manager? _________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Feb 04, 2009 7:43 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
fjb_saper wrote: |
Check the DLQ on the destination system for messages with reason 2053. What queue are they for? The audit queue?  |
DLQ on the destination system is empty, and the audit database (the final resting place for these messages) shows some updates for the times in question.
I repeat, the channel remains running throughout. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Feb 04, 2009 7:46 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20767 Location: LI,NY
|
exerk wrote: |
fjb_saper wrote: |
...I'd say symptom of a rapidly filling destination queue... |
Would that have a 'global' effect on other queues in the queue manager? |
Imagine you have a flood that needs to go through a funnel.
The funnel examines each message and directs it to it's slot.
Now for 2/3 of the messages coming through the funnel you need to look at, try, pause for 10 seconds try again repeat 10 times and put to the DLQ.
What do you think that will do to your throughput on the channel.??
All destinations on the remote qmgr will be affected by the one destination.
You can alleviate that some by sending the operational messages at a higher priority than the audit messages. What you really need is to scale the app reading the audit messages off the queue, and get a bigger queue depth to accommodate for spikes  _________________ MQ & Broker admin |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed Feb 04, 2009 7:51 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7723
|
Vitor wrote: |
PeterPotkay wrote: |
So you are saying that when the new flows are active, dequeue rates across multiple queues on the QM, even unrelated queues, drops? And when the flows are stopped, the rates return to normal? |
After an apparently random period of normality, yes. It's a subtle effect as we generate far more audit than we do useful messages, but that seems to be the case. Once dequeue rate falls on this queue (easily noticed by the rapidly increasing depth), dequeue rates drop across the queue manager. |
Could it be possible that after a period of time the flows have put so much under syncpoint and not committed that you get into a QM rolling back log scenario?
Or after a period of time the new flows decide to do something that is supper I/O or CPU intensive, starving the server of resources? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Feb 04, 2009 7:54 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20767 Location: LI,NY
|
PeterPotkay wrote: |
Could it be possible that after a period of time the flows have put so much under syncpoint and not committed that you get into a QM rolling back log scenario?
Or after a period of time the new flows decide to do something that is supper I/O or CPU intensive, starving the server of resources? |
Uncommitted messages does not fit the scenario. They should not be able to remove them using qload. The scenario fits rather a queue full on the remote qmgr.
Server starved of resources is more interesting to pursue...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed Feb 04, 2009 7:57 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7723
|
fjb_saper wrote: |
PeterPotkay wrote: |
Could it be possible that after a period of time the flows have put so much under syncpoint and not committed that you get into a QM rolling back log scenario?
Or after a period of time the new flows decide to do something that is supper I/O or CPU intensive, starving the server of resources? |
Uncommitted messages does not fit the scenario. They should not be able to remove them using qload. The scenario fits rather a queue full on the remote qmgr.  |
Q Full on a remotye q would drop the dequeu rate on the 1 transmission q, not across the board for multiple queues.
This partucular XMITQ doesn't have to be the one that is filling up the logs with uncommitted messages. Having said that, I would think this type of problem would have shown up in the QM logs, and Vitor says there is nothing odd there, so I guess this aint it.
What if a bad message gets into the flow and the flow starts looping, using all the CPU or I/O? No mention in this thread yet of these stats while the problem is happening. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Vitor |
Posted: Wed Feb 04, 2009 8:04 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
What if a bad message gets into the flow and the flow starts looping, using all the CPU or I/O? No mention in this thread yet of these stats while the problem is happening. |
We have not ruled out one of the new flows having a bad loop in it, but can't find anything as yet. It's theoretically possible that one is receiving a reply which is making it repeat the question, but that's not easy to determine.
Another thing hard to determine is the utilisation of the box at the times of problem. The best numbers I have are that CPU at server level is around 70% at the time, and I/O does not vary much from the "normal" levels.
I'm trying to obtain something a bit more scientific. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 05, 2009 6:23 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
As most experienced hands on the forum could have predicted, there was not a single cause of this, but the outcome (for the record) was thus:
1) Audit records were changed to be written to a local queue and moved on; this helped but not very much.
2) The 7 audit points in the flow were reduced to a more manageable number, reducing the overall number of messages to deal with.
3) A bug was identified while looking into the transactional / non-transactional question where a flow wrote out a request message and used a MQGet node to read the reply. Regretably the MQOutput was in the same UOW as the flow, so there was never a reply because the request was never committed. Hence the flow sat for 15 seconds waiting for the get to expire, then went through some complicated failure processing. High numbers of this causes the execution group to lock resources, run out of threads and all sorts of bad things, leading to high resource usage in the server.
4) The practice of having a single EG holding every single production flow has been called into question
I thank all concerned for their valuable input. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Feb 05, 2009 6:28 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Vitor wrote: |
3) A bug was identified while looking into the transactional / non-transactional question where a flow wrote out a request message and used a MQGet node to read the reply. Regretably the MQOutput was in the same UOW as the flow, so there was never a reply because the request was never committed. Hence the flow sat for 15 seconds waiting for the get to expire, then went through some complicated failure processing. High numbers of this causes the execution group to lock resources, run out of threads and all sorts of bad things, leading to high resource usage in the server. |
This would also lead to reserved space on the MQ transaction logs during the timeout, which when there was enough of it would cause the generation of additional secondary logs if possible - and if using circular logs could cause any transaction the queue manager is participating in to grind to a halt. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 05, 2009 6:31 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
and if using circular logs could cause any transaction the queue manager is participating in to grind to a halt. |
Leading to the poor dequeue performance we were seeing. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Feb 05, 2009 6:34 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Vitor wrote: |
mqjeff wrote: |
and if using circular logs could cause any transaction the queue manager is participating in to grind to a halt. |
Leading to the poor dequeue performance we were seeing. |
 |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Feb 05, 2009 7:01 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7723
|
mqjeff wrote: |
Vitor wrote: |
3) A bug was identified while looking into the transactional / non-transactional question where a flow wrote out a request message and used a MQGet node to read the reply. Regretably the MQOutput was in the same UOW as the flow, so there was never a reply because the request was never committed. Hence the flow sat for 15 seconds waiting for the get to expire, then went through some complicated failure processing. High numbers of this causes the execution group to lock resources, run out of threads and all sorts of bad things, leading to high resource usage in the server. |
This would also lead to reserved space on the MQ transaction logs during the timeout, which when there was enough of it would cause the generation of additional secondary logs if possible - and if using circular logs could cause any transaction the queue manager is participating in to grind to a halt. |
Wouldn't there be corresponding errors in the QM Error logs? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Feb 05, 2009 7:07 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7723
|
Vitor wrote: |
4) The practice of having a single EG holding every single production flow has been called into question
|
Without any other criteria to go by, shoot for 1 EG for every CPU core your server has, and divy up the flows between them as best you can. I also dedicate one of my EGs for any flows that deal with Batch jobs. That way when a flood of transactions come thru, driving that EG to 100% CPU, that EG is only driving one of the CPU cores to 100%, leaving the other cores to service the other EGs that are doing more timely non batch work.
Or, as you have painfully seen, if one EG is housing a bad flow that uses a lot resources, hopefully the other EGs will have access to the other CPUs and not be impacted. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 05, 2009 7:12 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
Or, as you have painfully seen, if one EG is housing a bad flow that uses a lot resources, hopefully the other EGs will have access to the other CPUs and not be impacted. |
The bittersweet part of this is watching the great and the good asking who decided to lump all these flows into the default EG, and getting responses ranging from "it's always been like that" to "I think it was <insert name of long departed employee> who decided that", with all shades in between.
How many times in the average organisation to you find design decisions which have not been made but arrived at through inertia?
(This question is intended to be rhetorical. If you actually wish to discuss it, please start a new thread!!! ) _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
|