ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » Clustering » Delay in delivery of messages to clustered queues

Post new topic  Reply to topic Goto page 1, 2  Next
 Delay in delivery of messages to clustered queues « View previous topic :: View next topic » 
Author Message
rujova
PostPosted: Wed Apr 01, 2020 4:15 pm    Post subject: Delay in delivery of messages to clustered queues Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hi guys!

I am facing a situation where some messages are being delivered late to clustered queues.

I've searched for a while in Cluster forum threads, also i took some of the advices given there, but i ran out of possible solutions.

To give you a little context, I have two servers (A and B) in separate data centers within the same country (one processing the production workload and the other idle). Additionally, two query servers (APP1 and APP2) in separate geographic sites, which process transactions forwarded by A or B depending on enabled clustered queues. The 4 servers are clustered against two FRs (FR1 and FR2, ) in separate geographic sites.

When A is processed against APP1 (within the same datacenter) the operation is optimal. When the production workload is moved to APP2 (separate geographic site), queuing is observed in the SCTQ for seconds, causing messages to be delivered late and generating timeouts. Service wait-time for response is set to 800ms.

There are no FDCs nor errors written to MQ logs during the events. Channels kept running without interruptions. No product errors to be used in a PMR were seen , seems to be a performance issue.

MQ Version:
A:9.0.0.8
B: 9.0.0.9
APP1 & APP2: 9.0.0.5
FR1 & FR2: 9.0.0.5

Points that I already checked:
1. Queue depths in remote QMGRs. There are no full queue in destination queue managers nor other queue managers shared in the cluster.
2. Trace route from A to APP2. 0ms reported.
3. Stress test from A to APP2. There are frames where messages start to build up in the SCTQ, but its for a few seconds, no more that 3 to 5 seconds.
4. Messages are put as non-persistent.
3. Network capture to validate packet loss or network problems. No errors were reported.

I read that it is not recommended to create a separated transmission queue to isolate the problem.

Any suggestions on where to continue investigating or how to dismiss MQ as the culprit?
_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
hughson
PostPosted: Wed Apr 01, 2020 5:05 pm    Post subject: Re: Delay in delivery of messages to clustered queues Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

rujova wrote:
When the production workload is moved to APP2 (separate geographic site), queuing is observed in the SCTQ for seconds, causing messages to be delivered late and generating timeouts.

When the production workload is moved to APP2, are the channels reading the SCTQ already running, or are you paying for channel start times in these measurements?

Remember it is perfectly legal to issue manual START CHANNEL commands against cluster-sender channels.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
rujova
PostPosted: Wed Apr 01, 2020 5:32 pm    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hey Morag!

Both channels are configured with DISCINT(0). I am attaching channels properties.

It is important to note that timeouts are not generated immediately after the workload change to APP2, they occur sporadically and without defined patterns.

Code:
dis channel(TO.APP1) ALL
AMQ8414: Display Channel details.
CHANNEL(TO.APP1)                        CHLTYPE(CLUSRCVR)
ALTDATE(2018-04-10)                     ALTTIME(15.48.49)
BATCHHB(0)                              BATCHINT(0)
BATCHLIM(5000)                          BATCHSZ(50)
CERTLABL( )                             CLUSNL(CLUSTERS)
CLUSTER( )                              CLWLPRTY(0)
CLWLRANK(0)                             CLWLWGHT(50)
COMPHDR(NONE)                           COMPMSG(NONE)
CONNAME(x.x.x.x(XXXX))                  CONVERT(NO)
DESCR()
DISCINT(0)                              HBINT(300)
KAINT(AUTO)                             LOCLADDR( )
LONGRTY(999999999)                      LONGTMR(1200)
MAXMSGL(4194304)                        MCANAME( )
MCATYPE(THREAD)                         MCAUSER( )
MODENAME( )                             MONCHL(QMGR)
MRDATA( )                               MREXIT( )
MRRTY(10)                               MRTMR(1000)
MSGDATA( )                              MSGEXIT( )
NETPRTY(0)                              NPMSPEED(FAST)
PROPCTL(COMPAT)                         PUTAUT(DEF)
RCVDATA( )                              RCVEXIT( )
RESETSEQ(NO)                            SCYDATA( )
SCYEXIT( )                              SENDDATA( )
SENDEXIT( )                             SEQWRAP(999999999)
SHORTRTY(10)                            SHORTTMR(60)
SSLCAUTH(REQUIRED)                      SSLCIPH( )
SSLPEER( )                              STATCHL(QMGR)
TPNAME( )                               TRPTYPE(TCP)
USEDLQ(YES)



Code:
AMQ8414: Display Channel details.
CHANNEL(TO.APP2)                        CHLTYPE(CLUSRCVR)
ALTDATE(2018-04-10)                     ALTTIME(16.06.04)
BATCHHB(0)                              BATCHINT(0)
BATCHLIM(5000)                          BATCHSZ(50)
CERTLABL( )                             CLUSNL(CLUSTERS)
CLUSTER( )                              CLWLPRTY(0)
CLWLRANK(0)                             CLWLWGHT(50)
COMPHDR(NONE)                           COMPMSG(NONE)
CONNAME(y.y.y.y(XXXX))                  CONVERT(NO)
DESCR()
DISCINT(0)                              HBINT(300)
KAINT(AUTO)                             LOCLADDR( )
LONGRTY(999999999)                      LONGTMR(1200)
MAXMSGL(4194304)                        MCANAME( )
MCATYPE(THREAD)                         MCAUSER( )
MODENAME( )                             MONCHL(QMGR)
MRDATA( )                               MREXIT( )
MRRTY(10)                               MRTMR(1000)
MSGDATA( )                              MSGEXIT( )
NETPRTY(0)                              NPMSPEED(FAST)
PROPCTL(COMPAT)                         PUTAUT(DEF)
RCVDATA( )                              RCVEXIT( )
RESETSEQ(NO)                            SCYDATA( )
SCYEXIT( )                              SENDDATA( )
SENDEXIT( )                             SEQWRAP(999999999)
SHORTRTY(10)                            SHORTTMR(60)
SSLCAUTH(REQUIRED)                      SSLCIPH( )
SSLPEER( )                              STATCHL(QMGR)
TPNAME( )                               TRPTYPE(TCP)
USEDLQ(YES)

_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Wed Apr 01, 2020 6:35 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

It is nice that the network layer reported no errors.
Did anybody verify the NIC cards?
Note that you are complaining about timeouts. But this is to be expected with the setup you have of non persistent messages as the message would be discarded if there is a problem...
What is the rate of transmission. Would it look better if you had 2 channels instead of one?

I would expect that you will probably find that there might have been a negotiation problem between NIC's?
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
hughson
PostPosted: Wed Apr 01, 2020 7:55 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

rujova wrote:
It is important to note that timeouts are not generated immediately after the workload change to APP2, they occur sporadically and without defined patterns.

Do you see an increase in DISPLAY CHSTATUS NETTIME when the timeouts occur? You can also look back on this historically by collecting STATCHL records.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
rujova
PostPosted: Wed Apr 01, 2020 9:02 pm    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hey fjb_saper!

fjb_saper wrote:
the message would be discarded if there is a problem...


Totally agree, but I did not mention that responses are being delivered (lately), queuing up in the reply-to-queue since requester application wait time expire.

fjb_saper wrote:
What is the rate of transmission.


About 20-50 messages per second.

fjb_saper wrote:
Would it look better if you had 2 channels instead of one?



I have not tried it . Is it as simple as defining the new channel clusrcvr with a different name than TO.APP1 / TO.APP2? Something like TO.APP1A / TO.APP2B.

I will ask my pals to check NIC's.
_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
rujova
PostPosted: Wed Apr 01, 2020 9:30 pm    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hey Morag!

hughson wrote:
Do you see an increase in DISPLAY CHSTATUS NETTIME when the timeouts occur? You can also look back on this historically by collecting STATCHL records.


I was curious about that statistic, so I ran the command without processing the workload. Here are the results, both from QMGR A:

Code:
DISPLAY CHSTATUS(TO.APP1) NETTIME
AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.APP1)                    CHLTYPE(CLUSSDR)
CONNAME(x.x.x.x(XXXX))              CURRENT
NETTIME(760,977)                    RQMNAME(APP1)
STATUS(RUNNING)                     SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)

AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.APP2)                    CHLTYPE(CLUSSDR)
CONNAME(y.y.y.y(XXXX))              CURRENT
NETTIME(43096,43096)                 RQMNAME(APP2)
STATUS(RUNNING)                     SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)


I am going to check it once the workload is moved to APP2 site.

Following that path i also ran the command from QMGR A to QMGRs CP1, CP2, CA1, CA2 (Same distribution as APP1&APP2). Here are the results:

Code:

AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.CP1)                     CHLTYPE(CLUSSDR)
CONNAME(x.x.x.x1(XXXX))              CURRENT
NETTIME(1310,3886)                      RQMNAME(CP1)
STATUS(RUNNING)                         SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)

AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.CP2)                    CHLTYPE(CLUSSDR)
CONNAME(x.x.x.x2(XXXX))              CURRENT
NETTIME(757,751)                    RQMNAME(CP2)
STATUS(RUNNING)                         SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)



AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.CA1)                     CHLTYPE(CLUSSDR)
CONNAME(y.y.y.y1(XXXX))              CURRENT
NETTIME(48169,51142)                    RQMNAME(CA1)
STATUS(RUNNING)                         SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)

AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.CA2)                    CHLTYPE(CLUSSDR)
CONNAME(y.y.y.y2(XXXX))              CURRENT
NETTIME(50299,98647)                    RQMNAME(CA2)
STATUS(RUNNING)                         SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)
 

_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Thu Apr 02, 2020 2:36 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

You have definitely some thing in the network
some of your channel average a nettime of about 1000, others are around 50,000 !!!
Enjoy
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
gbaddeley
PostPosted: Thu Apr 02, 2020 2:40 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

In the cross-geo network path, there may be other apps competing for network bandwith or switch/router resources, causing delays in MQ packets.
_________________
Glenn
Back to top
View user's profile Send private message
hughson
PostPosted: Thu Apr 02, 2020 4:29 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

rujova wrote:
Service wait-time for response is set to 800ms.


rujova wrote:
Code:
DISPLAY CHSTATUS(TO.APP1) NETTIME
AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.APP1)                    CHLTYPE(CLUSSDR)
CONNAME(x.x.x.x(XXXX))              CURRENT
NETTIME(760,977)                    RQMNAME(APP1)
STATUS(RUNNING)                     SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)

AMQ8417: Detalles de Display Channel Status.
CHANNEL(TO.APP2)                    CHLTYPE(CLUSSDR)
CONNAME(y.y.y.y(XXXX))              CURRENT
NETTIME(43096,43096)                 RQMNAME(APP2)
STATUS(RUNNING)                     SUBSTATE(MQGET)
XMITQ(SYSTEM.CLUSTER.TRANSMIT.QUEUE)


NETTIME is measured in microseconds, so 43,000 microseconds, is 43 milliseconds < 800 millisecond service wait time. This in itself should not be a problem, but it is clear that, not surprisingly, the network to APP2 in the separate geograpic site, is slower. Be interesting to see how the numbers are when you move the workload to it.

You've shown us the different MQ versions of the various servers in the picture, but are there any other differences between the machines? For example, when you move the workload to APP2, is it capable of running the load given to it?

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
rujova
PostPosted: Wed Apr 22, 2020 2:51 pm    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hi Morag!

hughson wrote:
Be interesting to see how the numbers are when you move the workload to it.


We are planning to move the workload to APP2 environment, looking for fresh statistics. As soon as I have it, I will share the results.

hughson wrote:
You've shown us the different MQ versions of the various servers in the picture, but are there any other differences between the machines? For example, when you move the workload to APP2, is it capable of running the load given to it?


Both servers have the same capability (CPUs, RAM, OS).

I was researching about XQTIME (It show us high like NETTIME to APP2). Is it valid to add XQTIME + NETTIME for statistics? Having sometime like:

XQTIME = 55 ms
NETTIME = 80 ms
MQ Transport time = 135ms
_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Wed Apr 22, 2020 3:27 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

It may be useful to try network ICMP ping from A to APP2. On a high bandwidth network with available capacity, ping times should be consistently below 10ms. I would be concerned if any pings take longer than say 30ms. Your networking team may be able to assist with identifying capacity bottlenecks.
_________________
Glenn
Back to top
View user's profile Send private message
rujova
PostPosted: Tue Oct 27, 2020 10:43 am    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hi guys, it's me again.

gbaddeley wrote:
It may be useful to try network ICMP ping from A to APP2. On a high bandwidth network with available capacity, ping times should be consistently below 10ms. I would be concerned if any pings take longer than say 30ms. Your networking team may be able to assist with identifying capacity bottlenecks.


We tried ICMP ping resulting 28 ms from A to APP2. However nettime and xqtime times were considerably high.

To recap and return to the topic, based on an event that was presented on October 16, we have 5 servers geographically distributed. C1, C2, C3 which are the ones that originate the messages. APP1 and APP2 where the applications that process and respond to C1, C2, C3 are hosted.

When C1, C2 and C3 have their workload processing against APP1, everything works. Later, when we move the processing to APP2, we begin to notice delays in the delivery of messages, but only for C2 and C3. C1 doesn't report them (which is quite strange).

Previously, we were going to run operational relocation tests, but we were not able to do it (COVID).

On October 16, a Business Continuity Test was carried out to process the messaging in APP2. The nettime and xqmit time for C2-> APP2 exceeded 450 ms (APP2-> C2 was around 40 ms). C3 was also affected, but not like C2.

Here are some of the samples I took from the STATCHL:

Code:


10/16/2020:16:09:13.65 , NETTIME(40153,159380) SUBSTATE(MQGET), XBATCHSZ(43,1) XQTIME(11944,143799) , CURDEPTH(0)

10/16/2020:16:12:38.27 , NETTIME(39311,157411) SUBSTATE(MQGET), XBATCHSZ(43,1) XQTIME(97176,176968) , CURDEPTH(0)

10/16/2020:16:12:42.25 , NETTIME(39311,157411) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(85029,174202) , CURDEPTH(38)

10/16/2020:16:15:04.28 , NETTIME(368359,196696) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(13836,146775) , CURDEPTH(218)

10/16/2020:16:15:49.16 , NETTIME(326463,194141) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(189196,276878) , CURDEPTH(30)

10/16/2020:16:21:47.13 , NETTIME(258265,189225) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(8617,176986) , CURDEPTH(141)

10/16/2020:16:27:41.12 , NETTIME(206991,184595) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(836,80534) , CURDEPTH(37)

10/16/2020:18:35:24.15 , NETTIME(43684,119215) SUBSTATE(RECEIVE), XBATCHSZ(43,1) XQTIME(94679,747118) , CURDEPTH(9)

10/16/2020:18:35:28.20 , NETTIME(387625,161027) SUBSTATE(MQGET), XBATCHSZ(43,1) XQTIME(94679,747118) , CURDEPTH(0) 


The networking team tells us that everything is fine and that a 70 ms time is guaranteed over the WAN. I've been doing some research on the XBATCHSZ, which is at 50. I did some testing in Development, increasing it to 2000 but didn't see a positive effect. I also checked the TCP buffers, I made some changes to increase them but I didn't see results either.

I'm running out of pieces and very close to checkmate.
_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Tue Oct 27, 2020 8:38 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

it might be worth running a traceroute between the producers' qmgr host and the service queue manager host.

If there is a differente route involved that could also affect the result.
Check in particular if there exists any static routes in C1 that are not present in C2 and C3

Hope this helps
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
rujova
PostPosted: Mon Nov 09, 2020 12:16 pm    Post subject: Reply with quote

Novice

Joined: 07 Jan 2015
Posts: 13

Hey guys!

Well, we tried one more time but we had the same symptom. Timeouts are generated sporadically and both the NETTIME and XQTIME increase. What I see quite curious is that there are time lapses where gets are not performed in the transmission queue. At the remote site there is no queuing or error logs that allow me to explain this effect

| Date and time | Server | XMITQ | Put | Get | High Depth |
| 9/11/2020 08:20:10 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 4 | 4 | 0 |
| 9/11/2020 08:20:11 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 1 | 1 | 0 |
| 9/11/2020 08:20:12 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 3 | 3 | 0 |
| 9/11/2020 08:20:13 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 3 | 3 | 0 |
| 9/11/2020 08:20:15 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 38 | 38 | 3 |
| 9/11/2020 08:20:16 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 59 | 59 | 1 |
| 9/11/2020 08:20:17 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 52 | 16 | 36 |
| 9/11/2020 08:20:18 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 67 | 0 | 103 |
| 9/11/2020 08:20:19 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 61 | 0 | 164 |
| 9/11/2020 08:20:20 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 31 | 100 | 167 |
| 9/11/2020 08:20:22 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 1 | 96 | 96 |
| 9/11/2020 08:20:23 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 3 | 3 | 0 |
| 9/11/2020 08:20:24 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 3 | 3 | 0 |
| 9/11/2020 08:20:25 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 1 | 1 | 0 |
| 9/11/2020 08:20:27 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 21 | 21 | 0 |
| 9/11/2020 08:20:28 | QMC3 | SYSTEM.CLUSTER.TRANSMIT.QUEUE | 3 | 3 | 0 |

We checked with the network team and there were no packet losses or high WAN times during those events.

I am tempted to increase the XBATCHSZ, but in the KC it indicates that it can have a negative impact by delaying the first messages placed in the batch waiting for a commit. I also found old threads that talk about the topic. I don't know what other parameter or configuration could help us reduce those timeout peaks.
_________________
Looking Forward,

Rujova
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum Index » Clustering » Delay in delivery of messages to clustered queues
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.