ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » QSG CHIN high CPU / workload skew - why ?

Post new topic  Reply to topic
 QSG CHIN high CPU / workload skew - why ? « View previous topic :: View next topic » 
Author Message
GheorgheDragos
PostPosted: Thu Sep 24, 2020 1:09 pm    Post subject: QSG CHIN high CPU / workload skew - why ? Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Dear MQ community,

For a few good days, up to even a week or more, we at MQ team get complaints that two of our channel initiators ( from different QSG's but on same LPAR ) are running with high cpu and are causing performance issues to the whole LPAR. The problem is not repeating in production. The QSG setup is the same. We use two gateway queue managers ( 1 on each lpar ) in to the queue sharing group. The number of dispatcher tasks is the same between the both. The MAXCHL is the same. How could I interpret the below data and why ? Why is the workload going mostly to B ? And Why is B using such high numbers for the below charts ?

CPUA is the OK one, CPUB is not ( fake names though not real LPAR names ) . Stats gathered with MP1B

Code:
DNS
mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET
 CPUA   MQA   QSG   24/09/2020   18:01:28   DNS    0     0.0    0.000000    0.0    0.0   0
 CPUA   MQA   QSG   24/09/2020   18:31:22   DNS    0     0.0    0.000000    0.0    0.0   0
 CPUA   MQA   QSG   24/09/2020   19:01:16   DNS    0     0.0    0.000000    0.0    0.0   0
 CPUA   MQA   QSG   24/09/2020   19:31:11   DNS    0     0.0    0.000000    0.0    0.0   0
 CPUA   MQA   QSG   24/09/2020   19:59:53   DNS    982     0.0    0.004972    0.0    5.1   1
 
mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET
 CPUB   MQB   QSG   24/09/2020   18:04:42   DNS    2622     0.0    0.023521    0.0    9.0   4
 CPUB   MQB   QSG   24/09/2020   18:34:37   DNS    2490     0.0    0.022688    0.0    9.1   3
 CPUB   MQB   QSG   24/09/2020   19:04:31   DNS    2548     0.0    0.019088    0.0    7.5   2
 CPUB   MQB   QSG   24/09/2020   19:34:25   DNS    2492     0.0    0.023597    0.0    9.5   3

SSL
 mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET   longest request   Date   Time
 CPUA   MQA   QSG   24/09/2020   18:01:28   SSL    2239868     0.2    2.025935    0.2    5.5   7   73   24/09/2020   18:01:01.537272
 CPUA   MQA   QSG   24/09/2020   18:31:22   SSL    2610255     0.2    2.454760    0.2    5.5   7   13   24/09/2020   18:02:12.108316
 CPUA   MQA   QSG   24/09/2020   19:01:16   SSL    1893426     0.1    1.661657    0.1    5.1   6   21   24/09/2020   18:32:06.223394
 CPUA   MQA   QSG   24/09/2020   19:31:11   SSL    1909699     0.1    1.786110    0.1    5.3   6   70   24/09/2020   19:02:10.645409
 CPUA   MQA   QSG   24/09/2020   19:59:53   SSL    1872263     0.2    3.279193    0.2    9.9   11   17   24/09/2020   19:45:21.998967

 mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET   longest request   Date   Time
 CPUB   MQB   QSG   24/09/2020   18:04:42   SSL    2944971     3.2    8.067913    0.5   15.7   97   1721   24/09/2020   18:02:05.566018
 CPUB   MQB   QSG   24/09/2020   18:34:37   SSL    2939129     1.8    8.179115    0.5   15.4   55   240   24/09/2020   18:05:12.318716
 CPUB   MQB   QSG   24/09/2020   19:04:31   SSL    3728686     1.1    8.400311    0.5   13.1   26   154   24/09/2020   18:45:00.996398
 CPUB   MQB   QSG   24/09/2020   19:34:25   SSL    3709952     3.3    8.438746    0.6   13.4   80   371   24/09/2020   19:30:00.976832

DISP
 mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET
 CPUA   MQA   QSG   24/09/2020   18:01:28   DISP   6726835     0.4    0.157831    0.4    7.9   9
 CPUA   MQA   QSG   24/09/2020   18:31:22   DISP   7827337     0.4    0.187892    0.3    7.8   8
 CPUA   MQA   QSG   24/09/2020   19:01:16   DISP   5726524     0.2    0.173564    0.2    6.9   6
 CPUA   MQA   QSG   24/09/2020   19:31:11   DISP   5790634     0.2    0.176444    0.2    7.1   7
 CPUA   MQA   QSG   24/09/2020   19:59:53   DISP   5606207     0.2    0.466262    0.2    7.1   6
 mvs   qm   qsg   date   time   Type   Requests   Busy%   Total CPU    CPU%   avg CPU   avg ET
 CPUB   MQB   QSG   24/09/2020   18:04:42   DISP   6832301     0.9    8.178705    0.5   11.9   23
 CPUB   MQB   QSG   24/09/2020   18:34:37   DISP   6752118     0.6    8.971264    0.4   11.8   16
 CPUB   MQB   QSG   24/09/2020   19:04:31   DISP   9018203     0.6    8.376705    0.5   10.7   12
 CPUB   MQB   QSG   24/09/2020   19:34:25   DISP   9010019     0.9    9.356962    0.6   11.9   17


In the TASKET tab I have 2525 counts with a total CPU time of 81 seconds for CPUA, while for CPUB I have 10000 counts with a total of 75s CPU time! How does that make sense ? This is beyond my current understanding.

Thank you in advance.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Thu Sep 24, 2020 4:05 pm    Post subject: Reply with quote

Jedi

Joined: 25 Mar 2003
Posts: 2492
Location: Melbourne, Australia

High CHIN CPU can indicate excessive MQ channel activity, including connect / disconnect and MQI calls made by Client channels. Also, check your queue depths and indexing.
_________________
Glenn
Back to top
View user's profile Send private message
hughson
PostPosted: Thu Sep 24, 2020 6:57 pm    Post subject: Re: QSG CHIN high CPU / workload skew - why ? Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

GheorgheDragos wrote:
Why is the workload going mostly to B ?

Could you tell us a little more about how the workload chooses where to go? Are these short-live client connections, or connections in from other queue managers (which would be assumed to be longer lived)? Do you use Sysplex Distributors, VIPAs, or CCDTs or something else, to have these connections choose A or B?

GheorgheDragos wrote:
In the TASKET tab I have 2525 counts with a total CPU time of 81 seconds for CPUA, while for CPUB I have 10000 counts with a total of 75s CPU time! How does that make sense ?

You don't show us these numbers in your output - could you add that data too?

One way that those totals you mention could make sense would be if the connections which go to CPUA manage to do some work, but those that go to CPUB don't spend very long there, which might suggest that they fail because a resource they need is missing. Are QMA and QMB identical in all ways - i.e. have all queues that are needed and so on?

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
GheorgheDragos
PostPosted: Thu Sep 24, 2020 10:01 pm    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Hello,

The problem is that only the gateway queue managers show very high cpu, on one LPAR while the other has almost nothing. Both of the GW QMGRS are listening on the same port.
There is an equal nr of local qmgr on CPUA and B. Connection to the QSG is done via SVRCONN/CLNTCONN. How the QSG choses where to distribute the work, I don't know exactly, since as far as I know MQ does not have a real workload balancer, however, the numbers seem balanced on terms of shared puts/gets ( also taken from MP1B ). So why one channel initiator on one LPAR that is identical to the other spends so much more CPU% in DNS, DISP, SLL ? I don't understand. MQS's on CPUA and CPUB all host shared queues, obviously, but they also host local queues of course, for non HA applications. But why would a HA application access NON HA queues. Everything else is identical.. number of CICS-es etc.

Code:

CPUA
 Date   Time   Type   Tran1   Tran2   Count   CPU S   logBytesMB   put MB   Get MB
 2020/09/24   17:00:00   M   xxxxxxxx   10.212.203   50   0,00893   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.224.228   2   0,03164   0.84247   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.202   2   0,02260   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.194   4   0,03793   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.194   10   0,11330   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.194   20   5,94150   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.203   2000   4,81220   2.8282   0   0.54474
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.203   20   1,80220   0   0   108.71
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.224   20   1,64010   0   0   0
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.225   100   18,08600   0.23779   0   0.13488
 2020/09/24   18:00:00   M   xxxxxxxx   10.212.194   4   0,16914   0   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.224.228   2   0,01723   1.0133   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.202   2   0,02396   0   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.194   4   0,04046   0   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.194   10   0,12677   0   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.194   50   6,38420   24.371   0   0.015259
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.203   20   1,87190   0   0   125
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.224   20   1,74550   0   0   0
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.225   100   21,67600   4.4541   0   1.3639
 2020/09/24   19:00:00   M   xxxxxxxx   10.212.194   4   0,17811   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.224.228   1   0,00004   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.202   1   0,01179   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.194   2   0,02047   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.194   6   0,06355   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.194   10   3,20660   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.203   9   0,91672   0   0   63.203
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.224   10   0,91649   0.03764   0   6.5819
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.225   40   11,38800   0   0   0
 2020/09/24   20:00:00   M   xxxxxxxx   10.212.194   2   0,08643   0   0   0

CPUB
 Date   Time   Type   Tran1   Tran2   Count   CPU S   logBytesMB   put MB   Get MB
 2020/09/24   17:00:00   M   XXXXX   10.212.194   3   0,00102   0.081135   0   0.02721
 2020/09/24   17:00:00   M   XXXXX   10.212.203   50   0,01605   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.194   50   0,51360   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.194   500   20,97100   0.13149   0   0.23563
 2020/09/24   18:00:00   M   XXXXX   10.173.71.   20   0,38541   0.0083981   0   0.0034599
 2020/09/24   18:00:00   M   XXXXX   10.212.203   3000   4,65350   2.3298   0   0.46959
 2020/09/24   18:00:00   M   XXXXX   10.212.202   20   2,28460   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.202   20   1,65450   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.202   4   0,44455   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.202   4   0,44264   0   0   0
 2020/09/24   18:00:00   M   XXXXX   10.212.202   80   3,42220   0.53914   0   0.16206
 2020/09/24   18:00:00   M   XXXXX   10.212.225   2   0,01920   0.069759   0   0.065398
 2020/09/24   18:00:00   M   XXXXX   10.212.129   4   0,07144   5.9453   0   3.4928
 2020/09/24   19:00:00   M   XXXXX   10.212.194   50   0,51573   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.194   500   22,13100   5.2621   0   3.5808
 2020/09/24   19:00:00   M   XXXXX   10.173.71.   10   0,37586   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.203   6000   8,92010   6.7903   0   1.3451
 2020/09/24   19:00:00   M   XXXXX   10.212.202   20   2,29600   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.202   20   1,63590   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.202   4   0,43805   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.202   4   0,43040   0   0   0
 2020/09/24   19:00:00   M   XXXXX   10.212.202   70   3,42850   0.055463   0   0.13924
 2020/09/24   19:00:00   M   XXXXX   10.212.225   2   0,06561   0.36759   0   0.3596
 2020/09/24   19:00:00   M   XXXXX   10.212.129   4   0,02668   2.1645   0   1.4387
 2020/09/24   20:00:00   M   XXXXX   10.212.194   4   0,00212   0.047628   0   0.0092936
 2020/09/24   20:00:00   M   XXXXX   10.212.203   80   0,34718   0.047455   0   0.022057
Back to top
View user's profile Send private message
hughson
PostPosted: Thu Sep 24, 2020 10:07 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

GheorgheDragos wrote:
Both of the GW QMGRS are listening on the same port.

And what is the technology that routes the clients to one or other LPAR?

From everything you have said so far, it would seem that a lot more clients are routed to CPUA than CPUB, so the above question is quite important! This would account for more CPU in DNS task, SSL tasks and DISPatcher tasks as those are the things client connections would use.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
GheorgheDragos
PostPosted: Thu Sep 24, 2020 10:13 pm    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

I'm sorry I don't understand, a CLNTCONN uses a CCDT file to connect to a shared channel on the mainframe. How the workload is split between A and B I don't know yet. TCPIP task on z/OS I guess I don't know.


edit

Apparently a sysplex distributor running for the balancing of the IPs. Never heard that one before.

edit2 - the problem doesn't seem to be uneven workload distribution, but the fact that on one LPAR the gateway CHIN spends much more time and CPU than the other.
Back to top
View user's profile Send private message
GheorgheDragos
PostPosted: Thu Sep 24, 2020 10:52 pm    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Moreover, in MSGM we have more or less equal numbers

Code:

 MVS   QM   Date   Time   Puts   Put1s   Gets   Open   Close   Inquire   Set   Close all H   Sub   SubR   Reg CB   Control   Stat   Publish   PersPuts   NPersPuts   PersBytsPut   NPersBytsPut
 CPUA   MQA   24/09/2020   18:01:28   10825   7   515095   58182   58178   9184   0   0   0   0   542218   495164   0   0   402   10436   331242   23468749
 CPUA   MQA   24/09/2020   18:31:22   10335   48   508796   49264   49295   8774   0   0   0   0   517361   478681   0   0   247   10142   214584   20857070
 CPUA   MQA   24/09/2020   19:01:16   10073   0   469256   1532   1533   988   0   0   0   0   387988   387991   0   0   82   9997   61957   16595862
 CPUA   MQA   24/09/2020   19:31:10   10243   498   471700   1530   1532   988   0   0   0   0   393000   395003   0   0   729   10018   1573570   17225862
 CPUA   MQA   24/09/2020   20:01:05   10952   0   473516   3946   3901   3667   0   0   0   0   388872   388961   0   0   951   10007   25096289   15517766


Code:

 MVS   QM   Date   Time   Puts   Put1s   Gets   Open   Close   Inquire   Set   Close all H   Sub   SubR   Reg CB   Control   Stat   Publish   PersPuts   NPersPuts   PersBytsPut   NPersBytsPut
 CPUB   MQB   24/09/2020   18:04:42   10998   0   578826   80184   80064   13764   0   0   0   0   215880   148814   0   0   887   10129   3010612   16039250
 CPUB   MQB   24/09/2020   18:34:37   10577   0   594522   75927   75815   11850   0   0   0   0   206953   141334   0   0   475   10118   1614906   15852338
 CPUB   MQB   24/09/2020   19:04:31   10335   0   657688   123020   122841   19433   0   0   0   0   337048   232711   0   0   223   10130   575430   16042447
 CPUB   MQB   24/09/2020   19:34:25   10981   0   623083   131931   131812   19800   0   0   0   0   362101   251467   0   0   887   10110   2730701   15617762
Back to top
View user's profile Send private message
bruce2359
PostPosted: Fri Sep 25, 2020 6:10 am    Post subject: Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 9394
Location: US: west coast, almost. Otherwise, enroute.

GheorgheDragos wrote:
I'm sorry I don't understand, a CLNTCONN uses a CCDT file to connect to a shared channel on the mainframe. How the workload is split between A and B I don't know yet. TCPIP task on z/OS I guess I don't know.

If I understand correctly, the inbound connections are from midrange Windows or UNIX. What channel and IP address in the CCDT? Is the IP defined as a shared port on z? Or a local port for a single qmgr?
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.
Back to top
View user's profile Send private message
GheorgheDragos
PostPosted: Fri Sep 25, 2020 7:45 am    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Hello Bruce,

All incoming connections are towards the shared port, then the connection will reach one of the two clones, one on lpar A or B. How this is chosen, in depth, I don't have much clue as there is no load balancer between the incoming connections and the QSG. We use an F5 interface between agencies, arm's etc but not in this case. And it is curious, since the setup is identical, and in PRD we do not have the issue ( both CPU% usage of the gateway chin's is reasonably balanced ).
PMR opened with IBM today.

Dragos
Back to top
View user's profile Send private message
hughson
PostPosted: Sat Sep 26, 2020 8:02 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

You have told us that a CCDT is used, that both GW QMgrs listen on the same port, and that a Sysplex Distributor is also in the picture. Could you complete our understanding of this part of the puzzle by showing us the CCDT CLNTCONN definitions, or at minimum telling us whether their CONNAMEs are the same (i.e. Sysplex Distributor is doing the routing) or different (i.e. CCDT is chosing the route).

You opened this thread saying that more connections were going to one system than the other, but now you say they are even. What data has led you to this new understanding?

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
fjb_saper
PostPosted: Sun Sep 27, 2020 10:58 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

You say that you have HA traffic and non HA traffic...

Do you separate the channels for these? Channels that do not use the same ip?
Channels with the correct ip in the localaddress field?

_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
GheorgheDragos
PostPosted: Tue Sep 29, 2020 4:11 am    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Dear MQ community,

The last few days have been hectic. Quick update though, with your suggestions, IBM support team and Gwydion, we found out that load balancing is seriously skewed towards CPUB, and, after activating STATCHL and running the newest version of MP1B, we found the culprit for the very high CPU%. 2,3 channels from the same application opening, putting, getting, closing and disconnecting for 1 message!

Will continue with updates that might help newer members in the future.

Dragos
Back to top
View user's profile Send private message
zpat
PostPosted: Tue Sep 29, 2020 6:04 am    Post subject: Reply with quote

Jedi Council

Joined: 19 May 2001
Posts: 5849
Location: UK

SCTQ "Message rellocation" can consume a lot of CPU, check for retrying channels.
_________________
Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.
Back to top
View user's profile Send private message
GheorgheDragos
PostPosted: Tue Oct 27, 2020 3:30 am    Post subject: Reply with quote

Acolyte

Joined: 28 Jun 2018
Posts: 51

Hello good afternoon.
The problem has been found - uneven weight LPAR allocation forcing more connections prefer one LPAR over the other.

Dragos
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » QSG CHIN high CPU / workload skew - why ?
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.