MQSeries.net :: View topic - Confusion over Heartbeat

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Confusion over Heartbeat

Goto page Previous 1, 2, 3

Confusion over Heartbeat

« View previous topic :: View next topic »

Author

Message

mvic

Posted: Wed Nov 25, 2009 6:42 am Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

mqjeff wrote:

Regardless of situation, a needlessly running channel doesn't have "zero impact". So if the app can't show any real gain from having the channel running all of the time, you are still wasting some resources for no gain.

DISCINT 0 keeps your channel "up" but it burns negligible CPU while waiting for the next message. One thing that I've seen interfere with long-running channels that go idle is when there is some active piece of the network (firewall etc.) that chops the connection. Other than this, if the system capacity is up to it (RAM, kernel capacity, MaxActiveChannels etc.) then DISCINT 0 shouldn't give any problems.

Quote:

And any process runs the risk of failure over an extended period of uptime. You still IPL your mainframe, right? So why not quiesce your channels for the same reasons.

No need to do so unless there are actually problems being seen. In which case I'm sure IBM would want to help solve that.

fjb_saper

Posted: Wed Nov 25, 2009 4:29 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20767
Location: LI,NY

mvic wrote:

mqjeff wrote:

Quote:

And any process runs the risk of failure over an extended period of uptime. You still IPL your mainframe, right? So why not quiesce your channels for the same reasons.

No need to do so unless there are actually problems being seen. In which case I'm sure IBM would want to help solve that.

The most common problems I have seen is that the sender channel is in retrying mode because the receiver channel never realized that the connection was broken... (between dist. and MF). Using adopt new MCA could certainly alleviate some of that. I'd expect that a SVRCONN channel is not subjected to the same adopt new MCA rules..... as that would potentially raise a whole other number of questions.

Manual intervention used to be required in the above case and we had to force stop the receiver chl on the mainframe... Haven't seen any of those problems in years though (not since using disconnect interval on the channel).

_________________
MQ & Broker admin

jcv

Posted: Sat Nov 28, 2009 4:33 am Post subject:

Chevalier

Joined: 07 May 2007
Posts: 411
Location: Zagreb

mqjeff wrote:

In those in-between cases...

Let me explain what I meant and how I see this scenario. Let's say that our application has one kind of idle periods which cannot be determined when they exactly begin and when they end, and can only be estimated how long at most they can be, for example during working day's nights, and we add some tolerance to that maximum expected idle period in order to define discint which we are fairly sure cannot be reached over such periods, and especially not during busy daily periods, and other kind of idle periods which can be determined when they exactly end, so that we can actually use scheduling tools to start the channels sufficiently before the application activity starts, that kind may be during non-working days, then I guess that for that other kind of idle period can also be known when it begins, and we can use scheduling tools to stop channels sufficiently after the application activity ends without depending on previously estimated discint to expire, saving more resources that way. That wouldn't mean much more work since one way or another we have to schedule start, in order to fulfill SLA. If we are concerned about the mentioned potential leak of some kind I was fortunately never faced to, although I have history of running non mature versions too, I would say ending the instance would help solving such problems in 99% of cases. If we want to be sure, and execute exactly the required (by that recommendation) branch of code, we may in the same schedule alter discint to 1, start the channel and after it immediately ages out, alter it back. Back to 0 I would say, because there is no need to avoid that value in such scenario, saves us the effort of estimating discint. Naturaly, mature sw should be able to run whenever is needed, for as long as needed, that is, free of any leaks. Hence, such altering can hardly be needed, and hardly an argument against keeping discint at 0.
Now, in that scenario, our channels are non-stop up let's say 5 or 6 days out of 7 regardless of discint set to 0 or to the estimation. When channel encounters retryable error during that period it will go retry regardless of discint, when it encounters non-retryable error it will stop regardles of discint. So if we set appropriate long retry count, what's the benefit of avoiding 0 with respect to avoiding manual interventions on channels in order to restart them? None.

PeterPotkay wrote:

Shirley, see my previous comments about how a SNDR channel can auto recover since it initiates work, but the RCVR channel just sits there.

Excuse me, I don't see the relevance here because back then at that moment we were already discussing discints, while this must be solved by heartbeats and adoptnewmca, discint cannot help rcvr during 6 days out of 7, since channel may not rest. It is also not obliged to help, because those two do that instead. I am probably missing something here?
In a scenario in which there are only nights (type 1), but all days are equally working (no type 2), avoiding 0 is even more pointless, since channel never rests, it doesn't have to be actually busy all the time, it just cannot be shutdown or aged out. Abstractly speaking, there can't be idle periods of type 3, for which you know when it ends but you don't know when it starts.

mvic wrote:

One thing that I've seen interfere with long-running channels that go idle is when there is some active piece of the network (firewall etc.) that chops the connection.

Isn't that always recoverable error if heartbeats and adoptnewmca are used?
I'm also not clear about clearing retryable and non-retryable errors while channel is inactive and does not notice it. To gain that benefit, and I saw people emphasizing that as a reason to let channels go inactive, which teams usually must do something manually, or is it usually automatic nowadays? Are both types of errors equally solvable without any manual intervention? Obviously, if MQ admin team has to intervene, than there is no actual benefit gained by channels being inactive.
Thanks in advance for answers to my questions and for any corrections of my thoughts.

bruce2359

Posted: Sat Nov 28, 2009 6:37 am Post subject:

Poobah

Joined: 05 Jan 2008
Posts: 9482
Location: US: west coast, almost. Otherwise, enroute.

Paragraph 1 may be the longest ever written on a post on this site. This paragraph is very long, and is very difficult to read and follow the train of thought.

Having said all that... and in summary:

Disconnect interval enable a channel to go inactive when there are no more messages to transfer. Heartbeats allow the channel ends to keep track of their peers health - up to the point of disconnect. Triggered channels allow the next message that arrives in an xmit queue to restart the inactive channel - without the need for external automation (job schedulers).

Given the nature of network hardware and software (message workloads that vary over time, routers, packets, firewalls, cables, nic cards, back-hoes, etc.), channels sometimes fail. WMQ offers some tools to keep channels alive - there is a Hursley post of a similar name that is worth reading).

In an ideal world (99% error-free), 99% error-free would be unacceptable. Rather, we deal with the tools we have.
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.

PeterPotkay

Posted: Sat Nov 28, 2009 7:55 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

With proper use of AdoptNewMCA, a long or 0 DISCINT is no longer as problematic as it used to be, since the RCVR channel can recover from an blocking on an orphaned socket.

There are still reasons I will never code 0 for DISCINT.

Consider Queue Manager A that has RCVR channels from 1000 other QMs. If one of those 1000 other QMs permanently goes away, and somone doesn't follow due process and clean everything up, do you want that orphaned RCVR channel running forever and ever? Throw a DISCINT of 999,999 versus zero if you must, at least that will allow these types of things to clean up.

Some shops monitor channel status and alert on retrying channels. Consider a channel that gets no traffic from midnight to 6 AM. At 1 AM the cleaning crew spills coffe on a router causing a network outage which is fixed by 5 AM. The MQ Admin whose DISCINT is set to allow that channel to end on its own stays sleeping all night. The MQ Admin who has DISCINT set to 0 is getting paged for a retrying channel.

And mqjeff's point - a QM with 1000 RCVR channels, all of them running unnecessarily running is a waste of resources, however small. If anything, it makes my DIS CHANNEL STATUS (*) command take a lot longer to run and it's output a lot bigger than it needs to be when I'm chasing down some other problem.

Its kinda like turn out the light when you leave the room. Is it some great crime if you don't? No. But it doesn't mean its not the right thing to do if you aren't going to be in the room for a while. There are lots of factors in deciding if its the right thing to do to turn out the light if you are only going to be out of the room for a minute. I'm talking about when you are leaving for a while, like an hour or more. Same thing for MQ channels - I don't have my DISCINT set to some ridiculously small number either. That's as bad or worse than DISCINT 0.
_________________
Peter Potkay
Keep Calm and MQ On

jcv

Posted: Sat Nov 28, 2009 3:11 pm Post subject:

Chevalier

Joined: 07 May 2007
Posts: 411
Location: Zagreb

Let me repeat what my point was, and what was not. I never said that idle channels of any type should be running unnecessarily if there is no reason (tight SLA) which forces that. I just said that it makes no difference in your scenario whether you keep your channels running by setting discint to 0 or to the estimated # that is sufficient for idle night period. Now, if you have 1000 idle channels that must be running because of tight SLA, I really don't see what you can do about it. I don't have it, because I don't have such tight SLA. I would say the same thing about monitoring retrying status. That's more probably problem for you, than for an MQadmin who doesn't have that tight SLA, and who can use for example default DISCINT.

bruce2359

Posted: Sat Nov 28, 2009 4:19 pm Post subject:

Poobah

Joined: 05 Jan 2008
Posts: 9482
Location: US: west coast, almost. Otherwise, enroute.

jcv wrote:

I really don't see what you can do about it... and who can use for example default DISCINT.

There's a general consensus:
1) networks are prone to failures from various sources
2) WMQ provides some tools for managing your channels (disconnect interval, heartbeats, retry counts and intervals, and other channel attributes, triggering)
3) there are other tools available from IBM (Tivoli)
4) there are other tools available from 3rd-party vendors
5) we keep trying this and that until it gets better, gets worse, or stays the same.
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.

Display posts from previous:

Goto page Previous 1, 2, 3

Page 3 of 3

MQSeries.net Forum Index » General IBM MQ Support » Confusion over Heartbeat

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP