Skip to Main Content
IBM Power Ideas Portal


This portal is to open public enhancement requests against IBM Power Systems products, including IBM i. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Not under consideration
Workspace IBM i
Categories Core OS
Created by Guest
Created on Sep 5, 2017

Improve cancel handler to avoid hang situations when ending jobs that behaves badly.

We've had a lot of hang situations when doing ENDSBS *ALL over the year which always results in us having to restart the LPAR using the HMC.

The cause have in almost all cases been found to be an application program not responding to the cancel handler properly which induces a sort of Catch 22 situation. The LPAR is completely unresponsive. Not even the console works.

I believe that PMR 43079,110,846 provides a good technical description of the problem.

Also see the following PMRs:
PMR 25960,110,846
PMR 43077,110,846
PMR 43079,110,846
PMR 43590,110,846
PMR 43831,110,846
PMR 44505,110,846
PMR 44506,110,846
PMR 44659,110,846
PMR 44666,110,846
PMR 44667,110,846
PMR 26751,110,846
PMR 26767,110,846
PMR 26934,110,846
PMR 26986,110,846
PMR 27554,110,840
(another one is being created as I write this)

The reply to all these PMRs have simply been that it works as designed and that we need to modify the application to avoid this behavior.

The developers are, however, having difficulties to find the problem as this it is intermittent.

My request is simply that you improve the cancel handler so it's able to end badly behaving jobs so that an uncontrolled restart doesn't have to be done.


Use Case:

Avoid having to deliberately crash an LPAR to get it out of a hang condition as described. This will avoid potential object damage and calls to the person On Call.


Idea priority Medium
  • Guest
    Reply
    |
    Mar 28, 2022
    Cancel handlers can be part of the application or they can also be part of the operating system. Basically, any program can provide a cancel handler that will run when the program gets interrupted and the call stack entries are terminated by something other than a normal return. The cancel handler needs to be very careful about what it does -- it needs to quickly cleanup (ie. unlock locks, undo operations that are incomplete, cleanup temporary objects, and so on). But cancel handlers should never wait forever (ie. for a message or a socket or an event) since the sending thread may already have ended.  There is no single "cancel handler" processing point that can be "fixed" to avoid hangs. Generally, you need to capture and analyze the call stacks of the hung jobs to determine where they were hung, what they were waiting on and then make the necessary code change. If the cancel handler is an operating system program, a PTF may be required. If the cancel handler is an application program, the application needs to be fixed. 

    The only thing the operating system can do is provide the "big hammer" to end the hung job, which is what ENDJOBABN command is used for. In 7.2 and later releases we added the Abnormal End Delay Time (ABNENDDLY) parameter to the ENDSBS command that will automatically do the ENDJOBABN command for you when running in batch restricted state, and restricted state has not been reached within the Batch Time Limit (BCHTIMLMT).
  • Guest
    Reply
    |
    Oct 9, 2017

    Cancel handlers can be part of the application or they can also be part of the operating system. Basically, any program can provide a cancel handler that will run when the program gets interrupted and the call stack entries are terminated by something other than a normal return. The cancel handler needs to be very careful about what it does -- it needs to quickly cleanup (ie. unlock locks, undo operations that are incomplete, cleanup temporary objects, and so on). But cancel handlers should never wait forever (ie. for a message or a socket or an event) since the sending thread may already have ended. There is no single "cancel handler" processing point that can be "fixed" to avoid hangs. Generally, you need to capture and analyze the call stacks of the hung jobs to determine where they were hung, what they were waiting on and then make the necessary code change. If the cancel handler is an operating system program, a PTF may be required. If the cancel handler is an application program, the application needs to be fixed.

    The only thing the operating system can do is provide the "big hammer" to end the hung job, which is what ENDJOBABN command is used for. In 7.2 and later releases we added the Abnormal End Delay Time (ABNENDDLY) parameter to the ENDSBS command that will automatically do the ENDJOBABN command for you when running in batch restricted state, and restricted state has not been reached within the Batch Time Limit (BCHTIMLMT).

  • Guest
    Reply
    |
    Sep 6, 2017

    Will have to agree with IBM on this one as what you're experiencing is called a wait state. You are the one that will have to modify your programs to respond to the termination signals.

    Good luck, IBM design from inception!