Skip to Main Content
IBM Power Ideas Portal


This portal is to open public enhancement requests against IBM Power Systems products, including IBM i. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Delivered
Workspace AIX
Created by Guest
Created on Aug 11, 2020

Add new POLLEXCL flag for poll() function to solve thundering herd problem. Flag logic like EPOLLEXCLUSIVE for linux epoll()

We have developed multiprocessing middleware named Enduro/X, where several processes may listen for messages on several ipc queues (SystemV for example). Processes are using poll() to wait for event in some queue. Work on queues may be load balanced with several processes doing poll() on same set of queues. Problem is that when in load balance way several same processes are started (say 500) and they are all listening (POLLIN) on same set of queues via poll(), then when one message is received in queue on which processes are waiting, all 500 processes are waken up, and after the wake up, only single process will receive msg via msgrcv() while others will get ENOMSG error - thus we get thundering herd problem. So this extra wakeup of 499 processes is un-needed CPU work done by kernel. Thus I propose that IBM AIX implements new POLLEXCL flag which would indicate that when event occurs, kernel will wake up only one process from the group of poll() waiters for same resource (queue or fd) for which struct pollfd.events or struct pollmsg.events has POLLEXCL bit set. For other other processes which waits on the same resources, but does not specify POLLEXCL, do the same wakups as currently.

Work modes of POLLEXCL could be following:
1. In case if no processes are listening (busy with some work) and poll resource level has changed, then on first process entry in poll, process shall receive the event, and if level does not change, others following the poll() entry, shall not receive the event.

2. In case if several processes are listening on poll() resource event, and if level of resource changes, then only one process shall be notified.

3. In mixed case if some processes does not specify POLLEXCL and some do specify POLLEXCL, processes with out POLLEXCL shall get event in both cases: on process entry on level changed resource and get even in case if already listening on resource. While processes with POLLEXCL shall work as in described in 1. and 2. i.e. only one process is notified and others do not.

Idea priority Medium
  • Guest
    Reply
    |
    Oct 21, 2021

    Available in the December 2021 GA of AIX 7.3

  • Guest
    Reply
    |
    Oct 2, 2020

    Hello,

    One more aspect which shall be considered. In case if poll() is monitoring several FDs which has the POLLEXCL set and events on several FDs in the same time are present, then poll() shall return only one event per poll() call. Several poll() calls shall return events from busy FDs by selecting single FD in round-robin fashion.

    This logic is required, to utilize the load balance provided by poll(), if there are several processes/threads doing the poll(). So that jobs are split between them evenly, otherwise one process will grab all the events, while others would not process any.

    If IBM sees that previous logic of "POLLEXCL" where several FDs are returned is suitable, then please add new flag like "POLLONE" to activate only one event per poll() call in round-robin fashion.

    As from Mavimax perspective, for our middleware to work effectively (and whole purpose of this change request) is that we need that for several exclusive FDs monitored, only one event is returned per poll() call.

    So two major changes:

    - POLLEXCL FD are triggered only for one thread/process poll() (if there are several ones doing poll)
    - POLLONE makes local process to receive only one event from several FDs monitored in round-robin fashion, so that other FDs are triggered in other thread/process poll() calls, if there are concurrent events.
    - From Mavimax perspective, this "POLLONE" flag logic may be built in automatically in the POLLEXCL flag logic.

    Illustrated here:

    --------------------------------------------------------------------
    struct pollfd fds[3];
    int ret;

    fds[0].fd = fd1;
    fds[0].events = POLLIN | POLLEXCL | POLLONE;

    fds[1].fd = fd2;
    fds[1].events = POLLIN | POLLEXCL | POLLONE;

    fds[2].fd = fd3;
    fds[2].events = POLLIN | POLLEXCL | POLLONE;


    /* Assuming that fd1, fd2, fd3 has full POLLIN of events */

    /* then 1st call to: */
    poll(fds, 3, TIMEOUT * 1000);
    /* shall return 1, and fds[0].revents is set to POLLIN all others fds revetns are 0 */

    /* then 2nd call to: */
    poll(fds, 3, TIMEOUT * 1000);
    /* shall return 1, and fds[1].revents is set to POLLIN all others fds revetns are 0 */

    /* then 3rd call to: */
    poll(fds, 3, TIMEOUT * 1000);
    /* shall return 1, and fds[2].revents is set to POLLIN all others fds revetns are 0 */

    /* then 4th call to: */
    poll(fds, 3, TIMEOUT * 1000);
    /* shall return 1, and fds[0].revents is set to POLLIN all others fds revetns are 0 */

    --------------------------------------------------------------------

  • Guest
    Reply
    |
    Aug 30, 2020

    Attachment (Description)

  • Guest
    Reply
    |
    Aug 30, 2020

    Hello,

    - This affects poll(), due to fact that it supports System-V message queue polling. Pollset() according to documentation, does not support it.
    - But we did some testing with pollset() over the unnamed pipes and threads which individually via new pollsets monitor the pipe, we get the same thundering herd issue. This scenario via several pollsets (rather than one) is must have for our middleware, as we basically run several executables on shared resources, thus pollset as object cannot be shared.

    This ideally IBM could:
    1. Add flag for POLLEXCL poll() to have single wakeup when on shared resource (socket/pipe/msgqueue) event appears.
    2. Add the same flag to pollset()
    3. Add support for msgqueue monitoring via pollset() API.

    For us scenariou 1. is critical, if IBM could do the 2.+3. changes, that would be even better.

    Here is test run from pollset() on unnmaed pipes (source attached as pollset2.c):

    -------------------------------------------------------------------------------------------

    $ cc pollset2.c -lpthreads

    $ ./a.out 1 1000000
    Wait 5s for threads to start... (num_threads=1 num_msg=1000000)
    server: FD = R: 3 W: 4
    server: START 2020-08-30 16:37:57
    server: STOP 2020-08-30 16:38:03 DELTA sec: 6
    server: Messages 1000000 sent successfully
    server: waiting 1s... M_msg_proc=967252, num_msg=1000000
    server: Messages 1000000
    server: COMPLETED 2020-08-30 16:38:04 WASTED WAEKUPS: 0
    server: done waiting. remove threads...
    server: done waiting. remove pipes...

    -------------------------------------------------------------------------------------------

    With one thread test runs for 6 seconds and gets 0 wasted wakeups.

    Then testing with 500 polling threads...:

    -------------------------------------------------------------------------------------------
    $ ./a.out 500 1000000
    Wait 5s for threads to start... (num_threads=500 num_msg=1000000)
    server: FD = R: 3 W: 4
    server: START 2020-08-30 16:38:15
    server: STOP 2020-08-30 16:39:22 DELTA sec: 67
    server: Messages 1000000 sent successfully
    server: Messages 1000000
    server: COMPLETED 2020-08-30 16:39:22 WASTED WAEKUPS: 1178438
    server: done waiting. remove threads...
    server: done waiting. remove pipes...
    -------------------------------------------------------------------------------------------

    We see that same workload took 67 seconds and wasted/thundering herd got: 1178438 number of times. So load balancing across multiple processes (or independent threads), actually worsen then work time for 1000%.

  • Guest
    Reply
    |
    Aug 28, 2020

    Can you please clarify if this enhancement request is for the poll() or pollset() service? Would one be preferable over the other?

    Note that pollset() is the scalable I/O event notification service for AIX and is similar to Linux epoll().

  • Guest
    Reply
    |
    Aug 27, 2020

    Attached example source code and results shows the unwanted effect of thundering herd problem. I.e. if 1 thread vs 500 threads waiting on poll. Results are 1 minute vs 11 minutes for the same work load. With introduction of POLLEXCL flag, the results would be the same, such work load shall take 1 minute, for no matter of the polling thread count.

  • Guest
    Reply
    |
    Aug 12, 2020

    Attachment (Description): Test results on AIX 7.2 Contains two runs for bulk of 10M messages. Where in first test case, with on receiver thread (poll() + msgrcv()) is completed within 1 minute. Where in second test case with 500 threads doing poll() + msgrcv() test is completed within 11 minutes.

  • Guest
    Reply
    |
    Aug 12, 2020

    Attachment (Description): Test program example: - program main does send bulk of msgs - number of threads via poll() receive messages may be used to test thundering herd issue with poll