Add new POLLEXCL flag for poll() function to solve thundering herd problem. Flag logic like EPOLLEXCLUSIVE for linux epoll()

We have developed multiprocessing middleware named Enduro/X, where several processes may listen for messages on several ipc queues (SystemV for example). Processes are using poll() to wait for event in some queue. Work on queues may be load balanced with several processes doing poll() on same set of queues. Problem is that when in load balance way several same processes are started (say 500) and they are all listening (POLLIN) on same set of queues via poll(), then when one message is received in queue on which processes are waiting, all 500 processes are waken up, and after the wake up, only single process will receive msg via msgrcv() while others will get ENOMSG error - thus we get thundering herd problem. So this extra wakeup of 499 processes is un-needed CPU work done by kernel. Thus I propose that IBM AIX implements new POLLEXCL flag which would indicate that when event occurs, kernel will wake up only one process from the group of poll() waiters for same resource (queue or fd) for which struct pollfd.events or struct pollmsg.events has POLLEXCL bit set. For other other processes which waits on the same resources, but does not specify POLLEXCL, do the same wakups as currently.

Work modes of POLLEXCL could be following:
1. In case if no processes are listening (busy with some work) and poll resource level has changed, then on first process entry in poll, process shall receive the event, and if level does not change, others following the poll() entry, shall not receive the event.

2. In case if several processes are listening on poll() resource event, and if level of resource changes, then only one process shall be notified.

3. In mixed case if some processes does not specify POLLEXCL and some do specify POLLEXCL, processes with out POLLEXCL shall get event in both cases: on process entry on level changed resource and get even in case if already listening on resource. While processes with POLLEXCL shall work as in described in 1. and 2. i.e. only one process is notified and others do not.

Idea priority

Medium

Post comment

Guest

Reply
| Oct 21, 2021

Available in the December 2021 GA of AIX 7.3

0 reply Hide replies

Guest

Reply
| Oct 2, 2020

Hello,

One more aspect which shall be considered. In case if poll() is monitoring several FDs which has the POLLEXCL set and events on several FDs in the same time are present, then poll() shall return only one event per poll() call. Several poll() calls shall return events from busy FDs by selecting single FD in round-robin fashion.

This logic is required, to utilize the load balance provided by poll(), if there are several processes/threads doing the poll(). So that jobs are split between them evenly, otherwise one process will grab all the events, while others would not process any.

If IBM sees that previous logic of "POLLEXCL" where several FDs are returned is suitable, then please add new flag like "POLLONE" to activate only one event per poll() call in round-robin fashion.

As from Mavimax perspective, for our middleware to work effectively (and whole purpose of this change request) is that we need that for several exclusive FDs monitored, only one event is returned per poll() call.

So two major changes:

- POLLEXCL FD are triggered only for one thread/process poll() (if there are several ones doing poll)
- POLLONE makes local process to receive only one event from several FDs monitored in round-robin fashion, so that other FDs are triggered in other thread/process poll() calls, if there are concurrent events.
- From Mavimax perspective, this "POLLONE" flag logic may be built in automatically in the POLLEXCL flag logic.

Illustrated here:

--------------------------------------------------------------------
struct pollfd fds[3];
int ret;

fds[0].fd = fd1;
fds[0].events = POLLIN | POLLEXCL | POLLONE;

fds[1].fd = fd2;
fds[1].events = POLLIN | POLLEXCL | POLLONE;

fds[2].fd = fd3;
fds[2].events = POLLIN | POLLEXCL | POLLONE;

/* Assuming that fd1, fd2, fd3 has full POLLIN of events */

/* then 1st call to: */
poll(fds, 3, TIMEOUT * 1000);
/* shall return 1, and fds[0].revents is set to POLLIN all others fds revetns are 0 */

/* then 2nd call to: */
poll(fds, 3, TIMEOUT * 1000);
/* shall return 1, and fds[1].revents is set to POLLIN all others fds revetns are 0 */

/* then 3rd call to: */
poll(fds, 3, TIMEOUT * 1000);
/* shall return 1, and fds[2].revents is set to POLLIN all others fds revetns are 0 */

/* then 4th call to: */
poll(fds, 3, TIMEOUT * 1000);
/* shall return 1, and fds[0].revents is set to POLLIN all others fds revetns are 0 */

--------------------------------------------------------------------

0 reply Hide replies

Guest

Reply
| Aug 30, 2020

Attachment (Description)

pollset2.c

pollset2.c
pollset2.c
Open full size
pollset2.c

0 reply Hide replies

Guest

Reply
| Aug 30, 2020

Hello,

- This affects poll(), due to fact that it supports System-V message queue polling. Pollset() according to documentation, does not support it.
- But we did some testing with pollset() over the unnamed pipes and threads which individually via new pollsets monitor the pipe, we get the same thundering herd issue. This scenario via several pollsets (rather than one) is must have for our middleware, as we basically run several executables on shared resources, thus pollset as object cannot be shared.

This ideally IBM could:
1. Add flag for POLLEXCL poll() to have single wakeup when on shared resource (socket/pipe/msgqueue) event appears.
2. Add the same flag to pollset()
3. Add support for msgqueue monitoring via pollset() API.

For us scenariou 1. is critical, if IBM could do the 2.+3. changes, that would be even better.

Here is test run from pollset() on unnmaed pipes (source attached as pollset2.c):

-------------------------------------------------------------------------------------------

$ cc pollset2.c -lpthreads

$ ./a.out 1 1000000
Wait 5s for threads to start... (num_threads=1 num_msg=1000000)
server: FD = R: 3 W: 4
server: START 2020-08-30 16:37:57
server: STOP 2020-08-30 16:38:03 DELTA sec: 6
server: Messages 1000000 sent successfully
server: waiting 1s... M_msg_proc=967252, num_msg=1000000
server: Messages 1000000
server: COMPLETED 2020-08-30 16:38:04 WASTED WAEKUPS: 0
server: done waiting. remove threads...
server: done waiting. remove pipes...

-------------------------------------------------------------------------------------------

With one thread test runs for 6 seconds and gets 0 wasted wakeups.

Then testing with 500 polling threads...:

-------------------------------------------------------------------------------------------
$ ./a.out 500 1000000
Wait 5s for threads to start... (num_threads=500 num_msg=1000000)
server: FD = R: 3 W: 4
server: START 2020-08-30 16:38:15
server: STOP 2020-08-30 16:39:22 DELTA sec: 67
server: Messages 1000000 sent successfully
server: Messages 1000000
server: COMPLETED 2020-08-30 16:39:22 WASTED WAEKUPS: 1178438
server: done waiting. remove threads...
server: done waiting. remove pipes...
-------------------------------------------------------------------------------------------

We see that same workload took 67 seconds and wasted/thundering herd got: 1178438 number of times. So load balancing across multiple processes (or independent threads), actually worsen then work time for 1000%.

0 reply Hide replies

Guest

Reply
| Aug 28, 2020

Can you please clarify if this enhancement request is for the poll() or pollset() service? Would one be preferable over the other?

Note that pollset() is the scalable I/O event notification service for AIX and is similar to Linux epoll().

0 reply Hide replies

Guest

Reply
| Aug 27, 2020

Attached example source code and results shows the unwanted effect of thundering herd problem. I.e. if 1 thread vs 500 threads waiting on poll. Results are 1 minute vs 11 minutes for the same work load. With introduction of POLLEXCL flag, the results would be the same, such work load shall take 1 minute, for no matter of the polling thread count.

0 reply Hide replies

Guest

Reply
| Aug 12, 2020

Attachment (Description): Test results on AIX 7.2 Contains two runs for bulk of 10M messages. Where in first test case, with on receiver thread (poll() + msgrcv()) is completed within 1 minute. Where in second test case with 500 threads doing poll() + msgrcv() test is completed within 11 minutes.

test_poll_resul...

test_poll_results.txt
test_poll_results.txt
Open full size
test_poll_results.txt

0 reply Hide replies

Guest

Reply
| Aug 12, 2020

Attachment (Description): Test program example: - program main does send bulk of msgs - number of threads via poll() receive messages may be used to test thundering herd issue with poll

test_poll.c

test_poll.c
test_poll.c
Open full size
test_poll.c

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Add new POLLEXCL flag for poll() function to solve thundering herd problem. Flag logic like EPOLLEXCLUSIVE for linux epoll()

Please enter your email address

RELATED IDEAS

Add new POLLEXCL flag for poll() function to solve thundering herd problem. Flag logic like EPOLLEXCLUSIVE for linux epoll()