In our server paths are failing when one process is already opening an adapter and another process try to open the same adapter. Please refer the PMR 24147,000,834. In this case paths are marked as failed when lspv was not able to query through an fscsi adapter which was already held by hbaapp process from VCS.
Please advice whether an enhancement be made to advice AIX OS to drop or delay one of the request rather than failing the path. There is no problem with path hence failing the path because two processes tried to access it does not seems logical. Please advice if an enhancement can be made in OS to either drop or delay any one request when multiple process query/open same path/adapter when disks are in failed state.
Also paths are not recovering in disks which are in closed state. Can we add an parameter to hcheck_mode to scan disks for which paths are in closed state and recover if paths are failed ? (or any other way to perform health check on disks which are in closed state.)
PMR Summary :
From the trace we know that a lspv command on hdisk14 ran into an error:
3C4 lspv 255 33030335 424.900427
FCPS entry_ioctl: errno: 00 devno: 8000001200000003
op: 0000000000000383 flag: 0000000040000002 chan: 0000000000000000
ext: 0000000000000000
...
3C4 lspv 255 33030335 424.900855
FCPS exit_ioctl: errno: 16 devno: 8000001200000003
221 lspv 255 33030335 424.900856
SCDISKDD mpioPathOpenFail: dev_handle=F1000A0230272200 path_id=0007
flag=0002 identifier=50060E8007298140
499 lspv 255 33030335 424.900856
SCSDISKDD_ERR mpioPathOpenFail:
Trace 00 F1000A0230272200 0000000000000002 50060E8007298140
0000000000000007 00000000061D8160
221 lspv 255 33030335 424.900867
SCDISKDD mpioPathOpenFail: dev_handle=F1000A0230272200 path_id=0007
flag=0002 identifier=50060E8007298140
499 lspv 255 33030335 424.900868
SCSDISKDD_ERR scsidisk_open_adapter:
Exit 04 0000000000000016 F1000A0238028000 0000000000000007
50060E8007298140 0000000000000000
221 lspv 255 33030335 424.900869
SCDISKDD mpioSetPathState: ddi=F1000A0230272200
path_id=0007 state=0004
...
1D1 lspv 255 33030335 424.900875
ERRLG errput: error id=0000000002A8BC99
So we know that an ioctl was returned with errno 16 EBUSY.
At the same time there was another ioctl running for the same adapter
issued by
3C4 hbaapp 255 98959583 424.900299
FCPS entry_ioctl: errno: 00 devno: 8000001200000003
op: 0000000000000381 flag: 0000000000000001 chan: 0000000000000000
ext: 0000000000000000
...
3C4 hbaapp 255 98959583 424.900898
FCPS exit_ioctl: errno: 00 devno: 8000001200000003
So there was a conflict between the ioctl issued by the hbaapp (to get
adapter statistics on a regular base) and the ioctl that has been forced
by the lspv (as the disk was not open).
Going back in the process tree we have seen that the lspv was finally
issued by the ITM kuxagent process;
The closed path health check was added to AIX 7.2TL4. The user can look at the "dk_closed_path_recovery" option in the ioo command to learn more. The enhancement to the open function has NOT been made. This can be resubmitted as a separate RFE if the capability is still needed.