Solaris io timeout variables

All disk layers for solaris are shown at this figure.

At lower layers, just after hba, there is fp and fp layers. FP refers to “fiber port” and FCP refers to “Fiber Channel Protocol”. If there exists any path failure or loss of fabric in the fabric, Registered State Change Notification ( RSCN ) is sent and fp_offline_ticker ( default is 90 seconds ) and fcp_offline_delay ( default is 20 seconds ) starts. Total waiting time is 110 seconds by default. It is an interval for giving recovery chance to temporary problems. If there occurs no recovery, LUNS are reported as OFFLINE and all IO to those LUNS are considered as FAILED. These default values can be changed from /kernel/drv/fp.conf and /kernel/drv/fcp.conf but must be done with care and needs updating if really necessary.

If there is no problem in the fabric, there will be no RSCN messages and neither failed/offline issues. IO is taken by sd or ssd layer after fcp. ( sd = scsi disk , ssd= FC-AL disk ) after an IO issued by sd or ssd, additional timeout values are in place to decide if IO failed or not.
Parameters for sd are; SD_RETRY_COUNT ( default is 5 ) and SD_IO_TIME ( default is 60 seconds )
Parameters for ssd are; SSD_RETRY_COUNT ( default is 3 ) and SSD_IO_TIME ( default is 60 seconds )

Total default timeouts for sd id 360 seconds and 240 seconds. (retries are additionally counted after first IO try. )
You can also change those values from /etc/system with care if necessary and needed.

If any fabric error occurs while waiting IO timeout period, default 110 seconds starts and if 110 second interval finishes earlier then IO timeout period is not waited.

**** The maximum delay before an application is notified of an I/O failure will be as follows:
HBA delay * [sd/ssd]_RETRY_COUNT * [sd/ssd]_IO_TIME * multipath_software_retry_count * number of paths * number of pending I/O's

Note: The timeout values are set and passed from the target drivers (sd/ssd) to the HBA layer.
HBA layer implements the timeout mechanism for the IO subsystem.

The newer HBA drivers leave the IO retry mechanism to the target drivers (sd/ssd).
The overhead is due to the HBA driver retry which is very minimal.

With the implementation of flag B_FAILFAST, the time taken to fail a submirror has been drastically reduced.
failfast mechanism is generally automatically activated under necessary conditions at Solaris, so you dont need to manually enable or disable it, Solaris does it for you. You can just examine if device driver has support about failfast mechanism with below command, if below attribute (ddi-failfast-supported) exists then it has support, if no attribute like below exists, there is no support and there may be driver update needed if failfast is so essential.

[server1]~#prtconf -v /dev/rdsk/c3t3FFD20B60F066856d0s2 | grep -i ddi-failfast
name='ddi-failfast-supported' type=boolean dev=none
[server1]~#

Solaris failfast details are explained at this article

İostat command outputs implementation regarding all these informations are also explained here.

*** Please feel free to communicate by bulent.yucesoy@gmail.com