Solaris io timeout variables
All disk layers for solaris are shown at this figure.
At lower layers, just after hba, there is fp and fp layers. FP refers to “fiber port” and FCP refers to “Fiber Channel Protocol”. If there exists any path failure or loss of fabric in the fabric, Registered State Change Notification ( RSCN ) is sent and fp_offline_ticker ( default is 90 seconds ) and fcp_offline_delay ( default is 20 seconds ) starts. Total waiting time is 110 seconds by default. It is an interval for giving recovery chance to temporary problems. If there occurs no recovery, LUNS are reported as OFFLINE and all IO to those LUNS are considered as FAILED. These default values can be changed from /kernel/drv/fp.conf and /kernel/drv/fcp.conf but must be done with care and needs updating if really necessary.
If there is no problem in the fabric, there will be no RSCN
messages and neither failed/offline issues. IO is taken by sd or ssd layer
after fcp. ( sd = scsi disk , ssd= FC-AL disk ) after an IO issued by
sd or ssd, additional timeout
values are in place to decide if IO failed or not.
Parameters for sd are; SD_RETRY_COUNT
( default is 5 ) and SD_IO_TIME ( default is 60 seconds )
Parameters for ssd are; SSD_RETRY_COUNT
( default is 3 ) and SSD_IO_TIME ( default is 60 seconds )
Total default timeouts for sd id 360 seconds and 240
seconds. (retries are additionally counted after first IO try. )
You can also change those values from /etc/system
with care if necessary and needed.
If any fabric error occurs while waiting IO timeout period, default 110 seconds starts and if 110 second interval finishes earlier then IO timeout period is not waited.
**** The maximum
delay before an application is notified of an I/O failure will be as follows:
HBA delay * [sd/ssd]_RETRY_COUNT *
[sd/ssd]_IO_TIME *
multipath_software_retry_count *
number of paths * number of
pending I/O's
Note: The timeout values are set and passed from the target
drivers (sd/ssd) to the HBA layer.
HBA layer implements the timeout mechanism for the IO subsystem.
The newer HBA drivers leave the IO retry mechanism to the
target drivers (sd/ssd).
The overhead is due to the HBA driver retry which is very minimal.
With the implementation of flag B_FAILFAST, the time taken to fail a submirror has been drastically
reduced.
failfast mechanism is generally automatically activated under necessary
conditions at Solaris, so you dont need to manually enable or disable it,
Solaris does it for you. You can just examine if device driver has support
about failfast mechanism with below command, if below attribute (ddi-failfast-supported) exists then it
has support, if no attribute like below exists, there is no support and there
may be driver update needed if failfast is so essential.
[server1]~#prtconf -v /dev/rdsk/c3t3FFD20B60F066856d0s2 | grep -i ddi-failfast
name='ddi-failfast-supported' type=boolean
dev=none
[server1]~#
Solaris failfast details are explained at this article
Ýostat command outputs implementation regarding all these
informations are also explained here.
*** Please feel free to communicate by bulent.yucesoy@gmail.com