Failfast for Solaris IO By Lewis Thompson on May 10, 2010
Introduction
Recently I have been
trying to better understand Solaris IO, specifically what goes on once
a process enters the biowait(\*buf) function. As part of this investigation
I found a need to learn about
failfast for Solaris IO which I will discuss in this blog post. Failfast was first
presented as PSARC/2002/126: Buf
Flag for Faster Failover. If you do not already
have a general knowledge of
Solaris IO internals I strongly recommend reading General Flow of Control from the
Writing Device
Drivers book available for free on docs.sun.com.
At a later date I would
like to expand
this article with more discussion
of the interaction between ZFS & SVM and the failfast flag, along with a more
general Solaris IO entry.
Functionality
Data buffers (buffer, buf or bp
from now on) passed into the
(s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are
passed to the transport (glm, qlc, etc.) via
scsi_transport(\*scsi_pkt).
sd_initpkt_for_buf() sets the scsi_pkt's
command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the
transport has finished processing
a packet (e.g. due to a completion,
timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the
packet back to the driver.
Commands time out if no response
is received from the target within
pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf().
In case of timeout the driver
will attempt to retry the
packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without
failfast it can take up to 5 minutes
for, e.g., a read to return
an error in the case of a non-responsive disk.
Failfast is a process that takes place
within the driver to more
expediently fail a pending buf and inform the upper layer
Volume Manager (ZFS, SVM, VxVM). Co-operation
is required from the VM which must
set B_FAILFAST in the buf b_flags mask to enable the
behaviour (the driver can check the ddi-failfast-supported property to know
whether B_FAILFAST can be used).
Most VMs tend
to round-robin read IO when multiple
copies of the data exist. In
the case of a mirror where one
disk has gone away we ultimately expect
all of our read IOs to
be serviced by the working disk. In order for
this to happen
it is necessary for the driver to
return a failure code (EIO) to the VM so
that it can retry with the working
disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall
average IO time. The B_FAILFAST flag was initially
proposed as B_ALTDATASRC as this accurately
describes the conditions that need to be true
for us to want to use
failfast behaviour.
Implementation
Within sd each
physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its
internal failfast state in un_failfast_state and un_failfast_bp. The instance
may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.
When any packet (i.e.
regardless of B_FAILFAST) is returned to sd via
pkt_comp qualifies for a retry due
to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete
reason is a selection timeout) we call
into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf
and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance
failfast state. Every instance begins in SD_FAILFAST_INACTIVE.
Transition to failfast
pending: The first
buf to enter
sd_retry_command() with
SD_RETRIES_FAILFAST set will take the
sd instance into the failfast
pending state by registering itself as the un_failfast_bp. The buf is then
retried normally. Subsequent SD_RETRIES_FAILFAST bufs will
be retried without changing any failfast
state.
Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to
sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all
B_FAILFAST bufs on the wait
queue to be returned to the
caller with a suitable error set (this is done via
thread). This buf is also returned
with an error set if it has B_FAILFAST set, otherwise it is retried.
Transition to SD_FAILFAST_INACTIVE: Any buf that
either completes successfully (via sd_return_command()) or
requires a retry for any reason
other than those that take
us into failfast pending will transition
us into SD_FAILFAST_INACTIVE by updating
un_failfast_state and clearing un_failfast_bp. It should now
be clear from above that only
B_FAILFAST bufs are affected
by the failfast
state which means any subsequent
buf without B_FAILFAST (or indeed any
buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.
Any buf passed
into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun,
\*buf).