Aladdin Lookup Services: Model, Protocols, and Analysis
Anish Arora |
Yi-Min Wang |
Wilf Russell |
Ohio State Univ. |
Microsoft Research |
Microsoft Research |
Columbus, OH |
Redmond, WA |
Redmond, WA |
“Providers” make information available for lookup by “clients”. Lookup information is of various types: “I-am-alive” information, address information, state information, etc. Providers operate asynchronously from clients, they create/update/revoke information in the lookup service at any time, and they may cease to operate at any time without explicitly revoking their information. It is thus entirely possible that a client lookup occurs before creation/update or after revocation/death of some provider information.
A delay exists between a provider starting up and creating its information in the lookup service. So, if the information is a server address, it is possible that a client lookup shows that there is such no such available server but the server is actually running. Dually, a delay exists between a provider revoking its information (or a provider dying) and the information removal from the lookup service. So, a client may get an invalid service address, and calls to that address will fail. Clients may choose to deal with the uncertainty induced by delays by retrying the lookup, optimistically expecting that the subsequent information they get is fresh (or, equivalently, not stale). Alternatively, clients may choose to subscribe to the lookup service to provide the information whenever it becomes available.
The lookup service may or may not acknowledge provider creation/update/revocation of information. In other words, the lookup service and the provider may negotiate (using meta-updates) whether or not an acknowledgement of successful completion of the provider event will be sent by the lookup service to the provider.
In several cases, the information being provided is such that its updates are idempotent and commutative. By idempotent, we mean that multiple updates (or even a single update following multiple unsuccessful updates) yield the same final information as a single update. By commutative, we mean that the order of the updates does not matter, i.e. all orders yield the same final information.
As mentioned above, in the home-networking/device world, it is normal for providers to cease operation without explicit announcement. The idea of soft-state refresh is therefore naturally applicable---each provider periodically refreshes its information. It follows that if no refresh is received in some corresponding interval, the provider has ceased operation and the soft-state disappears; the space allocated by the lookup service to its information is thus reclaimed and, more importantly, within bounded time, clients stop getting stale information. Likewise, if the provider is in the process of recovery and its state is timed out, the clients are blocked from performing potentially wasteful, unstable, or even divergent computation.
By the same rationale, soft-state refresh of client subscriptions is naturally applicable to deal with clients that cease operation without explicit revocation of their pending subscriptions. Another application is to simplify client programming, e.g. in scenarios where “this information request will expire at noon” or “raise an alert is this information request is not processed by noon”.
Soft-state lookup furthermore enables clients to access the state of devices that cannot be monitored or polled but that can announce their state periodically (albeit at a low frequency should they be subject to energy constraints). It also enables the design of efficient and simple fail-over of protocols; a somewhat subtle example is our lookup service itself, which dynamically maintains a group of active nodes. Group maintenance is based on each node periodically refreshing its status.
Lookup services should respond to client queries with the freshest information they have received from providers. In the presence of faults, however, the freshest information may become unavailable. For instance, provider updates initiated in the presence of faults may be delayed, if not lost. We therefore measure the quality of a fault-tolerant lookup service by the maximum decrease in the freshness (or, equivalently, the maximum increase in the staleness) of the state it provides in the presence of faults. A second metric is the maximum length of lookup outage in the presence of faults, i.e. the maximum time period during which client queries or provider events fail. Of course, in the presence of catastrophic faults, such as long power outages or permanent failure of multiple machines, it may not be possible to bound outage length. In this case, the best that can be hoped for is that eventually the lookup service recovers to providing fresh information and/or the outage is notified to the system administrator.
In the Aladdin environment, a variety of faults must be considered. These include: message loss, message delays, PC failure, PC reboot (which may hang or succeed), power outages, clock drifts, and state corruption. The effects of these faults on the lookup service vary widely, and depend crucially on the combination and number of the occurring faults. We therefore begin by classifying faults as “common”, “rare”, and “catastrophic” and then attempt to provide fault-tolerance separately for each of these fault-classes. In particular, the goal is to tune the quality of tolerance to be commensurate with the frequency/severity of the fault-class, the best quality being reserved for the common failure-class that consists of the high frequency and low severity faults.
The common failure case for Aladdin is at most one of the following faults:
· loss of a single message
· failure of a single node (possibly permanent)
· a short power outage
Any number of node repairs may occur in the common failure case. In the Aladdin system, we assume there are at least two nodes and at least one of the nodes is connected to an uninterruptible power supply (i.e., there is a designated UPS-node). It follows from this assumption and the faults in the common failure case that there is at least one node remains available to service the clients and providers if only common failure occurs. Note that this assumption stems from the importance of continuing critical sensing and controlling in the home networking environment in the presence of power failures.
The rare failure case for Aladdin is any finite number of the following faults:
· permanent failure of some (but not all) nodes
· node repairs
· message loss, loss of clock synchronization, and transient state corruption
Combinations of rare faults lead to scenarios where there are no nodes available to service the clients and providers. However, we can achieve the weaker tolerance whereby in these scenarios the lookup service eventually recovers to states from where subsequent client queries and provider events are dealt with correctly.
Finally, the catastrophic failure case for Aladdin is any number of the following faults:
· long power outage
· permanent failure of all nodes
In this case, the best tolerance that can be achieved is to notify the Aladdin administrator that Aladdin is shutting down and needs support.
Given the lookup service model, providing absolute consistency of lookup information via a tightly coupled group of replicated databases is not necessary; the overhead in the absence of faults may well outweigh its benefits in the presence of faults and may negatively impact lookup service scalability. Loose consistency is therefore preferable, provided the metrics for quality of fault-tolerant service are acceptable, especially in the presence of common failures.
We categorize the basic approaches to providing loose consistency for lookup services in terms of (a) whether consistency is ensured at information update time or information query time and (b) whether consistency enforcement is the responsibility of the lookup service, the provider, or the client. The approaches include:
1. Update-time, provider-side consistency: In this case, one of the lookup information base replicas, the “leader”, is the most consistent one, and providers communicate directly with the leader. The leader ensures that other replicas receive provider updates. Likewise, clients communicate directly with the leader to obtain information. Consistent leader information has therefore to be provided to the providers and clients, even in the presence of faults.
2. Update-time, server-side consistency: In this case, providers are not aware of the replicas and, therefore, which replica is leader; hence providers multicast the information. All replicas process the multicasts, coordinate with each other to ensure update atomicity, and the leader communicates to providers whether their updates succeeded. Clients are not aware of the replicas either, so they multicast their queries as well, and the leader responds accordingly. Unlike case (1), consistent leader information is needed only among the replicas.
3. Query-time, server-side consistency: As in case (2), both providers and clients multicast their requests. All replicas process provider information but independently, i.e. replicas do not coordinate with each other to ensure update atomicity. The leader communicates acknowledgments to providers that wish to know whether their updates succeeded, but these acknowledgements only confirm that the leader replica succeeded in performing the update. The leader also responds to client queries. Consistent leader information has to exist among the replicas only, but since update coordination between replicas is avoided in this case, the leader information is needed only in a “weak” sense: replicas need to know whether they are the leader, but not who the leader is.
4. Query-time, client-side consistency: In this case, providers multicast information which all replica process independently. No acknowledgment is communicated to providers. Clients also multicast their queries, and one or more replicas respond accordingly. Clients are responsible for suppressing duplicates and reconciling inconsistent responses from different replicas. Consistency requirements for leader information are low in this case: only weak leader information is needed and it suffices that the leader information stabilizes eventually, from which point only one replica responds to client queries.
Of these approaches, in Aladdin, we focus our attention on (3). The consistency requirements that (3) imposes on the lookup service are low compared to (1) and (2). Client and providers do not need to receive leader information, which significantly simplifies the leader failover protocol. Also, replicas do not need strong leader information, which significantly simplifies the leader election protocol (especially since Aladdin is subject to multiple fault-classes) and which reduces outage time. Since replicas perform updates independently in (3) but in a coordinated manner in (1) and (2), (3) imposes less overhead on the update response time than do (1) or (2). Moreover, (3) provides more freshness of information than (4) since leaders in (3) are more consistent than leaders in (4), plus it is difficult for clients to choose between inconsistent responses received.
In principle, the freshness of information in (3) is less than that of (1) and (2): consider for example the scenario where a non-leader replica loses an update, due to message loss, and it then becomes the leader, due to leader node failure; the information it provides will be relatively stale. Two observations about Aladdin mitigate this freshness concern. One, in the common failure case, both message loss and node failure do not occur in the same period; so the scenario described above occurs only rarely. And two, dealing with message loss can be conditioned to the refresh period of updates. Specifically, we may distinguish the handling of updates whose refresh period is small, i.e. high frequency updates, from those whose period is large, i.e. low frequency updates. In the high frequency case, the maximum decrease in freshness in naturally bounded. Dealing with message loss is more crucial in the low-frequency case, so we insist that low-frequency updates always receive an acknowledgement from the leader. By way of contrast, high-frequency updates may be performed optimistically (especially if the information is idempotent and commutative, as in keep-alive information).
We now proceed to present the Aladdin lookup service protocol based on (3) and give an analysis of its fault-tolerance.
We develop our lookup protocol in a stepwise manner. We begin with the relatively simple case of optimistic high-frequency updates and deal only with common failures. Then, we augment the protocol to accommodate pessimistic updates (including low-frequency updates). Later, we augment the protocol to deal with rare failures. (We omit discussion of the routine engineering issues needed to deal with catastrophic failures.) Finally, we discuss how to improve lookup service performance by dynamically adapting the frequency of updates.
We refer to optimistic provider updates as provider refreshes. Recall that provider refreshes are multicast periodically. We assume an upperbound of Max_Volatile_Refresh_Interval on the refresh period. “Volatile” indicates that this data can be kept in volatile storage. The lookup service runs on one node, namely the leader node. Providers optimistically expect the leader node to perform the refreshes, i.e. providers do not receive acknowledgements that their refreshes were performed. Client queries are also multicast, and the leader node responds to them. Note that should an update message be lost, the lookup service will only be able to provide stale information for that item. The maximum staleness in this case is 2 Max_Volatile_Refresh_Interval + δ where δ is a statistical minimum on the time difference between query initiation time and the most recent refresh prior to the query for which an ideal lookup service can guarantee correct (i.e., 0-stale) information. Bear in mind that Max_Volatile_Refresh_Interval + δ is a theoretical lowerbound for maximum staleness.
Should the leader node fail, service is restored on some non-failed node. Detection of the leader node failure is achieved by using soft-state heartbeats, thus detection that masks a single message loss occurs within 3 SS_Heartbeat_Interval time of the leader node failure. Soft-state heartbeats also enable election of the new leader. As explained previously, only "weak leader" election is needed; such election is achieved simply by having nodes independently check whether they have the highest "id" among eligible nodes when they detect failure of the leader node.
Note that upon leader failover, the new leader "warms up" for a period of Max_Volatile_Refresh_Interval, in which time the lookup data is reestablished and after which all queries are performed correctly. The service outage due to node failure is therefore of length 3 SS_Heartbeat_Interval + Max_Volatile_Refresh_Interval. Thus, the maximum staleness of the data received in response to a query is 3 SS_Heartbeat_Interval + 2 Max_Volatile_Refresh_Interval + δ.
The maximum staleness of the previous protocol can be substantially reduced if we maintain a fixed set of "hot" spares. Each hot node performs all refreshes, but hot spares do not respond to queries unless they are elected leader. Therefore, assuming that a hot spare is available during leader failover, the maximum staleness is reduced to Max_Volatile_Refresh_Interval + δ, since the common failure assumption implies no updates are lost if a node fails. The maximum staleness in the case of a message loss remains unchanged. Hence, the overall maximum staleness is reduced to 2 Max_Volatile_Refresh_Interval + δ.
Final protocol for high-frequency lookup service:
The hot-start lookup protocol works correctly, provided nodes leave but not (re)join the set of hot nodes. Else, if a hot node fails and then repairs, race conditions may arise in weak-leader election whereby multiple nodes concurrently become leader. Since the lookup state of a leader may be mutually inconsistent or phase delayed with respect to another leader, they may start providing conflicting responses to queries.
To allow failed nodes to become hot spares upon repair or, more generally, for the protocol to work correctly for a dynamic set of spare nodes, we augment the protocol with a "join" method that each node executes when it wishes to become a hot spare (in particular, when it repairs or when some hot node fails). The join method works as follows: before joining, each node multicasts a "Can I Join" message. The leader node (if there is one) determines whether the node may join--specifically, the protocol attempts to maintain at most t+1 hot nodes, where t>0--and sends an "Ack"/"Nack" message accordingly. To deal with the loss of the Can I Join or the Ack/Nack message as well as leader failover, the node times out if it does not get a response, waits for 3 SS_Heartbeat_Interval time (which we prove is the upper bound on the leader failover time) and then retries twice more.
The maximum staleness of the high-frequency protocol, given below, remains 2 Max_Volatile_Refresh_Interval + δ and its maximum outage length is 3 SS_Heartbeat_Interval.
//==========================================================================================
int Join()
{ for ( int retry_cnt = 0; retry_cnt < 3; retry_cnt++ )
multicast("Can I join");
wait_on_recv(msg, ROUNDTRIP_COMMUNICATION_DELAY);
switch(msg)
{case " Ack":
bIsActive = TRUE;
activeStartTime = GetLocalTime();
Sleep(MAX_VOLATILE_REFRESH_INTERVAL);
SSS_Update("LS_start_time", activeStartTime, heartbeat_interval);
// SSS stands for soft-state store
create_main_thread_for_receiving_announcements_and_queries();
return();
case "Nack": // t+1 active nodes are already running
return();
case ROUNDTRIP_COMMUNICATION_DELAY:
if (retry_cnt < 2) Sleep(3 * SS_HEARTBEAT_INTERVAL ); break
else return();
default: break;
}
}
//=========================================================================================
int leader_thread()
{ while (1)
{ recv(msg);
switch (msg)
{case "Can I join":
if (number of LS_active_nodes < t || sender_is_UPS_designated_node)
reply(ack)
else reply(nack);
break;
default: break;
}
}
}
//=======================================================================================
int main_thread_for_receiving_announcements_and_queries()
{ while (1)
{ recv(msg);
switch (msg)
{case "query":
if (bIsLeader) respond_to_query();
break;
case "update":
if (bIsActive) perform_update();
break;
default: break;
}
}
}
// =======================================================================================
int main()
{ BOOL bIsActive = FALSE; // Starts as a passive server
BOOL bIsLeader = FALSE;
BOOL bIsUPSnode = TRUE?FALSE; // TRUE only for UPS-designated node
int activeStartTime = INFINITE; // for computing the age of LS daemon
Join();
HANDLE stateMayNeedToChangeEvent; // If any LS daemon dies, this daemon
CreateEvent(stateMayNeedToChangeEvent); // may need to join or become the leader
SSS_SubscribeEvent("any LS_start_time entry changes", stateMayNeedToChangeEvent);
while (1)
{ SSS_PutValue(bIsActive, activeStartTime);//This code is called Maintainer
if (!bIsActive) // Node is passive
{ if (number of LS_active_nodes < t || bIsUPSnode )
Join() // t+1=target number of LS active nodes
}
else // Node is active; if there are too many
{ // LS active nodes, youngest should stop
if (number of LS_active_nodes > (t+1) &&
activeStartTime is the latest among nonUPS LS_active_nodes)
{ bIsActive = FALSE;
activeStartTime = INFINITE;
bIsLeader = FALSE;
SSS_Delete("LS_start_time");
} // Declare node leader if it is oldest
if (!bIsLeader && activeStartTime is the earliest among LS_active_nodes)
{ bIsLeader = TRUE;
create_leader_thread();
}
}
WaitForSingleObject(stateMayNeedToChangeEvent, INFINITE);
SSS_GetValue("set_of_up_nodes");
SSS_GetValue("set_of_LS_active_nodes");
}
return;
}
//======================================================================================
The following predicate is an invariant of the protocol is:
(I1a and I1b) and (forall j: j is an up node: I2a and I2b and I2c and I2d)
where
(I1a) there is an active node that has been up for the last
(3 SS_Heartbeat_Interval +2 roundtrip-communication-delay) interval
(I1b) designated UPS node is active
<= it has been up for the last
(3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) interval
(I2a) st.j=passive <=> age.j=\infinity
and (#k::st.k=active) >=1 <= st.j=passive for the last
(3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) interval
and (#k::st.k=active) >1 <= st.j=passive and no node went down for the last
(3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) interval
(I2b) j is down => age.j.k in SSS.k is either missing or greater than clock.k
and st.j=passive => age.j.k in SSS.k is either \infinity or a common time greater than clock.j
and st.j=active => age.j.k in SSS.k is either age.j or \infinity
and age.j.k is not in SSS.k <= j has been down for the last (3 SS_Heartbeat_Interval) interval
and age.j.k is not in SSS.k => j was down at some point in the last (3 SS_Heartbeat_Interval) interval
and age.j.k is \infinity in SSS.k <= st.j=passive for the last (SS_Heartbeat_Interval) interval
and age.j.k is \infinity in SSS.k => st.j=passive held at sometime in last (SS_Heartbeat_Interval) interval
and age.j.k=age.j in SSS.k <= st.j=active for the last (SS_Heartbeat_Interval) interval
(I2c) l.j => st.j=active
l.j => <age.j, j> is the maximum <age.k,k> among all active nodes k
(I2d) <age.j, j> is max <age.k,k> of all active k
=> <age.j.k,j> in SSS.k is max or
k will receive "node down()" message from all active l of greater age
and <age.j, j> is max <age.k.j,k> of all active nodes k in SSS.j
=> l.j
(I2e) j receives "node down(k)" message
=> k failed at some point in the last (3 SS_Heartbeat_Interval) interval
and j receives "node down(k)" message
<= k was failed during the last (3 SS_Heartbeat_Interval) interval
It is easy to see that in the absence of faults, upon starting from any invariant state, the protocol satisfies its specification; i.e., every computation of the protocol correctly performs provider refreshes and client queries without any increase in maximum staleness.
Now, observe that at invariant states, at least one of the nodes is active, but it may be that none is a leader. In any state where there is no leader, by I2d, the active node with the maximum <age.j,j> will receive the "active-node-down()" message from the higher (failed) nodes within (3 SS-heartbeat-interval) time, which causes execution of the maintainer, and hence the new election of leader. Thus, we have:
Lemma: Leader failover occurs within (3 SS-heartbeat-interval) time.
If no node has failed in the last (3 SS-heartbeat-interval) time, then by I2e no node receives a "node down()" message in that time. By I2d, the active node that had maximum age at the beginning of that interval also has the <age.k.j,k> of all active nodes k in its local SSS and is hence a leader at that time. It follows that
Lemma: If no node has failed in the last (3 SS-heartbeat-interval) time, then currently there is a leader node.
It follows that the maximum outage time is 3 SS-heartbeat-interval, and the maximum staleness, in case of node failure, is Max_Volatile_Refresh_Interval + δ (because of the common failure assumption and I1a-b) and, in case of message loss, is 2 Max_Volatile_Refresh_Interval + δ.
Finally, to prove that the invariant predicate is indeed an invariant, we show that if the predicate holds before any protocol action or common failure action executes, it holds afterwards as well. We consider separately the preservation of each conjunct upon executing an action in any invariant state. (Protocol inspection suffices to prove that timing invariant I2e hold in all computations. When a node k fails, the state of j times out at SSS.j within (3 SS_Heartbeat_Interval) and SSS.j notifies the lookup service daemon with an "node down(k)" message. Conversely, if the daemon receives an "node down(k)" message, it must be that k was failed at some point within the last (3 SS_Heartbeat_Interval) interval, else k would have sent at least 2 SS_Heartbeat_Interval messages to j, both of which could not have lost due to the common failure assumption. Henceforth, we restrict ourselves to reasoning about I1a-b and I2a-d.)
I1a is preserved by the protocol actions since the only action that violates it is the maintainer. Before the maintainer at j changes its state from active to passive it detects from SSS.j that there are at least t+1 nodes that j believes are active and <age.j,j> is the minimum of the non UPS-active nodes. From the common failure assumption, at most one of them may be down, hence at least t of them are active and remain active when j changes its state to passive; from I2a, at least one of the remaining nodes has been active during the last (3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) interval. I1a is preserved by the common failure actions, since it is unaffected by message loss and a single node failure (cf. I2a).
I1b is preserved by the protocol actions since they never make the designated UPS powered node passive. Also, once that node has been up for (3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) interval, its constructor ensures that within two attempts to join (during which a message may be lost or a leader failover occurs), it becomes active as well. Also, if a common failure action fails the designated node, I1b is trivially preserved.
I2a (first conjunct) is preserved by the protocol actions since they set st.j to passive iff they set the age to \infinity, and by the common failure actions trivially. I2a (second and third conjuncts) are preserved by the protocol actions since a node becomes passive only if it is not the designated UPS powered node, its SSS detects more than t active nodes (cf. I2b), and it has the least age among them. Hence the consequents of the conjuncts are not violated. If the antecedent of the second conjunct becomes true, the second conjunct must hold otherwise the first time it would be violated, some node would have up and passive for (3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay), which means its join would have failed, but then the leader would have aware of more than t active nodes. If the antecedent of the third conjunct becomes true then either the previous argument holds and there are more than t active nodes or some node would have failed more than (3 SS_Heartbeat_Interval + 2 roundtrip-communication-delay) in the past; in the latter case, a node-down() message would have occurred within (3 SS_Heartbeat_Interval) time, thereby causing passive nodes to execute their maintainer and join if the number of the active nodes were less than t; hence the number of active nodes would be more than 1. I2a (second conjunct) is preserved by the common failure actions since if a node fails then by I2a the number of active nodes previously was more than 1 and now becomes at least one. (The message loss case is trivial.) I2a (third conjunct) is preserved by the common failure actions trivially.
The three follows-from (<=) conjuncts of I2b and the last two implies (=>) conjuncts of I2b are straightforward timing invariants of the protocol: they hold for all protocol computations, regardless of the initial state. The remaining three conjuncts of I2b are preserved by the protocol actions. If a node j fails or becomes passive then its age value in SSS.k is either correct or a former value (i.e. greater than the current time); likewise, if j becomes active then its age value in SSS.k is either correct or \infinity (it cannot be a former value since the time between two consecutive activations of j is at least (3 SS_Heartbeat_Interval). I2b is preserved by the common failure actions: the node failure case is as discussed above and the message loss case is trivial.
I2c is preserved by the protocol actions since j becomes a leader only if it is active and its SSS shows that it has the greatest age. If its SSS does not have the correct age of some active k then, by I2b, k must have become active only within the last (SS_Heartbeat_Interval) interval and hence cannot have the greatest age (by I1a). Also, if the age of j is maximum among active nodes, then that age remains maximum since the age of active nodes is not changed; and if j becomes passive then it sets bisLeader to false. I2c is preserved by the common failure actions trivially.
I2d is preserved by the protocol actions since, by I1a, the age of j becomes maximum only when an active l of greater age fails, in which case all nodes k will receive a "node down(l)" message within (3 SS_Heartbeat_Interval) time. When j receives this message (which is how its age becomes the maximum of all age values in its SSS), it elects itself leader, thereby preserving the second conjunct. Also, when j sets its bisLeader to true, its age is the maximum of all age values in its SSS. Moreover, when the age of j stops being the maximum of the age values in SSS.k, it is either the case that j failed or the age of j is not the maximum. I2d is preserved by the common failure actions: the single node failure case was discussed above and the message loss case is trivial since it does not truthify the antecedents or falsify the consequents.
As mentioned previously, for providers whose events are refreshed with low frequency, the loss of the first event ("Create") may increase the staleness of query responses unacceptably; in other words, 2 Max_Persistent_Refresh_Interval + δ may be too large. (“Persistent” indicates that low frequency information will be stored in stable storage.) By the same token, the loss of the last event ("Revoke") and the event denoting the change of the refresh period ("Meta-Update") is critical. If low frequency update events are idempotent and commutative, then the loss of these events is not critical; but if we are dealing with updates that contain device state, then it is likely that update events are also critical. In order to reduce the maximum increase in stateless for low frequency devices, we distinguish the treatment of low frequency critical event: The leader always acknowledges whether it has successfully performed these events and the provider retries the event until it receives a positive acknowledgement.
The modifications to the protocol to deal with pessimistic and low-frequency updates are few: The update case of the main_thread_for_receiving_announcements_and_queries() is modified to:
if (bIsLeader && update_needs_acknowledgement)
{ reply(status); // status is status of the update on the persistent state
}
the reply(ack) message send in the leader_thread() method is replaced with:
stream persistent store;
reply(ack(UPS_token));
and the corresponding receive case " Ack" in the Join() method is replace with:
case "streamed persistent store and ack(UPS_token)":
We refer to this modified protocol as the composite frequency lookup service protocol. The protocol deals with the loss of a single message relatively quickly: if the loss occurs en route to the leader node or on the return channel from the leader node, the provider is forced to retry the event until it is successfully performed. If the loss occurs en route to a non-leader node, then by our common failure assumption that node will not become a leader in a long enough interval (during which the event will be refreshed or the state will be timed out at all active nodes). We assume that the period over which the provider continues to retry the event exceeds the leader failover interval; hence every low frequency event will eventually be performed successfully despite common failures.
The invariant predicate of the high frequency protocol is also invariantly true in all computations composite protocol, even in the presence of common failures. It follows that the maximum outage time for the composite protocol remains 3 SS-heartbeat-interval and the maximum staleness of high frequency information is also unchanged. The maximum staleness of low-frequency information in case of a single message loss is Max_Persistent_Refresh_Interval + 2 roundtrip-communication-delay + δ, and in case of node failure is Max_Persistent_Refresh_Interval + 3 SS-heartbeat-interval + roundtrip-communication-delay + δ, since an update that occurs concurrently with a query may actually have started at the beginning of the most recent outage. If we relax the assumption that providers retry immediately to providers retry within an upperbound Θ, then the maximum outage in case of node failure becomes Max_Persistent_Refresh_Interval + 3 SS-heartbeat-interval + Θ + roundtrip-communication-delay + δ. (Recall that Max_Persistent_Refresh_Interval + δ is a lowerbound.)
To deal with rare failures, we need to deal with states where there are zero or multiple leader nodes. More generally, we need to deal with all protocol states that can be reached in the presence of rare failures. Instead of explicitly characterizing a weaker invariant that is satisfied in the presence of rare failures, we modify the protocol to be self-stabilizing: upon starting from an arbitrary state, it eventually reaches a state where the original invariant of the protocol is satisfied.
For the case where there are no leaders, a node that attempts to join may timeout after three of its Can I Join message sends. In this case, we let the node assume that there are no leader nodes and itself become the leader. Specifically, we modify the timeout case in the join method to:
case ROUNDTRIP_COMMUNICATION_DELAY:
if (retry_cnt < 2) Sleep(3 * SS_HEARTBEAT_INTERVAL ); break
else
{bIsActive = TRUE;
activeStartTime = GetLocalTime();
Sleep(MAX_VOLATILE_REFRESH_INTERVAL); // build up volatile state
SSS_Update("LS_start_time", activeStartTime, heartbeat_interval);
create_main_thread_for_receiving_announcements_and_queries();
return()
}
For the case where the leader information is inconsistent or there are multiple leaders, we modify the maintainer to first check in case the node is passive (i.e., if (!bIsActive) ) that
if (INFINITE != activeStartTime || bIsLeader)
{ activeStartTime = INFINITE;
bIsLeader:= FALSE;
}
and in case the node is active (i.e., if (bIsActive) ) that
if (bIsLeader && activeStartTime is not the earliest among LS_active_nodes)
{ bIsLeader = FALSE;
terminate_leader_thread();
}
Moreover, we require that each node execute the modified maintainer periodically when some LONG_TIMEOUT expires, instead of only when some stateMayNeedToChangeEvent occurs. Specifically, we modify the WaitForSingleObject in the maintainer to:
WaitForSingleObject(stateMayNeedToChangeEvent, LONG_TIMEOUT);
To enable low frequency state to be available despite long-power outages, we require that all low frequency state be persisted; i.e. maintained in stable store.
Stabilizing tolerance proof. The modified maintainer is the key to the stabilization of the composite protocol from arbitrary start states to states where the original invariant holds. We begin by showing that periodic execution of the maintainer ensures that eventually for all up nodes j I2c and I2d hold forever.
If there are no active nodes in the start state then within (3 SSS-heartbeat-interval) time the SSS of some passive node will detect this fact. The first such passive node to execute its maintainer will, upon executing join, become the active node with maximum age and set its bisLeader to true. Since "fake" age values of any node j in the SSS of any other node disappear within (3 SSS-heartbeat-interval) time and SSS_PutValue/SS_Heartbeat_Interval occur periodically, the leader never receives a greater age value, and hence its value eventually remains as the maximum in the SSS of all nodes. Periodic execution of the maintainer at each up node ensures that for all j I2c and I2d hold forever.
If there is an active node in the start state (and given the argument immediately above, in all successive states in the protocol computation), let M be the maximum age value in the SSS of all active nodes in the start state. Henceforth the maximum age of all active nodes is never greater than M. Also, due to timeout of the "fake" age values and the periodic SSS_PutValue and the SS_Heartbeat_Interval, eventually the maximum age value in the SSS of all nodes is (and remains) M. Periodic execution of the maintainer at each node ensures that for all j I2c and I2d hold forever.
It follows that eventually I1a holds forever and that there is a leader. Also, if the designated UPS powered node is passive, eventually the SSS of the leader detects this fact. Hence, eventually the designated node executes its maintainer and (upon executing join) becomes active. Once active, it remains active, its age stabilizes, and thus eventually I1b also holds forever.
Since each node j executes its maintainer periodically, j eventually truthifies the first conjunct of I2a: j sets its "st" to passive if and only if it sets its age to infinity. Once the age of all non-designated active nodes stabilizes, the number of such active nodes stabilizes to at most t and these nodes remain active. Likewise if the designated node becomes active it remains active. Periodic execution of the maintainer at each node ensures that eventually for all j the remaining conjuncts of I2a hold forever.
Timeout of "fake" SSS state and periodic SSS_PutValue/SS_Heartbeat_Interval ensure that eventually for all j I2b holds forever. Finally, since all providers update their data periodically and the data items that are not refreshed are timed out, eventually all lookup service data at the active nodes is consistent.
The frequency of provider updates can be tuned to suit system characteristics. We have identified several scenarios where such tuning is appropriate.
a) If demand for a particular soft-state item increases, its update frequency should stay high.
b) As age of an item increases, its update frequency should become lower.
c) On-demand, the update frequency of an item may be increased.
d) If SSS load increases, update frequency of a group of items may be decreased.
The basic implication of adaptivity is update frequency be (re)negotiable. Negotiation is achieved by using the “meta-update” events, which may be piggybacked on other events or responses. When a new update frequency is negotiated, the lookup service protocol adapts to give acknowledgements if need be and the provider protocol adapts to deal with acknowledgments and to retry as need be.
· Rich client subscription model, including asynchronous and persistent queries, to deal naturally with asynchrony between clients and providers.
· High quality of lookup fault-tolerance for common failures.
o Outage length in common failures is 3 SS-heartbeat-interval
o Maximum staleness of high frequency information is 2 Max_Volatile_Refresh_Interval + δ
o Maximum staleness of low-frequency information is Max_Persistent_Refresh_Interval + 3 SS-heartbeat-interval + Θ + roundtrip-communication-delay + δ
· Stabilizing lookup fault-tolerance for rare failures.
· Multi-time-scale soft-state-based lookup with both volatile and persistent storage.
· Adaptive refreshing intervals based on client-provider usage patterns and service load.