L. Gasieniec, A. Pelc, Adaptive broadcasting with faulty nodes, Parallel Computing 22 (1996), 903-912.
We consider broadcasting from a fault-free source to all nodes of a completely connected n-node network in the presence of k faulty nodes. Every node can communicate with at most one other node in a unit of time and during this period every pair of communicating nodes can exchange information packets. Faulty nodes cannot send information. Broadcasting is adaptive, i.e., a node schedules its next communication on the basis of information currently available to it. Assuming that the fraction of faulty nodes is bounded by a constant smaller than 1, we construct a broadcasting algorithm working in worst-case time O(log2 n).
A. Pelc, Fault-tolerant broadcasting and gossiping in communication networks, Networks 28 (1996), 143-156.
A preliminary version appeared as: A. Pelc, Fast fault-tolerant broadcasting and gossiping (invited paper), Proc. 2nd Colloquium on Structural Information and Communication Complexity, SIROCCO'95, June 1995, Olympia, Greece, pp. 159-172.
Broadcasting and gossiping are fundamental tasks in network communication. In broadcasting, or one-to-all communication, information originally held in one node of the network (called the source) must be transmitted to all other nodes. In gossiping, or all-to-all communication, every node holds a message which has to be transmitted to all other nodes. As communication networks grow in size, they become increasingly vulnerable to component failures. Thus, capabilities for fault-tolerant broadcasting and gossiping gain importance. The present paper is a survey of the fast growing area of research investigating these capabilities. We focus on two most important efficiency measures of broadcasting and gossiping algorithms: running time and number of elementary transmissions required by the communication process. We emphasize the unifying thread in most results from the research in fault-tolerant communication: the trade-offs between efficiency of communication schemes and their fault-tolerance.
B. S. Chlebus, K. Diks, A. Pelc, Reliable broadcasting in hypercubes with random link and node failures, Combinatorics, Probability & Computing 5 (1996), 337-350.
We consider the problem of broadcasting in an n-node hypercube whose links and nodes fail independently with given probabilities p < 1 and q < 1, respectively. Information held in a fault-free node, called the source , has to reach all other fault-free nodes. Messages may be directly transmitted to adjacent nodes only, and every node may communicate with at most one neighbor in a unit of time. A message can be transmitted only if both communicating neighbors and the link joining them are fault free. For parameters p and q satisfying (1-p)(1-q) > 0.99 (e.g., p = q = 0.5% ), we give an algorithm working in time O(log n) and broadcasting source information to all fault-free nodes with probability exceeding 1-cn-d , for some positive constants c,d , depending on p and q but not depending on n
K. Diks, A. Pelc, Fault-tolerant linear broadcasting, Nordic Journal of Computing 3 (1996), 188-201.
A preliminary version appeared in: Proc. First Canada-France Conference on Parallel and Distributed Computing, Theory and Practice, Montreal, Canada, May 1994, LNCS 805, pp. 207-217.
In linear broadcasting, packets originally stored in one node, called the source, have to visit all other nodes of the network. Every packet has a predetermined route indicating in which order it visits the nodes. A faulty link or node of the network destroys all packets passing through it. A linear broadcasting scheme consisting of packets' routes is f-fault-tolerant if every fault-free node is visited by at least one packet for any configuration of at most f link or node failures. We estimate the minimum number of packets for which there exists an f-fault-tolerant linear broadcasting scheme in complete networks, and we construct schemes using few packets. Variations of this problem when faults can occur only in links or only in nodes are also considered.
K. Diks, A. Pelc, Reliable computations on faulty EREW PRAM, Theoretical Computer Science 164 (1996), 107-122.
We consider the problem of efficient and reliable computing on EREW PRAM whose processors are subject to random independent stop-failures with constant probability p < 1. An algorithm for such a fault-prone machine is called safe if it solves a problem of size n with probability exceeding 1 - d/n, for some constant d independent of n. Our main contribution is a safe algorithm for the well known list ranking problem, working in time O(log n) on an O(n log n)-processor EREW PRAM. We also show an optimal safe algorithm for computing prefix sums, which works in time O(log n) on an O(n/log n)-processor EREW PRAM. The methods presented in this paper can be applied to a wide class of EREW PRAM algorithms making them safe and simultaneously preserving their complexity.
B. S. Chlebus, K. Diks, A. Pelc, Broadcasting in synchronous networks with dynamic faults, Networks 27 (1996), 309-318.
The problem of broadcasting in a network is to disseminate information from one node to all other nodes by transmitting it over communication links that connect nodes. We consider the time of broadcasting in the presence of at most k dynamic link failures. If a node knows source information, then in the next step all its neighbors connected by operational links also get to know it. Faults are dynamic, in the sense that a link may alternate arbitrarily between being operational or faulty, provided that, at every time step, the number of faulty links does not exceed k. The time bounds on broadcasting are considered with respect to two parameters: the number k of faulty links and the diameter d of the underlying graph. Broadcasting is guaranteed to be successful if and only if the edge connectivity of the network exceeds k; and we consider only such networks. For a fixed k, it is shown that broadcasting is always completed in time O(dk+1), where the bound is a function of diameter d. For a fixed d, it is shown that broadcasting is always completed in time O(kd/2 - 1), where the bound is a function of k. We prove that these orders of magnitude cannot be improved in general. Among networks, particularly interesting are those in which broadcasting time is close to their diameter in the presence of at most k dynamic faults, where k+1 is the edge-connectivity of the network. We show that multidimensional tori have this property.
K. Diks, A. Pelc, Broadcasting with universal lists, Networks 27 (1996), 183-196.
A preliminary version appeared in: Proc. 28th Annual Hawaii International Conference on System Sciences, January 1995, Vol.2, pp. 564-573.
In broadcasting, information originally held in one node of a communication network (called the source) has to be transmitted to all other nodes. In a unit of time, every node which already received the source message can transmit it to one neighbor. In classical broadcasting, the choice of neighbors to be informed by a node and the order in which they are informed may depend on the source. Thus, nodes need to store many transmission lists corresponding to different possible sources and need to know the source to adapt their behavior accordingly. In this paper we consider a variant of broadcasting in which every node is given a priori a single ordered list containing some of its neighbors. This list is meant to be universal for all possible sources. Upon obtaining the source message, a node transmits it to the neighbors from its list in prescribed order and then stops. This requires substantially less local memory devoted to schedule communication but usually increases broadcasting time. We compare broadcasting time in this and in the classical model and design optimal broadcasting schemes in the universal-list model for trees, rings and grids. For tori and for complete graphs we give upper bounds on broadcasting time.
K. Diks, A. Pelc, Efficient gossiping by packets in networks with random faults, SIAM Journal on Discrete Mathematics 9 (1996), pp. 8-17.
Every node of a communication network has a constant size value which should be made known to all other nodes. Nodes and links fail independently with constant probabilities p < 1 and q < 1, respectively. Faults are permanent and of crash type: a faulty link does not transmit messages and a faulty node neither sends nor receives messages. In a unit of time every node can send a packet of information to at most one neighbor and receive a packet from at most one neighbor. The size of each packet does not exceed b(n), where n is the number of nodes. For every d > 0 we present an algorithm to exchange values between all fault-free nodes of an n-node network in time O(n/b(n) + log n) , with probability exceeding 1-n-d, for sufficiently large n. This order of magnitude of running time is optimal.
A. Pelc, Efficient fault location with small risk, Proc. 3rd Coll. on Structural Information and Communication Complexity, SIROCCO'96, Certosa di Pontignano, Siena, June 1996, p. 292-300.
We consider the problem of fault diagnosis in multiprocessor systems. Processors perform tests on their neighbors; fault-free testers correctly identify the fault status of tested neighbors, while faulty testers can give arbitrary test results. Processors fail independently with constant probability p < 1/2 and the goal is to identify correctly the status of all processors, based on the set of test results. For 0 < q < 1 , q-diagnosis is a fault diagnosis algorithm whose probability of error does nor exceed q. Since testing is costly in terms of local computation time and communication between neighboring processors, it is important to design diagnosis algorithms combining high reliability with a small number of tests. We show that the minimum number of tests to perform q-diagnosis in a n-processor system is of the order of n log (1/q) and show q-diagnosis achieving this bound. We also prove that the minimum degree of a multiprocessor system permitting q-diagnosis is of the order of log n + log (1/q) and show q-diagnosis with a low number of tests that can work in such systems. Our results show precisely the trade-off between reliability of system diagnosis and the cost of the testing process.