
ECE 257A: FaultTolerant Computing
Behrooz Parhami: 2007/06/19  Email: parhami@ece.ucsb.edu  Problems: webadmin@ece.ucsb.edu Other contact info at: Bottom of this page  Go up to: B. Parhami's course syllabi or his home page
On June 19, 2007, Professor Parhami's UCSB ECE website moved to a new location. For an uptodate version of this page, visit it at the new address: http://www.ece.ucsb.edu/~parhami/ece_257a.htm Background and history of ECE 257A Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the twocourse sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching fault tolerance at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation. Link to previous offerings of ECE 257AECE 257A: Fall Quarter 2006 offeringThis area reserved for important course announcements: Files for lecture 18 have been posted. If you have any lastminute questions about homework or the course material, I will be in my office all morning (9:0012:00) on Friday, 12/8. Per our agreement in class on Thursday 11/30, the course's final exam will be on Friday, December 8, 1:003:00 PM, in Room 4164 (ECE Conference Room in Harold Frank Hall).
Lecture plan:
Homework: General Requirements Solutions should be brought to class on the due date
and handed in before the lecture begins. Late
homework will not be accepted, so plan to start work on your assignments early. Use a cover page that includes your name, course and assignment number for your solutions. Staple
the sheets and write your name on top of every sheet in case sheets are
separated. Although
some cooperation is permitted, direct copying will have severe consequences. ECE 257A f2006, Homework #1: Dependability measures and modeling (Lectures 14, due Thursday, Oct. 12) 1. [Mean time to failure; 10 points] A particular computer model has a constant failure rate of l. If a large number m of identical machines of this type run continuously for a period of time equal to their MTTF, how many of them are expected to be still working at the end? (Hint: The correct answer isn't m/2.) By what time do we expect 90% of the machines to have failed? 2. [Failure independence; 15 points] A department has three networked servers. The maintenance logs for the month of September indicate that all three servers were up and running 83% of the time, two were running (while the third one was down) 16% of the time, and only one was running at all other times. Show that the logged data are consistent with, but do not prove, the hypothesis that the three servers are equally reliable and fail independently. 3. [Data availability with triplication; 25 points] Redo example 1.2 of the first course handout (Oct. 5), quantifying the probability of being able to access a data file if, whenever link malfunctions make direct access impossible, the requesting site can obtain a copy indirectly via the remaining site that does not itself holds a copy of the data, provided the latter is operational and can access the required file. State your assumptions clearly. Because the availability expression for this problem can grow quite complicated, use the parameters L and S in lieu of a_{L} and a_{S} for brevity. 4. [Combinational modeling; 10 points] A safety critical subsystem is built of three computation modules, each having the reliability r_{M}, and a voter with reliability r_{V}. Represent this system as a seriesparallel system. Verify that your model is indeed correct by computing its reliability and comparing to that of the original system. 5. [Statespace modeling; 20 points] A computer system experiences one failure per 1000 hours of operation. Upon failure detection, a diagnostic process is run. In 95% of the cases, the cause of failure can be removed based on the diagnostic (e.g., through system reset or software reload) and service is restored in about 15 minutes. In the remaining 5% of the cases, a repair person must be summoned and the expected time to service restoration is around 12 hours. Construct a statespace model for this system and derive its availability.
6. [Reading assignment; 20 points] Read the paper
“Evaluation of
SoftwareImplemented FaultTolerance (SIFT) Approach in Gracefully Degradable
MultiComputer Systems ECE 257A f2006, Homework #2: Dealing with lowlevel impairments (Lectures 58, due Thursday, Oct. 26) 1. [Bathtub curve; 15 points] Argue that, with proper interpretation, the hardware failure rate trend represented in a bathtub curve is applicable to software as well. Take time t = 0 to represent the completion of the first fully functional prototype of the software, which is then followed by various prerelease versions and, eventually, the first released version, with the latter ideally occurring at the end of infant mortality. Reference for the aging phase: Grottke, Michael, Lei Li, Kalyanaraman Vaidyanathan, and Kishor S. Trivedi, “Analysis of Software Aging in a Web Server,” IEEE Trans. Reliability, Vol. 55, No. 3, pp. 411420, September 2006. 2. [Fault testing; 15 points] Find a minimal set of tests for all sa0 and sa1 faults in the fulladder circuit shown in slide 12 of lecture 6. 3. [Testability; 15 points] Quantify the testability of each line of the fulladder circuit shown in slide 12 of lecture 6. Based on the results obtained, where would you place a single testpoint? 4. [Modeling for fault masking; 15 points] Consider the following faultmasking redundancy scheme with voting. There are four modules. Initially, three modules are operating concurrently and one is a spare. When a disagreement is detected between the three operating modules, the disagreeing module is purged (switched out) and the spare is used to replace it. After the spare has been switched in, a detected disagreement causes the disagreeing module and one of the good ones to be purged, leaving a simplex system. State all your assumptions clearly. Construct and solve a statespace reliability model for this system, assuming perfect coverage, switching, and voting. 5. [Switchvoter design for fault masking; 20 points] For the redundancy scheme discussed in Problem 4, design the required 4input, 1output voterswitch, with 4 internal FFs indicating which units are contributing to the formation of the output. Feel free to use of muxes, comparators, and other standard circuits in your design (the design need not be at the gate level). 6. [Error detection; 20 points] A multibit unidirectional error affects the bits in the same direction; that is, all changes are either 0to1 or 1to0, but not both. Show that a product code with check modulus 2^{a}  1 detects all burst errors affecting a  1 or fewer consecutive bits of a codeword and all unidirectional errors affecting any a  1 or fewer bits. ECE
257A f2006, Homework #3: Dealing with highlevel impairments ( 1. [Triangular code for error correction; 15 points] Instead of arranging k data bits in a square array and attaching 2k^{1/2} + 1 parity bits to its rows and columns, we can use a triangular (halfsquare) shape, as shown below, and place the parity bits on its diagonal. Each parity bit covers all data bits in the same row and the same column, so every parity bit covers the same number of data bits. Compare this code to the 2D square code with regard to redundancy and error correction/detection capabilities. d d d d d p d d d d p d d d p d d p d p p 2. [Biproduct codes; 15 points] We discussed biresidue codes and their error correction capabilities in class, noting, for example, that a biresidue code with the lowcost check moduli 7 and 15 can correct any singlebit error within 12 data bits. Consider now a biproduct code with the lowcost check moduli 7 and 15 (that is, a coding scheme whereby numbers are multiplied by 7 ´ 15 = 105). Derive the error correction capabilities of this code, compare with the biresidue code mentioned above, and generalize to a biproduct code with relatively prime check moduli 2^{a} – 1 and 2^{b} – 1. 3. [Sequential diagnosis; 15 points] Consider an nnode directed ring, with nodes numbered 0 to n – 1. Augment this ring with additional test links that allow nodes 2, 3, . . . , 2t – 1 (besides node n – 1) to test node 0. Show that the resulting system is sequentially tdiagnosable. Hint: Let x be the number of testers, from among the 2t – 1 testers, that indicate node 0 to be malfunctioning, and consider various subcases according to the magnitude of x. 4. [Modeling of failsoft systems; 20 points] A failsoft system is composed of 5 modules, the health of any 3 of which is adequate for system availability. Modules malfunction at the rate l and repair is performed at the rate m = 100l, regardless of the number of existing malfunctions. Upon each malfunction, the system recovers (by changing to a configuration having one fewer module) with probability c, representing the coverage factor. Find the availability of the system as a function of c, plot its value as c varies from 0.8 to 1.0, and discuss the observed trend. 5. [Number of checkpoints; 15 points] Consider a computation that takes 100 hours to run. The failure rate (events that create a need to backtrack to the previous checkpoint) is l per hour. Each checkpoint adds 100 s to the computation in way of overhead. Two checkpointing schemes are being considered: (1) inserting 3 checkpoints, one after every 25 hours of computation.; (2) inserting a total of 9 checkpoints, one after every 10 hours of computation. Determine the values of l for which each checkpointing scheme would be preferable to the other. 6. [Failure avoidance/confinement; 20 points] The November 2006 issue of IEEE Spectrum (p. 22, “German Maglev Tragedy”) reports in a very brief news story that on 22 September 2006, a train built to demonstrate the magnetic levitation technology in northwest Germany collided with a maintenance vehicle left on the track, killing 25 of the 29 people aboard. Using Internet sources, write a 2page report on this incident, focusing in particular on the role that computer systems played (or should have played) and the inadequacy of failure avoidance, failure confinement, and fault tolerance strategies. ECE
257A f2006, Homework #4:
1. [Redundant hardware structures; 15 points] Compare the reliabilities and MTTFs for the following three redundancy schemes, each with 4 computational modules. Assume that the voting and switching components are perfectly reliable and that there is no repair. (1) 3outof4 voting. (2) 2outof3 voting and one spare module that is brought online when one of the active ones fails; assume identical failure rates for active and spare modules. (3) The pairandspare scheme. 2. [Time redundancy with selfdual functions; 20 points] (a) Prove that a kbit unsigned, binary adder, with 2k + 1 inputs and k + 1 outputs, is selfdual with respect to each of its outputs. (b) Repeat part a for a 1’scomplement binary adder that has its carryout connected to its carryin. Such an adder has 2k inputs and k outputs. (c) Repeat part a for a 2’scomplement binary adder with c_{in} = 0 (2k inputs, k + 1 outputs). Is selfduality maintained if the 2’scomplement adder is used as an addersubtractor via a control signal that causes complementation of the second operand and the insertion of c_{in} = 1? 3. [Totally selfchecking tworail checker; 15 points] We discussed the design of a circuit, with 4 inputs and 2 outputs, that combines error indicators from two modules (tworail checker). Show that the design is selftesting with respect to all single stuckat and multiple unidirectional faults. 4. [Totally selfchecking tworail comparator; 15 points] Discuss the design of a TSC tworail comparator circuit. The inputs are tworail encodings of two kbit numbers for an arbitrary word width k. The output is 01 if the input numbers are equal, 10 if they are unequal, and 00 or 11 if there is an input error or if the comparator has an internal fault. Present a complete design with 4bit tworail inputs. 5. [Generalized and weighted voting; 20 points] A generalized voting scheme can be specified by listing its agreement sets. For example, standard 2outof3 majority voting with inputs A, B, and C has the agreement sets {A, B}, {B, C}, {C, A}. Show that each of the agreement sets in parts ad below corresponds to a weighted threshold voting scheme and present a simple hardware voter implementation for at least one of the four cases. (a) {A, B}, {A, C}, {A, D}, {B, C, D}. (b) {A, B}, {A, C, D}, {B, C, D}. (c){A, B, C}, {A, C, D}, {B, C, D}. (d) {A, B}, {A, C, D}, {B, D}. (e) Is there any generalized voting scheme on 4 inputs that cannot be realized as a weighted voting scheme? 6. [Reconfiguration; 15 points] We have discussed the use of reconfiguration for fault tolerance. Reconfiguration can also be used for flexibility; i.e., use of a common computational resource, such as an FPGA, to solve different problems at the hardware level. Discuss the similarities and differences of these two uses for reconfigurable computing. The following reference might be helpful: Compton, K., and S. Hauck, “Reconfigurable Computing: A Survey of Systems and Software,” ACM Computing Surveys, Vol. 34, No. 2, pp. 171–210, June 2002. ECE 257A f2006, Notes on homework solutions and other topics None at this time. ECE 257A f2006, Possible Research Topics
List of topics will be posted here if some students indicate
interest in the research option in lieu of final exam. Before starting work on
your report, consult useful guidelines on
organization and formatting of a research paper. Return to: Top of this page  Go up to: B. Parhami's course syllabi or his home page

