Behrooz Parhami's website banner

Menu:

Behrooz Parhami's ECE 257A Course Page for Fall 2009

Collapsed bridge

Fault-Tolerant Computing

Enrollment code: 55111
Prerequisite: ECE 154 (or equivalent)
Class meetings: TR 10:00-11:15, Phelps 1437
Instructor: Professor Behrooz Parhami
Open office hours: M 10:30-11:50, W 12:30-1:50; HFH 5155
Course announcements: Listed in reverse chronological order
Course calendar: Lecture, homework, and exam schedules
Homework assignments: Four assignments, worth a total of 20%
Exams: Open-book midterm and final, each worth 40%
Research paper: Does not apply to fall 2009
Research paper guidlines: Brief guide to format and contents
Grades: Statistics for homework and exam grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Via the textbook's Web page; for Lecture 0, see below
Miscellaneous information: Motivation, catalog entry, history

Course Announcements

Megaphone

2009/11/19: Homework 4 and lecture slides for Part VI of the textbook have been posted. Homework 4 will be due on R 12/03, not T 12/01. The PDF file containing the four chapters of the textbook's Part VI (in preliminary form) will be posted by F 11/20.
2009/11/12: Homework 3 has been available since R 11/05. Lecture slides for Part V of the textbook have been posted. The PDF file containing the four chapters of Part V of the textbook (in preliminary form) will be posted by F 11/13.
2009/10/26: Lecture slides for Part IV of the textbook have been posted. The PDF file containing the four chapters of Part IV of the textbook (in preliminary form) will be posted by W 10/28.
2009/10/20: Lecture slides for Part III of the textbook have been posted. The PDF file containing the four chapters of Part III of the textbook will be posted by R 10/22.
2009/10/15: Homework 2 has been posted below. Lecture slides for Part II of the textbook and the PDF file containing the four chapters of Part II (in preliminary form) are now available from the book's Web page.
2009/10/01: Homework 1 has been posted below, the slides for Part I of the textbook have been updated, and the PDF file containing Part I of the textbook has been posted to the book's Web page.
2009/09/20: The course schedule and coverage has changed compared with the initial version posted here in July. This is to bring the course in line with the textbook's content and structure. Please discard any earlier version of this page that you may have saved.
2009/07/09: Welcome to the ECE 257A web page for fall 2009. The following tentative information is provided for planning purposes only. Details will be finalized in late September 2009 and updated weekly thereafter.

Course Calendar

Calendar

Course lectures, homework assignments, and exams, have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.

Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
R 09/24 (0) Course introduction: Goals, pretest, class survey {ppt, pdf}

T 09/29 (1) Background and motivation {class surveys due}
R 10/01 (2) Dependability attributes [HW1 posted]

T 10/06 (3) Combinational modeling
R 10/08 (4) State-space modeling

T 10/13 (5-6) Defect avoidance and circumvention [HW1 due]
R 10/15 (7-8) Shielding and hardening; Yield enhancement [HW2 posted]

T 10/20 (9, 11) Fault testing; Design for testability
R 10/22 (10, 12) Fault masking; Replication with voting

T 10/27 (13-14) Error detection and correction [HW2 due]
R 10/29 (15-16) Self-checking modules; Redundant disk arrays

T 11/03 No lecture {Instructor away at conference}
R 11/05 (1-12) Midterm exam, open-book/notes, 10:00-11:45 [HW3 posted]

T 11/10 (17-18) Malfunction diagnosis and tolerance
R 11/12 (19-20) Standby redundancy; Robust parallel processing

T 11/17 (21-22) Degradation allowance and management [HW3 due; extended to 11/19]
R 11/19 (23-24) Resilient algorithms; Software redundancy [HW4 posted]

T 11/24 (25-26) Failure confinement and recovery
R 11/26 No lecture: Thanksgiving Holiday

T 12/01 (27-28) Agreement and adjudication; Fail-safe systems [HW4 due; extended to 12/03]
R 12/03 (A) Conclusion: Past, present, and future {Instructor and course evaluations?}

W 12/09 Final exam, open-book/notes, 9:00-11:00

W 12/16 Course grades to be submitted by midnight

Homework Assignments

Homework image

-Turn in solutions in class before the lecture begins.
-Because solutions will be handed out on the due date, no extension can be granted.
-Use a cover page that includes your name, course name, and assignment number.
-Staple the sheets and write your name on top of each sheet in case they are separated.
-Although some cooperation is permitted, direct copying will have severe consequences

Homework 1: Dependability and its modeling (ch. 1-4, due T 2009/10/13, 10:00 AM)
Do problems 1.7, 1.15, 2.16, 3.5, and 4.1 from the textbook.

Homework 2: Defects and faults (ch. 5-12, due T 2009/10/27, 10:00 AM)
Do problems 6.7, 8.1, 9.1, 10.6, and 12.4 from the textbook.

Homework 3: Errors and malfunctions (ch. 13-20, due T 2009/11/17, 10:00 AM; extended to R 11/19)
Do problems 13.1, 15.2, 16.3, 17.2, and 20.3 from the textbook.

Homework 4: Degradations and failures (ch. 21-28, due T 2009/12/01, 10:00 AM; extended to R 12/03)
Do Problems 21.2, 23.1, 24.4, 27.6, and A.1 from the textbook; the last two problems appear below.
27.6 Generalized and weighted voting A generalized voting scheme can be specified by listing its agreement sets. For example, simple 2-out-of-3 majority voting with inputs A, B, and C has the agreement sets {A, B}, {B, C}, {C, A}. This means, for instance, that if B and C agree, then their common output will be taken as the voting outcome. Show that each of the agreement sets below corresponds to a weighted threshold voting scheme, providing the corresponding weights and threshold value. Then, present a simple hardware voting unit implementation for one of the four cases.
a. {A, B}, {A, C}, {A, D}, {B, C, D}
b. {A, B}, {A, C, D}, {B, C, D}
c. {A, B, C}, {A, C, D}, {B, C, D}
d. {A, B}, {A, C, D}, {B, D}
A.1 Safety-critical systems It was reported widely that on Thursday, 2009/11/19, a computer failure at the US Federal Aviation Administration (FAA) caused massive flight cancellations and delays. The failure made some flight data (such as flight numbers, destinations, and altitudes) unavailable, leading to manual data entry and forcing flight controllers to space the aircraft further apart for safety reasons. So far, very few technical details are available about the incident, with news reports blaming it on the failure of a single circuit board in Salt Lake City, Utah. However, more details will surely emerge by the end of the month. Prepare a 2-page report about this incident, citing the main technical reasons for the disruption and the role played by dependability enhancement features, or lack thereof.

Sample Exams and Study Guide

Answer sheet

The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).
Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.

Sample Midterm Exam (105 minutes)
Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.

Midterm Exam Study Guide
Nothing specific; just study Chapters 1-12 and review the problems in homework assignments 1-2.

Sample Final Exam (120 minutes)
Problems 13.5, 15.5, 17.1, 21.2

Final Exam Study Guide
Study Chapters 13-28. There will be one problem from each of the four parts. Pay special attention to the problems in homework assignments 3-4. Note that the coverage in the sample final exam above is more limited than that of the current quarter.

Research Paper and Presentation

Colored marbles

This section does not apply to the fall 2009 offering of the course. Please ignore.
Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A tentative list of research topics is provided below; however, students should feel free to propose their own topics for approval. A publishable report earns an "A" for the course, regardless of homework and midterm grades. See the course calendar for schedule and due dates and Research Paper Guidlines for formatting tips.

1. Topic 1 (Assigned to: TBD)
2. Topic 2 (Assigned to: TBD)
3. Topic 3 (Assigned to: TBD)
4. Topic 4 (Assigned to: TBD)
5. Topic 5 (Assigned to: TBD)
6. Topic 6 (Assigned to: TBD)
7. Topic 7 (Assigned to: TBD)
8. Topic 8 (Assigned to: TBD)

Grade Statistics

Chart

All grades listed are in percent.
HW1 grades: Range = [66, 86], Mean = 78, SD = 07, Median = 79
HW2 grades: Range = [55, 80], Mean = 69, SD = 08, Median = 70
HW3 grades: Range = [xx, xx], Mean = xx, SD = xx, Median = xx
HW4 grades: Range = [xx, xx], Mean = xx, SD = xx, Median = xx
Midterm exam grades: Range = [50, 73], Mean = 63, SD = 07, Median = 64
Final exam grades: Range = [xx, xx], Mean = xx, SD = xx, Median = xx

References

Image of a reference book

Required text: B. Parhami, Dependable Computing: A Multilevel Approach, chapters will be posted as they become available. Please visit the textbook's web page for general information. Lecture slides and (preliminary) sample chapters are also available there.

Some useful books (not required):
Koren/Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (ISBN 0-12-088525-5)
Shooman, Reliability of Computer Systems and Networks, Wiley, 2002 (ISBN 0-471-29342-3)
Siewiorek/Swarz, Reliable Computer Systems, Digital Press, 1992 (ISBN 1-55558-075-0)
Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989 (ISBN 0-201-07570-9)

Research resources:
Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
IEEE Trans. Dependable and Secure Computing, quarterly journal, published since 2004
IEEE Trans. Reliability, Quarterly journal, published since 1955
IEEE Trans. Computers, monthly journal, published since 1952
UCSB library's electronic journals, collections, and other resources
UCSB library's research guide in ECE

Miscellaneous Information

Motivation: Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.

Catalog entry: 257A. Fault-Tolerant Computing. (4) PARHAMI. Prerequisites: ECE 154. Lecture, 3 hours. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.

History: Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.
Offering of ECE 257A in fall 2007 (PDF file)
Offerings of ECE 257A in 1998 and 2006 (PDF file)